This string may be inside the computer, and the user may want to know if it has the word “man”. If it has the word man, he may then want to change the word “man” to “woman”; so that the string should read:
There are many other desires like these from the computer user; some are complex. Regular Expression, abbreviated, regex, is the subject of handling these issues by the computer. C++ comes with a library called regex. So, a C++ program to handle regex should begin with:
#include <regex>
using namespace std;
This article explains Regular Expression Basics in C++.
Article Content
- Regular Expression Fundamentals
- Pattern
- Character Classes
- Matching Whitespaces
- The Period (.) in the Pattern
- Matching Repetitions
- Matching Alternation
- Matching Beginning or End
- Grouping
- The icase and multiline regex_constants
- Matching the Whole Target
- The match_results Object
- Position of Match
- Search and Replace
- Conclusion
Regular Expression Fundamentals
Regex
A string like “Here is my man.” above is the target sequence or target string or simply, target. “man”, which was searched for, is the regular expression, or simply, regex.
Matching
Matching is said to occur when the word or phrase that is being searched for is located. After matching, a replacement can take place. For example, after “man” is located above, it can be replaced by “woman”.
Simple Matching
The following program shows how the word “man” is matched.
#include <regex>
using namespace std;
int main()
{
regex reg("man");
if (regex_search("Here is my man.", reg))
cout << "matched" << endl;
else
cout << "not matched" << endl;
return 0;
}
The function regex_search() returns true if there is a match and returns false if no match occurs. Here, the function takes two arguments: the first is the target string, and the second is the regex object. The regex itself is "man", in double-quotes. The first statement in the main() function forms the regex object. Regex is a type, and reg is the regex object. The above program's output is "matched", as "man" is seen in the target string. If "man" were not seen in the target, regex_search() would have returned false, and the output would have been "not matched".
The output of the following code is “not matched”:
if (regex_search("Here is my making.", reg))
cout << "matched" << endl;
else
cout << "not matched" << endl;
Not matched because the regex "man" could not be found in the entire target string, "Here is my making."
Pattern
The regular expression, “man” above, is very simple. Regexes are usually not that simple. Regular expressions have metacharacters. Metacharacters are characters with special meanings. A metacharacter is a character about characters. C++ regex metacharacters are:
A regex, with or without metacharacters, is a pattern.
Character Classes
Square Brackets
A pattern can have characters within square brackets. With this, a particular position in the target string would match any of the square brackets’ characters. Consider the following targets:
"The bat is in the room."
"The rat is in the room."
The regex, [cbr]at would match cat in the first target. It would match bat in the second target. It would match rat in the third target. This is because, “cat” or “bat” or “rat” begins with ‘c’ or ‘b’ or ‘r’. The following code segment illustrates this:
if (regex_search("The cat is in the room.", reg))
cout << "matched" << endl;
if (regex_search("The bat is in the room.", reg))
cout << "matched" << endl;
if (regex_search("The rat is in the room.", reg))
cout << "matched" << endl;
The output is:
matched
matched
Range of Characters
The class, [cbr] in the pattern [cbr], would match several possible characters in the target. It would match ‘c’ or ‘b’ or ‘r’ in the target. If the target does not have any of ‘c’ or ‘b’ or ‘r’, followed by “at”, there would be no match.
Some possibilities like ‘c’ or ‘b’ or ‘r’ exist in a range. The range of digits, 0 to 9 has 10 possibilities, and the pattern for that is [0-9]. The range of lowercase alphabets, a to z, has 26 possibilities, and the pattern for that is [a-z]. The range of uppercase alphabets, A to Z, has 26 possibilities, and the pattern for that is [A-Z]. – is not officially a metacharacter, but within square brackets, it would indicate a range. So, the following produces a match:
cout << "matched" << endl;
Note how the regex has been constructed as the second argument. The match occurs between the digit, 6 in the range, 0 to 9, and the 6 in the target, “ID6id”. The above code is equivalent to:
cout << "matched" << endl;
The following code produces a match:
if (regex_search(str, regex("[a-z]")))
cout << "matched" << endl;
Note that the first argument here is a string variable and not the string literal. The match is between ‘i’ in [a-z] and ‘i’ in “ID6iE”.
Do not forget that a range is a class. There can be text to the right of the range or to the left of the range in the pattern. The following code produces a match:
cout << "matched" << endl;
The match is between “ID[0-9]id” and “ID2id”. The rest of the target string, “ is an ID,” is not matched in this situation.
As used in the regular expression subject (regexes), the word class actually means a set. That is, one of the characters in the set is to match.
Note: The hyphen – is a metacharacter only within square brackets, indicating a range. It is not a metacharacter in the regex, outside of the square brackets.
Negation
A class including a range can be negated. That is, non of the characters in the set (class) should match. This is indicated with the ^ metacharacter at the beginning of the class pattern, just after the opening square bracket. So, [^0-9] means matching the character at the appropriate position in the target, which is not any character in the range, 0 to 9 inclusive. So the following code will not produce a match:
cout << "matched" << endl;
else
cout << "not matched" << endl;
A digit within the range 0 to 9 could be found in any of the target string positions, “0123456789101112,”; so there is no match – negation.
The following code produces a match:
cout << "matched" << endl;
No digit could be found in the target, “ABCDEFGHIJ,”; so there is a match.
[a-z] is a range outside [^a-z]. And so [^a-z] is the negation of [a-z].
[A-Z] is a range outside [^A-Z]. And so [^A-Z] is the negation of [A-Z].
Other negations exist.
Matching Whitespaces
‘ ’ or \t or \r or \n or \f is a whitespace character. In the following code, the regex, “\n” matches ‘\n’ in the target:
cout << "matched" << endl;
Matching any Whitespace Character
The pattern or class to match any white space character is, [ \t\r\n\f]. In the following code, ‘ ’ is matched:
cout << "matched" << endl;
Matching any Non-whitespace Character
The pattern or class to match any non-white space character is, [^ \t\r\n\f]. The following code produces a match because there is no whitespace in the target:
cout << "matched" << endl;
The period (.) in the Pattern
The period (.) in the pattern matches any character including itself, except \n, in the target. A match is produced in the following code:
cout << "matched" << endl;
No matching results in the following code because the target is “\n”.
cout << "matched" << endl;
else
cout << "not matched" << endl;
Note: Inside a character class with square brackets, the period has no special meaning.
Matching Repetitions
A character or a group of characters can occur more than once within the target string. A pattern can match this repetition. The metacharacters, ?, *, +, and {} are used to match the repetition in the target. If x is a character of interest in the target string, then the metacharacters have the following meanings:
x+: means match 'x' 1 or more times, i.e., at least once
x? : means match 'x' 0 or 1 time
x{n,}: means match 'x' at least n or more times. Note the comma.
x{n} : match 'x' exactly n times
x{n,m}: match 'x' at least n times, but not more than m times.
These metacharacters are called quantifiers.
Illustrations
*
The * matches the preceding character or preceding group, zero or more times. “o*” matches ‘o’ in “dog” of the target string. It also matches “oo” in “book” and “looking”. The regex, “o*” matches “boooo” in “The animal booooed.”. Note: “o*” matches “dig”, where ‘o’ occurs zero (or more) time.
+
The + matches the preceding character or preceding group, 1 or more times. Contrast it with zero or more times for *. So the regex, “e+” matches ‘e’ in “eat”, where ‘e’ occurs one time. “e+” also matches “ee” in “sheep”, where ‘e’ occurs more than one time. Note: “e+” will not match “dig” because in “dig”, ‘e’ does not occur at least once.
?
The ? matches the preceding character or preceding group, 0 or 1 time (and not more). So, “e?” matches “dig” because ‘e’ occurs in “dig”, zero time. “e?” matches “set” because ‘e’ occurs in “set”, one time. Note: “e?” still matches “sheep”; though there are two ‘e’s in “sheep”. There is a nuance here – see later.
{n,}
This matches at least n consecutive repetitions of a preceding character or preceding group. So the regex, “e{2,}” matches the two ‘e’s in the target, “sheep”, and the three ‘e’s in the target “sheeep”. “e{2,}” does not match “set”, because “set” has only one ‘e’.
{n}
This matches exactly n consecutive repetitions of a preceding character or preceding group. So the regex, “e{2}” matches the two ‘e’s in the target, “sheep”. “e{2}” does not match “set” because “set” has only one ‘e’. Well, “e{2}” matches two ‘e’s in the target, “sheeep”. There is a nuance here – see later.
{n,m}
This matches several consecutive repetitions of a preceding character or preceding group, anywhere from n to m, inclusive. So, “e{1,3}” matches nothing in “dig”, which has no ‘e’. It matches the one ‘e’ in “set”, the two ‘e’s in “sheep”, the three ‘e’s in “sheeep”, and three ‘e’s in “sheeeep”. There is a nuance at the last match – see later.
Matching Alternation
Consider the following target string in the computer.
“The farm has pigs of different sizes.”
The programmer may want to know if this target has “goat” or “rabbit” or “pig”. The code would be as follows:
if (regex_search(str, regex("goat|rabbit|pig")))
cout << "matched" << endl;
else
cout << "not matched" << endl;
The code produces a match. Note the use of the alternation character, |. There can be two, three, four, and more options. C++ will first try to match the first alternative, “goat,” at each character position in the target string. If it does not succeed with “goat”, it tries the next alternative, “rabbit”. If it does not succeed with “rabbit”, it tries the next alternative, “pig”. If “pig” fails, then C++ moves on to the next position in the target and starts with the first alternative again.
In the above code, “pig” is matched.
Matching Beginning or End
Beginning
If ^ is at the beginning of the regex, then the beginning text of the target string can be matched by the regex. In the following code, the start of the target is “abc”, which is matched:
cout << "matched" << endl;
No matching takes place in the following code:
cout << "matched" << endl;
else
cout << "not matched" << endl;
Here, “abc” is not at the beginning of the target.
Note: The circumflex character, ‘^’, is a metacharacter at the start of the regex, matching the start of the target string. It is still a metacharacter at the start of the character class, where it negates the class.
End
If $ is at the end of the regex, then the ending text of the target string can be matched by the regex. In the following code, the end of the target is “xyz”, which is matched:
cout << "matched" << endl;
No matching takes place in the following code:
cout << "matched" << endl;
else
cout << "not matched" << endl;
Here, “xyz” is not at the end of the target.
Grouping
Parentheses can be used to group characters in a pattern. Consider the following regex:
The group here is “pianist” surrounded by the metacharacters ( and ). It is actually a sub-group, while “a concert (pianist)” is the whole group. Consider the following:
Here, the sub-group or sub-string is, “pianist is good”.
Sub-strings with Common Parts
A bookkeeper is a person who takes care of books. Imagine a library with a bookkeeper and bookshelf. Assume that one of the following target strings are in the computer:
"Here is the bookkeeper.";
"The bookkeeper works with the bookshelf.";
Assume that the programmer’s interest is not to know which of these sentences is in the computer. Still, his interest is to know if “bookshelf” or “bookkeeper” is present in whatever target string is in the computer. In this case, his regex can be:
Using alternation.
Notice that “book”, which is common to both words, has been typed twice, in the two words in the pattern. To avoid typing “book” twice, the regex would be better written as:
Here, the group, “shelf|keeper” The alternation metacharacter has still been used, but not for two long words. It has been used for the two ending parts of the two long words. C++ treats a group as an entity. So, C++ will look for “shelf” or “keeper” that comes immediately after “book”. The output of the following code is “matched”:
if (regex_search(str, regex("book(shelf|keeper)")))
cout << "matched" << endl;
“bookshelf” and not “bookkeeper” have been matched.
The icase and multiline regex_constants
icase
Matching is case sensitive by default. However, it can be made case insensitive. To achieve this, use the regex::icase constant, as in the following code:
cout << "matched" << endl;
The output is “matched”. So “Feedback” with uppercase ‘F’ has been matched by “feed” with lowercase ‘f’. “regex::icase” has been made the second argument of the regex() constructor. Without that, the statement would not produce a match.
Multiline
Consider the following code:
if (regex_search(str, regex("^.*$")))
cout << "matched" << endl;
else
cout << "not matched" << endl;
The output is “not matched”. The regex, “^.*$,” matches the target string from its beginning to its end. “.*” means any character except \n, zero or more times. So, because of the newline characters (\n) in the target, there was no matching.
The target is a multiline string. In order for ‘.’ to match the newline character, the constant “regex::multiline” has to be made, the second argument of the regex() construction. The following code illustrates this:
if (regex_search(str, regex("^.*$", regex::multiline)))
cout << "matched" << endl;
else
cout << "not matched" << endl;
Matching the Whole Target String
To match the whole target string, which does not have the newline character (\n), the regex_match() function can be used. This function is different from regex_search(). The following code illustrates this:
if (regex_match(str, regex(".*second.*")))
cout << "matched" << endl;
There is a match here. However, note that the regex matches the whole target string, and the target string does not have any ‘\n’.
The match_results Object
The regex_search() function can take an argument in-between the target and the regex object. This argument is the match_results object. The whole matched (part) string and the sub-strings matched can be known with it. This object is a special array with methods. The match_results object type is cmatch (for string literals).
Obtaining Matches
Consider the following code:
cmatch m;
if (regex_search(str, m, regex("w.m.n")))
cout << m[0] << endl;
The target string has the word “woman”. The output is “woman’, which corresponds to the regex, “w.m.n”. At index zero, the special array holds the only match, which is “woman”.
With class options, only the first sub-string found in the target, is sent to the special array. The following code illustrates this:
if (regex_search("The rat, the cat, the bat!", m, regex("[bcr]at")))
cout << m[0] << endl;
cout << m[1] << endl;
cout << m[2] << endl;
The output is “rat” from index zero. m[1] and m[2] are empty.
With alternatives, only the first sub-string found in the target, is sent to the special array. The following code illustrates this:
cout << m[0] << endl;
cout << m[1] << endl;
cout << m[2] << endl;
The output is “rabbit” from index zero. m[1] and m[2] are empty.
Groupings
When groups are involved, the complete pattern matched, goes into cell zero of the special array. The next sub-string found goes into cell 1; the sub-string following, goes into cell 2; and so on. The following code illustrates this:
cout << m[0] << endl;
cout << m[1] << endl;
cout << m[2] << endl;
cout << m[3] << endl;
The output is:
seller
sel
ler
Note that the group (seller) comes before the group (sel).
Position of Match
The position of match for each sub-string in the cmatch array can be known. Counting begins from the first character of the target string, at position zero. The following code illustrates this:
if (regex_search("Best bookseller today!", m, regex("book((sel)(ler))")))
cout << m[0] << "->" << m.position(0) << endl;
cout << m[1] << "->" << m.position(1) << endl;
cout << m[2] << "->" << m.position(2) << endl;
cout << m[3] << "->" << m.position(3) << endl;
Note the use of the position property, with the cell index, as an argument. The output is:
seller->9
sel->9
ler->12
Search and Replace
A new word or phrase can replace the match. The regex_replace() function is used for this. However, this time, the string where the replacement occurs is the string object, not the string literal. So, the string library has to be included in the program. Illustration:
#include <regex>
#include <string>
using namespace std;
int main()
{
string str = "Here, comes my man. There goes your man.";
string newStr = regex_replace(str, regex("man"), "woman");
cout << newStr << endl;
return 0;
}
The regex_replace() function, as coded here, replaces all the matches. The first argument of the function is the target, the second is the regex object, and the third is the replacement string. The function returns a new string, which is the target but having the replacement. The output is:
“Here comes my woman. There goes your woman.”
Conclusion
The regular expression uses patterns to match substrings in the target sequence string. Patterns have metacharacters. Commonly used functions for C++ regular expressions, are: regex_search(), regex_match() and regex_replace(). A regex is a pattern in double-quotes. However, these functions take the regex object as an argument and not just the regex. The regex must be made into a regex object before these functions can use it.