See All Titles |
![]() ![]() Special Symbols and Characters for REsWe will now introduce the most popular of the metacharacters, special characters and symbols, which give regular expressions their power and flexibility. You will find the most common of these symbols and characters in Table 15.1.
Matching more than one RE pattern with alternation ( | )The pipe symbol ( | ), a vertical bar on your keyboard, indicates an alternation operation, meaning that it is used to choose from one of the different regular expressions which are separated by the pipe symbol. For example, below are some patterns which employ alternation, along with the strings they match:
With this one symbol, we have just increased the flexibility of our regular expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR. Matching any single character ( . )The dot or period ( . ) symbol matches any single character except for NEWLINE (Python REs have a compilation flag [S or DOTALL] which can override this to include NEWLINEs.). Whether letter, number, whitespace not including "\n," printable, non-printable, or a symbol, the dot can match them all.
Matching from the beginning or end of strings or word boundaries ( ^/$)There are also symbols and related special characters to specify searching for patterns at the beginning and ending of strings. To match a pattern starting from the beginning, you must use the carat symbol ( ^ ) or the special character \A (backslash-capital "A"). The latter is primarily for keyboards which do not have the carat symbol, i.e., international. Similarly, the dollar sign ( $ ) or \Z will match a pattern from the end of a string. Patterns which use these symbols differ from most of the others we describe in this chapter since they dictate location or position. In the Core Note above, we noted that a distinction is made between "matching," attempting matches of entire strings starting at the beginning, and "searching," attempting matches from anywhere within a string. Because we are looking specifically at symbols and special characters which deal with position, they make sense only when applied to searching. That said, here are some examples of "edge-bound" RE search patterns:
Again, if you want to match either (or both) of these characters verbatim, you must use an escaping backslash. For example, if you wanted to match any string which ended with a dollar sign, one possible RE solution would be the pattern ".*\$$". The \b and \B special characters will match the empty string, meaning that they can start performing the match anywhere. The difference is that \b will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether there are any characters in front of it (word in the middle of a string) or not (word at the beginning of a line). And likewise, \B will match a pattern only if it appears starting in the middle of a word (i.e., not at a word boundary). Here are some examples:
Creating character classes ( [ ] )While the dot is good for allowing matches of any symbols, there may be occasions where there are specific characters you want to match. For this reason, the bracket symbols ( [ ] ) were invented. The regular expression will match from any of the enclosed characters. Here are some examples:
One side note regarding the RE "[cr][23][dp][o2]"—a more restrictive version of this RE would be required to allow only "r2d2" or "c3po" as valid strings. Because brackets merely imply "logical OR" functionality, it is not possible to use brackets to enforce such a requirement. The only solution is to use the pipe, as in "r2d2|c3po". For single character REs, though, the pipe and brackets are equivalent. For example, let's start with the regular expression "ab" which matches only the string with an "a" followed by a "b." If we wanted either a one-letter string, i.e., either "a" or a "b," we could use the RE "[ab]". Because "a" and "b" are individual strings, we can also choose the RE "a|b". However, if we wanted to match the string with the pattern "ab" followed by "cd," we cannot use the brackets because they work only for single characters. In this case, the only solution is "ab|cd," similar to the "r2d2/c3po" problem just mentioned. Denoting ranges ( - ) and negation ( ^ )In addition to single characters, the brackets also support ranges of characters. A hyphen between a pair of symbols enclosed in brackets is used to indicate a range of characters, e.g., A–Z, a–z, or 0–9 for uppercase letters, lowercase letters, and numeric digits, respectively. This is a lexicographic range, so you are not restricted to using just alphanumeric characters. Additionally, if a caret ( ^ ) is the first character immediately inside the open left bracket, this symbolizes a directive to not match any of the characters in the given character set.
Multiple occurrence/repetition using closure operators ( *, +, ?, { } )We will now introduce the most common RE notations, namely, the special symbols *, +, and ?, all of which can be used to match single, multiple, or no occurrences of string patterns. The asterisk or star operator ( * ) will match zero or more occurrences of the RE immediately to its left (in language and compiler theory, this operation is known as the Kleene Closure). The plus operator ( + ) will match one or more occurrences of an RE (known as Positive Closure), and the question mark operator ( ? ) will match exactly 0 or 1 occurrences of an RE. There are also brace operators ( { } ) with either a single value or a comma-separated pair of values. These indicate a match of exactly N occurrences (for { N }) or a range of occurrences, i.e., {M, N} will match from M to N occurrences. These symbols may also be escaped with the backslash, i.e., "\*" matches the asterisk, etc. Finally, the question mark ( ? ) is overloaded so that if it follows any of the following symbols, it will direct the regular expression engine to match as few repetitions as possible. Here are some examples using the closure operators:
Special characters representing character setsWe also mentioned that there are special characters which may represent character sets. Rather than using a range of "0–9," you may simply use "\d" to indicate the match of any decimal digit. Another special character "\w" can be used to denote the entire alphanumeric character class, serving as a shortcut for "A–Za–z0–9_," and "\s" for whitespace characters. Uppercase versions of these strings symbolizes a non-match, i.e., "\D" matches any non-decimal digit (same as "[^0–9]"), etc. Using these shortcuts, we will present a few more complex examples:
Note that all special characters, including all the ones mentioned before such as "\A," "\B," "\d," etc., may or may not have ASCII equivalents. To be sure you are using the regular expression versions, it would be a safe bet to use raw strings to escape backslash functionality (see the Core Note later in this chapter). Also, the "\w" and "\W" alphanumeric character sets are affected by the L or LOCALE compilation flag and in Python 1.6 and newer, by Unicode flags. Designating groups with parentheses ( ( ) )Now, perhaps we have achieved the goal of matching a string and discarding non-matches, but in some cases, we may also be more interested in the data that we did match. Not only do we want to know whether the entire string matched our criteria, but whether we can also extract any specific strings or substrings which were part of a successful match. The answer is yes. To accomplish this, surround any RE with a pair of parentheses. A pair of parentheses ( ( ) ) can accomplish either (or both) of the below when used with regular expressions:
One good example for wanting to group regular expressions is when you have two different REs with which you want to compare a string. Another reason is to group an RE in order to use a repetition operator on the entire RE (as opposed to an individual characters or character classes). One side-effect of using parentheses is that the substring which matched the pattern is saved for future use. These subgroups can be recalled for the same match or search, or extracted for post-processing. Why are matches of subgroups important? The main reason is that there are times where you want to extract the patterns you match, in addition to making a match. For example, what if we decided to match the pattern "\w+-\d+" but wanted save the alphabetic first part and the numeric second part individually? This may be desired because with any successful match, we may want to see just what those strings were that matched our RE patterns. If we add parentheses to both subpatterns, i.e., "(\w+)-(\d+)," then we can access each of the matched subgroups individually. Subgrouping is preferred because the alternative is to write code to determine we have a match, then execute another separate routine (which we also had to create) to parse the entire match just to extract both parts. Why not let Python do it, since it is a supported feature of the re module, instead of "reinventing the wheel"?
|
© 2002, O'Reilly & Associates, Inc. |