|
The Regular Expressions Primer is a tutorial for those
completely new to regular expressions. To familiarize you with
regular expressions, this primer starts with the simple building
blocks of the syntax and through examples, builds to construct
expressions useful for solving real every-day problems including
searching for and replacing text.
A regular expression is often called a "regex", "rx" or "re".
This primer uses the terms "regular expression" and "regex".
Unless otherwise stated, the examples in this primer are
generic, and will apply to most programming languages and tools.
However, each language and tool has it's own implementation of
regular expressions, so quoting conventions, metacharacters,
special sequences, and modifiers may vary (e.g. Perl, Python,
grep, sed, and Vi have slight variations on standard regex
syntax). Consult the regular expression documentation for your
language or application for details.
Regular expressions are a syntactical shorthand for describing
patterns. They are used to find text that matches a pattern, and
to replace matched strings with other strings. They can be used
to parse files and other input, or to provide a powerful way to
search and replace. Here's a short example in Python:
import re
n = re.compile(r'\bw[a-z]*', re.IGNORECASE)
print n.findall('will match all words beginning with the letter w.')
Here's a more advanced regular expression from the Python
Tutorial:
# Generate statement parsing regexes.
stmts = ['#\s*(?P<op>if|elif|ifdef|ifndef)\s+(?P<expr>.*?)',
'#\s*(?P<op>else|endif)',
'#\s*(?P<op>error)\s+(?P<error>.*?)',
'#\s*(?P<op>define)\s+(?P<var>[^\s]*?)(\s+(?P<val>.+?))?',
'#\s*(?P<op>undef)\s+(?P<var>[^\s]*?)']
patterns = ['^\s*%s\s*%s\s*%s\s*$'
% (re.escape(cg[0]), stmt, re.escape(cg[1]))
for cg in cgs for stmt in stmts]
stmtRes = [re.compile(p) for p in patterns]
Komodo can accept Python syntax regular expressions in it's
various Search
features.
Komodo IDE's Rx Toolkit can help you build and test regular
expressions. See Using Rx Toolkit for more
information.
Regular expressions can be used to find a particular pattern,
or to find a pattern and replace it with something else (substitution). Since the syntax is same for
the "find" part of the regex, we'll start with matching.
The simplest type of regex is a literal match. Letters,
numbers and most symbols in the expression will match themselves
in the the text being searched; an "a" matches an "a", "cat"
matches "cat", "123" matches "123" and so on. For example:
Example: Search for the string "at".
-
Regex:
at
-
Matches:
at
-
Doesn't Match:
it
a-t
At
Note: Regular expressions are case sensitive
unless a modifier is used .
Regex characters that perform a special function instead of
matching themselves literally are called "metacharacters". One
such metacharacter is the dot ".", or wildcard. When used in a
regular expression, "." can match any single character.
Using "." to match any character.
Example: Using '.' to find certain types of
words.
-
Regex:
t...s
-
Matches:
trees
trams
teens
-
Doesn't Match:
trucks
trains
beans
Many non-alphanumeric characters, like the "." mentioned
above, are treated as special characters with specific functions
in regular expressions. These special characters are called
metacharacters. To search for a literal occurence of a
metacharacter (i.e. ignoring its special regex attribute),
precede it with a backslash "\". For example:
Precede the following metacharacters with a backslash "\" to
search for them as literal characters:
^ $ + * ? . | ( ) { } [ ] \
These metacharacters take on a special function (covered
below) unless they are escaped. Conversely, some characters take
on special functions (i.e. become metacharacters) when they
are preceeded by a backslash (e.g. "\d" for "any digit"
or "\n" for "newline"). These special sequences vary from
language to language; consult your language documentation for a
comprehensive list.
Quantifiers specify how many instances of the preceeding
element (which can be a character or a group) must appear in order to match.
The "?" matches 0 or 1 instances of the previous element. In
other words, it makes the element optional; it can be present,
but it doesn't have to be. For example:
-
Regex:
colou?r
-
Matches:
colour
color
-
Doesn't Match:
colouur
colur
The "*" matches 0 or more instances of the previous element.
For example:
As the third match illustrates, using ".*" can be dangerous.
It will match any number of any character
(including spaces and non alphanumeric characters). The
quantifier is "greedy" and will match as much text as possible.
To make a quantifier "non-greedy" (matching as few characters as
possible), add a "?" after the "*". Applied to the example above,
the expression "www\.my.*?\.com" would match just
"www.mysite.com", not the longer string.
The "+" matches 1 or more instances of the previous element.
Like "*", it is greedy and will match as much as possible unless
it is followed by a "?".
To match a character a specific number of times, add that
number enclosed in curly braces after the element. For
example:
To specify the minimum number of matches to find and the
maximum number of matches to allow, use a number range inside
curly braces. For example:
| Quantifier |
Description |
| ? |
Matches any preceding element 0 or 1 times. |
| * |
Matches the preceding element 0 or more times. |
| + |
Matches the preceding element 1 or more times. |
| {num} |
Matches the preceding element num times. |
| {min, max} |
Matches the preceding element at least min
times, but not more than max times. |
The vertical bar "|" is used to represent an "OR" condition.
Use it to separate alternate patterns or characters for matching.
For example:
-
Regex:
perl|python
-
Matches:
perl
python
-
Doesn't Match:
ruby
Parentheses "()" are used to group characters and expressions
within larger, more complex regular expressions. Quantifiers that
immediately follow the group apply to the whole group. For
example:
-
Regex:
(abc){2,3}
-
Matches:
abcabc
abcabcabc
-
Doesn't Match:
abc
abccc
Groups can be used in conjunction with alternation. For
example:
-
Regex:
gr(a|e)y
-
Matches:
gray
grey
-
Doesn't Match:
graey
Strings that match these groups are stored, or "delimited",
for use in substitutions or
subsequent statements. The first group is stored in the
metacharacter "\1", the second in "\2" and so on. For
example:
-
Regex:
(.{2,5}) (.{2,8}) <\1_\2@example\.com>
-
Matches:
Joe Smith <Joe_Smith@example.com>
jane doe <jane_doe@example.com>
459 33154 <459_33154@example.com>
-
Doesn't Match:
john doe <doe_john@example.com>
Jane Doe <janie88@example.com>
Character classes indicate a set of characters to match.
Enclosing a set of characters in square brackets "[...]" means
"match any one of these characters". For example:
-
Regex:
[cbe]at
-
Matches:
cat
bat
eat
-
Doesn't Match:
sat
beat
Since a character class on it's own only applies to one
character in the match, combine it with a quantifier to search
for multiple instances of the class. For example:
-
Regex:
[0123456789]{3}
-
Matches:
123
999
376
-
Doesn't Match:
W3C
2_4
If we were to try the same thing with letters, we would have
to enter all 26 letters in upper and lower case. Fortunately, we
can specify a range instead using a hyphen. For example:
-
Regex:
[a-zA-Z]{4}
-
Matches:
Perl
ruby
SETL
-
Doesn't Match:
1234
AT&T
Most languages have special patterns for representing the most
commonly used character classes. For example, Python uses "\d" to
represent any digit (same as "[0-9]") and "\w" to represent any
alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your
language documentation for the special sequences applicable to
the language you use.
To define a group of characters you do not want to
match, use a negated character class. Adding a caret "^" to the
beginning of the character class (i.e. [^...]) means "match any
character except these". For example:
-
Regex:
[^a-zA-Z]{4}
-
Matches:
1234
$.25
#77;
-
Doesn't Match:
Perl
AT&T
Anchors are used to specify where in a string or line to look
for a match. The "^" metacharacter (when not used at the
beginning of a negated character class) specifies the beginning
of the string or line:
The "$" metacharacter specifies the end of a string or
line:
Sometimes it's useful to anchor both the beginning and end of
a regular expression. This not only makes the expression more
specific, it often improves the performance of the search.
-
Regex:
^To: .*example.org$
-
Matches:
To: feedback@example.org
To: hr@example.net, qa@example.org
-
Doesn't Match:
To: qa@example.org, hr@example.net
Send a Message To: example.org
Regular expressions can be used as a "search and replace"
tool. This aspect of regex use is known as substitution.
There are many variations in substitution syntax depending on
the language used. This primer uses the
"/search/replacement/modifier" convention used in Perl. In simple
substitutions, the "search" text will be a regex like the ones
we've examined above, and the "replace" value will be a
string:
For example, to earch for an old domain name and replace it
with the new domain name:
-
Regex Substitution:
s/http:\/\/www\.old-domain\.com/http://www.new-domain.com/
-
Search for:
http://www.old-domain.com
-
Replace with:
http://www.new-domain.com
Notice that the "/" and "." characters are not escaped in the
replacement string. In replacement strings, they do not need to
be. In fact, if you were to preceed them with backslashes, they
would appear in the substitution literally (i.e.
"http:\/\/www\.new-domain\.com").
The one way you can use the backslash "\" is to put saved
matches in the substitution using "\num". For
example:
This regex will actually match a number of URLs other than
"http://old-domain.com". If we had a list of URLs with various
permutations, we could replace all of them with related versions
of the new domain name (e.g. "ftp://old-domain.net" would become
"ftp://new-domain.net"). To do this we need to use a
modifier.
Modifiers alter the behavior of the regular expression. The
previous substitution example replaces only the first occurence
of the search string; once it finds a match, it performs the
substitution and stops. To modify this regex in order to replace
all matches in the string, we need to add the "g" modifier.
-
Substitution Regex:
/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/g
-
Target Text:
http://old-domain.com and ftp://old-domain.net
-
Result:
http://new-domain.com and ftp://new-domain.net
The "i" modifier causes the match to ignore the case of
alphabetic characters. For example:
| Modifier |
Meaning |
| i |
Ignore case when matching exact strings. |
| m |
Treat string as multiple lines. Allow "^'' and "$'' to
match next to newline characters. |
| s |
Treat string as single line. Allow ".'' to match a
newline character. |
| x |
Ignore whitespace and newline characters in the regular
expression. Allow comments. |
| o |
Compile regular expression once only. |
| g |
Match all instances of the pattern in the target
string. |
Komodo's Search features
(including "Find...", "Replace..." and "Find in Files...") can
accept plain text, glob style matching (called "wildcards" in the
drop list, but using "." and "?" differently than regex
wildcards), and Python regular expressions. A complete guide to
regexes in Python can be found in the
Python documentation. The Regular Expression
HOWTO by A.M. Kuchling is a good introduction to regular
expresions in Pyhon.
Beginner:
Intermediate:
Advanced:
Language-Specific:
|