|
The Regular Expressions Primer is a tutorial for those completely new
to regular expressions. To familiarize you with regular expressions,
this primer starts with the simple building blocks of the syntax and
through examples, builds to construct expressions useful for solving
real every-day problems. The primer later discusses how to search for
and replace text with regular expression syntax.
Regular expressions are used in programs to parse text. Here is a
simple regular expresion in Perl:
my $string = "This will match all words beginning with the letter w.";
$string =~ /\bw[a-z]*/ig;
print "$&";
Regular expressions are used to describe patterns of characters that
match against text strings. They can be used as a tool to search for
and replace text, manipulate data, or test for a certain condition in a
string or file.
Many everyday tasks can be accomplished with regular expressions, such
as checking for the occurrence of a specific word or phrase in the body
of an e-mail message, or finding specific file extensions in a folder
or directory. Regular expressions are often called "regex", "regexes",
"regexps", and "RE". This primer uses the terms "regular expressions",
"regex", and "regexes" equally.
Regular expressions use syntax elements comprised of alphanumeric
characters and symbols. For example, the regex (2) searches for the
number 2, while the regex ([1-9][0-9]{2}-[0-9]{4}) matches a regular
7-digit phone number.
There are many flavors and types of regular expression syntax used by
tools, languages and operating systems. For example, Perl, Python,
grep, sed, VI, and Unix all use variations on standard regex syntax.
This primer focuses on standard regex patterns not tied to a specific
language or tool. This standard syntax can be later applied to the
specific language, tool or application of your choice.
Complete regular expressions are constructed using characters as small
building blocks. Each building block is in itself simple, but since
these units can be combined in an infinite number of ways, knowing how
to combine them to achieve a goal takes some practice. This section
shows you how to build regexes through examples ranging from the simple
to the complex.
The simplest and most common type of regex is an alphanumeric string
that matches itself, called a "literal text match". A literal text
regex matches anywhere along a string. For example, a literal string
matches itself when placed alone, and at the beginning, middle, or end
of a larger string. Literal text matches are case sensitive.
Example 1: Search for the string "at".
Example 2: Search for the string "email".
Example 3: Search for the alphanumeric string "abcdE567".
Note: Regular expressions are case sensitive unless case is
deliberately modified.
There are other characters in regex syntax that match in a more
generalized way. These are called "metacharacters". Metacharacters do
not match themselves, but rather perform a specific task when used in a
regular expression. One such metacharacter is the dot "." or wildcard.
When used in a regular expression, the wildcard can match any single
character.
Example 1: Use "." to search for any one character before the string
"ubject:".
Example 2: Use three dots "..." to search for any three characters
within a string.
- Regex:
t...s
- Matches:
trees
tEENs
t345s
t-4-s
- Doesn't Match:
Trees
twentys
t1234s
Example 3: Use several wildcards to match characters throughout a
string.
In regular expression syntax, most non-alphanumeric characters are
treated as special characters. These characters, called
"metacharacters", include asterisks, question marks, dots, slashes,
etc. In order to search for a metacharacter without using its special
attribute, precede it with a backslash "\" to change it into a literal
character. For example, to build a regex to search for a .txt file,
precede the dot with a backslash \.txt to prevent the dot's special
function, a wildcard search. The backslash, called an "escape
character" in regex terminology, turns metacharacters into literal
characters.
Precede the following metacharacters with a backslash "\" to search for
them as literal characters:
^ $ + * ? . | ( ) { } [ ] \
Using the backslash "\" to escape special characters in a regular
expression.
Example 1: Escape the dollar sign "$" to find the alphanumeric
string "$100".
- Regex:
\$100
- Matches:
$100
$1000
- Doesn't Match:
\$100
100
Example 2: Use the dot "." as a literal character to find a file
called "email.txt".
- Regex:
email\.txt
- Matches:
email.txt
- Doesn't Match:
email
txt
email_txt
Example 3: Escape the backslash "\" character to search for a
Windows file.
Regex syntax includes metacharacters which specify the number of times
a particular character or string must match. This group of
metacharacters is called "quantifiers"; they influence the quantity of
matches found. Quantifiers act on the element immediately preceding
them, which could be a digit, a letter, or another metacharacter
(including spaces as metacharacters not previously defined and the dot
"."). This section demonstrates how quantifiers search using ranges and
repetition.
Ranges are considered a "counting qualifier" in regular expressions.
This is because they specify the minimum number of matches to find and
the maximum number of matches to allow. Use ranges in regex searches
when a bound, or a limit, should be placed on search results. For
example, the range {3,5} matches an item at least 3 times, but not more
than 5 times. When this range is combined with the regex, a{3,5}, the
strings "aaa", "aaaa", and "aaaaa" are successfully matched. If only a
single number is expressed within curly braces {3}, the pattern matches
exactly three items. For example, the regex b{3} matches the string
"bbb".
Using ranges to identify search patterns.
Example 1: Match the preceding "0" at least 3 times with a maximum
of 5 times.
Example 2: Using the "." wildcard to match any character sequence
two or three characters long.
- Regex:
.{2,3}
- Matches:
404
44
com
w3
- Doesn't Match:
4
a
aaaa
Example 3: Match the preceding "e" exactly twice.
- Regex:
be{2}t
- Matches:
beet
- Doesn't Match:
bet
beat
eee
Example 4: Match the preceding "w" exactly three times.
Unlike range quantifiers, the repetition quantifiers (question mark
"?", asterisk "*", and plus "+") have few limits when performing regex
searches. This is significant because these quantifiers settle for the
minimum number of required matches, but always attempt to match as many
times as possible, up to the maximum allowed. For example, the question
mark "?" matches any preceding character 0 or 1 times, the asterisk "*"
matches the preceding character 0 or more times, and the plus "+"
matches the preceding character 1 or more times.
Using repetition to search for repeated characters with few limits.
Example 1: Use "?" to match the "u" character 0 or 1 times.
- Regex:
colou?r
- Matches:
colour
color
- Doesn't Match:
colouur
Colour
Example 2: Use "*" to match the preceding item 0 or more times; use
"." to match any character.
Example 3: Use "+" to match the preceding "5" at least once.
The following table defines the various regex quantifiers. Note that
each quantifier is unique and will perform a varying minimum and
maximum number of matches in order to search successfully.
Quantifier Description
{num} Matches the preceding element num times.
{min, max} Matches the preceding element at least min times, but not more than max times.
? Matches any preceding element 0 or 1 times.
* Matches the preceding element 0 or more times.
+ Matches the preceding element 1 or more times.
Conditional expressions help qualify and restrict regex searches,
increasing the probability of a desirable match. The vertical bar "|"
symbol, meaning "OR", places a condition on the regex to search for
either one character in a string or another. Because the regex has a
list of alternate choices to evaluate, this regex technique is called
"alternation". To search for either one character or another, insert a
vertical bar "|" between the desired characters.
Example 1: Use "|" to alternate a search for various spellings of a
string.
- Regex:
gray|grey
- Matches:
gray
grey
- Doesn't Match:
GREY
Gray
Example 2: Use "|" to alternate a search for either email or Email
or EMAIL or e-mail.
- Regex:
email|Email|EMAIL|e-mail
- Matches:
email
Email
EMAIL
e-mail
- Doesn't Match:
EmAiL
E-Mail
Use parentheses to enclose a group of related search elements.
Parentheses limit scope on alternation and create substrings to enhance
searches with metacharacters. For example, use parentheses to group the
expression (abc), then apply the range quantifier {3} to find instances
of the string "abcabcabc".
Using parentheses to group regular expressions.
Example 1: Use parentheses and a range quantifier to find instances
of the string "abcabcabc".
- Regex:
(abc){3}
- Matches:
abcabcabc
abcabcabcabc
- Doesn't Match:
abc
abcabc
Example 2: Use parentheses to limit the scope of alternative matches
on the words gray and grey.
- Regex:
gr(a|e)y
- Matches:
gray
grey
- Doesn't Match:
gry
graey
Example 3: Use parentheses and "|" to locate past correspondence in
a mail-filtering program. This regex finds a 'To:' or a 'From:' line
followed by a space and then either the word 'Smith' or the word
'Chan'.
"Character classes" are used to specify a group of characters, enclosed
in square brackets "[]", which can match in a specific place. Any of
the characters specified in the class can match in that place. The
class can be used to match a single character or, when used in
conjunction with quantifiers, a string.
The most basic type of character class is a set of alphanumeric
characters within square brackets "[]". For example, the regular
expression [bcr]at, matches the words "bat", "cat", or "rat" because it
uses a character class (that includes "b","c", or "r") as its first
character. Character classes match singular characters unless a
quantifier is placed after the closing bracket. For examples using
quantifiers with character classes, see Compound Character Classes.
Note: When placed inside a character class, the hyphen "-"
metacharacter denotes a continuous sequence of letters or numbers in a
range. For example, [a-d] is a range of letters denoting the continuous
sequence of a,b,c and d. When a hyphen is otherwise used in a regex, it
matches a literal hyphen.
Using simple character classes to perform regex searches.
Example 1: Use a character class to match all cases of the letter
"s".
- Regex:
Java[Ss]cript
- Matches:
JavaScript
Javascript
- Doesn't Match:
javascript
javaScript
Example 2: Use a character class to limit the scope of alternative
matches on the words gray and grey.
- Regex:
gr[ae]y
- Matches:
gray
grey
- Doesn't Match:
gry
graey
Example 3: Use a character class to match any one digit in the list.
- Regex:
[0123456789]
- Matches:
5
0
9
- Doesn't Match:
x
?
F
Example 4: To simplify the previous example, use a hyphen "-" within
a character class to denote a range for matching any one digit in the
list.
- Regex:
[0-9]
- Matches:
5
0
9
- Doesn't Match:
234
42
Example 5: Use a hyphen "-" within a character class to denote an
alphabetic range for matching various words ending in "mail".
- Regex:
[A-Z]mail
- Matches:
Email
Xmail
Zmail
- Doesn't Match:
email
mail
Example 6: Match any three or more digits listed in the character
class.
- Regex:
[0-9]{3,}
- Matches:
012
1234
555
98754378623
- Doesn't Match:
10
7
Previous examples used character classes to specify exact sequences to
match. Character classes can also be used to prevent, or negate,
matches with undesirable strings. To prevent a match, use a leading
caret "^" (meaning NOT), within square brackets,[^...]. For example,
the regex [^a] matches any single character except the letter "a".
Note: The caret symbol must be the first character within the square
brackets to negate a character class.
Using character classes to prevent a sequence from matching.
Example 1: Prevent a match on any numeric string. Use the "*" to
match an item 0 or more times.
- Regex:
[^0-9]*
- Matches:
abc
c
Mail
u-see
a4a
- Doesn't Match:
1
42
100
23000000
Example 2: Search for a text file beginning with any character not a
lower-case letter.
- Regex:
[^a-z]\.txt
- Matches:
A.txt
4.txt
Z.txt
- Doesn't Match:
r.txt
a.txt
Aa.txt
Example 3: Prevent a match on the numbers "10" and "12".
- Regex:
1[^02]
- Matches:
13
11
19
17
1a
- Doesn't Match:
10
12
42
a1
Character classes are a versatile tool when combined with various
pieces of the regex syntax. Compound character classes can help clarify
and define sophisticated searches, test for certain conditions in a
program, and filter wanted e-mail from spam. This section uses compound
character classes to build meaningful expressions with the regex
syntax.
Using compound character classes with the regex syntax.
Example 1: Find a partial e-mail address. Use a character class to
denote a match for any number between 0 and 9. Use a range to restrict
the number of times a digit matches.
- Regex:
smith[0-9]{2}@
- Matches:
smith44@
smith42@
- Doesn't Match:
Smith34
smith6
Smith0a
Example 2: Search an HTML file to find each instance of a header
tag. Allow matches on whitespace after the tag but before the ">".
- Regex:
(<[Hh][1-6] *>)
- Matches:
<H1>
<h6>
<H3 >
<h2 >
- Doesn't Match:
<H1
< h2>
<a1>
Example 3: Match a regular 7-digit phone number. Prevent the digit
"0" from leading the string.
Example 4: Match a valid web-based protocol. Escape the two front
slashes.
Example 5: Match a valid e-mail address.
- Regex:
[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+
- Matches:
j_smith@foo.com
j.smith@bc.canada.ca
smith99@foo.co.uk
1234@mydomain.net
- Doesn't Match:
@foo.com
.smith@foo.net
smith.@foo.org
www.myemail.com
The following table defines various character class sequences. Use
these alphanumeric patterns to simplify your regex searches.
Character Description
Class
[0-9] Matches any digit from 0 to 9.
[a-zA-z] Matches any alphabetic character.
[a-zA-z0-9] Matches any alphanumeric character.
[^0-9] Matches any non-digit.
[^a-zA-z] Matches any non-alphabetic character.
At times, the pattern to be matched appears at either the very
beginning or end of a string. In these cases, use a caret "^" to match
a desired pattern at the beginning of a string, and a dollar sign "$"
for the end of the string. For example, the regular expression email
matches anywhere along the following strings: "email", "emailing",
"bogus_emails", and "smithsemailaddress". However, the regex ^email
only matches the strings "email" and "emailing". The caret "^" in this
example is used to effectively anchor the match to the start of the
string. For this reason, both the caret "^" and dollar sign "$" are
referred to as anchors in the regex syntax.
Note: The caret "^" has many meanings in regular expressions. Its
function is determined by its context. The caret can be used as an
anchor to match patterns at the beginning of a string, for
example:(^File). The caret can also be used as a logical "NOT" to
negate content in a character class, for example: [^...].
Using anchors to match at the beginning or end of a string.
Example 1: Use "$" to match the ".com" pattern at the end of a
string.
Example 2: Use "^" to match "inter" at the beginning of a string,
"$" to match "ion" at the end of a string, and ".*" to match any number
of characters within the string.
Example 3: Use "^" inside parentheses to match "To" and "From" at
the beginning of the string.
- Regex:
(^To:|^From:)(Smith|Chan)
- Matches:
From:Chan
To:Smith
From:Smith
To:Chan
- Doesn't Match:
From: Chan
from:Smith
To Chan
Example 4: Performing the same search as #3, place the caret "^"
outside the parentheses this time for similar results.
Regular expressions are often used to search and replace text strings.
Up until this point, the preceding examples have centered on matching a
string using regex syntax. This section examines the search and replace
operation as a prominent feature of regular expressions and solves
standard problems using the substitution syntax.
Like with building regular expressions, there are many variations on
substitution syntax depending on the language used. This primer focuses
on general search and replace syntax. This standard syntax can be later
applied to the specific language, tool or application of your choice.
Substitution searches search for and replace a pattern of text.
Substitutions are performed using the s/// operator, "s" standing for
substitution. The s/// operator takes a regular expression between the
first and second front slashes, while the second and third front
slashes take the replacement text.
For example:
s/<regex>/<substitution-string>/
Use the s/// operator to search for and replace a simple text string.
Note: these searches only replace the first instance of the string
found.
Example 1: Search for the string "email" and replace it with
"e-mail".
Example 2: Search for an old domain name and replace it with the new
domain name. Using regex syntax, escape "." and "/" characters.
Example 3: Search for a single string starting with any lowercase
letter and ending with "mail". Replace the string with "Email".
- Regex Substitution:
s/[a-z]mail/Email
- Search for:
email
zmail
xmail
- Replace with:
Email
The previous substitution examples focused on small searches, such as
replacing a single lower-case word in a single line of text. Extend the
scope and flexibility of substitution searches through the use of
modifiers. The modifier parameter is appended to the end of the s///
operator as follows:
s/<regex>/<substitution-string>/<modifier>
Use the modifier "i" to ignore case in alphabetic searches, "m" to
allow multiple lines in a string, "s" to treat a pattern as a single
line, "x" to allow for whitespace and comments, and "g" for global
searching all occurances of the pattern in a file and not just the
first instance found.
Use various modifiers with the s/// operator to search for and replace
text strings.
Example 1: Using the "g" modifier, search globally through all .htm
instances in a file and replace them with ".html". Using "$", only
substitute the ".htm" string when it appears at the end of a line. An
example file where this substitution succeeds:
/manual/mod_python/pythonapi.htm /manual/mod_python/more-comp.htm
/manual/mod_python/overview.htm
- Regex Substitution:
s/\.htm$/\.html/g
- Search for:
.htm
- Replace with:
.html
Example 2: Using the "g" modifier, remove all html tags in a file
and replace the tags with an empty string.
- Regex Substitution:
s/<[^>]+>//g
- Search for:
<code>Tag</code>
- Replace with:
Tag
Example 3: Perform a case insensitive search for various instances
of "login" and replace with the string "password".
- Regex Substitution:
s/LOGIN/password/i
- Search for:
LOGIN
login
LoGiN
Login
- Replace with:
password
Modifiers change how a match is performed. Use these modifiers to
expand the scope and versatility of your substitutions.
-
i: Ignore case when matching exact strings.
-
m: Treat string as multiple lines. Allow "^'' and "$'' to
match next to newline characters.
-
s: Treat string as single line. Allow ".'' to match a
newline character.
-
x: Ignore whitespace and newline characters in the regular
expression. Allow comments.
-
o: Compile regular expression once only.
-
g: Match all instances of the pattern in the target string.
Beginner:
Intermediate:
Advanced:
|