Compilation flags let you modify some aspects of how regular
expressions work. Flags are available in the re module under
two names, a long name such as IGNORECASE, and a short,
one-letter form such as I. (If you're familiar with Perl's
pattern modifiers, the one-letter forms use the same letters; the
short form of re.VERBOSE is re.X, for example.)
Multiple flags can be specified by bitwise OR-ing them; re.I |
re.M sets both the I and M flags, for example.
Here's a table of the available flags, followed by a more detailed explanation of each one.
| Flag | Meaning |
|---|---|
| DOTALL, S | Make . match any character, including newlines |
| IGNORECASE, I | Do case-insensitive matches |
| LOCALE, L | Do a locale-aware match |
| MULTILINE, M | Multi-line matching, affecting ^ and $ |
| VERBOSE, X | Enable verbose REs, which can be organized more cleanly and understandably. |
Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French text, you'd want to be able to write \w+ to match words, but \w only matches the character class [A-Za-z]; it won't match "é" or "ç". If your system is configured properly and a French locale is selected, certain C functions will tell the program that "é" should also be considered a letter. Setting the LOCALE flag when compiling a regular expression will cause the resulting compiled object to use these C functions for \w; this is slower, but also enables \w+ to match French words as you'd expect.
Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).
For example, here's a RE that uses re.VERBOSE; see how much easier it is to read?
charref = re.compile(r""" &[#] # Start of a numeric entity reference ( [0-9]+[^0-9] # Decimal form | 0[0-7]+[^0-7] # Octal form | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form ) """, re.VERBOSE)
Without the verbose setting, the RE would look like this:
charref = re.compile("&#([0-9]+[^0-9]"
"|0[0-7]+[^0-7]"
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
In the above example, Python's automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it's still more difficult to understand than the version using re.VERBOSE.