Product Documentation

ActivePython 2.5 Documentation

4.4 Lookahead Assertions

Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form, and look like this:

(?=...)
Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn't advance at all; the rest of the pattern is tried right where the assertion started.

(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn't match at the current position in the string.

An example will help make this concrete by demonstrating a case where a lookahead is useful. Consider a simple pattern to match a filename and split it apart into a base name and an extension, separated by a ".". For example, in "news.rc", "news"is the base name, and "rc" is the filename's extension.

The pattern to match this is quite simple:

.*[.].*$

Notice that the "." needs to be treated specially because it's a metacharacter; I've put it inside a character class. Also notice the trailing $; this is added to ensure that all the rest of the string must be included in the extension. This regular expression matches "foo.bar" and "autoexec.bat" and "sendmail.cf" and "printers.conf".

Now, consider complicating the problem a bit; what if you want to match filenames where the extension is not "bat"? Some incorrect attempts:

.*[.][^b].*$

The first attempt above tries to exclude "bat" by requiring that the first character of the extension is not a "b". This is wrong, because the pattern also doesn't match "foo.bar".

.*[.]([^b]..|.[^a].|..[^t])$

The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first character of the extension isn't "b"; the second character isn't "a"; or the third character isn't "t". This accepts "foo.bar" and rejects "autoexec.bat", but it requires a three-letter extension and won't accept a filename with a two-letter extension such as "sendmail.cf". We'll complicate the pattern again in an effort to fix it.

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$

In the third attempt, the second and third letters are all made optional in order to allow matching extensions shorter than three characters, such as "sendmail.cf".

The pattern's getting really complicated now, which makes it hard to read and understand. Worse, if the problem changes and you want to exclude both "bat" and "exe" as extensions, the pattern would get even more complicated and confusing.

A negative lookahead cuts through all this:

.*[.](?!bat$).*$

The lookahead means: if the expression bat doesn't match at this point, try the rest of the pattern; if bat$ does match, the whole pattern will fail. The trailing $ is required to ensure that something like "sample.batch", where the extension only starts with "bat", will be allowed.

Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. The following pattern excludes filenames that end in either "bat" or "exe":

.*[.](?!bat$|exe$).*$