ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-Tutor
python-Tutor
Re: [Tutor] newbie re question
by other posts by this author
Jul 8 2003 5:41PM messages near this date
Re: [Tutor] newbie re question | [Tutor] newbie 'while' question
hi Danny,

ah yes, I have seen Ping at various parties (and wearing a PythonLabs
shirt no less!).  But I digress.  I am still confused why you provided for
a negative lookahead.  I looked at amk's definition of a negative lookahead,
and it seems to say the regex will not match if the negative lookahead
condition is met.  So:

> >> testsearch = re.compile('tetsuro(?!hello)', re.IGNORECASE)
> >> testsearch.search('tetsurohello')
> >> testsearch.search('hitetsuroone')
<_sre.SRE_Match object at 0x860e4a0> 

Now in the case of:

> >> myre = re.compile(r'http://[\w\.-]+\.?(?![\w.-/])')

you are looking for 'http://' then one or more word characters, periods
and hyphens, and an optional period and then a negative lookahead of a
word character, any character, a hyphen and a forward slash.  Granted,
your regex may have been sloppy example and you might have meant a
negative lookahead of a word character, a period, a hyphen and a forward
slash.  I still do not understand why you provided for one, and if you had
a good reason, why the sfgate url would match at all, since you clearly
had a word character, period, or hyphen following the set of characters
you were allowing for, including the optional period.  Here is an example
of something similar that perplexes:

> >> testsearch = re.compile(r'tetsuro\.?(?!hello)', re.IGNORECASE)
> >> testsearch.search('tetsurohello')
> >> testsearch.search('tetsuro.hello')
<_sre.SRE_Match object at 0x8612028> 
> >> match = testsearch.search('tetsuro.hello')
> >> match.group()
'tetsuro'
> >> match = testsearch.search('tetsuro..hello')
> >> match.group()
'tetsuro.'

Why wasn't the first period caught ?

On Mon, 30 Jun 2003, Danny Yoo wrote:

> 
> 
>  On Mon, 30 Jun 2003 tpc@[...].edu wrote:
> 
>  > hi Danny, I had a question about your quick intro to Python lesson sheet
>  > you gave out back in March 2001.
> 
>  Hi tpc,
> 
> 
> 
>  Some things will never die.  *grin*
> 
> 
> 
>  > The last page talks about formulating a regular expression to handle
>  > URLs, and you have the following:
>  >
>  > myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')
> 
> 
> 
>  Ok.  Let's split that up using verbose notation:
> 
> 
>  ###
>  myre = re.compile(r'''http://            ## protocol
>                        [\w\.-/]+          ## followed by a bunch of "word"
>                                           ## characters
> 
>                        \.?                ## followed by an optional
>                                           ## period.
> 
>                        (?!                ## Topped with a negative
>                                           ## lookahead for
>                              [\w.-/]      ## "word" character.
> 
>                        )''', re.VERBOSE)
>  ###
> 
> 
>  The page:
> 
>      http://www.python.org/doc/lib/re-syntax.html
> 
>  has more details about some of the regular expression syntax.  AMK has
>  written a nice regex HOWTO here:
> 
>      http://www.amk.ca/python/howto/regex/regex.html
> 
>  which you might find useful.
> 
> 
> 
> 
>  > I understand \w stands for any word character, and \. means escaped
>  > period, and ? means zero or one instances of a character or set.  I did
>  > a myre.search('http://www.hotmail.com') which yielded a result, but I am
>  > confused as to why
>  >
>  > myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin2
9.DTL')
>  >
>  > would work, since there is a '=' and you don't provide for one in the
>  > regular expression.
> 
> 
>  Very true.  It should, however, match against the negative lookahead ---
>  the regex tries to look ahead to see that it can match something like:
> 
>      "This is an url: http://python.org.  Isn't that neat?"
> 
> 
>  The negative lookup should match right here:
> 
>      "This is an url: http://python.org.  Isn't that neat?"
>                                         ^
> 
>  In your url above, the negative lookahead should actually hit the question
>  mark first before it sees '='.  That regex was a sloppy example; I should
>  have been more careful with it, but I was in a hurry when I wrote that
>  intro...  *grin*
> 
> 
> 
>  If you're in the Berkeley area, by the way, you might want to see if
>  Ka-Ping Yee is planning another CS 198 class in the future:
> 
>      http://zesty.ca/bc/info.html
> 
> 
> 
> 
> 
>  Anyway, we can experiment with this more easily by introducing a group
>  into the regular expression:
> 
>  ###
>  myre = re.compile(r'''
>                      (                  ## group 1
> 
>                        http://            ## protocol
>                        [\w\.-/]+          ## followed by a bunch of "word"
>                                           ## characters
> 
>                        \.?                ## followed by an optional
>                                           ## period.
> 
>                      )                  ## end group
> 
> 
>                        (?!                ## Topped with a negative
>                                           ## lookahead for
>                              [\w.-/]      ## "word" character.
> 
>                        )''', re.VERBOSE)
>  ###
> 
> 
> 
>  Let's check it now:
> 
>  ###
>  >>> match =
>  myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.
DTL')
>  >>> match.group(1)
>  'http://www.sfgate.com/cgi'
>  ###
> 
> 
> 
>  Oiii!  The regular expression is broken.  What has happened is that I've
>  incorrectly defined the hyphen in the character group.  That is, instead
>  of
> 
> 
>      [\w\.-/]+
> 
> 
>  I should have done:
> 
>      [\w./-]+
> 
> 
>  instead, to keep the regex engine from treating the hyphen as a span of
>  characters (like "[a-z]", or "[0-9]").  You can then introduce the other
>  characters into the "word" character class, and then it should correctly
>  match the sfgate url.
> 
> 
> 
>  I hope this helps!
> 
> 


_______________________________________________
Tutor maillist  -  Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Danny Yoo


Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved