Re: [Tutor] newbie re question
by other posts by this author
Jul 8 2003 5:41PM messages near this date
Re: [Tutor] newbie re question
|
[Tutor] newbie 'while' question
hi Danny,
ah yes, I have seen Ping at various parties (and wearing a PythonLabs
shirt no less!). But I digress. I am still confused why you provided for
a negative lookahead. I looked at amk's definition of a negative lookahead,
and it seems to say the regex will not match if the negative lookahead
condition is met. So:
> >> testsearch = re.compile('tetsuro(?!hello)', re.IGNORECASE)
> >> testsearch.search('tetsurohello')
> >> testsearch.search('hitetsuroone')
<_sre.SRE_Match object at 0x860e4a0>
Now in the case of:
> >> myre = re.compile(r'http://[\w\.-]+\.?(?![\w.-/])')
you are looking for 'http://' then one or more word characters, periods
and hyphens, and an optional period and then a negative lookahead of a
word character, any character, a hyphen and a forward slash. Granted,
your regex may have been sloppy example and you might have meant a
negative lookahead of a word character, a period, a hyphen and a forward
slash. I still do not understand why you provided for one, and if you had
a good reason, why the sfgate url would match at all, since you clearly
had a word character, period, or hyphen following the set of characters
you were allowing for, including the optional period. Here is an example
of something similar that perplexes:
> >> testsearch = re.compile(r'tetsuro\.?(?!hello)', re.IGNORECASE)
> >> testsearch.search('tetsurohello')
> >> testsearch.search('tetsuro.hello')
<_sre.SRE_Match object at 0x8612028>
> >> match = testsearch.search('tetsuro.hello')
> >> match.group()
'tetsuro'
> >> match = testsearch.search('tetsuro..hello')
> >> match.group()
'tetsuro.'
Why wasn't the first period caught ?
On Mon, 30 Jun 2003, Danny Yoo wrote:
>
>
> On Mon, 30 Jun 2003 tpc@[...].edu wrote:
>
> > hi Danny, I had a question about your quick intro to Python lesson sheet
> > you gave out back in March 2001.
>
> Hi tpc,
>
>
>
> Some things will never die. *grin*
>
>
>
> > The last page talks about formulating a regular expression to handle
> > URLs, and you have the following:
> >
> > myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')
>
>
>
> Ok. Let's split that up using verbose notation:
>
>
> ###
> myre = re.compile(r'''http:// ## protocol
> [\w\.-/]+ ## followed by a bunch of "word"
> ## characters
>
> \.? ## followed by an optional
> ## period.
>
> (?! ## Topped with a negative
> ## lookahead for
> [\w.-/] ## "word" character.
>
> )''', re.VERBOSE)
> ###
>
>
> The page:
>
> http://www.python.org/doc/lib/re-syntax.html
>
> has more details about some of the regular expression syntax. AMK has
> written a nice regex HOWTO here:
>
> http://www.amk.ca/python/howto/regex/regex.html
>
> which you might find useful.
>
>
>
>
> > I understand \w stands for any word character, and \. means escaped
> > period, and ? means zero or one instances of a character or set. I did
> > a myre.search('http://www.hotmail.com') which yielded a result, but I am
> > confused as to why
> >
> > myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin2
9.DTL')
> >
> > would work, since there is a '=' and you don't provide for one in the
> > regular expression.
>
>
> Very true. It should, however, match against the negative lookahead ---
> the regex tries to look ahead to see that it can match something like:
>
> "This is an url: http://python.org. Isn't that neat?"
>
>
> The negative lookup should match right here:
>
> "This is an url: http://python.org. Isn't that neat?"
> ^
>
> In your url above, the negative lookahead should actually hit the question
> mark first before it sees '='. That regex was a sloppy example; I should
> have been more careful with it, but I was in a hurry when I wrote that
> intro... *grin*
>
>
>
> If you're in the Berkeley area, by the way, you might want to see if
> Ka-Ping Yee is planning another CS 198 class in the future:
>
> http://zesty.ca/bc/info.html
>
>
>
>
>
> Anyway, we can experiment with this more easily by introducing a group
> into the regular expression:
>
> ###
> myre = re.compile(r'''
> ( ## group 1
>
> http:// ## protocol
> [\w\.-/]+ ## followed by a bunch of "word"
> ## characters
>
> \.? ## followed by an optional
> ## period.
>
> ) ## end group
>
>
> (?! ## Topped with a negative
> ## lookahead for
> [\w.-/] ## "word" character.
>
> )''', re.VERBOSE)
> ###
>
>
>
> Let's check it now:
>
> ###
> >>> match =
> myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.
DTL')
> >>> match.group(1)
> 'http://www.sfgate.com/cgi'
> ###
>
>
>
> Oiii! The regular expression is broken. What has happened is that I've
> incorrectly defined the hyphen in the character group. That is, instead
> of
>
>
> [\w\.-/]+
>
>
> I should have done:
>
> [\w./-]+
>
>
> instead, to keep the regex engine from treating the hyphen as a span of
> characters (like "[a-z]", or "[0-9]"). You can then introduce the other
> characters into the "word" character class, and then it should correctly
> match the sfgate url.
>
>
>
> I hope this helps!
>
>
_______________________________________________
Tutor maillist - Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Danny Yoo
|