ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-Tutor
python-Tutor
Re: [Tutor] newbie re question
by Danny Yoo other posts by this author
Jun 30 2003 6:06PM messages near this date
Re: [Tutor] Declaring a global variable (where?) | Re: [Tutor] newbie re question
On Mon, 30 Jun 2003 tpc@[...].edu wrote:

>  hi Danny, I had a question about your quick intro to Python lesson sheet
>  you gave out back in March 2001.

Hi tpc,



Some things will never die.  *grin*



>  The last page talks about formulating a regular expression to handle
>  URLs, and you have the following:
> 
>  myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')



Ok.  Let's split that up using verbose notation:


###
myre = re.compile(r'''http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###


The page:

    http://www.python.org/doc/lib/re-syntax.html

has more details about some of the regular expression syntax.  AMK has
written a nice regex HOWTO here:

    http://www.amk.ca/python/howto/regex/regex.html

which you might find useful.




>  I understand \w stands for any word character, and \. means escaped
>  period, and ? means zero or one instances of a character or set.  I did
>  a myre.search('http://www.hotmail.com') which yielded a result, but I am
>  confused as to why
> 
>  myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.
DTL')
> 
>  would work, since there is a '=' and you don't provide for one in the
>  regular expression.


Very true.  It should, however, match against the negative lookahead ---
the regex tries to look ahead to see that it can match something like:

    "This is an url: http://python.org.  Isn't that neat?"


The negative lookup should match right here:

    "This is an url: http://python.org.  Isn't that neat?"
                                       ^

In your url above, the negative lookahead should actually hit the question
mark first before it sees '='.  That regex was a sloppy example; I should
have been more careful with it, but I was in a hurry when I wrote that
intro...  *grin*



If you're in the Berkeley area, by the way, you might want to see if
Ka-Ping Yee is planning another CS 198 class in the future:

    http://zesty.ca/bc/info.html





Anyway, we can experiment with this more easily by introducing a group
into the regular expression:

###
myre = re.compile(r'''
                    (                  ## group 1

                      http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                    )                  ## end group


                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###



Let's check it now:

###
> >> match =
myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DT
L')
> >> match.group(1)
'http://www.sfgate.com/cgi'
###



Oiii!  The regular expression is broken.  What has happened is that I've
incorrectly defined the hyphen in the character group.  That is, instead
of


    [\w\.-/]+


I should have done:

    [\w./-]+


instead, to keep the regex engine from treating the hyphen as a span of
characters (like "[a-z]", or "[0-9]").  You can then introduce the other
characters into the "word" character class, and then it should correctly
match the sfgate url.



I hope this helps!


_______________________________________________
Tutor maillist  -  Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Danny Yoo


Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved