Re: advanced regex, was: Re: scanf style parsing
by Skip Montanaro other posts by this author
Oct 5 2001 2:14AM messages near this date
scanf style parsing
|
Re: advanced regex, was: Re: scanf style parsing
Hans-Peter> Well, yesterday, I tried to parse some simple hexdump,
Hans-Peter> produced by tcpdump -xs1500 port 80. The idea was, filter
Hans-Peter> the hexcodes, and display and 7 bit acsii codes like a
Hans-Peter> little advanced hex monitors do.
Hans-Peter> As I'm fairly new to advanced regex constructs, would
Hans-Peter> somebody enlight me, how to efficiently parse lines like:
Hans-Peter> 2067 726f 7570 732e 2e2e 3c2f 613e 3c2f
Hans-Peter> 666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c
Hans-Peter> 7472 3e3c 7464 2062 6763 6f6c 6f72 3d23
Hans-Peter> 6666 6363 3333 2063 6f6c 7370 616e 3d34
Hans-Peter> 3e3c 494d 4720 6865 6967 6874 3d31 2073
Hans-Peter> 7263 3d22 2f69 6d61 6765 732f 636c 6561
Hans-Peter> 7264 6f74 2e67 6966 2220 7769 6474 683d
Hans-Peter> 3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74
Hans-Peter> 6162 6c65 3e3c 703e 3c66 6f6e 7420 7369
Hans-Peter> 7a65 3d2d 313e 4172 6520 796f 7520 6120
Hans-Peter> with respect to varying column numbers. I will refrain to
Hans-Peter> show my stupid beginnings, but I wasn't able to get that
Hans-Peter> _one_ regex right, with all columns in matchobj.groups()
Hans-Peter> listed.
I'm not sure quite what you're looking for, but this data is so regular I
wouldn't use regular expressions to parse it (no pun intended).
Assuming the above stream is coming in on stdin and I wanted to display
any printable ASCII characters, I'd start with something like this:
import sys
for line in sys.stdin.readlines():
line = line.strip()
fields = line.split()
printing = []
for pair in fields:
first = chr(int(pair[:2], 16))
second = chr(int(pair[2:], 16))
if first < " " or first > "~":
first = "."
if second < " " or second > "~":
second = "."
printing.extend([first, second])
print line, "".join(printing)
The above hex data fed to this code produces
2067 726f 7570 732e 2e2e 3c2f 613e 3c2f groups...</a> </
666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c font> </td></tr><
7472 3e3c 7464 2062 6763 6f6c 6f72 3d23 tr> <td bgcolor=#
6666 6363 3333 2063 6f6c 7370 616e 3d34 ffcc33 colspan=4
3e3c 494d 4720 6865 6967 6874 3d31 2073 > <IMG height=1 s
7263 3d22 2f69 6d61 6765 732f 636c 6561 rc="/images/clea
7264 6f74 2e67 6966 2220 7769 6474 683d rdot.gif" width=
3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74 1 > </td></tr></t
6162 6c65 3e3c 703e 3c66 6f6e 7420 7369 able> <p><font si
7a65 3d2d 313e 4172 6520 796f 7520 6120 ze=-1> Are you a
on stdout.
--
Skip Montanaro (skip@pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/
--
http://mail.python.org/mailman/listinfo/python-list
Thread:
Bruce Dawson
Skip Montanaro
George Demmy
Hans-Peter Jansen
Quinn Dunkan
Tim Hammerquist
Ralph Corderoy
Toby Dickenson
Duncan Booth
Aahz Maruch
Aahz Maruch
Aahz Maruch
Stefan Schwarzer
Grant Edwards
Fredrik Lundh
Malcolm Tredinnick
Ralph Corderoy
Tim Hammerquist
Stefan Schwarzer
Greg Ewing
Skip Montanaro
Boyd Roberts
Steve Clift
Bruce Dawson
Tim Hammerquist
Tim Hammerquist
Tim Hammerquist
Skip Montanaro
Andrew Dalke
Fredrik Lundh
Oleg Broytmann
Andrei Kulakov
Duncan Booth
Chris Barker
Tim Hammerquist
Skip Montanaro
Jon Nicoll
Oleg Broytmann
Bruce Dawson
Skip Montanaro
Andrei Kulakov
Richard Jones
Skip Montanaro
Andrew Dalke
|