|
|
 |
|
Title: Normalizing newlines between windows/unix/macs
Submitter: Ori Peleg
(other recipes)
Last Updated: 2005/07/02
Version no: 1.0
Category:
Text
|
|
|
Description:
When comparing text generated on different platforms, the newlines are different. This recipe normalizes any string to use unix-style newlines.
This code is used in the TestOOB unit testing framework (http://testoob.sourceforge.net).
Source: Text Source
def _normalize_newlines(string):
import re
return re.sub(r'(\r\n|\r|\n)', '\n', string)
Discussion:
I've tested this on POSIX and Windows. Anyone with an old Mac care to try it? :-)
|
|
Add comment
|
|
Number of comments: 4
Speed up by precompiling regular expression, Andreas Kloss, 2005/07/20
On the expense of one more line (and the re module plus the regular expression inserted into your namespace), you can get some speed (On my PC, for the contents of a random python script, it finishes in a third of the time) by pulling almost everything out of the function. Of course, this works best if you use this function quite often.
import re
_newlines_re = re.compile(r'(\r\n|\r|\r)')
def _normalize_newlines(string):
return _newlines_re.sub('\n', string)
Add comment
Good point, Ori Peleg, 2005/08/05
When this function shows up in my profiler I'll probably do this.
Until it does, I prefer the greater readability -- in my eyes -- of not precompiling the expression.
Add comment
don't use regular expressions when not really needed., Not specified Not specified, 2005/07/20
It's even better to do two replace calls:
#!/usr/bin/env python
import profile, re, random
s = "".join([random.choice(" \n\r") for i in range(10000)])
def use_re_sub():
global r1
for i in range(1000):
r1 = re.sub(r'(\r\n|\r|\n)', '\n', s)
_newlines_re = re.compile(r'(\r\n|\r|\n)')
def use_re_compile():
global r2
for i in range(1000):
r2 = _newlines_re.sub('\n', s)
def use_replace():
global r3
for i in range(1000):
r3 = s.replace('\r\n', '\n').replace('\r', '\n')
profile.run('use_re_sub()')
profile.run('use_re_compile()')
profile.run('use_replace()')
assert r1 == r2 == r3
The last version is several times faster (of course this also depends on the string you convert).
Add comment
The regular expression isn't there for special features, Ori Peleg, 2005/08/05
It's there for readability.
Replacing (\r\n|\r|\n) with whatever (I arbitrarily chose '\n') sits in my mind fairly well. And I understand that regex at a single glance.
I agree that using two replaces, noting that '\n' need not be replaced with '\n', is both efficient and clever.
I'll probably stay with the regex, though, because I find it easier to understand, and it isn't a performance hit in my application yet.
Add comment
|
|
|
|
|
 |
|