ActiveState Powered by ActiveState

Recipe 435882: Normalizing newlines between windows/unix/macs


When comparing text generated on different platforms, the newlines are different. This recipe normalizes any string to use unix-style newlines.

This code is used in the TestOOB unit testing framework (http://testoob.sourceforge.net).

Python
1
2
3
def _normalize_newlines(string):
    import re
    return re.sub(r'(\r\n|\r|\n)', '\n', string)

Discussion

I've tested this on POSIX and Windows. Anyone with an old Mac care to try it? :-)

Comments

  1. 1. At 12:30 a.m. on 20 jul 2005, Andreas Kloss said:

    Speed up by precompiling regular expression. On the expense of one more line (and the re module plus the regular expression inserted into your namespace), you can get some speed (On my PC, for the contents of a random python script, it finishes in a third of the time) by pulling almost everything out of the function. Of course, this works best if you use this function quite often.

    import re
    _newlines_re = re.compile(r'(\r\n|\r|\r)')
    def _normalize_newlines(string):
        return _newlines_re.sub('\n', string)
    
  2. 2. At 2:21 a.m. on 20 jul 2005, Anonymous said:

    don't use regular expressions when not really needed. It's even better to do two replace calls:

    #!/usr/bin/env python
    import profile, re, random
    
    s = "".join([random.choice(" \n\r") for i in range(10000)])
    
    def use_re_sub():
        global r1
        for i in range(1000):
            r1 = re.sub(r'(\r\n|\r|\n)', '\n', s)
    
    _newlines_re = re.compile(r'(\r\n|\r|\n)')
    def use_re_compile():
        global r2
        for i in range(1000):
            r2 = _newlines_re.sub('\n', s)
    
    def use_replace():
        global r3
        for i in range(1000):
            r3 = s.replace('\r\n', '\n').replace('\r', '\n')
    
    profile.run('use_re_sub()')
    profile.run('use_re_compile()')
    profile.run('use_replace()')
    assert r1 == r2 == r3
    

    The last version is several times faster (of course this also depends on the string you convert).

  3. 3. At 9:22 a.m. on 5 aug 2005, Ori Peleg (the author) said:

    Good point. When this function shows up in my profiler I'll probably do this.

    Until it does, I prefer the greater readability -- in my eyes -- of not precompiling the expression.

  4. 4. At 9:28 a.m. on 5 aug 2005, Ori Peleg (the author) said:

    The regular expression isn't there for special features. It's there for readability.

    Replacing (\r\n|\r|\n) with whatever (I arbitrarily chose '\n') sits in my mind fairly well. And I understand that regex at a single glance.

    I agree that using two replaces, noting that '\n' need not be replaced with '\n', is both efficient and clever.

    I'll probably stay with the regex, though, because I find it easier to understand, and it isn't a performance hit in my application yet.

Sign in to comment