Welcome, guest | Sign In | My Account | Store | Cart

latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"

This takes a UNICODE string and replaces Latin-1 characters with something equivalent in 7-bit ASCII and returns a plain ASCII string. This function makes a best effort to convert Latin-1 characters into ASCII equivalents. It does not just strip out the Latin-1 characters. All characters in the standard 7-bit ASCII range are preserved. In the 8th bit range all the Latin-1 accented letters are converted to unaccented equivalents. Most symbol characters are converted to something meaningful. Anything not converted is deleted.

Python, 86 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/usr/bin/env python
"""
latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"

This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. This returns a plain ASCII string. 
This function makes a best effort to convert Latin-1 characters into 
ASCII equivalents. It does not just strip out the Latin1 characters.
All characters in the standard 7-bit ASCII range are preserved. 
In the 8th bit range all the Latin-1 accented letters are converted to 
unaccented equivalents. Most symbol characters are converted to 
something meaningful. Anything not converted is deleted.

Background:

One of my clients gets address data from Europe, but most of their systems 
cannot handle Latin-1 characters. With all due respect to the umlaut,
scharfes s, cedilla, and all the other fine accented characters of Europe, 
all I needed to do was to prepare addresses for a shipping system.
After getting headaches trying to deal with this problem using Python's 
built-in UNICODE support I gave up and decided to use some brute force.
This function converts all accented letters to their unaccented equivalents. 
I realize this is dirty, but for my purposes the mail gets delivered.
"""

def latin1_to_ascii (unicrap):
    """This takes a UNICODE string and replaces Latin-1 characters with
        something equivalent in 7-bit ASCII. It returns a plain ASCII string. 
        This function makes a best effort to convert Latin-1 characters into 
        ASCII equivalents. It does not just strip out the Latin-1 characters.
        All characters in the standard 7-bit ASCII range are preserved. 
        In the 8th bit range all the Latin-1 accented letters are converted 
        to unaccented equivalents. Most symbol characters are converted to 
        something meaningful. Anything not converted is deleted.
    """
    xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
        0xc6:'Ae', 0xc7:'C',
        0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
        0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
        0xd0:'Th', 0xd1:'N',
        0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
        0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
        0xdd:'Y', 0xde:'th', 0xdf:'ss',
        0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
        0xe6:'ae', 0xe7:'c',
        0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
        0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
        0xf0:'th', 0xf1:'n',
        0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
        0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
        0xfd:'y', 0xfe:'th', 0xff:'y',
        0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
        0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
        0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
        0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
        0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
        0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
        0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>', 
        0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
        0xd7:'*', 0xf7:'/'
        }

    r = ''
    for i in unicrap:
        if xlate.has_key(ord(i)):
            r += xlate[ord(i)]
        elif ord(i) >= 0x80:
            pass
        else:
            r += str(i)
    return r

if __name__ == '__main__':
    s = unicode('','latin-1')
    for c in range(32,256):
        if c != 0x7f:
            s = s + unicode(chr(c),'latin-1')
    plain_ascii = latin1_to_ascii(s)
    
    print 'INPUT type:', type(s)
    print 'INPUT:'
    print s.encode('latin-1')
    print
    print 'OUTPUT type:', type(plain_ascii)
    print 'OUTPUT:'
    print plain_ascii

One of my clients gets address data from Europe, but most of their systems cannot handle Latin-1 characters. With all due respect to the umlaut, scharfes s, cedilla, and all the other fine accented characters of Europe, all I needed to do was to prepare addresses for a shipping system. After getting headaches trying to deal with this problem using Python's built-in UNICODE support I gave up and decided to use some brute force. This function converts all accented letters to their unaccented equivalents. I realize this is dirty, but for my purposes the mail gets delivered.

If you run this script from the command line it will run a demo. It will create a UNICODE string with all the Latin-1 characters from 32 to 255. Then it will convert that string to a plain ASCII Python string and print the results.

13 comments

Harvey Thomas 20 years, 5 months ago  # | flag

Better method. For the application for which this was written, the code given is OK, but it would creak a lot with a long string to convert. I'm sure the following is much faster:

import re

xlate = {0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
...    }

nonasciire = re.compile(u'([\x00-\x7f]+)|([^\x00-\x7f])', re.UNICODE).sub

def latin1_to_ascii (unicrap):
    return str(nonasciire(lambda x: x.group(1) or xlate.setdefault(ord(x.group(2)), ''), unicrap))
Harvey Thomas 20 years, 5 months ago  # | flag

Better method. For the application for which this was written, the code given is OK, but it would creak a lot with a long string to convert. I'm sure the following is much faster:

import re

xlate = {0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
...    }

nonasciire = re.compile(u'([\x00-\x7f]+)|([^\x00-\x7f])', re.UNICODE).sub

def latin1_to_ascii (unicrap):
    return str(nonasciire(lambda x: x.group(1) or xlate.setdefault(ord(x.group(2)), ''), unicrap))
Tiago Henriques 20 years, 5 months ago  # | flag

unicodedata is your friend. You can save a lot of time by using unicodedata.name and unicodedata.normalize.

The following code collects all unicode characters whose name starts with 'LATIN' in a dictionary:

latin_chars={}
for i in range(0xffff):
    u=unichr(i)
    try:
        n=unicodedata.name(u)
        if n.startswith('LATIN '):
            latin_chars[u]=n
    except ValueError:
        pass

This gives you a list of all the latin character names you might want to reduce to plain ASCII.

Remember that you can use unicode character names in python unicode strings:

unicode_a = u'\N{LATIN SMALL LETTER A}'
unicode_a_with_acute = u'\N{LATIN SMALL LETTER A WITH ACUTE}'

Also, you could the use unicodedata.normalize function to decompose combinatorial unicode characters into their components. For instance,

>>> unicodedata.normalize('NFKD', unicode_a_with_acute)
u'a\u0301'
>>> unicodedata.name(u'\u0301')
'COMBINING ACUTE ACCENT'

A possible approach to your problem might be:

  1. Replace every unicode character in your text with its KD normal form using unicodedata.normalize;

  2. Create a dictionary associating each unicode character that occurs in your text with its unicode name, using unicodedata.name;

  3. Discard from the dictionary all items that correspond to plain vanilla ASCII characters;

  4. Remove all unicode characters in the dictionary from your text, or replace them with some ASCII representation (e.g. u'\u0301' -> u'\N{COMBINING ACUTE ACCENT}');

Hope that helps.

Tiago Henriques 20 years, 4 months ago  # | flag

Improve readability by using unicode character names. If you don't want to make any changes to the behaviour of your code, you can still make it more readable by replacing the "xlate" dictionary with the following:

xlate = {
 u'\N{ACUTE ACCENT}': "'",
 u'\N{BROKEN BAR}': '|',
 u'\N{CEDILLA}': '{cedilla}',
 u'\N{CENT SIGN}': '{cent}',
 u'\N{COPYRIGHT SIGN}': '{C}',
 u'\N{CURRENCY SIGN}': '{currency}',
 u'\N{DEGREE SIGN}': '{degrees}',
 u'\N{DIAERESIS}': '{umlaut}',
 u'\N{DIVISION SIGN}': '/',
 u'\N{FEMININE ORDINAL INDICATOR}': '{^a}',
 u'\N{INVERTED EXCLAMATION MARK}': '!',
 u'\N{INVERTED QUESTION MARK}': '?',
 u'\N{LATIN CAPITAL LETTER A WITH ACUTE}': 'A',
 u'\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}': 'A',
 u'\N{LATIN CAPITAL LETTER A WITH DIAERESIS}': 'A',
 u'\N{LATIN CAPITAL LETTER A WITH GRAVE}': 'A',
 u'\N{LATIN CAPITAL LETTER A WITH RING ABOVE}': 'A',
 u'\N{LATIN CAPITAL LETTER A WITH TILDE}': 'A',
 u'\N{LATIN CAPITAL LETTER AE}': 'Ae',
 u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}': 'C',
 u'\N{LATIN CAPITAL LETTER E WITH ACUTE}': 'E',
 u'\N{LATIN CAPITAL LETTER E WITH CIRCUMFLEX}': 'E',
 u'\N{LATIN CAPITAL LETTER E WITH DIAERESIS}': 'E',
 u'\N{LATIN CAPITAL LETTER E WITH GRAVE}': 'E',
 u'\N{LATIN CAPITAL LETTER ETH}': 'Th',
 u'\N{LATIN CAPITAL LETTER I WITH ACUTE}': 'I',
 u'\N{LATIN CAPITAL LETTER I WITH CIRCUMFLEX}': 'I',
 u'\N{LATIN CAPITAL LETTER I WITH DIAERESIS}': 'I',
 u'\N{LATIN CAPITAL LETTER I WITH GRAVE}': 'I',
 u'\N{LATIN CAPITAL LETTER N WITH TILDE}': 'N',
 u'\N{LATIN CAPITAL LETTER O WITH ACUTE}': 'O',
 u'\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}': 'O',
 u'\N{LATIN CAPITAL LETTER O WITH DIAERESIS}': 'O',
 u'\N{LATIN CAPITAL LETTER O WITH GRAVE}': 'O',
 u'\N{LATIN CAPITAL LETTER O WITH STROKE}': 'O',
 u'\N{LATIN CAPITAL LETTER O WITH TILDE}': 'O',
 u'\N{LATIN CAPITAL LETTER THORN}': 'th',
 u'\N{LATIN CAPITAL LETTER U WITH ACUTE}': 'U',
 u'\N{LATIN CAPITAL LETTER U WITH CIRCUMFLEX}': 'U',
 u'\N{LATIN CAPITAL LETTER U WITH DIAERESIS}': 'U',
 u'\N{LATIN CAPITAL LETTER U WITH GRAVE}': 'U',
 u'\N{LATIN CAPITAL LETTER Y WITH ACUTE}': 'Y',
 u'\N{LATIN SMALL LETTER A WITH ACUTE}': 'a',
 u'\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}': 'a',
 u'\N{LATIN SMALL LETTER A WITH DIAERESIS}': 'a',
 u'\N{LATIN SMALL LETTER A WITH GRAVE}': 'a',
 u'\N{LATIN SMALL LETTER A WITH RING ABOVE}': 'a',
 u'\N{LATIN SMALL LETTER A WITH TILDE}': 'a',
 u'\N{LATIN SMALL LETTER AE}': 'ae',
 u'\N{LATIN SMALL LETTER C WITH CEDILLA}': 'c',
 u'\N{LATIN SMALL LETTER E WITH ACUTE}': 'e',
 u'\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}': 'e',
 u'\N{LATIN SMALL LETTER E WITH DIAERESIS}': 'e',
 u'\N{LATIN SMALL LETTER E WITH GRAVE}': 'e',
 u'\N{LATIN SMALL LETTER ETH}': 'th',

(comment continued...)

Tiago Henriques 20 years, 4 months ago  # | flag

(...continued from previous comment)

 u'\N{LATIN SMALL LETTER I WITH ACUTE}': 'i',
 u'\N{LATIN SMALL LETTER I WITH CIRCUMFLEX}': 'i',
 u'\N{LATIN SMALL LETTER I WITH DIAERESIS}': 'i',
 u'\N{LATIN SMALL LETTER I WITH GRAVE}': 'i',
 u'\N{LATIN SMALL LETTER N WITH TILDE}': 'n',
 u'\N{LATIN SMALL LETTER O WITH ACUTE}': 'o',
 u'\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}': 'o',
 u'\N{LATIN SMALL LETTER O WITH DIAERESIS}': 'o',
 u'\N{LATIN SMALL LETTER O WITH GRAVE}': 'o',
 u'\N{LATIN SMALL LETTER O WITH STROKE}': 'o',
 u'\N{LATIN SMALL LETTER O WITH TILDE}': 'o',
 u'\N{LATIN SMALL LETTER SHARP S}': 'ss',
 u'\N{LATIN SMALL LETTER THORN}': 'th',
 u'\N{LATIN SMALL LETTER U WITH ACUTE}': 'u',
 u'\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}': 'u',
 u'\N{LATIN SMALL LETTER U WITH DIAERESIS}': 'u',
 u'\N{LATIN SMALL LETTER U WITH GRAVE}': 'u',
 u'\N{LATIN SMALL LETTER Y WITH ACUTE}': 'y',
 u'\N{LATIN SMALL LETTER Y WITH DIAERESIS}': 'y',
 u'\N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}': '&lt;&lt;',
 u'\N{MACRON}': '_',
 u'\N{MASCULINE ORDINAL INDICATOR}': '{^o}',
 u'\N{MICRO SIGN}': '{micro}',
 u'\N{MIDDLE DOT}': '*',
 u'\N{MULTIPLICATION SIGN}': '*',
 u'\N{NOT SIGN}': '{not}',
 u'\N{PILCROW SIGN}': '{paragraph}',
 u'\N{PLUS-MINUS SIGN}': '{+/-}',
 u'\N{POUND SIGN}': '{pound}',
 u'\N{REGISTERED SIGN}': '{R}',
 u'\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}': '&gt;&gt;',
 u'\N{SECTION SIGN}': '{section}',
 u'\N{SOFT HYPHEN}': '-',
 u'\N{SUPERSCRIPT ONE}': '{^1}',
 u'\N{SUPERSCRIPT THREE}': '{^3}',
 u'\N{SUPERSCRIPT TWO}': '{^2}',
 u'\N{VULGAR FRACTION ONE HALF}': '{1/2}',
 u'\N{VULGAR FRACTION ONE QUARTER}': '{1/4}',
 u'\N{VULGAR FRACTION THREE QUARTERS}': '{3/4}',
 u'\N{YEN SIGN}': '{yen}'
}
Thorsten Kohnhorst 20 years, 4 months ago  # | flag

almost. the code has to change a little bit ... since the dictionary keys are now unicode characters, two lines must be changed to:

if xlate.has_key(i):
   r += xlate[i]

it's even more readable now.

thanks a lot to both of you for the nice recipe.

Andrew Dalke 19 years, 6 months ago  # | flag

a cleaner solution. Unicode strings have a 'translate' function which takes the dictionary mapping unicode values to new text. If not present the character is left unchanged. If the mapped value is None then the character is deleted.

Here's one way to use it to solve this problem. Call the fix_unicode() function defined at the end of this comment. It takes the unicode string and returns the hammered ASCII string.

# If the character doesn't exist in the dictionary, add it as None
# and also return None.  This tells the translate to delete the character
# and makes the next lookup for that character faster.
class XLate(dict):
    def __getitem__(self, c):
        try:
            return dict.__getitem__(self, c)
        except KeyError:
            self[c] = None
            return None

# Define the translation table.  I needed to hammer unicode going to
# NCBI's web services (for Biopython's EUtils package) so I used the
# table defined at
#  http://www.nlm.nih.gov/databases/dtd/medline_character_database.utf8
# This is not as extensive as the original conversion set.
class XLate(dict):
    def __getitem__(self, c):
        try:
            return dict.__getitem__(self, c)
        except KeyError:
            self[c] = None
            return None

# Convert these unicode characters into ASCII
xlate = XLate({
    # The note at the bottom of the page says "the inverted question
    # mark represents a questionable character found as a result of
    # NLM's conversion from its legacy extended EBCDIC character set
    # to UNICODE UTF-8."  I do not use it but leave it here for
    # completeness.
    ord(u"\N{INVERTED QUESTION MARK}"): None,

    ord(u"\N{LATIN CAPITAL LETTER O WITH STROKE}"): u"O",
    ord(u"\N{LATIN SMALL LETTER A WITH GRAVE}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH ACUTE}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH TILDE}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH DIAERESIS}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH RING ABOVE}"): u"a",
    ord(u"\N{LATIN SMALL LETTER C WITH CEDILLA}"): u"c",
    ord(u"\N{LATIN SMALL LETTER E WITH GRAVE}"): u"e",
    ord(u"\N{LATIN SMALL LETTER E WITH ACUTE}"): u"e",
    ord(u"\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}"): u"e",
    ord(u"\N{LATIN SMALL LETTER E WITH DIAERESIS}"): u"e",
    ord(u"\N{LATIN SMALL LETTER I WITH GRAVE}"): u"i",
    ord(u"\N{LATIN SMALL LETTER I WITH ACUTE}"): u"i",
    ord(u"\N{LATIN SMALL LETTER I WITH CIRCUMFLEX}"): u"i",
    ord(u"\N{LATIN SMALL LETTER I WITH DIAERESIS}"): u"i",
    ord(u"\N{LATIN SMALL LETTER N WITH TILDE}"): u"n",
    ord(u"\N{LATIN SMALL LETTER O WITH GRAVE}"): u"o",

(comment continued...)

Andrew Dalke 19 years, 6 months ago  # | flag

(...continued from previous comment)

    ord(u"\N{LATIN SMALL LETTER O WITH ACUTE}"): u"o",
    ord(u"\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}"): u"o",
    ord(u"\N{LATIN SMALL LETTER O WITH TILDE}"): u"o",
    ord(u"\N{LATIN SMALL LETTER O WITH DIAERESIS}"): u"o",
    ord(u"\N{LATIN SMALL LETTER O WITH STROKE}"): u"o",
    ord(u"\N{LATIN SMALL LETTER U WITH GRAVE}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH ACUTE}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH DIAERESIS}"): u"u",
    ord(u"\N{LATIN SMALL LETTER Y WITH ACUTE}"): u"y",
    ord(u"\N{LATIN SMALL LETTER Y WITH DIAERESIS}"): u"y",
    ord(u"\N{LATIN SMALL LETTER A WITH MACRON}"): u"a",
    ord(u"\N{LATIN SMALL LETTER A WITH BREVE}"): u"a",
    ord(u"\N{LATIN SMALL LETTER C WITH ACUTE}"): u"c",
    ord(u"\N{LATIN SMALL LETTER C WITH CIRCUMFLEX}"): u"c",
    ord(u"\N{LATIN SMALL LETTER E WITH MACRON}"): u"e",
    ord(u"\N{LATIN SMALL LETTER E WITH BREVE}"): u"e",
    ord(u"\N{LATIN SMALL LETTER G WITH CIRCUMFLEX}"): u"g",
    ord(u"\N{LATIN SMALL LETTER G WITH BREVE}"): u"g",
    ord(u"\N{LATIN SMALL LETTER G WITH CEDILLA}"): u"g",
    ord(u"\N{LATIN SMALL LETTER H WITH CIRCUMFLEX}"): u"h",
    ord(u"\N{LATIN SMALL LETTER I WITH TILDE}"): u"i",
    ord(u"\N{LATIN SMALL LETTER I WITH MACRON}"): u"i",
    ord(u"\N{LATIN SMALL LETTER I WITH BREVE}"): u"i",
    ord(u"\N{LATIN SMALL LETTER J WITH CIRCUMFLEX}"): u"j",
    ord(u"\N{LATIN SMALL LETTER K WITH CEDILLA}"): u"k",
    ord(u"\N{LATIN SMALL LETTER L WITH ACUTE}"): u"l",
    ord(u"\N{LATIN SMALL LETTER L WITH CEDILLA}"): u"l",
    ord(u"\N{LATIN CAPITAL LETTER L WITH STROKE}"): u"L",
    ord(u"\N{LATIN SMALL LETTER L WITH STROKE}"): u"l",
    ord(u"\N{LATIN SMALL LETTER N WITH ACUTE}"): u"n",
    ord(u"\N{LATIN SMALL LETTER N WITH CEDILLA}"): u"n",
    ord(u"\N{LATIN SMALL LETTER O WITH MACRON}"): u"o",
    ord(u"\N{LATIN SMALL LETTER O WITH BREVE}"): u"o",
    ord(u"\N{LATIN SMALL LETTER R WITH ACUTE}"): u"r",
    ord(u"\N{LATIN SMALL LETTER R WITH CEDILLA}"): u"r",
    ord(u"\N{LATIN SMALL LETTER S WITH ACUTE}"): u"s",
    ord(u"\N{LATIN SMALL LETTER S WITH CIRCUMFLEX}"): u"s",
    ord(u"\N{LATIN SMALL LETTER S WITH CEDILLA}"): u"s",
    ord(u"\N{LATIN SMALL LETTER T WITH CEDILLA}"): u"t",
    ord(u"\N{LATIN SMALL LETTER U WITH TILDE}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH MACRON}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH BREVE}"): u"u",
    ord(u"\N{LATIN SMALL LETTER U WITH RING ABOVE}"): u"u",
    ord(u"\N{LATIN SMALL LETTER W WITH CIRCUMFLEX}"): u"w",
    ord(u"\N{LATIN SMALL LETTER Y WITH CIRCUMFLEX}"): u"y",
    ord(u"\N{LATIN SMALL LETTER Z WITH ACUTE}"): u"z",

(comment continued...)

Andrew Dalke 19 years, 6 months ago  # | flag

(...continued from previous comment)

    ord(u"\N{LATIN SMALL LETTER W WITH GRAVE}"): u"w",
    ord(u"\N{LATIN SMALL LETTER W WITH ACUTE}"): u"w",
    ord(u"\N{LATIN SMALL LETTER W WITH DIAERESIS}"): u"w",
    ord(u"\N{LATIN SMALL LETTER Y WITH GRAVE}"): u"y",
    })

# These are the ASCII characters NCBI knows about.  Note that I'm
# building one unicode string here, and not a tuple of unicode
# characters.
for c in (u"\N{SPACE}"
          u"\N{EXCLAMATION MARK}"
          u"\N{QUOTATION MARK}"
          u"\N{NUMBER SIGN}"
          u"\N{DOLLAR SIGN}"
          u"\N{PERCENT SIGN}"
          u"\N{AMPERSAND}"
          u"\N{APOSTROPHE}"
          u"\N{LEFT PARENTHESIS}"
          u"\N{RIGHT PARENTHESIS}"
          u"\N{ASTERISK}"
          u"\N{PLUS SIGN}"
          u"\N{COMMA}"
          u"\N{HYPHEN-MINUS}"
          u"\N{FULL STOP}"
          u"\N{SOLIDUS}"
          u"\N{DIGIT ZERO}"
          u"\N{DIGIT ONE}"
          u"\N{DIGIT TWO}"
          u"\N{DIGIT THREE}"
          u"\N{DIGIT FOUR}"
          u"\N{DIGIT FIVE}"
          u"\N{DIGIT SIX}"
          u"\N{DIGIT SEVEN}"
          u"\N{DIGIT EIGHT}"
          u"\N{DIGIT NINE}"
          u"\N{COLON}"
          u"\N{SEMICOLON}"
          u"\N{LESS-THAN SIGN}"
          u"\N{EQUALS SIGN}"
          u"\N{GREATER-THAN SIGN}"
          u"\N{QUESTION MARK}"
          u"\N{COMMERCIAL AT}"
          u"\N{LATIN CAPITAL LETTER A}"
          u"\N{LATIN CAPITAL LETTER B}"
          u"\N{LATIN CAPITAL LETTER C}"
          u"\N{LATIN CAPITAL LETTER D}"
          u"\N{LATIN CAPITAL LETTER E}"
          u"\N{LATIN CAPITAL LETTER F}"
          u"\N{LATIN CAPITAL LETTER G}"
          u"\N{LATIN CAPITAL LETTER H}"
          u"\N{LATIN CAPITAL LETTER I}"
          u"\N{LATIN CAPITAL LETTER J}"
          u"\N{LATIN CAPITAL LETTER K}"
          u"\N{LATIN CAPITAL LETTER L}"
          u"\N{LATIN CAPITAL LETTER M}"
          u"\N{LATIN CAPITAL LETTER N}"
          u"\N{LATIN CAPITAL LETTER O}"
          u"\N{LATIN CAPITAL LETTER P}"
          u"\N{LATIN CAPITAL LETTER Q}"
          u"\N{LATIN CAPITAL LETTER R}"
          u"\N{LATIN CAPITAL LETTER S}"
          u"\N{LATIN CAPITAL LETTER T}"
          u"\N{LATIN CAPITAL LETTER U}"
          u"\N{LATIN CAPITAL LETTER V}"
          u"\N{LATIN CAPITAL LETTER W}"
          u"\N{LATIN CAPITAL LETTER X}"
          u"\N{LATIN CAPITAL LETTER Y}"
          u"\N{LATIN CAPITAL LETTER Z}"
          u"\N{LEFT SQUARE BRACKET}"
          u"\N{REVERSE SOLIDUS}"
          u"\N{RIGHT SQUAR
Aaron Bentley 18 years, 3 months ago  # | flag

Using NFKD. A very simple, and obviously-correct way to do this is like so:

unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

It has the advantage that you don't need to enumerate any particular conversions-- any accented latin characters will be reduced to their base form, and non-ascii characters will be stripped.

By normalizing to NFKD, we transform precomposed characters like \u00C0 (LATIN CAPITAL LETTER A WITH GRAVE) into pairs of base letter \u0041 (A) and combining character \u0300 (GRAVE accent).

Converting to ascii using 'ignore' strips all non-ascii characters, e.g. the combining characters. However, it will also strip other non-ascii characters, so if there are no latin characters in the input, the output will be empty.

Martin Blais 17 years, 6 months ago  # | flag

Another solution. BTW this is very, very useful. Thanks for this thread.

I had been solving this issue by using a modified version of Skip Montanaro's latscii, which creates a string encoding (a codec) that does that, but I like your unicodedata solution better::

# -*- coding: latin-1 -*-
""" Character mapping codec which removes accents from latin-1 characters

Written by Skip Montanaro (skip@pobox.com) using the autogenerated cp1252
codec as an example.

(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
(c) Copyright 2000 Guido van Rossum.

"""#"

import codecs

### Codec APIs

class Codec(codecs.Codec):

    def encode(self,input,errors='strict'):

        return codecs.charmap_encode(input,errors,encoding_map)

    def decode(self,input,errors='strict'):

        return codecs.charmap_decode(input,errors,decoding_map)

class StreamWriter(Codec,codecs.StreamWriter):
    pass

class StreamReader(Codec,codecs.StreamReader):
    pass

### encodings module API

def getregentry():

    return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

### Decoding Map

decoding_map = codecs.make_identity_dict(range(256))
for x in range(0x80, 0xa0):
    decoding_map[x] = ord('?') # undefined
decoding_map.update({
     0x00a1: ord('!'), # ¡
    0x00a2: ord('c'), # ¢
    0x00a3: ord('#'), # £
    0x00a4: ord('o'), # ¤
    0x00a5: ord('Y'), # ¥
    0x00a6: ord('|'), # ¦
    0x00a7: ord('S'), # §
    0x00a8: ord('"'), # ¨
    0x00a9: ord('c'), # ©
    0x00aa: ord('a'), # ª
    0x00ab: ord('BTW this is very, very useful.  Thanks for this thread.

I had been solving this issue by using a modified version of Skip Montanaro's latscii, which creates a string encoding (a codec) that does that, but I like your unicodedata solution better::

<pre>

# -*- coding: latin-1 -*-
""" Character mapping codec which removes accents from latin-1 characters

Written by Skip Montanaro (skip@pobox.com) using the autogenerated cp1252
codec as an example.

(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
(c) Copyright 2000 Guido van Rossum.

"""#"

import codecs

### Codec APIs

class Codec(codecs.Codec):

    def encode(self,input,errors='strict'):

        return codecs.charmap_encode(input,errors,encoding_map)

    def decode(self,input,errors='strict'):

        return codecs.charmap_decode(input,errors,decoding_map)

class StreamWriter(Codec,codecs.StreamWriter):
    pass

class StreamReader(Codec,codecs.StreamReader):
    pass

### encodings module API

def getregentry():

    return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

### Decoding Map

(comment continued...)

Martin Blais 17 years, 6 months ago  # | flag

(...continued from previous comment)

decoding_map = codecs.make_identity_dict(range(256))
for x in range(0x80, 0xa0):
    decoding_map[x] = ord('?') # undefined
decoding_map.update({
     0x00a1: ord('!'), # ¡
    0x00a2: ord('c'), # ¢
    0x00a3: ord('#'), # £
    0x00a4: ord('o'), # ¤
    0x00a5: ord('Y'), # ¥
    0x00a6: ord('|'), # ¦
    0x00a7: ord('S'), # §
    0x00a8: ord('"'), # ¨
    0x00a9: ord('c'), # ©
    0x00aa: ord('a'), # ª
    0x00ab: ord('

</pre>

Chris Lasher 16 years, 2 months ago  # | flag

Nice solution with NFKD. This is a really elegant solution. Thank you!

Created by Noah Spurrier on Mon, 10 Nov 2003 (PSF)
Python recipes (4591)
Noah Spurrier's recipes (10)

Required Modules

  • (none specified)

Other Information and Tasks