ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: Auto-detect XML encoding
Submitter: Paul Prescod (other recipes)
Last Updated: 2001/03/14
Version no: 1.0
Category: XML

 

4 stars 4 vote(s)


Approved

Description:

The XML specification describes the outlines of an algorithm for detecting the
Unicode encoding that an XML document uses. This function will do that.

Source: Text Source

import codecs, encodings

"""Caller will hand this library a buffer and ask it to either convert
it or auto-detect the type."""

# None represents a potentially variable byte. "##" in the XML spec... 
autodetect_dict={ # bytepattern     : ("name",              
                (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),        
                (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
                (0xFE, 0xFF, None, None) : ("utf_16_be"), 
                (0xFF, 0xFE, None, None) : ("utf_16_le"), 
                (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
                (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
                (0x3C, 0x3F, 0x78, 0x6D): ("utf_8"),
                (0x4C, 0x6F, 0xA7, 0x94): ("EBCDIC")
                 }

def autoDetectXMLEncoding(buffer):
    """ buffer -> encoding_name
    The buffer should be at least 4 bytes long.
        Returns None if encoding cannot be detected.
        Note that encoding_name might not have an installed
        decoder (e.g. EBCDIC)
    """
    # a more efficient implementation would not decode the whole
    # buffer at once but otherwise we'd have to decode a character at
    # a time looking for the quote character...that's a pain

    encoding = "utf_8" # according to the XML spec, this is the default
                          # this code successively tries to refine the default
                          # whenever it fails to refine, it falls back to 
                          # the last place encoding was set.
    bytes = (byte1, byte2, byte3, byte4) = tuple(map(ord, buffer[0:4]))
    enc_info = autodetect_dict.get(bytes, None)

    if not enc_info: # try autodetection again removing potentially 
                     # variable bytes
        bytes = (byte1, byte2, None, None)
        enc_info = autodetect_dict.get(bytes)

        
    if enc_info:
        encoding = enc_info # we've got a guess... these are
                            #the new defaults

        # try to find a more precise encoding using xml declaration
        secret_decoder_ring = codecs.lookup(encoding)[1]
        (decoded,length) = secret_decoder_ring(buffer) 
        first_line = decoded.split("\n")[0]
        if first_line and first_line.startswith(u"<?xml"):
            encoding_pos = first_line.find(u"encoding")
            if encoding_pos!=-1:
                # look for double quote
                quote_pos=first_line.find('"', encoding_pos) 

                if quote_pos==-1:                 # look for single quote
                    quote_pos=first_line.find("'", encoding_pos) 

                if quote_pos>-1:
                    quote_char,rest=(first_line[quote_pos],
                                                first_line[quote_pos+1:])
                    encoding=rest[:rest.find(quote_char)]

    return encoding

Discussion:

This code detects a variety of encodings, including some that are
not supported by Python's Unicode decoder. So the fact that you can
decipher the encoding does not guarantee that you can decipher the
document itself!



Add comment

Number of comments: 3

Good, but..., Mike Brown, 2003/03/27
It makes the assumption that the XML declaration is the only thing on the first line, but this is not necessarily going to be the case; there might not be any line breaks at all. For example, the encoding of '<?xml version="1.0"?><foo encoding="x-bar"/>' is detected as 'x-bar' instead of 'utf-8'. Using a regular expression to find the XML declaration would be more reliable.
Add comment

Error in 2nd edition, Mike Brown, 2005/10/02
The discussion of this on page 469 of the 2nd print edition of the Python Cookbook acted upon my previous comment incorrectly. The book makes the assertion that the XML declaration must be terminated by a linefeed, and it implies that the recipe does not need to handle such cases of malformed "almost-XML". This is entirely wrong; there does not need to be linefeed at all; the XML grammar makes this clear in all three editions of XML 1.0.

Also, Lars Tiede has submitted a regex-based version at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841.
Add comment

Some changes, Lars Tiede, 2005/01/19
I made some changes to the code, beside blunt renaming and little cosmetic are worth mentioning:

- I haven't found some of the BOM byte patterns you used. Thus, I removed them
- the patterns for the 4 byte schemes fit to the names UTF32 rather than UCS4
- the algorithm searching in the xml declaration is wrong. I worked out a regex which should do for all halfway correct XML 1.0 headers

My code: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841
Add comment



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. IPy Notify

4. Changing return value ...

5. Quantum Superposition

6. Pickle objects under ...

7. Generalized delegates ...

8. Reorder a sequence (uses ...

9. Setting Win32 System ...

10. ObjectMerger




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.