|
Description:
The XML specification describes the outlines of an algorithm for detecting the
Unicode encoding that an XML document uses. This function will do that.
Source: Text Source
import codecs, encodings
"""Caller will hand this library a buffer and ask it to either convert
it or auto-detect the type."""
autodetect_dict={
(0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),
(0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
(0xFE, 0xFF, None, None) : ("utf_16_be"),
(0xFF, 0xFE, None, None) : ("utf_16_le"),
(0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
(0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
(0x3C, 0x3F, 0x78, 0x6D): ("utf_8"),
(0x4C, 0x6F, 0xA7, 0x94): ("EBCDIC")
}
def autoDetectXMLEncoding(buffer):
""" buffer -> encoding_name
The buffer should be at least 4 bytes long.
Returns None if encoding cannot be detected.
Note that encoding_name might not have an installed
decoder (e.g. EBCDIC)
"""
encoding = "utf_8"
bytes = (byte1, byte2, byte3, byte4) = tuple(map(ord, buffer[0:4]))
enc_info = autodetect_dict.get(bytes, None)
if not enc_info:
bytes = (byte1, byte2, None, None)
enc_info = autodetect_dict.get(bytes)
if enc_info:
encoding = enc_info
secret_decoder_ring = codecs.lookup(encoding)[1]
(decoded,length) = secret_decoder_ring(buffer)
first_line = decoded.split("\n")[0]
if first_line and first_line.startswith(u"<?xml"):
encoding_pos = first_line.find(u"encoding")
if encoding_pos!=-1:
quote_pos=first_line.find('"', encoding_pos)
if quote_pos==-1:
quote_pos=first_line.find("'", encoding_pos)
if quote_pos>-1:
quote_char,rest=(first_line[quote_pos],
first_line[quote_pos+1:])
encoding=rest[:rest.find(quote_char)]
return encoding
Discussion:
This code detects a variety of encodings, including some that are
not supported by Python's Unicode decoder. So the fact that you can
decipher the encoding does not guarantee that you can decipher the
document itself!
|
|
Add comment
|
|
Number of comments: 3
Good, but..., Mike Brown, 2003/03/27
It makes the assumption that the XML declaration is the only thing on the first line, but this is not necessarily going to be the case; there might not be any line breaks at all. For example, the encoding of '<?xml version="1.0"?><foo encoding="x-bar"/>' is detected as 'x-bar' instead of 'utf-8'. Using a regular expression to find the XML declaration would be more reliable.
Add comment
Error in 2nd edition, Mike Brown, 2005/10/02
The discussion of this on page 469 of the 2nd print edition of the Python Cookbook acted upon my previous comment incorrectly. The book makes the assertion that the XML declaration must be terminated by a linefeed, and it implies that the recipe does not need to handle such cases of malformed "almost-XML". This is entirely wrong; there does not need to be linefeed at all; the XML grammar makes this clear in all three editions of XML 1.0.
Also, Lars Tiede has submitted a regex-based version at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841.
Add comment
Some changes, Lars Tiede, 2005/01/19
I made some changes to the code, beside blunt renaming and little cosmetic are worth mentioning:
- I haven't found some of the BOM byte patterns you used. Thus, I removed them
- the patterns for the 4 byte schemes fit to the names UTF32 rather than UCS4
- the algorithm searching in the xml declaration is wrong. I worked out a regex which should do for all halfway correct XML 1.0 headers My code: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841
Add comment
|
|
|