|
When encoding a Unicode string into a byte string, unencodable
characters may be encountered. So far, Python has allowed specifying
the error processing as either ``strict'' (raising
UnicodeError), ``ignore'' (skipping the character), or
``replace'' (using a question mark in the output string), with
``strict'' being the default behavior. It may be desirable to specify
alternative processing of such errors, such as inserting an XML
character reference or HTML entity reference into the converted
string.
Python now has a flexible framework to add different processing
strategies. New error handlers can be added with
codecs.register_error, and codecs then can access the error
handler with codecs.lookup_error. An equivalent C API has
been added for codecs written in C. The error handler gets the
necessary state information such as the string being converted, the
position in the string where the error was detected, and the target
encoding. The handler can then either raise an exception or return a
replacement string.
Two additional error handlers have been implemented using this
framework: ``backslashreplace'' uses Python backslash quoting to
represent unencodable characters and ``xmlcharrefreplace'' emits
XML character references.
See About this document... for information on suggesting changes.
|