[Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull
stephen at xemacs.org
Thu Apr 7 06:20:53 CEST 2005
>>>>> "Walter" == Walter Dörwald <walter at livinglogic.de> writes:
Walter> Not really. In every encoding where a sequence of more
Walter> than one byte maps to one Unicode character, you will
Walter> always need some kind of buffering. If we remove the
Walter> handling of initial BOMs from the codecs (except for
Walter> UTF-16 where it is required), this wouldn't change any
Walter> buffering requirements.
Sure. My point is that codecs should be stateful only to the extent
needed to assemble semantically meaningful units (ie, multioctet coded
characters). In particular, they should not need to know about
location at the beginning, middle, or end of some stream---because in
the context of operating on a string they _can't_.
>> I don't know whether that's really feasible in the short
>> run---I suspect there may be a lot of stream-like modules that
>> would need to be updated---but it would be a saner in the long
Walter> I'm not exactly sure, what you're proposing here. That all
Walter> codecs (even UTF-16) pass the BOM through and some other
Walter> infrastructure is responsible for dropping it?
Not exactly. I think that at the lowest level codecs should not
implement complex mode-switching internally, but rather explicitly
abdicate responsibility to a more appropriate codec.
For example, autodetecting UTF-16 on input would be implemented by a
Python program that does something like
data = stream.read()
for detector in [ "utf-16-signature", "utf-16-statistical" ]:
# for the UTF-16 detectors, OUT will always be u"" or None
out, data, codec = data.decode(detector)
if codec: break
more_out, data, codec = data.decode(codec)
out = out + more_out
# a real program would complain about it
where decode("utf-16-signature") would be implemented
def utf-16-signature-internal (data):
if data[0:2] == "\xfe\xff":
return (u"", data[2:], "utf-16-be")
else if data[0:2] == "\xff\xfe":
return (u"", data[2:], "utf-16-le")
# note: data is undisturbed if the detector fails
return (None, data, None)
The main point is that the detector is just a codec that stops when it
figures out what the next codec should be, touches only data that
would be incorrect to pass to the next codec, and leaves the data
alone if detection fails. utf-16-signature only handles the BOM (if
present), and does not handle arbitrary "chunks" of data. Instead, it
passes on the rest of the data (including the first chunk) to be
handled by the appropriate utf-16-?e codec.
I think that the temptation to encapsulate this logic in a utf-16
codec that "simplifies" things by calling the appropriate utf-16-?e
codec itself should be deprecated, but YMMV. What I would really like
is for the above style to be easier to achieve than it currently is.
BTW, I appreciate your patience in exploring this; after Martin's
remark about different mental models I have to suspect this approach
is just somehow un-Pythonic, but fleshing it out this way I can see
how it will be useful in the context of a different project.
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev