[Python-Dev] Unicode byte order mark decoding

Thu Apr 7 06:20:53 CEST 2005

>>>>> "Walter" == Walter Dörwald <walter at livinglogic.de> writes:

    Walter> Not really. In every encoding where a sequence of more
    Walter> than one byte maps to one Unicode character, you will
    Walter> always need some kind of buffering. If we remove the
    Walter> handling of initial BOMs from the codecs (except for
    Walter> UTF-16 where it is required), this wouldn't change any
    Walter> buffering requirements.

Sure.  My point is that codecs should be stateful only to the extent
needed to assemble semantically meaningful units (ie, multioctet coded
characters).  In particular, they should not need to know about
location at the beginning, middle, or end of some stream---because in
the context of operating on a string they _can't_.

    >> I don't know whether that's really feasible in the short
    >> run---I suspect there may be a lot of stream-like modules that
    >> would need to be updated---but it would be a saner in the long
    >> run.

    Walter> I'm not exactly sure, what you're proposing here. That all
    Walter> codecs (even UTF-16) pass the BOM through and some other
    Walter> infrastructure is responsible for dropping it?

Not exactly.  I think that at the lowest level codecs should not
implement complex mode-switching internally, but rather explicitly
abdicate responsibility to a more appropriate codec.

For example, autodetecting UTF-16 on input would be implemented by a
Python program that does something like

    data = stream.read()
    for detector in [ "utf-16-signature", "utf-16-statistical" ]:
        # for the UTF-16 detectors, OUT will always be u"" or None
        out, data, codec = data.decode(detector)
        if codec: break
    while codec:
        more_out, data, codec = data.decode(codec)
        out = out + more_out
    if data:
        # a real program would complain about it
        pass
    process(out)

where decode("utf-16-signature") would be implemented

def utf-16-signature-internal (data):
    if data[0:2] == "\xfe\xff":
        return (u"", data[2:], "utf-16-be")
    else if data[0:2] == "\xff\xfe":
        return (u"", data[2:], "utf-16-le")
    else
        # note: data is undisturbed if the detector fails
        return (None, data, None)

The main point is that the detector is just a codec that stops when it
figures out what the next codec should be, touches only data that
would be incorrect to pass to the next codec, and leaves the data
alone if detection fails.  utf-16-signature only handles the BOM (if
present), and does not handle arbitrary "chunks" of data.  Instead, it
passes on the rest of the data (including the first chunk) to be
handled by the appropriate utf-16-?e codec.

I think that the temptation to encapsulate this logic in a utf-16
codec that "simplifies" things by calling the appropriate utf-16-?e
codec itself should be deprecated, but YMMV.  What I would really like
is for the above style to be easier to achieve than it currently is.

BTW, I appreciate your patience in exploring this; after Martin's
remark about different mental models I have to suspect this approach
is just somehow un-Pythonic, but fleshing it out this way I can see
how it will be useful in the context of a different project.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.