"Walter" == Walter Dörwald email@example.com writes:
Walter> Not really. In every encoding where a sequence of more Walter> than one byte maps to one Unicode character, you will Walter> always need some kind of buffering. If we remove the Walter> handling of initial BOMs from the codecs (except for Walter> UTF-16 where it is required), this wouldn't change any Walter> buffering requirements.
Sure. My point is that codecs should be stateful only to the extent needed to assemble semantically meaningful units (ie, multioctet coded characters). In particular, they should not need to know about location at the beginning, middle, or end of some stream---because in the context of operating on a string they _can't_.
>> I don't know whether that's really feasible in the short >> run---I suspect there may be a lot of stream-like modules that >> would need to be updated---but it would be a saner in the long >> run.
Walter> I'm not exactly sure, what you're proposing here. That all Walter> codecs (even UTF-16) pass the BOM through and some other Walter> infrastructure is responsible for dropping it?
Not exactly. I think that at the lowest level codecs should not implement complex mode-switching internally, but rather explicitly abdicate responsibility to a more appropriate codec.
For example, autodetecting UTF-16 on input would be implemented by a Python program that does something like
data = stream.read() for detector in [ "utf-16-signature", "utf-16-statistical" ]: # for the UTF-16 detectors, OUT will always be u"" or None out, data, codec = data.decode(detector) if codec: break while codec: more_out, data, codec = data.decode(codec) out = out + more_out if data: # a real program would complain about it pass process(out)
where decode("utf-16-signature") would be implemented
def utf-16-signature-internal (data): if data[0:2] == "\xfe\xff": return (u"", data[2:], "utf-16-be") else if data[0:2] == "\xff\xfe": return (u"", data[2:], "utf-16-le") else # note: data is undisturbed if the detector fails return (None, data, None)
The main point is that the detector is just a codec that stops when it figures out what the next codec should be, touches only data that would be incorrect to pass to the next codec, and leaves the data alone if detection fails. utf-16-signature only handles the BOM (if present), and does not handle arbitrary "chunks" of data. Instead, it passes on the rest of the data (including the first chunk) to be handled by the appropriate utf-16-?e codec.
I think that the temptation to encapsulate this logic in a utf-16 codec that "simplifies" things by calling the appropriate utf-16-?e codec itself should be deprecated, but YMMV. What I would really like is for the above style to be easier to achieve than it currently is.
BTW, I appreciate your patience in exploring this; after Martin's remark about different mental models I have to suspect this approach is just somehow un-Pythonic, but fleshing it out this way I can see how it will be useful in the context of a different project.