[Python-3000] encoding hell

"Martin v. Löwis" martin at v.loewis.de
Wed Sep 13 06:20:12 CEST 2006


tomer filiba schrieb:
> # read 3 UTF8 *characters*
> f.read(3)
> 
> # this will seek by AT LEAST 7 *bytes*, until resynched
> f.substream.seekby(7)
> 
> # we can resume reading of UTF8 *characters*
> f.read(3)
> 
> heck, i even like this idea :)

Notice that resyncing is a really tricky operation, and
should not be expected to work for all encodings. For
example, for the iso-2022 encodings, you have to know
what character set you are "in", and you have to read
forward/backward until you find a character-code switching
escape sequence.

There is an RFC-imposed requirement that each line
of input is "neutral" wrt. character set switching,
so you can typically synchronize at a line break. Still,
this could require to skip an arbitrary amount of text.

Regards,
Martin


More information about the Python-3000 mailing list