[Python-3000] encoding hell

Sun Sep 3 00:29:25 CEST 2006

[Talin]
> The Java version, Java.io.BufferedReader, has a "skip()" method which
> only allows seeking forward.
> Sounds to me like copying the Java model would work.

then there's no need for it at all... just read() and discard the return value.
we don't need a special API for that.

on the other hand, the .NET version has a BaseStream attribute holding
the underlying stream over which the StreamReader operates... this
means you *can* change the position if the underlying stream supports
seeking.

i read through the msdn but found no explicit definition for what happens
in the case of seeking in text-encoded streams, but they noted
somewhere they use a "best fit" decoder, which, to the best of my
understanding, may skip some bytes until it's in synch with the stream.

that's a *horrible* design, imho, but that's microsoft. i say let's leave it
below layer 3, at the byte level. if users find seeking very important,
we can come up with a layer-2 ReSyncLayer, which will attempt to
come in synch with a specified encoding.

for example:

f = TextAdapter(
    ReSyncLayer(
        BufferedLayer(
            FileStream("blah", "r")
        ),
        encoding = "utf8"
    ),
    encoding = "utf8"
)

# read 3 UTF8 *characters*
f.read(3)

# this will seek by AT LEAST 7 *bytes*, until resynched
f.substream.seekby(7)

# we can resume reading of UTF8 *characters*
f.read(3)

heck, i even like this idea :)
thanks for the pointers.

-tomer

On 9/2/06, Talin <talin at acm.org> wrote:
> tomer filiba wrote:
> > i'm quite finished with the base of iostack (streams and layers), and
> > have moved to implementing the adpaters layer (especially the dreaded
> > TextAdapter).
> >
> > as was discussed earlier, streams and layers work with bytes, while
> > adpaters may work with arbitrary objects (be it struct-style records,
> > serialized objects, characters and whatnot).
> >
> > the question that arises is -- how far should we stretch this abstraction?
> > for example, the TextAdapter reads and writes characters to the
> > stream, after they go encoding or decoding, so from the programmer's
> > point of view, he's working with *characters*, not *bytes*.
> > that means the programmer need not be aware of how the characters
> > are "physically" stored in the underlying stream.
> >
> > that's all very nice, but what do we do when it comes to seek()ing?
> > do you want to seek by character position or by byte position?
> > logically you are working with characters, but it would be impossible
> > to implement without first decoding the entire stream in-memory...
> > which is unacceptable of course.
> >
> > and if seek()ing is byte-oriented, then you must somehow seek
> > only to the beginning of a multibyte character sequence... how
> > would you do that?
> >
> > my solution would be completely leaving seek() and tell() out of the
> > 3rd layer -- it's a byte-level operation.
> >
> > anyone thinks differently? if so, what's your solution?
>
> Well, for comparison with other APIs:
>
> The .Net equivalent, System.IO.TextReader, does not have a "seek" method
> at all.
>
> The Java version, Java.io.BufferedReader, has a "skip()" method which
> only allows seeking forward.
>
> Sounds to me like copying the Java model would work.
>
> -- Talin
>