read a unicode file

Martin v. Löwis martin at v.loewis.de
Tue Jun 10 01:29:13 EDT 2003


Alan Kennedy <alanmk at hotmail.com> writes:

> So, a question: Why would the 'utf-16' codec not support readline?
> Looking at the 'Lib\encodings\utf-16.py' module gives no hints. Is
> there a problem with knowing what constitutes a line ending in that
> encoding?

The problem is that the codec's .readline usually invokes the
.readline of the underlying stream. For UTF-16, this fails, since
.readline of the stream sometimes will break at the next \n character,
which means that there might be a dangling second byte (which might
not be NUL, in which case .readline has misinterpreted the \n byte).
On other systems, .readline may fail to find a \r\n sequence (since
there are interspersed NUL bytes), or it may find a  \r\n sequence,
but that would not be a line break, but the character U+2573.

So for UTF-16, the codec would have to read ahead, to find the line
ends, and it would have to take into account the line breaking
conventions. It then may happen that it has read more data than
desired, so it also needs to implement buffering.

Nobody has yet contributed code to do so.

Regards,
Martin





More information about the Python-list mailing list