[Tutor] Composing codecs using codecs.EncodedFile / UTF-16 DOS format converted to Unix ASCII

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Wed Jan 8 14:14:03 2003


On Tue, 7 Jan 2003, Poor Yorick wrote:

> >Ah, I see what you mean now.  No, as far as I understand, Python
> >doesn't do this automatic conversion of newlines.  However, Python
> >2.3's "Universal Newline" support is probably what you're looking for.
> >
>
> Thanks for the sample code!  It'll take me a while to digest....
>
> But about newlines, I thought that '\n' was already a sort of universal
> newline for Python.  On windows platforms, both open.read and
> open.readlines already transform '\r\n' into '\n' unless you use binary
> mode.  That's why I thought it was a discrepancy for codecs.open to
> return '\r\n'.

Ah!  I completely forgot about that!


You're right: there's an platform-dependent automatic conversion of the
line-endings.

    http://www.wdvl.com/Authoring/Languages/Python/Quick/python4_2.html

if we open a file in "text" mode, which is what we've been doing in the
past examples.


However, this "\r\n"->"\n" conversion won't take its expected effect
against UTF-16-encoded files because the character sequence isn't '\r\n'
in the file, but rather, the four byte sequence: '\x00\r\x00\n'.

That is, there's padding involved because each character now consists of
two bytes each!  So even when we open the file in text mode, Python file
operations don't catch and do platform-dependent conversions here.


Even the Universal Newlines support of Python 2.3 won't help us here,
since by the time we read those four bytes, normalization will pass by
without touching those strings.  Or even worse, may even convert what
looks like a lone "\r" into another newline since it looks like a
Macintosh newline.  So we may really need to open UTF-16 files in binary
mode after all to be careful.


Hmmm... perhaps this is a bug!  Perhaps the utf-16 decoder should really
do the '\r\n' normalization if it's running on a Windows platform.  I
haven't been able to Google other pages talking about this issue, so I
have no idea what other people think about it.  Maybe you might want to
bring it up on the i18n-sig?

    http://www.python.org/sigs/i18n-sig/

It would be an interesting thing to talk about.  I'm sorry I can't give a
definitive answer on this one; I really don't know what the "right" thing
to do is in this case.


Good luck to you!