ascii to unicode line endings

fidtz at clara.co.uk fidtz at clara.co.uk
Thu May 3 08:45:26 EDT 2007


On 3 May, 13:00, Jean-Paul Calderone <exar... at divmod.com> wrote:
> On 3 May 2007 04:30:37 -0700, f... at clara.co.uk wrote:
>
>
>
> >On 2 May, 17:29, Jean-Paul Calderone <exar... at divmod.com> wrote:
> >> On 2 May 2007 09:19:25 -0700, f... at clara.co.uk wrote:
>
> >> >The code:
>
> >> >import codecs
>
> >> >udlASCII = file("c:\\temp\\CSVDB.udl",'r')
> >> >udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")
>
> >> >udlUNI.write(udlASCII.read())
>
> >> >udlUNI.close()
> >> >udlASCII.close()
>
> >> >This doesn't seem to generate the correct line endings. Instead of
> >> >converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as  0x0D/
> >> >0x0A
>
> >> >I have tried various 2 byte unicode encoding but it doesn't seem to
> >> >make a difference. I have also tried modifying the code to read and
> >> >convert a line at a time, but that didn't make any difference either.
>
> >> >I have tried to understand the unicode docs but nothing seems to
> >> >indicate why an seemingly incorrect conversion is being done.
> >> >Obviously I am missing something blindingly obvious here, any help
> >> >much appreciated.
>
> >> Consider this simple example:
>
> >>   >>> import codecs
> >>   >>> f = codecs.open('test-newlines-file', 'w', 'utf16')
> >>   >>> f.write('\r\n')
> >>   >>> f.close()
> >>   >>> f = file('test-newlines-file')
> >>   >>> f.read()
> >>   '\xff\xfe\r\x00\n\x00'
>
> >> And how it differs from your example.  Are you sure you're examining
> >> the resulting output properly?
>
> >> By the way, "\r\0\n\0" isn't a "unicode line ending", it's just the UTF-16
> >> encoding of "\r\n".
>
> >> Jean-Paul
>
> >I am not sure what you are driving at here, since I started with an
> >ascii file, whereas you just write a unicode file to start with. I
> >guess the direct question is "is there a simple way to convert my
> >ascii file to a utf16 file?". I thought either string.encode() or
> >writing to a utf16 file would do the trick but it probably isn't that
> >simple!
>
> There's no such thing as a unicode file.  The only difference between
> the code you posted and the code I posted is that mine is self-contained
> and demonstrates that the functionality works as you expected it to work,
> whereas the code you posted is requires external resources which are not
> available to run and produces external results which are not available to
> be checked regarding their correctness.
>
> So what I'm driving at is that both your example and mine are doing it
> correctly (because they are doing the same thing), and mine demonstrates
> that it is correct, but we have to take your word on the fact that yours
> doesn't work. ;)
>
> Jean-Paul

Thanks for the advice. I cannot prove what is going on. The following
code seems to work fine as far as console output goes, but the actual
bit patterns of the files on disk are not what I am expecting (or
expected as input by the ultimate user of the converted file). Which I
can't prove of course.

>>> import codecs
>>> testASCII = file("c:\\temp\\test1.txt",'w')
>>> testASCII.write("\n")
>>> testASCII.close()
>>> testASCII = file("c:\\temp\\test1.txt",'r')
>>> testASCII.read()
'\n'
Bit pattern on disk : \0x0D\0x0A
>>> testASCII.seek(0)
>>> testUNI = codecs.open("c:\\temp\\test2.txt",'w','utf16')
>>> testUNI.write(testASCII.read())
>>> testUNI.close()
>>> testUNI = file("c:\\temp\\test2.txt",'r')
>>> testUNI.read()
'\xff\xfe\n\x00'
Bit pattern on disk:\0xff\0xfe\0x0a\0x00
Bit pattern I was expecting:\0xff\0xfe\0x0d\0x00\0x0a\0x00
>>> testUNI.close()

Dom




More information about the Python-list mailing list