Python UTF-8 and codecs

Mike Currie dev at null.com
Tue Jun 27 22:38:22 CEST 2006


I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters 
inside quoted fields.  The idea is to convert all the new line and 
characters to 0x85 and 0x88 respectivly, then process the files.  Finally 
right before importing them into a database convert them back to new line 
and tab's thus preserving the field values.

Will python not handle the control characters correctly?


"Serge Orlov" <serge.orlov at gmail.com> wrote in message 
news:mailman.7516.1151440194.27775.python-list at python.org...
> On 6/27/06, Mike Currie <dev at null.com> wrote:
>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
>> them.  Every configuration I try I get a UnicodeError: ascii codec can't
>> decode byte 0x85 in position 255: oridinal not in range(128)
>>
>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', 
>> errors='strict')
>> and that doesn't work and I've also try wrapping the file in an 
>> utf8_writer
>> using codecs.lookup('utf8')
>>
>> Any clues?
>
> Use unicode strings for non-ascii characters. The following program 
> "works":
>
> import codecs
>
> c1 = unichr(0x85)
> f = codecs.open('foo.txt', 'wU', 'utf-8')
> f.write(c1)
> f.close()
>
> But unichr(0x85) is a control characters, are you sure you want it?
> What is the encoding of your data? 





More information about the Python-list mailing list