utf8 and ftplib

Thu Jun 16 14:06:50 EDT 2005

"Richard Lewis" <richardlewis at fastmail.co.uk> wrote in message 
news:mailman.540.1118935910.10512.python-list at python.org...
> Hi there,
>
> I'm having a problem with unicode files and ftplib (using Python 2.3.5).
>
> I've got this code:
>
> xml_source = codecs.open("foo.xml", 'w+b', "utf8")
> #xml_source = file("foo.xml", 'w+b')
>
> ftp.retrbinary("RETR foo.xml", xml_source.write)
> #ftp.retrlines("RETR foo.xml", xml_source.write)
>
> It opens a new local file using utf8 encoding and then reads from a file
> on an FTP server (also utf8 encoded) into that local file. It comes up
> with an error, however, on calling the xml_source.write callback (I
> think) saying that:
>
> "File "myscript.py", line 75, in get_content
>  ftp.retrbinary("RETR foo.xml", xml_source.write)
> File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary
>  callback(data)
> File "/usr/lib/python2.3/codecs.py", line 400, in write
>  return self.writer.write(data)
> File "/usr/lib/python2.3/codecs.py", line 178, in write
>  data, consumed = self.encode(object, self.errors)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76:
> ordinal not in range(128)"
>
> I've tried using both the commented lines of code in the above example
> (i.e. using file() instead of codecs.open() and retlines() instead of
> retbinary()). retlines() makes no difference, but if I use file()
> instead of codecs.open() I can open the file, but the extended
> characters from the source file (e.g. foreign characters, copyright
> symbol, etc.) all appear with an extra character in front of them
> (because of the two char width in utf8?).
>
> Is the xml_source.write callback causing the problem here? Or is it
> something else? Is there any way that I can correctly retrieve a utf8
> encoded file from an FTP server?

It looks like there are at least two problems here. The major one
is that you seem to have a misconception about utf-8 encoding.

The _disk_ version of the file is what is encoded in utf-8, and it has
to be decoded to unicode on being read later. In other words,
what you got is what you should have put on disk without any
conversion. As you noted, when you did that, the FTP part of
the process worked.

Whatever program you are using to read it has to then decode
it from utf-8 into unicode. Failure to do this is what is causing
the extra characters on output.

The object returned by codecs.open raised an exception
because it expected a
unicode string on input; it got a character string already
encoded in utf-8 format. The internal mechanism is first
going to try to decode that into unicode before then
encoding it into utf-8. Unfortunately, the default for
encoding or decoding (outside of special contexts) is
ASCII-7. So everything outside of the ASCII range
is invalid.

Amusingly, this would have worked:

xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8")

It is, of course, an expensive way of doing nothing, but
it at least has the virtue of being good documentation.

HTH

John Roth

>
> Cheers,
> Richard