[lxml-dev] File name encoding problems
Hi! I'm not sure if it's really an lxml problem, but it looks like... I'm using lxml version 1.0.3 (from Debian package). My default system encoding is ISO-8859-2. I have a simple program: #!/usr/bin/python # -*- coding: iso-8859-2 -*- import lxml.etree d = lxml.etree.parse('/tmp/ó.xml') The problem is with letter 'ó' (or any other non-ascii letter) in file name. While running program I get something like that: Traceback (most recent call last): File "./p.py", line 6, in ? d = lxml.etree.parse('/tmp/łódź.xml') File "etree.pyx", line 1615, in etree.parse File "parser.pxi", line 687, in etree._parseDocument File "apihelpers.pxi", line 343, in etree._utf8 AssertionError: All strings must be Unicode or ASCII If I try to write path using UTF-8 encoding, the file cannot be found (because the path with UTF-8 encoded name does not exists). This did not happen with python 2.3 - the problem is only with python2.4. There's a workaround - reading file to StringIO object and then parsing XML from that object. It works fine but it's silly. I would be very grateful for any suggestions. Paweł Pałucha
Hi Paweł, first thing to note: lxml uses UTF-8 internally, also for filenames, as libxml2 requires a char sequence for their representation. If your system can't handle that, we'll have to figure out a way to make it work. This part of lxml is not much tested, so it would be nice if you could help us in getting this straight. Paweł Pałucha wrote:
I'm using lxml version 1.0.3 (from Debian package). My default system encoding is ISO-8859-2. I have a simple program:
#!/usr/bin/python # -*- coding: iso-8859-2 -*-
import lxml.etree d = lxml.etree.parse('/tmp/ó.xml')
Ok, so you're using 8-bit encoded filenames.
Traceback (most recent call last): File "./p.py", line 6, in ? d = lxml.etree.parse('/tmp/łódź.xml') File "etree.pyx", line 1615, in etree.parse File "parser.pxi", line 687, in etree._parseDocument File "apihelpers.pxi", line 343, in etree._utf8 AssertionError: All strings must be Unicode or ASCII
Right, I guess that treating the filename with the _utf8() function is not the right thing to do for 8-bit strings. We should have a separate way of treating filenames. I'll look into it.
If I try to write path using UTF-8 encoding, the file cannot be found (because the path with UTF-8 encoded name does not exists).
I assume you get the same file-not-found error if you pass the filename as unicode string? d = lxml.etree.parse(u'/tmp/ó.xml')
This did not happen with python 2.3 - the problem is only with python2.4. There's a workaround - reading file to StringIO object and then parsing XML from that object. It works fine but it's silly.
You can also pass an opened file object. Stefan
Hi Paweł, Stefan Behnel wrote:
first thing to note: lxml uses UTF-8 internally, also for filenames, as libxml2 requires a char sequence for their representation. If your system can't handle that, we'll have to figure out a way to make it work.
I guess that treating the filename with the _utf8() function is not the right thing to do for 8-bit strings. We should have a separate way of treating filenames. I'll look into it.
Here's a patch that might fix your problem. However, it's against the current trunk (i.e. 1.1 beta), as fixing this problem requires a behavioural change that will not make it into 1.0. The web page has information on how to build lxml on Linux, it's pretty easy. Stefan
participants (2)
-
Paweł Pałucha
-
Stefan Behnel