
Hi everyone, I just released lxml 3.1.0. This is a stable feature release that mainly adds a new incremental XML serialisation API, but also fixes a couple of bugs. Upgrading from 3.0.x should be smooth. The complete changelog follows below. Note that this release is called "3.1.0" instead of just "3.1" in order to simplify version sorting. You can get it from PyPI: http://pypi.python.org/pypi/lxml/3.1.0 The documentation is here: http://lxml.de/ Download: http://lxml.de/files/lxml-3.1.0.tgz Signature: http://lxml.de/files/lxml-3.1.0.tgz.asc GitHub: https://github.com/lxml/lxml/commit/f76cca898b7edefe91be584540883d158416e6cb This release was built using Cython 0.18. Note that the build no longer uses Cython, even if it is installed. Recompilation of the sources has to be requested explicitly with the setup.py option "--with-cython". If you are interested in commercial support or customisations for the lxml package, please contact me directly. Have fun, Stefan 3.1.0 (2013-02-10) ================== Features added -------------- * GH#89: lxml.html.clean allows overriding the set of attributes that it considers 'safe'. Patch by Francis Devereux. Bugs fixed ---------- * LP#1104370: ``copy.copy(el.attrib)`` raised an exception. It now returns a copy of the attributes as a plain Python dict. * GH#95: When used with namespace prefixes, the ``el.find*()`` methods always used the first namespace mapping that was provided for each path expression instead of using the one that was actually passed in for the current run. * LP#1092521, GH#91: Fix undefined C symbol in Python runtimes compiled without threading support. Patch by Ulrich Seidl. Other changes ------------- 3.1beta1 (2012-12-21) ===================== Features added -------------- * New build-time option ``--with-unicode-strings`` for Python 2 that makes the API always return Unicode strings for names and text instead of byte strings for plain ASCII content. * New incremental XML file writing API ``etree.xmlfile()``. * E factory in lxml.objectify is callable to simplify the creation of tags with non-identifier names without having to resort to getattr(). Bugs fixed ---------- * When starting from a non-namespaced element in lxml.objectify, searching for a child without explicitly specifying a namespace incorrectly found namespaced elements with the requested local name, instead of restricting the search to non-namespaced children. * GH#85: Deprecation warnings were fixed for Python 3.x. * GH#33: lxml.html.fromstring() failed to accept bytes input in Py3. * LP#1080792: Static build of libxml2 2.9.0 failed due to missing file. Other changes ------------- * The externally useless class ``_ObjectifyElementMakerCaller`` was removed from the module API of lxml.objectify. * LP#1075622: lxml.builder is faster for adding text to elements with many children. Patch by Anders Hammarquist.

My apologies in advance for this ignorant question: Does the fact that you can download the latest version of lxml from pypi mean that you (or rather I) no longer have to worry about the versions of the libxml libraries that come with OS Lion? Or do I still have to go through the special Mac OS routines described on the lxml site? Martin Mueller Professor of English and Classics Northwestern University On 2/10/13 12:28 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 10.02.2013 21:47:
No, it has nothing to do with that. lxml has always been available from PyPI.
Or do I still have to go through the special Mac OS routines described on the lxml site?
I don't know, can't test it. It's certainly safest to enable STATIC_DEPS, the rest should work with pip these days. That being said, the installation instructions can always use some fresh air and a bit of a spring cleanup. Stefan

I have been trying to install the latest version of lxml on OS Lion. I've had varying success. It worked for Python2.7, but didn't work for Python3.3. I used the installation instructions given on the lxml website: STATIC_DEPS=true sudo pip install lxml I got an error message about a missing "llvm-gcc-4.2", but there was good advice on how to handle this at http://waqasshabbir.tumblr.com/post/19073648382/llvm-gcc-4-2-exe-error-on-m ac-osx-lion-when-building My goal was to install lxml in Python3.3, but the command with a specified target installed lxml in Python2.7. I learned how to use the target flag and how to install lxml in a particular version of Python with this command: STATIC_DEPS=true sudo pip install lxml --target /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es The lxml package shows up properly in the site-packages directory: the same files in the same order in the 2.7 and 3.3 installation. But when running a simple script that calls on lxml, the following error message appears: ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site -packages/lxml/etree.so, 2): Symbol not found: _PyBaseString_Type Referenced from: /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es/lxml/etree.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es/lxml/etree.so In looking more closely at the two directories, I observe that each of them has a directory called "lxml-3.1.0-py2.7.egg-info." Is this right or should the pip command have imported a Python3 specific version of lxml? Stymied and grateful for help MM

Martin Mueller, 13.02.2013 19:19:
That's the wrong way round. You are asking a Py2.7 pip to install its package into the site-packages directory of you Py3.3 installation. Rather, you want to run pip with the Py3.3 "python" executable so that it can build a proper Py3.3 package for you. When you install pip under Py3.3, it usually creates a "pip-3.3" that you can just run (and there should be a "pip-2.7" that's identical to your current "pip"). Stefan

Actually, I did install pip with Python3.3 in the site-packages directory of Python3.3, and it shows up as pip-1.2.1-py3.3.egg. But there doesn't seem to be a way of calling it. 'pip' will do the 2.7 routine, and pip3.3 or pip-3.3 are not recognized as commands. I checked the http://www.pip-installer.org/en/latest/usage.html#pip-install for any information about using different versions of pip with different versions of Python. But there is no such information. I installed pip by running 'python3.3. setup.py install' in the downloaded pip directory. That worked in the sense that it put various pip files in the appropriate site packages directory Martin Mueller Professor of English and Classics Northwestern University On 2/13/13 1:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

In article <7D89C1F94A704245AF5254CFAE7AA1A8097EBB7C@evcspmbx4.ads.northwestern.edu
If you are using a framework build of Python on OS X, most, including Pythons from the python.org binary installers, are configured to install scripts into the framework bin directory. For example, after installing distribute, pip, and virtualenv for 3.3: $ cd /Library/Frameworks/Python.framework/Versions/3.3/bin $ ls 2to3@ pip-3.3* python3.3-32* pythonw3.3-32* 2to3-3.3* pydoc3@ python3.3-config@ pyvenv@ easy_install* pydoc3.3* python3.3m* pyvenv-3.3* easy_install-3.3* python3@ python3.3m-config* virtualenv* idle3@ python3-32@ pythonw3@ virtualenv-3.3* idle3.3* python3-config@ pythonw3-32@ pip* python3.3* pythonw3.3* The most convenient way to ensure that you pick up the right version is to ensure that that bin directory comes first on your shell PATH: export PATH=/Library/Frameworks/Python.framework/Versions/3.3/bin:$PATH The python.org installers provide a script in /Applications/Python x.y called "Update Shell Profile.command" to do that "permanently". The Python 2.x installers run their script by default automatically during installation. The 3.x installers do not. BTW, the Apple-supplied system Pythons do not follow this pattern. They install scripts into /usr/local/bin. -- Ned Deily, nad@acm.org

I am baffled by the behaviour of lxml 3.1.0 when you feed it a perfectly ordinary TEI file that begins with the standard xml declaration: <?xml version="1.0" encoding="UTF-8"?> It produces the error message "Unicode strings with encoding declaration are not supported." If you take out the line, things work. How does that make sense? The xml declaration is a standard part of an XML file. If you run xqueries with oXygen on the file, the processor does not complain. This seems to be an lxml rather than a Python feature: if you do something with the file that doesn't involve lxml, it doesn't complain about the declaration,and it handles Unicode characters just fine. Martin Mueller Professor of English and Classics Northwestern University On 2/13/13 1:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 16.02.2013 21:24:
It looks like you're feeding it Unicode data instead of a byte stream. If you're really reading it from a file, make sure you open it in binary mode, or just pass the filename in to lxml's parse() function. Stefan

Thank you. That works, but there are other ways in which lxml 3 behaves quite differently from 2, and in ways that don't seem to be documented in the tutorial. For instance, its default prefaces every line with a b', which I take it stands for binary, and it transforms utf-8 characters into character entities. Also, when I worked with 2.7 and a pre 3 version of lxml, printed output behaved in a predictable way. But in lxml 3 the etree.tostring output makes a mess of white space and prints newline characters in the form "\n" I work a lot with Python3.3 because I don't have to worry about UTF-8: the scripts handle text files as if they were ASCII, and I don't have to worry about the encoding. The same is true of Xquery. But lxml seems to be very fussy about character issues, and oddly enough it seems even fussier in Python 3 than in Python 2. Is there a set of simple rules about what to do or not to do when you routinely work with Utf-8 and multilingual documents? Sorry to be so difficult about this. I thought lxml was going to be easier than xquery. And the ability to combine XML parsing with other operations in one program is certainly an advantage. But the management of character problems seems to be a real headache, and it's a lot harder than ordinary operations in Python. Martin Mueller Professor of English and Classics Northwestern University On 2/16/13 2:28 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/16/2013 04:31 PM, Martin Mueller wrote:
The simplest rule is: always encode your data (preferably to UTF-8) before passing it into the lxml parsing mechanism.
See the FAQ: http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlEgUXUACgkQ+gerLs4ltQ55ZgCg1TNoDKo8AUcuQiLV6S6TjQLR 56UAoMiLKSzAp098bCrdJLHTHgEFPTJw =xEoQ -----END PGP SIGNATURE-----

Tres Seaver, 17.02.2013 04:41:
"bytes" actually, which is a builtin type in Python.
and it transforms utf-8 characters into character entities.
There is no such thing as "utf-8 characters". "UTF-8" is an encoding, i.e. a mapping from Unicode text (i.e. characters) to a byte sequence (i.e. bytes). And there's no difference in lxml between Py2 and Py3 in its behaviour regarding character escaping. By default, it does it in both, because the default encoding is ASCII with character references for characters that ASCII cannot represent. If you want plain, unescaped, UTF-8 encoded output, pass "utf-8" into the "encoding" option.
That has nothing to do with lxml. It's how repr() works for bytes objects. And it's actually quite predictable and deterministic.
Hmm, actually, it's the other way round: "do not decode your XML data yourself". That's a part of the work of an XML parser. Just pass it in unchanged, it'll always do the right thing, by specification. XML is explicitly defined as a stream of bytes. It's not Unicode text. Take a look at the spec, it's all about byte sequences and how to map them to text and structure. The error that lxml emits is explicitly there to keep people from doing the wrong thing. The ability to parse XML from Unicode text input is only there to simplify parsing fragments from code, not to do anything major with it. Specifically not to parse whole documents and (that's where the error comes from) specifically not whole documents that lie about their encoding. Basically, what lxml and libxml2 have to do when they see Unicode text input is to 1) decode the underlying system specific byte buffer to Unicode 2) encode that to UTF-8 3) parse the result as XML. That's a bit inefficient. Even worse, if the input comes from an actual file that is opened in text mode, the process becomes 1) Python reads bytes from the file 2) Python decodes the data to a unicode object 3) lxml and libxml2 decode the underlying buffer to Unicode 4) encode it to UTF-8 5) parse the result as XML. For straight bytes input from a UTF-8 encoded file, it's more like 1) read bytes from the file 2) parse them as XML This is very efficient, especially when you pass the input file by file name instead of letting Python open it for you (but the latter is ok, too). Stefan

I'm still not quite out of the woods, and perhaps I should stay in the woods of Python2.7 If parse an XML file with lxml under Python2.7, iterate through all its elements and use the command print etree.tostring(element, encoding = 'utf-8), the output of the print operation is identical with the input, as it should be. If I do the same thing with Python3.3 and add no encoding information, the output of the print operation is not identical with the input. The lines are prefaced by b' and Unicode characters show up as character entities (ſ) If I now use the command print etree.tostring(element, encoding = 'utf-8), the output is also not identical, but the utf-8 characters show up differently (the long 's' (ſ) shows up as '\xc5\xbf'. and new line characters show up as \n What do I do to decode the bytes version so that I can parse a file with lxml and then print it out unchanged? That's a basic operation for me: before you introduce changes, you want to make sure that you can put a file through a series of operations without change. The section on "serializing to Unicode strings" has a sentence that says "However, if you want to save the result to a file or pass it over the network, you should use write() or tostring() with a byte encoding (typically UTF-8) to serialize the XML." If that means something other than adding "encoding = "utf-8" , I don't know what it would be. If I simply open an XML file in 'rb' mode and then cycle through its lines, giving a print command like print(line.decode('utf-8') the output transforms the bytes input into ordinary utf-8. There must surely be something that does the same for the parsed XML tree, but what is it? The scripts I write mix XML parsing with other ordinary string operations. This is unproblematic in 2.7, once you've learned about u'something' and 'encoding='utf-8'. It seems to be quite hard with Python3.3 Martin Mueller Professor of English and Classics Northwestern University On 2/17/13 12:55 AM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 19.02.2013 06:38:
I'm still not quite out of the woods, and perhaps I should stay in the woods of Python2.7
"woods" seems like the right word. Once you've seen the light, it's hard to go back there. :)
As I said, that's unrelated to Py2/3. If you leave out the encoding option in Py2, you'll get the same form of escaping there. The only thing that changes is the repr() of bytes strings in Py3, which prefixes them with the 'b' prefix. But you'll only see that when you *print* them. It doesn't change your data, it just makes it clearer what you have in your hands.
The output is escaped by repr(). You get the same thing in Py2 when you *explicitly* call repr() on the object.
What do I do to decode the bytes version so that I can parse a file with lxml and then print it out unchanged?
If all you want to do is to print it, pass encoding="unicode". That doesn't really make sense semantically, because "unicode" is not an encoding, but it makes lxml serialise the result to a Python unicode string, which prints properly in both Py2 and Py3.
Different use cases. Printing data to make it readable to humans is something entirely different from serialising it for machine processing.
In Py2, you mean. That's a bug that was fixed in Py3. What happens is that the line first gets decoded (!) into Unicode behind your back, using a default encoding (which may change, meaning that the operation may fail arbitrarily) and only then it does what you asked it for, namely it encodes the result into a UTF-8 byte sequence. The implicit decoding is a step that leads to rather unpredictable and hard to understand behaviour in real-life, especially because the codec it uses is system specific. Users can change it and thus break your code on their installation. These problems eventually lead to the decision to separate byte strings from Unicode strings in Py3.
See my comment on the two use cases above.
The scripts I write mix XML parsing with other ordinary string operations.
Could you give an example? It's not all that common to do that.
This is unproblematic in 2.7
It's actually *very* problematic in Py2. You just didn't notice yet because you were lucky, and most likely also because you were not dealing with Unicode data outside of the ASCII range. Those who do were almost certainly bitten more than once by the infamous Py2 auto-decoding bug. It's also a bit problematic in Py3, but for a different reason. Byte strings no longer support the complete string API there, which means that some operations that used to work in Py2 are no longer available in Py3. Some of them because they simply don't make sense on byte strings, but another few were removed because they seemed inappropriate and hard to keep alive, above all '%' formatting of byte strings, which would still have made a nice feature. There are portable ways to do it differently (e.g. the struct module), but they are more complex.
once you've learned about u'something' and 'encoding='utf-8'. It seems to be quite hard with Python3.3
It's actually quite simple. Py3.3 has backwards compatibility support for the 'u' prefix (to reduce the syntax impact when porting code), so you can just leave your Unicode strings as they are. Alternatively, strip the 'u' prefix, it's completely redundant in Py3. Now, you need to be aware when you are dealing with byte strings, but that's easy for XML because all serialised XML *is* byte strings. The encoding="unicode" feature above is the only exception, and a rather explicit one. Basically it's like this: 1) encoded byte data gets parsed into a tree 2) XML trees are completely Unicode 3) XML trees get serialised into byte strings So you're only dealing with two worlds: a nice and shiny Unicode world with parsed in-memory XML trees, and a somewhat less shiny bytes world before and after parsing. All of this being said, it's worth using Py3, but you'll have to learn something new for it, something that users of Py2 often sneaked their way around. String processing in Py2 was something seemingly simply that could quickly get you into encoding hell for no apparent reason. String processing in Py3 is way safer and cleaner, but you have to get used to being aware what you are dealing with: bytes or text. The upside is that you'll write better code when you're aware of that. Stefan

Stefan Behnel, 19.02.2013 07:39:
Sorry, misread your "decode" as "encode", maybe because you said "into utf8". It doesn't transform it into utf8, it decodes it *from* utf8 into Unicode. Printing Unicode is not a problem, because it can just be encoded into whatever your terminal accepts, usually UTF-8 these days, on most systems, which means that the above code would first decode the data explicitly and then encode the result implicitly. So, there is an implicit encoding step involved in print(), because it needs to convert (Unicode) text into a byte sequence to pass it through the standard output stream of your Python interpreter into your terminal. Now, encoding Unicode text is a natural thing to do, but encoding byte data makes no sense, and your terminal may not be able to handle the bare data it contains. It could be a TCP/IP dump, for example, or a JPEG compressed image, or some CJK encoded text, or a serialised XML document. This was actually a real problem in Py2, where you could get an exception about a failed encoding or decoding step, and it would not be able to tell you what happened with what data, because the exception message simply wasn't printable. That's finally no longer a problem in Py3. Because of all these ambiguities between bytes and text in Py2, it was decided for Py3 to go the only safe and sane route which always works, i.e. to use an escaped ASCII-compatible representation when converting a byte string to a printable representation, prefixed with 'b' to make it clear what happened. Does this explanation help? Stefan

Hi, I’m a bit late to this party but Martin, if you want to write a byte string to the standard output with Python 3, use sys.stdout.buffer.write() This is what you want if eg. you’re piping the output to a file. print() on Python 3 insist on getting Unicode strings, and sys.stdout has its own logic to try and detect the appropriate encoding for showing text on the terminal. -- Simon Sapin

Tres Seaver, 17.02.2013 04:41:
http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
I've updated that FAQ section, I hope it's clearer now. Stefan

My apologies in advance for this ignorant question: Does the fact that you can download the latest version of lxml from pypi mean that you (or rather I) no longer have to worry about the versions of the libxml libraries that come with OS Lion? Or do I still have to go through the special Mac OS routines described on the lxml site? Martin Mueller Professor of English and Classics Northwestern University On 2/10/13 12:28 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 10.02.2013 21:47:
No, it has nothing to do with that. lxml has always been available from PyPI.
Or do I still have to go through the special Mac OS routines described on the lxml site?
I don't know, can't test it. It's certainly safest to enable STATIC_DEPS, the rest should work with pip these days. That being said, the installation instructions can always use some fresh air and a bit of a spring cleanup. Stefan

I have been trying to install the latest version of lxml on OS Lion. I've had varying success. It worked for Python2.7, but didn't work for Python3.3. I used the installation instructions given on the lxml website: STATIC_DEPS=true sudo pip install lxml I got an error message about a missing "llvm-gcc-4.2", but there was good advice on how to handle this at http://waqasshabbir.tumblr.com/post/19073648382/llvm-gcc-4-2-exe-error-on-m ac-osx-lion-when-building My goal was to install lxml in Python3.3, but the command with a specified target installed lxml in Python2.7. I learned how to use the target flag and how to install lxml in a particular version of Python with this command: STATIC_DEPS=true sudo pip install lxml --target /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es The lxml package shows up properly in the site-packages directory: the same files in the same order in the 2.7 and 3.3 installation. But when running a simple script that calls on lxml, the following error message appears: ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site -packages/lxml/etree.so, 2): Symbol not found: _PyBaseString_Type Referenced from: /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es/lxml/etree.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packag es/lxml/etree.so In looking more closely at the two directories, I observe that each of them has a directory called "lxml-3.1.0-py2.7.egg-info." Is this right or should the pip command have imported a Python3 specific version of lxml? Stymied and grateful for help MM

Martin Mueller, 13.02.2013 19:19:
That's the wrong way round. You are asking a Py2.7 pip to install its package into the site-packages directory of you Py3.3 installation. Rather, you want to run pip with the Py3.3 "python" executable so that it can build a proper Py3.3 package for you. When you install pip under Py3.3, it usually creates a "pip-3.3" that you can just run (and there should be a "pip-2.7" that's identical to your current "pip"). Stefan

Actually, I did install pip with Python3.3 in the site-packages directory of Python3.3, and it shows up as pip-1.2.1-py3.3.egg. But there doesn't seem to be a way of calling it. 'pip' will do the 2.7 routine, and pip3.3 or pip-3.3 are not recognized as commands. I checked the http://www.pip-installer.org/en/latest/usage.html#pip-install for any information about using different versions of pip with different versions of Python. But there is no such information. I installed pip by running 'python3.3. setup.py install' in the downloaded pip directory. That worked in the sense that it put various pip files in the appropriate site packages directory Martin Mueller Professor of English and Classics Northwestern University On 2/13/13 1:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

In article <7D89C1F94A704245AF5254CFAE7AA1A8097EBB7C@evcspmbx4.ads.northwestern.edu
If you are using a framework build of Python on OS X, most, including Pythons from the python.org binary installers, are configured to install scripts into the framework bin directory. For example, after installing distribute, pip, and virtualenv for 3.3: $ cd /Library/Frameworks/Python.framework/Versions/3.3/bin $ ls 2to3@ pip-3.3* python3.3-32* pythonw3.3-32* 2to3-3.3* pydoc3@ python3.3-config@ pyvenv@ easy_install* pydoc3.3* python3.3m* pyvenv-3.3* easy_install-3.3* python3@ python3.3m-config* virtualenv* idle3@ python3-32@ pythonw3@ virtualenv-3.3* idle3.3* python3-config@ pythonw3-32@ pip* python3.3* pythonw3.3* The most convenient way to ensure that you pick up the right version is to ensure that that bin directory comes first on your shell PATH: export PATH=/Library/Frameworks/Python.framework/Versions/3.3/bin:$PATH The python.org installers provide a script in /Applications/Python x.y called "Update Shell Profile.command" to do that "permanently". The Python 2.x installers run their script by default automatically during installation. The 3.x installers do not. BTW, the Apple-supplied system Pythons do not follow this pattern. They install scripts into /usr/local/bin. -- Ned Deily, nad@acm.org

I am baffled by the behaviour of lxml 3.1.0 when you feed it a perfectly ordinary TEI file that begins with the standard xml declaration: <?xml version="1.0" encoding="UTF-8"?> It produces the error message "Unicode strings with encoding declaration are not supported." If you take out the line, things work. How does that make sense? The xml declaration is a standard part of an XML file. If you run xqueries with oXygen on the file, the processor does not complain. This seems to be an lxml rather than a Python feature: if you do something with the file that doesn't involve lxml, it doesn't complain about the declaration,and it handles Unicode characters just fine. Martin Mueller Professor of English and Classics Northwestern University On 2/13/13 1:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 16.02.2013 21:24:
It looks like you're feeding it Unicode data instead of a byte stream. If you're really reading it from a file, make sure you open it in binary mode, or just pass the filename in to lxml's parse() function. Stefan

Thank you. That works, but there are other ways in which lxml 3 behaves quite differently from 2, and in ways that don't seem to be documented in the tutorial. For instance, its default prefaces every line with a b', which I take it stands for binary, and it transforms utf-8 characters into character entities. Also, when I worked with 2.7 and a pre 3 version of lxml, printed output behaved in a predictable way. But in lxml 3 the etree.tostring output makes a mess of white space and prints newline characters in the form "\n" I work a lot with Python3.3 because I don't have to worry about UTF-8: the scripts handle text files as if they were ASCII, and I don't have to worry about the encoding. The same is true of Xquery. But lxml seems to be very fussy about character issues, and oddly enough it seems even fussier in Python 3 than in Python 2. Is there a set of simple rules about what to do or not to do when you routinely work with Utf-8 and multilingual documents? Sorry to be so difficult about this. I thought lxml was going to be easier than xquery. And the ability to combine XML parsing with other operations in one program is certainly an advantage. But the management of character problems seems to be a real headache, and it's a lot harder than ordinary operations in Python. Martin Mueller Professor of English and Classics Northwestern University On 2/16/13 2:28 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/16/2013 04:31 PM, Martin Mueller wrote:
The simplest rule is: always encode your data (preferably to UTF-8) before passing it into the lxml parsing mechanism.
See the FAQ: http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlEgUXUACgkQ+gerLs4ltQ55ZgCg1TNoDKo8AUcuQiLV6S6TjQLR 56UAoMiLKSzAp098bCrdJLHTHgEFPTJw =xEoQ -----END PGP SIGNATURE-----

Tres Seaver, 17.02.2013 04:41:
"bytes" actually, which is a builtin type in Python.
and it transforms utf-8 characters into character entities.
There is no such thing as "utf-8 characters". "UTF-8" is an encoding, i.e. a mapping from Unicode text (i.e. characters) to a byte sequence (i.e. bytes). And there's no difference in lxml between Py2 and Py3 in its behaviour regarding character escaping. By default, it does it in both, because the default encoding is ASCII with character references for characters that ASCII cannot represent. If you want plain, unescaped, UTF-8 encoded output, pass "utf-8" into the "encoding" option.
That has nothing to do with lxml. It's how repr() works for bytes objects. And it's actually quite predictable and deterministic.
Hmm, actually, it's the other way round: "do not decode your XML data yourself". That's a part of the work of an XML parser. Just pass it in unchanged, it'll always do the right thing, by specification. XML is explicitly defined as a stream of bytes. It's not Unicode text. Take a look at the spec, it's all about byte sequences and how to map them to text and structure. The error that lxml emits is explicitly there to keep people from doing the wrong thing. The ability to parse XML from Unicode text input is only there to simplify parsing fragments from code, not to do anything major with it. Specifically not to parse whole documents and (that's where the error comes from) specifically not whole documents that lie about their encoding. Basically, what lxml and libxml2 have to do when they see Unicode text input is to 1) decode the underlying system specific byte buffer to Unicode 2) encode that to UTF-8 3) parse the result as XML. That's a bit inefficient. Even worse, if the input comes from an actual file that is opened in text mode, the process becomes 1) Python reads bytes from the file 2) Python decodes the data to a unicode object 3) lxml and libxml2 decode the underlying buffer to Unicode 4) encode it to UTF-8 5) parse the result as XML. For straight bytes input from a UTF-8 encoded file, it's more like 1) read bytes from the file 2) parse them as XML This is very efficient, especially when you pass the input file by file name instead of letting Python open it for you (but the latter is ok, too). Stefan

I'm still not quite out of the woods, and perhaps I should stay in the woods of Python2.7 If parse an XML file with lxml under Python2.7, iterate through all its elements and use the command print etree.tostring(element, encoding = 'utf-8), the output of the print operation is identical with the input, as it should be. If I do the same thing with Python3.3 and add no encoding information, the output of the print operation is not identical with the input. The lines are prefaced by b' and Unicode characters show up as character entities (ſ) If I now use the command print etree.tostring(element, encoding = 'utf-8), the output is also not identical, but the utf-8 characters show up differently (the long 's' (ſ) shows up as '\xc5\xbf'. and new line characters show up as \n What do I do to decode the bytes version so that I can parse a file with lxml and then print it out unchanged? That's a basic operation for me: before you introduce changes, you want to make sure that you can put a file through a series of operations without change. The section on "serializing to Unicode strings" has a sentence that says "However, if you want to save the result to a file or pass it over the network, you should use write() or tostring() with a byte encoding (typically UTF-8) to serialize the XML." If that means something other than adding "encoding = "utf-8" , I don't know what it would be. If I simply open an XML file in 'rb' mode and then cycle through its lines, giving a print command like print(line.decode('utf-8') the output transforms the bytes input into ordinary utf-8. There must surely be something that does the same for the parsed XML tree, but what is it? The scripts I write mix XML parsing with other ordinary string operations. This is unproblematic in 2.7, once you've learned about u'something' and 'encoding='utf-8'. It seems to be quite hard with Python3.3 Martin Mueller Professor of English and Classics Northwestern University On 2/17/13 12:55 AM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 19.02.2013 06:38:
I'm still not quite out of the woods, and perhaps I should stay in the woods of Python2.7
"woods" seems like the right word. Once you've seen the light, it's hard to go back there. :)
As I said, that's unrelated to Py2/3. If you leave out the encoding option in Py2, you'll get the same form of escaping there. The only thing that changes is the repr() of bytes strings in Py3, which prefixes them with the 'b' prefix. But you'll only see that when you *print* them. It doesn't change your data, it just makes it clearer what you have in your hands.
The output is escaped by repr(). You get the same thing in Py2 when you *explicitly* call repr() on the object.
What do I do to decode the bytes version so that I can parse a file with lxml and then print it out unchanged?
If all you want to do is to print it, pass encoding="unicode". That doesn't really make sense semantically, because "unicode" is not an encoding, but it makes lxml serialise the result to a Python unicode string, which prints properly in both Py2 and Py3.
Different use cases. Printing data to make it readable to humans is something entirely different from serialising it for machine processing.
In Py2, you mean. That's a bug that was fixed in Py3. What happens is that the line first gets decoded (!) into Unicode behind your back, using a default encoding (which may change, meaning that the operation may fail arbitrarily) and only then it does what you asked it for, namely it encodes the result into a UTF-8 byte sequence. The implicit decoding is a step that leads to rather unpredictable and hard to understand behaviour in real-life, especially because the codec it uses is system specific. Users can change it and thus break your code on their installation. These problems eventually lead to the decision to separate byte strings from Unicode strings in Py3.
See my comment on the two use cases above.
The scripts I write mix XML parsing with other ordinary string operations.
Could you give an example? It's not all that common to do that.
This is unproblematic in 2.7
It's actually *very* problematic in Py2. You just didn't notice yet because you were lucky, and most likely also because you were not dealing with Unicode data outside of the ASCII range. Those who do were almost certainly bitten more than once by the infamous Py2 auto-decoding bug. It's also a bit problematic in Py3, but for a different reason. Byte strings no longer support the complete string API there, which means that some operations that used to work in Py2 are no longer available in Py3. Some of them because they simply don't make sense on byte strings, but another few were removed because they seemed inappropriate and hard to keep alive, above all '%' formatting of byte strings, which would still have made a nice feature. There are portable ways to do it differently (e.g. the struct module), but they are more complex.
once you've learned about u'something' and 'encoding='utf-8'. It seems to be quite hard with Python3.3
It's actually quite simple. Py3.3 has backwards compatibility support for the 'u' prefix (to reduce the syntax impact when porting code), so you can just leave your Unicode strings as they are. Alternatively, strip the 'u' prefix, it's completely redundant in Py3. Now, you need to be aware when you are dealing with byte strings, but that's easy for XML because all serialised XML *is* byte strings. The encoding="unicode" feature above is the only exception, and a rather explicit one. Basically it's like this: 1) encoded byte data gets parsed into a tree 2) XML trees are completely Unicode 3) XML trees get serialised into byte strings So you're only dealing with two worlds: a nice and shiny Unicode world with parsed in-memory XML trees, and a somewhat less shiny bytes world before and after parsing. All of this being said, it's worth using Py3, but you'll have to learn something new for it, something that users of Py2 often sneaked their way around. String processing in Py2 was something seemingly simply that could quickly get you into encoding hell for no apparent reason. String processing in Py3 is way safer and cleaner, but you have to get used to being aware what you are dealing with: bytes or text. The upside is that you'll write better code when you're aware of that. Stefan

Stefan Behnel, 19.02.2013 07:39:
Sorry, misread your "decode" as "encode", maybe because you said "into utf8". It doesn't transform it into utf8, it decodes it *from* utf8 into Unicode. Printing Unicode is not a problem, because it can just be encoded into whatever your terminal accepts, usually UTF-8 these days, on most systems, which means that the above code would first decode the data explicitly and then encode the result implicitly. So, there is an implicit encoding step involved in print(), because it needs to convert (Unicode) text into a byte sequence to pass it through the standard output stream of your Python interpreter into your terminal. Now, encoding Unicode text is a natural thing to do, but encoding byte data makes no sense, and your terminal may not be able to handle the bare data it contains. It could be a TCP/IP dump, for example, or a JPEG compressed image, or some CJK encoded text, or a serialised XML document. This was actually a real problem in Py2, where you could get an exception about a failed encoding or decoding step, and it would not be able to tell you what happened with what data, because the exception message simply wasn't printable. That's finally no longer a problem in Py3. Because of all these ambiguities between bytes and text in Py2, it was decided for Py3 to go the only safe and sane route which always works, i.e. to use an escaped ASCII-compatible representation when converting a byte string to a printable representation, prefixed with 'b' to make it clear what happened. Does this explanation help? Stefan

Hi, I’m a bit late to this party but Martin, if you want to write a byte string to the standard output with Python 3, use sys.stdout.buffer.write() This is what you want if eg. you’re piping the output to a file. print() on Python 3 insist on getting Unicode strings, and sys.stdout has its own logic to try and detect the appropriate encoding for showing text on the terminal. -- Simon Sapin

Tres Seaver, 17.02.2013 04:41:
http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
I've updated that FAQ section, I hope it's clearer now. Stefan
participants (5)
-
Martin Mueller
-
Ned Deily
-
Simon Sapin
-
Stefan Behnel
-
Tres Seaver