[lxml-dev] tounicode(), again

Hi all, we had a lengthy discussion yesterday and I guess we found a few use cases where tounicode() makes sense and a few counter-arguments why it might not be a good idea to expose that API at a similarly visible place as tostring(). I'm still convinced that it's a good idea to have that API, but as one of the arguments was that "people who don't understand unicode" (PeWDUUs) would be more likely to write broken code, I added this paragraph to api.txt, in the section that describes the unicode support of lxml. """ Note that the unicode strings returned by ``tounicode()`` never have an XML declaration and therefore do not specify an encoding. This makes it possible to pass them back into the lxml parsers. However, you may have to add a declaration yourself if you want to serialize such a unicode string to a byte stream later. In contrast, the ``tostring()`` function automatically adds a declaration as needed that reflects the encoding of the returned byte string. """ I hope that makes it clear enough for PeWDUUs what the advantage of using tostring() over tounicode() is and that you have to take care what you do with unicode strings. So, I propose leaving the API (and implementation) just as it is now. Regards, Stefan

Stefan Behnel wrote:
Hi all,
we had a lengthy discussion yesterday and I guess we found a few use cases where tounicode() makes sense and a few counter-arguments why it might not be a good idea to expose that API at a similarly visible place as tostring().
I'm still convinced that it's a good idea to have that API, but as one of the arguments was that "people who don't understand unicode" (PeWDUUs) would be more likely to write broken code, I added this paragraph to api.txt, in the section that describes the unicode support of lxml.
""" Note that the unicode strings returned by ``tounicode()`` never have an XML declaration and therefore do not specify an encoding. This makes it possible to pass them back into the lxml parsers. However, you may have to add a declaration yourself if you want to serialize such a unicode string to a byte stream later. In contrast, the ``tostring()`` function automatically adds a declaration as needed that reflects the encoding of the returned byte string. """
I hope that makes it clear enough for PeWDUUs what the advantage of using tostring() over tounicode() is and that you have to take care what you do with unicode strings.
Maybe we want to alter this to something like this: """ Normally you use tostring() with an encoding argument (typically UTF-8) to create XML, which is typically a stream of bytes. You can then safely save it to a file, pass it over the network, etc. If you're not sure about the way to go, use tostring(). Using tostring() with UTF-8 is also typically faster. In some exceptional use cases it might be useful to obtain XML in a Python unicode string, in which case you can use tounicode(). Only use this if you are confident in your understanding of Python unicode and that your application needs serialized XML in a Python unicode string. """ this way we relate it to use cases, and make clear that tostring() is the way to go for most people. This way people who do not understand what's up with unicode still get a clear hint that they're not supposed to use tounicode(), and that it's even faster not to do so. :) Regards, Martijn
participants (2)
-
Martijn Faassen
-
Stefan Behnel