[Note: I haven't looked thoroughly at our handling yet, so hence I raise the
question.]
This got posted on the Unicode list, does it seem interesting for Python
itself, the UTF-8 to UTF-16 transcoding might be?
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
--
Jeroen Ruigrok van der Werven
Jeroen Ruigrok van der Werven
This got posted on the Unicode list, does it seem interesting for Python itself, the UTF-8 to UTF-16 transcoding might be?
If you have some time on your hands, you could try benchmarking it against Python 3.1's (py3k) decoder. There are two cases to consider: - mostly non-ASCII input, such as the "utf-8 demo" file mentioned in the page above - mostly ASCII input, such as will happen very often (think HTML, XML, log files, etc.) The py3k utf-8 decoder is optimized for the latter. Regards Antoine.
-On [20090414 16:43], Antoine Pitrou (solipsis@pitrou.net) wrote:
If you have some time on your hands, you could try benchmarking it against Python 3.1's (py3k) decoder. There are two cases to consider:
Bjoern actually did it himself already:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance
(results are Large, Medium, Tiny)
PyUnicode_DecodeUTF8Stateful (3.1a2), Visual C++ 7.1 -Ox -Ot -G7
4523ms 5686ms 3138ms
Manually inlined transcoder (see above), Visual C++ 7.1 -Ox -Ot -G7
4277ms 4998ms 4640ms
So on medium and large datasets the decoder of Bjoern is very interesting,
but the tiny case (just Bjoern's name) is quite a tad bit slower. The other
cases seems more typical of what the average use in Python would be.
--
Jeroen Ruigrok van der Werven
Jeroen Ruigrok van der Werven
So on medium and large datasets the decoder of Bjoern is very interesting, but the tiny case (just Bjoern's name) is quite a tad bit slower. The other cases seems more typical of what the average use in Python would be.
Keep in mind what the datasets are: « The large buffer is a April 2009 Hindi Wikipedia article XML dump, the medium buffer Markus Kuhn's UTF-8-demo.txt, and the tiny buffer my name » It would be interesting to test with mostly ASCII data to see what that gives. Now the good thing is that, even with wildly non-ASCII data, our current decoder is very efficient. Regards Antoine.
participants (2)
-
Antoine Pitrou
-
Jeroen Ruigrok van der Werven