Mailman 3 UTF-8 Decoder - Python-Dev - python.org

newer
2.6.2 Vista installer failure on...

UTF-8 Decoder

older
Windows buildbots failing...

Jeroen Ruigrok van der Werven

13 Apr 2009 13 Apr '09

8:09 a.m.

[Note: I haven't looked thoroughly at our handling yet, so hence I raise the question.] This got posted on the Unicode list, does it seem interesting for Python itself, the UTF-8 to UTF-16 transcoding might be? http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ -- Jeroen Ruigrok van der Werven / asmodai イェルーンラウフロックヴァンデルウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Whenever you meet difficult situations dash forward bravely and joyfully...

Reply

Sign in to reply online Use email software

Show replies by date

Antoine Pitrou

14 Apr 14 Apr

2:42 p.m.

Jeroen Ruigrok van der Werven writes:

This got posted on the Unicode list, does it seem interesting for Python itself, the UTF-8 to UTF-16 transcoding might be?

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

If you have some time on your hands, you could try benchmarking it against Python 3.1's (py3k) decoder. There are two cases to consider: - mostly non-ASCII input, such as the "utf-8 demo" file mentioned in the page above - mostly ASCII input, such as will happen very often (think HTML, XML, log files, etc.) The py3k utf-8 decoder is optimized for the latter. Regards Antoine.

Reply

Sign in to reply online Use email software

Jeroen Ruigrok van der Werven

27 Apr 27 Apr

6:28 p.m.

-On [20090414 16:43], Antoine Pitrou (solipsis@pitrou.net) wrote:

If you have some time on your hands, you could try benchmarking it against Python 3.1's (py3k) decoder. There are two cases to consider:

Bjoern actually did it himself already: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance (results are Large, Medium, Tiny) PyUnicode_DecodeUTF8Stateful (3.1a2), Visual C++ 7.1 -Ox -Ot -G7 4523ms 5686ms 3138ms Manually inlined transcoder (see above), Visual C++ 7.1 -Ox -Ot -G7 4277ms 4998ms 4640ms So on medium and large datasets the decoder of Bjoern is very interesting, but the tiny case (just Bjoern's name) is quite a tad bit slower. The other cases seems more typical of what the average use in Python would be. -- Jeroen Ruigrok van der Werven / asmodai イェルーンラウフロックヴァンデルウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Nobilitas sola est atque unica virtus...

Reply

Sign in to reply online Use email software

Antoine Pitrou

6:48 p.m.

Jeroen Ruigrok van der Werven writes:

So on medium and large datasets the decoder of Bjoern is very interesting, but the tiny case (just Bjoern's name) is quite a tad bit slower. The other cases seems more typical of what the average use in Python would be.

Keep in mind what the datasets are: « The large buffer is a April 2009 Hindi Wikipedia article XML dump, the medium buffer Markus Kuhn's UTF-8-demo.txt, and the tiny buffer my name » It would be interesting to test with mostly ASCII data to see what that gives. Now the good thing is that, even with wildly non-ASCII data, our current decoder is very efficient. Regards Antoine.

Reply

Sign in to reply online Use email software

5477

Age (days ago)

5491

Last active (days ago)

Download

3 comments

2 participants

tags

participants (2)

Antoine Pitrou
Jeroen Ruigrok van der Werven