[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 11:45:33 CEST 2005

Keir Mierle wrote:

> Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg
> Wilson this summer
> 
> We're having a very strange problem with Python's unicode parsing of source
> files. Basically, our CGI script was running extremely slowly on our production
> box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCSI drives). Slow to the tune
> of 6-10 seconds per request. I eventually tracked this down to imports of our
> source tree; the actual request was completing in 300ms, the rest of the time
> was spent in __import__.

This is caused by the chances to the codecs in 2.4. Basically the codecs 
no longer rely on C's readline() to do line splitting (which can't work 
for UTF-16), but do it themselves (via unicode.splitlines()).

> After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was
> getting called 51 million times. Our code is 1.2 million characters, so I
> hardly think it makes sense to call IsLinebreak 50 times for each character;
> and we're not even importing our entire source tree on every invocation.

But if you're using CGI, you're importing your source on every 
invocation. Switching to a different server side technology might help. 
Nevertheless 50 million calls seems to be a bit much.

> Our code is a fork of Trac, and originally had these lines at the top:
> 
> # -*- coding: iso8859-1 -*-  
> 
> This made me suspicious, so I removed all of them. The CGI execution time
> immediately dropped to ~1 second. gprof revealed that
> _PyUnicodeUCS2_IsLinebreak is not called at all anymore.
> 
> Now that our code works fast enough, I don't really care about this, but I
> thought python-dev might want to know something weird is going on with unicode
> splitlines.

I wonder if we should switch back to a simple readline() implementation 
for those codecs that don't require the current implementation 
(basically every charmap codec). AFAIK source files are opened in 
universal newline mode, so at least we'd get proper treatment of "\n", 
"\r" and "\r\n" line ends, but we'd loose u"\x1c", u"\x1d", u"\x1e", 
u"\x85", u"\u2028" and u"\u2029" (which are line terminators according 
to unicode.splitlines()).

> I documented my investigation of this problem; if anyone wants further details,
> just email me. (I'm not on python-dev)
> http://www.third-bit.com/trac/argon/ticket/525

Bye,
    Walter Dörwald