The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)
BartC
bc at freeuk.com
Mon Mar 21 13:12:45 EDT 2016
On 21/03/2016 12:59, Chris Angelico wrote:
> On Mon, Mar 21, 2016 at 11:34 PM, BartC <bc at freeuk.com> wrote:
>> For Python I would have used a table of 0..255 functions, indexed by the
>> ord() code of each character. So all 52 letter codes map to the same
>> name-handling function. (No Dict is needed at this point.)
>>
>
> Once again, you forget that there are not 256 characters - there are
> 1114112. (Give or take.)
The original code for this test expected the data to be a series of
bytes, mostly ASCII. Any Unicode in the input would be expected to be in
the form of UTF-8.
Since this was designed to tokenise C, I don't think C supports Unicode
except in comments within the code, and within string literals. For
those purposes, it is not necessary to do anything with UTF-8 escape
sequences except ignore them or process them unchanged. (I'm ignoring
'wide' string and char literals).
But it doesn't make any difference: you process a byte at a time, and
trap codes C0 to FF which is the start of an escape sequence.
I understand that Python 3 doing text mode files can do this expansion
automatically, and give you a string that might contain code points
above 127. That's not a problem: you can still treat the first 128
code-points exactly as I have, and have special treatment for the rest.
But you /will/ need to know if data is a raw UTF-8 stream, or has been
already processed into Unicode.
(I'm taking about 'top-level' character dispatch where you're looking
for the start of a token.)
Note that my test data was 5,964,784 bytes on disk, of which 14 had
values above 127: probably 3 or 4 Unicode characters, and most likely in
comments.
Given that 99.9998% of input byte data is ASCII, and 99.9999% of
characters (in this data), is it unreasonable to concentrate on that
0..127 range?
--
Bartc
More information about the Python-list
mailing list