The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)

Mon Mar 21 13:12:45 EDT 2016

On 21/03/2016 12:59, Chris Angelico wrote:
> On Mon, Mar 21, 2016 at 11:34 PM, BartC <bc at freeuk.com> wrote:
>> For Python I would have used a table of 0..255 functions, indexed by the
>> ord() code of each character. So all 52 letter codes map to the same
>> name-handling function. (No Dict is needed at this point.)
>>
>
> Once again, you forget that there are not 256 characters - there are
> 1114112. (Give or take.)

The original code for this test expected the data to be a series of 
bytes, mostly ASCII. Any Unicode in the input would be expected to be in 
the form of UTF-8.

Since this was designed to tokenise C, I don't think C supports Unicode 
except in comments within the code, and within string literals. For 
those purposes, it is not necessary to do anything with UTF-8 escape 
sequences except ignore them or process them unchanged. (I'm ignoring 
'wide' string and char literals).

But it doesn't make any difference: you process a byte at a time, and 
trap codes C0 to FF which is the start of an escape sequence.

I understand that Python 3 doing text mode files can do this expansion 
automatically, and give you a string that might contain code points 
above 127. That's not a problem: you can still treat the first 128 
code-points exactly as I have, and have special treatment for the rest. 
But you /will/ need to know if data is a raw UTF-8 stream, or has been 
already processed into Unicode.

(I'm taking about 'top-level' character dispatch where you're looking 
for the start of a token.)

Note that my test data was 5,964,784 bytes on disk, of which 14 had 
values above 127: probably 3 or 4 Unicode characters, and most likely in 
comments.

Given that 99.9998% of input byte data is ASCII, and 99.9999% of 
characters (in this data), is it unreasonable to concentrate on that 
0..127 range?

-- 
Bartc