how to avoid leading white spaces

Chris Torek nospam at torek.net
Fri Jun 3 00:30:46 EDT 2011


>In article <is9ikg083h at news1.newsguy.com>,
> Chris Torek <nospam at torek.net> wrote:
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).

In article <roy-751FAC.23443902062011 at news.panix.com>
Roy Smith  <roy at panix.com> wrote:
>I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't 
>unicode inherently 32 bit?

Well, not exactly.  As I understand it, Python is normally built
with a 16-bit "unicode character" type though (using either UCS-2
or UTF-16 internally; but I admit I have been far too lazy to look
up stuff like surrogates here :-) ).

>In any case, while I could imagine building a 2^16 entry jump table, 
>clearly it's infeasible (with today's hardware) to build a 2^32 entry 
>table. But, there's nothing that really requires you to build a table at 
>all.  If I understand the algorithm right, all that's really required is 
>that you can map a character to a shift value.

Right.  See the URL I included for an example.  The point here,
though, is ... well:

>For an 8 bit character set, an indexed jump table makes sense.  For a 
>larger character set, I would imagine you would do some heuristic 
>pre-processing to see if your search string consisted only of characters 
>in one unicode plane and use that fact to build a table which only 
>indexes that plane.  Or, maybe use a hash table instead of a regular 
>indexed table.

Just so.  You have to pay for one scan through the string to build
a hash-table of offsets -- an expense similar to that for building
the 256-entry 8-bit table, perhaps, depending on string length --
but then you pay again for each character looked-at, since:

    skip = hashed_lookup(table, this_char);

is a more complex operation than:

    skip = table[this_char];

(where table is a simple array, hence the C-style semicolons: this
is not Python pseudo-code :-) ).  Hence, a "penalty".

>Not as fast, but only slower by a small constant factor, 
>which is not a horrendous price to pay in a fully i18n world :-)

Indeed.
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html



More information about the Python-list mailing list