how to avoid leading white spaces
Roy Smith
roy at panix.com
Thu Jun 2 23:44:40 EDT 2011
In article <is9ikg083h at news1.newsguy.com>,
Chris Torek <nospam at torek.net> wrote:
> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value).
I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
unicode inherently 32 bit? Or at least 20-something bit? Things like
UTF-16 are just one way to encode it.
In any case, while I could imagine building a 2^16 entry jump table,
clearly it's infeasible (with today's hardware) to build a 2^32 entry
table. But, there's nothing that really requires you to build a table at
all. If I understand the algorithm right, all that's really required is
that you can map a character to a shift value.
For an 8 bit character set, an indexed jump table makes sense. For a
larger character set, I would imagine you would do some heuristic
pre-processing to see if your search string consisted only of characters
in one unicode plane and use that fact to build a table which only
indexes that plane. Or, maybe use a hash table instead of a regular
indexed table. Not as fast, but only slower by a small constant factor,
which is not a horrendous price to pay in a fully i18n world :-)
More information about the Python-list
mailing list