How best to handle Unicode where only 8-bit chars are now?

Sat Feb 8 09:54:41 EST 2003

Skip Montanaro wrote:

> The csv module supporting PEP 305 doesn't do Unicode yet.  All string
> manipulation is currently done using null-terminated C strings for speed.
> I'm looking for suggestions about how best to incorporate Unicode string
> handling into the code.  I see three possibilities:
>
>     1 Try and treat unicode objects the same as string objects - extract the
>       raw data and handle them as bytes.
>
>     2 Provide two different state machines, the current fast one which
>       operates only on C strings representing ASCII data and a slow one
>       which operates on unicode objects.
>
>     3 Rewrite the state machine to operate at the level of string or unicode
>       objects even though it will slow down the common case significantly.
>
> Option 1 seems doomed because you'd be trying to mix processing of wide and
> narrow characters.  Option 2 seems the least disruptive, but if somehow a
> unicode object snuck into the system (say, a single field was unicode or the
> delimiter was specified as u'"' even though it was really ASCII), the whole
> system might mysteriously slow down.  Option 3 seems the cleanest, but would
> slow everything down significantly because character extraction and
> comparison would require a function call instead of an array index operation
> or a simple comparison.

what makes you think 8-bit == fast and unicode == slow?

have you looked at SRE?  it compiles portions of itself twice, to get 8-bit
and unicode versions of the core engine.  on modern machines, the unicode
version often runs *faster* than the corresponding 8-bit code.

</F>