How best to handle Unicode where only 8-bit chars are now?
fredrik at pythonware.com
Sat Feb 8 15:54:41 CET 2003
Skip Montanaro wrote:
> The csv module supporting PEP 305 doesn't do Unicode yet. All string
> manipulation is currently done using null-terminated C strings for speed.
> I'm looking for suggestions about how best to incorporate Unicode string
> handling into the code. I see three possibilities:
> 1 Try and treat unicode objects the same as string objects - extract the
> raw data and handle them as bytes.
> 2 Provide two different state machines, the current fast one which
> operates only on C strings representing ASCII data and a slow one
> which operates on unicode objects.
> 3 Rewrite the state machine to operate at the level of string or unicode
> objects even though it will slow down the common case significantly.
> Option 1 seems doomed because you'd be trying to mix processing of wide and
> narrow characters. Option 2 seems the least disruptive, but if somehow a
> unicode object snuck into the system (say, a single field was unicode or the
> delimiter was specified as u'"' even though it was really ASCII), the whole
> system might mysteriously slow down. Option 3 seems the cleanest, but would
> slow everything down significantly because character extraction and
> comparison would require a function call instead of an array index operation
> or a simple comparison.
what makes you think 8-bit == fast and unicode == slow?
have you looked at SRE? it compiles portions of itself twice, to get 8-bit
and unicode versions of the core engine. on modern machines, the unicode
version often runs *faster* than the corresponding 8-bit code.
More information about the Python-list