unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Ethan Furman ethan at stoneleaf.us
Fri Mar 29 05:56:05 CET 2013

On 03/28/2013 08:34 PM, Neil Hodgson wrote:
> Steven D'Aprano:
>> Any string method that takes a starting offset requires the method to
>> walk the string byte-by-byte. I've even seen languages put responsibility
>> for dealing with that onto the programmer: the "start offset" is given in
>> *bytes*, not characters. I don't remember what language this was... it
>> might have been Haskell? Whatever it was, it horrified me.
>     It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural.

Horrifying or not, I am willing to give up a small amount of speed for correctness.  Heck, I'm willing to give up a lot 
of speed for correctness.  Once I have my slow but correct prototype going I can recode in a faster language (if needed) 
and compare it's blazingly fast output with my slowly-generated but known-good output.

>  You can wrap
> access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are
> safe only allowing the iterator to point at the start/end of valid characters.

Sure.  Or I can let Python handle it for me.

>     The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside
> Latin-1 will double in size as a Python string.

True.  But how often do you have the entire document as a single string?  Use readlines() instead of read().  Besides, 
memory is cheap.


More information about the Python-list mailing list