unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]
ethan at stoneleaf.us
Fri Mar 29 05:56:05 CET 2013
On 03/28/2013 08:34 PM, Neil Hodgson wrote:
> Steven D'Aprano:
>> Any string method that takes a starting offset requires the method to
>> walk the string byte-by-byte. I've even seen languages put responsibility
>> for dealing with that onto the programmer: the "start offset" is given in
>> *bytes*, not characters. I don't remember what language this was... it
>> might have been Haskell? Whatever it was, it horrified me.
> It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural.
Horrifying or not, I am willing to give up a small amount of speed for correctness. Heck, I'm willing to give up a lot
of speed for correctness. Once I have my slow but correct prototype going I can recode in a faster language (if needed)
and compare it's blazingly fast output with my slowly-generated but known-good output.
> You can wrap
> access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are
> safe only allowing the iterator to point at the start/end of valid characters.
Sure. Or I can let Python handle it for me.
> The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside
> Latin-1 will double in size as a Python string.
True. But how often do you have the entire document as a single string? Use readlines() instead of read(). Besides,
memory is cheap.
More information about the Python-list