Unicode and Python - how often do you index strings?
Peter Otten
__peter__ at web.de
Wed Jun 4 06:10:41 EDT 2014
Mark Lawrence wrote:
> On 04/06/2014 01:39, Chris Angelico wrote:
>> A current discussion regarding Python's Unicode support centres (or
>> centers, depending on how close you are to the cent[er]{2} of the
>> universe) around one critical question: Is string indexing common?
>>
>> Python strings can be indexed with integers to produce characters
>> (strings of length 1). They can also be iterated over from beginning
>> to end. Lots of operations can be built on either one of those two
>> primitives; the question is, how much can NOT be implemented
>> efficiently over iteration, and MUST use indexing? Theories are great,
>> but solid use-cases are better - ideally, examples from actual
>> production code (actual code optional).
>>
>> I know the collective experience of python-list can't fail to bring up
>> a few solid examples here :)
>>
>> Thanks in advance, all!!
>>
>> ChrisA
>>
>
> Single characters quite often, iteration rarely if ever, slicing all the
> time, but does that last one count?
The indices used for slicing typically don't come out of nowhere. A simple
example would be
def strip_prefix(text, prefix):
if text.startswith(prefix):
text = text[len(prefix):]
return text
If both prefix and text use UTF-8 internally the byte offset is already
known. The question is then how we can preserve that information.
The first approach that comes to mind is an int subtype:
>>> for i, c in enumerate("123αλφα"):
... print(i, byteoffset(i), c)
...
0 0 1
1 1 2
2 2 3
3 3 α
4 5 λ
5 7 φ
6 9 α
This would work in the strip_prefix() example, but lead to data corruption
in most other cases unless limited to a specific string -- in which case it
would no longer work with strip_prefix().
So a new interface would be needed. My second try, an object with two byte
offsets linked to a specific string:
>>> span("foobar").startswith("oob")
>>> p = span("foobar").startswith("foo")
>>> p.replace("baz")
'bazbar'
>>> p.before()
''
>>> p.after()
'bar'
>>> span("foo bar baz").find("bar").replace("spam")
'foo spam bar'
I have no idea if that could work out...
More information about the Python-list
mailing list