how to avoid leading white spaces
Chris Torek
nospam at torek.net
Wed Jun 8 22:32:08 EDT 2011
>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table. The re.compile() gets to re-use
>> the table every time.
In article <mailman.2508.1307394262.9059.python-list at python.org>
Ian <hobson42 at gmail.com> wrote:
>Is that true? My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.
As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t. Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.
>Or am I misunderstanding something.
Here's python 2.7 on a Linux box:
>>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
38 39 40
>>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
56 60 64
This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
More information about the Python-list
mailing list