how to avoid leading white spaces

Chris Torek nospam at torek.net
Wed Jun 8 22:32:08 EDT 2011


>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table.  The re.compile() gets to re-use
>> the table every time.

In article <mailman.2508.1307394262.9059.python-list at python.org>
Ian  <hobson42 at gmail.com> wrote:
>Is that true?  My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.

As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t.  Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.

>Or am I misunderstanding something.

Here's python 2.7 on a Linux box:

    >>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
    38 39 40
    >>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
    56 60 64

This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html



More information about the Python-list mailing list