[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Steve Willoughby steve at alchemy.com
Sun Nov 20 21:30:16 CET 2011


It's customary to copy the list with answers, so everyone can benefit 
who may run into the same issue, too.

On 20-Nov-11 11:38, dave selby wrote:
> It came from some automated HTML generation app ... I just had the
> idea of looking at in with ghex .... every other character is \00
> !!!!, thats mad. OK will try ans replace('\00', '') in the string
> before splitting

Those bytes are there for a reason, it's not mad.  It's using wide 
characters, possibly due to Unicode encoding.  If there are special
characters involved (multinational applications or whatever), you'll 
destroy them by killing the null bytes and won't handle the case of that 
high-order byte being something other than zero.

Check out Python's Unicode handling, and character set encode/decode 
features for a robust way to translate the output you're getting.


>
> Cheers
>
> Dave
>
> On 20 November 2011 19:15, Steve Willoughby<steve at alchemy.com>  wrote:
>> Where did the string come from?  It looks at first glance like you have two bytes for each character instead of the one you expect.  Is this perhaps a Unicode string instead of ASCII?
>>
>> Sent from my iPad
>>
>> On 2011/11/20, at 10:28, dave selby<dave6502 at gmail.com>  wrote:
>>
>>> Hi All,
>>>
>>> I have a long string which is an HTML file, I strip the HTML tags away
>>> and make a list with
>>>
>>> text = re.split('<.*?>', HTML)
>>>
>>> I then tried to search for a string with text.index(...) but it was
>>> not found, printing HTML to a terminal I get what I expect, a block of
>>> tags and text, I split the HTML and print text and I get loads of
>>>
>>> \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.
>>>
>>> Any idea what is happening and how to get back to a list of ascii strings ?
>>>
>>> Cheers
>>>
>>> Dave
>>>
>>> --
>>>
>>> Please avoid sending me Word or PowerPoint attachments.
>>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>> _______________________________________________
>>> Tutor maillist  -  Tutor at python.org
>>> To unsubscribe or change subscription options:
>>> http://mail.python.org/mailman/listinfo/tutor
>>
>
>
>


-- 
Steve Willoughby / steve at alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C


More information about the Tutor mailing list