[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?
Steve Willoughby
steve at alchemy.com
Sun Nov 20 21:30:16 CET 2011
It's customary to copy the list with answers, so everyone can benefit
who may run into the same issue, too.
On 20-Nov-11 11:38, dave selby wrote:
> It came from some automated HTML generation app ... I just had the
> idea of looking at in with ghex .... every other character is \00
> !!!!, thats mad. OK will try ans replace('\00', '') in the string
> before splitting
Those bytes are there for a reason, it's not mad. It's using wide
characters, possibly due to Unicode encoding. If there are special
characters involved (multinational applications or whatever), you'll
destroy them by killing the null bytes and won't handle the case of that
high-order byte being something other than zero.
Check out Python's Unicode handling, and character set encode/decode
features for a robust way to translate the output you're getting.
>
> Cheers
>
> Dave
>
> On 20 November 2011 19:15, Steve Willoughby<steve at alchemy.com> wrote:
>> Where did the string come from? It looks at first glance like you have two bytes for each character instead of the one you expect. Is this perhaps a Unicode string instead of ASCII?
>>
>> Sent from my iPad
>>
>> On 2011/11/20, at 10:28, dave selby<dave6502 at gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I have a long string which is an HTML file, I strip the HTML tags away
>>> and make a list with
>>>
>>> text = re.split('<.*?>', HTML)
>>>
>>> I then tried to search for a string with text.index(...) but it was
>>> not found, printing HTML to a terminal I get what I expect, a block of
>>> tags and text, I split the HTML and print text and I get loads of
>>>
>>> \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character.
>>>
>>> Any idea what is happening and how to get back to a list of ascii strings ?
>>>
>>> Cheers
>>>
>>> Dave
>>>
>>> --
>>>
>>> Please avoid sending me Word or PowerPoint attachments.
>>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>> _______________________________________________
>>> Tutor maillist - Tutor at python.org
>>> To unsubscribe or change subscription options:
>>> http://mail.python.org/mailman/listinfo/tutor
>>
>
>
>
--
Steve Willoughby / steve at alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
More information about the Tutor
mailing list