[Tutor] parse text file

Wed Feb 3 09:18:23 CET 2010

On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine <norman at khine.net> wrote:

> i am no expert, but there seems to be a bigger difference.
> 
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
> 
> where as you get
> 
> Sat\xc3\xa9re Maw\xc3\xa9
> 
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

This is a rather complicated issue mixing python str, unicode string, and their repr().
Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', which should only appear in a unicode string.
So:
   unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9"
or more simply:
   u"\u00e9".encode("utf8") == "\xc3\xa9"
Conversely:
   unicode("\xc3\xa9", "utf8") == u"\u00e9"	-- decoding

The question is: what do you want to do with the result? You'll need either the utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for processing). But what you actually get is a kind of mix, actually the (python str) repr of a unicode string.

> also, i still get an empty list when i run the code as suggested.

? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching at \n by default)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/