csv reader

Tue Dec 15 23:27:09 EST 2009

En Tue, 15 Dec 2009 19:12:01 -0300, Emmanuel <manouchk at gmail.com> escribió:

> Then my problem is diferent!
>
> In fact I'm reading a csv file saved from openoffice oocalc using
> UTF-8 encoding. I get a list of list (let's cal it tab) with the csv
> data.
> If I do:
>
> print tab[2][4]
> In ipython, I get:
> equação de Toricelli. Tarefa exercícios PVR 1 e 2 ; PVP 1
>
> If I only do:
> tab[2][4]
>
> In ipython, I get:
> 'equa\xc3\xa7\xc3\xa3o de Toricelli. Tarefa exerc\xc3\xadcios PVR 1 e
> 2 ; PVP 1'
>
> Does that mean that my problem is not the one I'm thinking?

Yes. You have a real problem, but not this one. When you say `print  
something`, you get a nice view of `something`, basically the result of  
doing `str(something)`. When you say `something` alone in the interpreter,  
you get a more formal representation, the result of calling  
`repr(something)`:

py> x = "ecuação"
py> print x
ecuação
py> x
'ecua\x87\xc6o'
py> print repr(x)
'ecua\x87\xc6o'

Those '' around the text and the \xNN notation allow for an unambiguous  
representation. Two strings may "look like" the same but be different, and  
repr shows that.
('ecua\x87\xc6o' is encoded in windows-1252; you should see  
'equa\xc3\xa7\xc3\xa3o' in utf-8)

> My real problem is when I use that that kind of UTF-8 encoded (?) with
> selenium here.
> If I just switch the folowing line:
> self.sel.type("q", "equação")
>
> by:
> self.sel.type("q", u"equação")
>
>
> It works fine!

Yes: you should work with unicode most of the time. The "recipe" for  
having as little unicode problems as possible says:

- convert the input data (read from external sources, like a file) from  
bytes to unicode, using the (known) encoding of those bytes

- handle unicode internally everywhere in your program

- and convert from unicode to bytes as late as possible, when writing  
output (to screen, other files, etc) using the encoding expected by those  
external files.

See the Unicode How To: http://docs.python.org/howto/unicode.html

> The problem is that the csv.reader does give a "equação" and not a
> u"equação"

The csv module cannot handle unicode text directly, but see the last  
example in the csv documentation for a simple workaround:  
http://docs.python.org/library/csv.html

-- 
Gabriel Genellina