raw_input() and utf-8 formatted chars
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Fri Nov 2 07:07:05 EDT 2007
On Thu, 01 Nov 2007 19:21:03 -0700, 7stud wrote:
> BeautifulSoup can convert an html entity representing an 'A' with
> umlaut, e.g.:
>
> Ä
>
> into an without every touching my keyboard. How does BeautifulSoup
> do it?
It maps the HTML entity names to unicode characters. Take a look at the
`htmlentitydefs` module.
> from BeautifulSoup import BeautifulStoneSoup as bss
>
>
> s1 = "<h1>Ä</h1>" #&_Auml;_
> #I added the comment after the line to show the
> #format of the html entity. In case a browser
> #might render the comment into the actual character,
> #I added underscores to the html entity:
>
> soup = bss(s1)
> text = soup.contents[0].string #gets the 'A' with umlaut out of the
> html
>
> new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
> print repr(new_s)
> print new_s
>
> I see the same output for both print statements, and what I see is an
> 'A' with umlaut. I expected that the first print statement would show
> the utf-8 encoding for the character.
Well it does, and apparently your terminal, or wherever the output goes,
decodes that UTF-8 encoded 'Ä' and shows it. If you expected the output
'\xc3\x84' then remember that you ask the soup object for its
representation and not a string. The object itself decides what
`repr(obj)` returns. Soup objects represent themselves as UTF-8 encoded
strings.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list