Unicode -> String problem
michael at stroeder.com
Tue Jul 10 15:21:05 CEST 2001
Jay Parlar wrote:
> My task is to create an HTML parser that will pull full text from HTML
Basically your parser has to honour the charset defined in HTTP
header or <meta> tag.
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Your parser should use the denoted charset for converting the raw
strings to Unicode objects. Also HTML char entities have to be added
to the Unicode objects with same encoding.
> Now, whenever I'm given HTML from IE's cache, it is unicode. There is no doubt
> about that.
Are you sure? Which encoding of Unicode? UTF-16, UTF-8, ...
More information about the Python-list