Using lxml to screen scrap a site, problem with charset

Tim Arnold tim.arnold at
Mon Feb 2 18:45:25 CET 2009

"?????? ???????????" <gdamjan at> wrote in message 
news:ciqh56-ses.ln1 at
> So, I'm using lxml to screen scrap a site that uses the cyrillic
> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
> ..content-type.. charset=..> header, but does have a HTTP header that
> specifies the charset... so they are standards compliant enough.
> Now when I run this code:
> from lxml import html
> doc = html.parse('')
> root = doc.getroot()
> title = root.cssselect(('head title'))[0]
> print title.text
> the title.text is ? unicode string, but it has been wrongly decoded as
> latin1 -> unicode
> So.. is this a deficiency/bug in lxml or I'm doing something wrong.
> Also, what are my other options here?
> I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.
> -- 
> ?????? ( )
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan

The way I do that is to open the file with codecs, encoding=cp1251, read it 
into variable and feed that to the parser.


More information about the Python-list mailing list