> So, I'm using lxml to screen scrap a site that uses the cyrillic
> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
> ..content-type.. charset=..> header, but does have a HTTP header that
> specifies the charset... so they are standards compliant enough.
> Now when I run this code:
> from lxml import html
> doc = html.parse('')
> root = doc.getroot()
> title = root.cssselect(('head title'))[0]
> print title.text
> the title.text is ? unicode string, but it has been wrongly decoded as
> latin1 -> unicode
> So.. is this a deficiency/bug in lxml or I'm doing something wrong.
> Also, what are my other options here?
> I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.
The way I do that is to open the file with codecs, encoding=cp1251, read it 
into variable and feed that to the parser.


