Using lxml to screen scrap a site, problem with charset

Tim Arnold tim.arnold at sas.com
Mon Feb 2 12:45:25 EST 2009


"?????? ???????????" <gdamjan at gmail.com> wrote in message 
news:ciqh56-ses.ln1 at archaeopteryx.softver.org.mk...
> So, I'm using lxml to screen scrap a site that uses the cyrillic
> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
> ..content-type.. charset=..> header, but does have a HTTP header that
> specifies the charset... so they are standards compliant enough.
>
> Now when I run this code:
>
> from lxml import html
> doc = html.parse('http://a1.com.mk/')
> root = doc.getroot()
> title = root.cssselect(('head title'))[0]
> print title.text
>
> the title.text is ? unicode string, but it has been wrongly decoded as
> latin1 -> unicode
>
> So.. is this a deficiency/bug in lxml or I'm doing something wrong.
> Also, what are my other options here?
>
>
> I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.
>
> -- 
> ?????? ( http://softver.org.mk/damjan/ )
>
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan
>

The way I do that is to open the file with codecs, encoding=cp1251, read it 
into variable and feed that to the parser.

--Tim





More information about the Python-list mailing list