Using lxml to screen scrap a site, problem with charset
gdamjan at gmail.com
Mon Feb 2 01:15:39 CET 2009
So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
..content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.
Now when I run this code:
from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))
the title.text is а unicode string, but it has been wrongly decoded as
latin1 -> unicode
So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?
I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.
дамјан ( http://softver.org.mk/damjan/ )
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
More information about the Python-list