Mailman 3 Charset issue with etree - lxml - The Python XML Toolkit

10 Dec 2013

      Hello,
parsing HTML files with etree.HTMLParser is just fine for UTF-8 files
saved from my Linux OS. But for ISO-8859-2 files saved from others OS, I
get a charset issue. I have to convert files into utf-8 with such
command : "iconv -f latin1 -t utf8". I would like to do it only with
Python.

Do you already have a function to detect encoding and change it
automatically ? Something like:

------------------------------------------------------------------------
import chardet
import glob

files = glob.glob('*html')

for filename in files:
	encoding = chardet.detect(filename)['encoding']
		if encoding != 'utf-8':
		filename.decode(encoding,'replace').encode('utf-8')
------------------------------------------------------------------------

Since the last line of the example above is not resolving my issue do
you have some good practices to advice, maybe specific to lxml ? I have
seen some links on the web but I can hardly find a nice solution for
such issue.

Thanks for your help,

-- 
                Alexandre Delanoë

Charset issue with etree

Alexandre Delanoë

Stefan Behnel

Alexandre Delanoë

Stefan Behnel

tags

participants (2)