Hello, parsing HTML files with etree.HTMLParser is just fine for UTF-8 files saved from my Linux OS. But for ISO-8859-2 files saved from others OS, I get a charset issue. I have to convert files into utf-8 with such command : "iconv -f latin1 -t utf8". I would like to do it only with Python. Do you already have a function to detect encoding and change it automatically ? Something like: ------------------------------------------------------------------------ import chardet import glob files = glob.glob('*html') for filename in files: encoding = chardet.detect(filename)['encoding'] if encoding != 'utf-8': filename.decode(encoding,'replace').encode('utf-8') ------------------------------------------------------------------------ Since the last line of the example above is not resolving my issue do you have some good practices to advice, maybe specific to lxml ? I have seen some links on the web but I can hardly find a nice solution for such issue. Thanks for your help, -- Alexandre Delanoë
Alexandre Delanoë, 10.12.2013 14:07:
parsing HTML files with etree.HTMLParser is just fine for UTF-8 files saved from my Linux OS. But for ISO-8859-2 files saved from others OS, I get a charset issue. I have to convert files into utf-8 with such command : "iconv -f latin1 -t utf8". I would like to do it only with Python.
Do you already have a function to detect encoding and change it automatically ? Something like:
------------------------------------------------------------------------ import chardet import glob
files = glob.glob('*html')
for filename in files: encoding = chardet.detect(filename)['encoding'] if encoding != 'utf-8': filename.decode(encoding,'replace').encode('utf-8') ------------------------------------------------------------------------
Since the last line of the example above is not resolving my issue do you have some good practices to advice, maybe specific to lxml ? I have seen some links on the web but I can hardly find a nice solution for such issue.
It seems to me that this has nothing specifically to do with lxml. Your problem is badly/differently encoded file names. The solution might be to make sure you only use one encoding for file names, i.e. the one that your operating system (read: file system) expects. Stefan
Année 2013, vendredi 13 décembre, vers 07:09, Stefan Behnel écrivait:
It seems to me that this has nothing specifically to do with lxml. Your problem is badly/differently encoded file names. The solution might be to make sure you only use one encoding for file names, i.e. the one that your operating system (read: file system) expects.
You are right but the parser is supposed to parse files from differents operating systems. Then, I have patched the parser with a bash test transcoding files from latin1 to utf-8 (it may be ugly but that works). Many thanks for your reply. -- Alexandre Delanoë
Alexandre Delanoë, 16.12.2013 16:57:
Année 2013, vendredi 13 décembre, vers 07:09, Stefan Behnel écrivait:
It seems to me that this has nothing specifically to do with lxml. Your problem is badly/differently encoded file names. The solution might be to make sure you only use one encoding for file names, i.e. the one that your operating system (read: file system) expects.
You are right but the parser is supposed to parse files from differents operating systems.
Well, it does, as long as you stay within the bounds of each operating system. Note that the parser is only ever running on one operating system at a time. (Although, as I said, the problem you describe is not an OS issue but a file system issue.) As soon as you start transferring files between different systems, it's your own responsibility to adapt the files and/or their names as needed. For example, you may have to adapt the encoding that a file system is mounted with in order to integrate it properly into the currently running system. Basically, using different encodings on the same file system is just screaming for trouble in all sorts of places. Imagine the case where a directory name is encoded in one encoding and a file name in that directory uses a different encoding. Then there is simply no way to decode the complete file path any more. Stefan
participants (2)
-
Alexandre Delanoë
-
Stefan Behnel