Need help with file encoding-decoding
philip at semanchuk.com
Fri Sep 23 15:14:15 CEST 2011
On Sep 23, 2011, at 7:44 AM, Yaşar Arabacı wrote:
> I'am trying to write a mass html downloader, and it processes files after it
> downloaded them. I have problems with encodings, and decodings. Sometimes I
> get UnicodeDecodeErrors, or
> I get half-pages in after processing part. Or more generally, some things
> don't feel right. Can you check my approach, and provide me some feedback
> please? Here is what I am doing.
> 1) send a HEAD request to file's source to get file encoding, set encoding
> variable accordingly.
This is a pretty optimistic algorithm, at least by the statistics from 2008 (see below).
> 2) if server doesn't provide an encoding, set encoding variable as utf-8
This is statistically a good guess but it doesn't follow the HTTP specification.
> 4) in this step, I need to parse the content I get, because I will search
> for further links \
> I feed content to parser (subclass of HTMLParser.HTMLParser) like
Does HTMLParser.HTMLParser handle broken HTML? Because there's lots of it out there.
I used to run an automated site validator, and I wrote a couple of articles you might find interesting. One is about how to get the encoding of a Web page:
I also wrote an article examining the statistics I'd seen run through the crawler/validator. One thing I saw was that almost 2/3 of Web pages specified the encoding in the META HTTP-EQUIV Content-Type tag rather than in the HTTP Content-Type header. Mind you, this was three years ago so the character of the Web has likely changed since then, but probably not too dramatically.
You can also do some straightforward debugging. Save the raw bytes you get from each site, and when you encounter a decode error, check the raw bytes. Are they really in the encoding specified? Webmasters make all kinds of mistakes.
Hope this helps
> this -> content.decode(encoding)
> 5) open a file in binary mod open(file_path,"wb")
> 6) I write as I read without modifing.
> # After processing part....
> (Note: encoding variable is same as the downloading part)
> 1) open local file in binary mod for reading file_name =
> 2) decode the file contents into a variable => decoded_content =
> 3) send decoded content to a parser, parser contstruct new html content. (as
> 4) open same file for writing, in binary mod, write parsers output like
> this: file_name.write(parser.output.encode(encoding))
More information about the Python-list