Handling text lines from files with some (few) starnge chars

Sat Jun 5 23:05:54 EDT 2010

On Jun 6, 12:14 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Paulo da Silva wrote:
> > Em 06-06-2010 00:41, Chris Rebert escreveu:
> >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
> >> <psdasilva.nos... at netcabonospam.pt> wrote:
> > ...
>
> >> Specify the encoding of the text when opening the file using the
> >> `encoding` parameter. For Windows-1252 for example:
>
> >> your_file = open("path/to/file.ext", 'r', encoding='cp1252')
>
> > OK! This fixes my current problem. I used encoding="iso-8859-15". This
> > is how my text files are encoded.
> > But what about a more general case where the encoding of the text file
> > is unknown? Is there anything like "autodetect"?
>
>  >
> An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
> How could you tell which was the correct encoding?
>
> Well, if the file contained words in a certain language and some of the
> characters were wrong, then you'd know that the encoding was wrong. This
> does imply, though, that you'd need to know what the language should
> look like!
>
> You could try different encodings, and for each one try to identify what
> could be words, then look them up in dictionaries for various languages
> to see whether they are real words...

This has been automated (semi-successfully, with caveats) by the
chardet package ... see http://chardet.feedparser.org/