Encoding detection in the standard library?

Is there some sort of text encoding detection module is the standard library? And, if not, is there any reason not to add one? After some googling, I've come across this: http://mail.python.org/pipermail/python-3000/2006-September/003537.html But I can't find any changes that resulted from that thread.

David> Is there some sort of text encoding detection module is the David> standard library? And, if not, is there any reason not to add David> one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. Skip

skip@pobox.com wrote:
The only approach I know of is a heuristic based approach. e.g. http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml (Which was 'borrowed' from docutils in the first place.) Michael Foord

On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
This isn't the only approach, although you're right that in general you have to rely on heuristics. See the charset detection features of ICU: http://www.icu-project.org/userguide/charsetDetection.html I think OSAF's pyicu exposes these APIs: http://pyicu.osafoundation.org/ Jean-Paul

Michael> The only approach I know of is a heuristic based approach. e.g. Michael> http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml Michael> (Which was 'borrowed' from docutils in the first place.) Yes, I implemented a heuristic approach for the Musi-Cal web server. I was able to rely on domain knowledge to guess correctly almost all the time. The heuristic was that almost all form submissions came from the US and the rest which didn't came from Western Europe. Python could never embed such a narrow-focused heuristic into its core distribution. Skip

On 21-Apr-08, at 12:44 PM, skip@pobox.com wrote:
Sorry, I wasn't very clear what I was asking. I was thinking about making an educated guess -- just like chardet (http://chardet.feedparser.org/). This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content).

At 1:14 PM -0400 4/21/08, David Wolever wrote:
Feedparser.org's chardet can't guess 'latin1', so it should be used as a last resort, just as the docs say. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

On Mon, Apr 21, 2008 at 06:37:20PM -0300, Rodrigo Bernardo Pimentel wrote:
The famous chardet returns probablity of its guessing:
Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand- rolled algorithm.

IMO, encoding estimation is something that many web programs will have to deal with
Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly.
Ok, let me try differently then. Please feel free to post a patch to bugs.python.org, and let other people rip it apart. For example, I don't think it should be a codec, as I can't imagine it working on streams. Regards, Martin

On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote: the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee. Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to "Unicode for Dummies".
As things frequently are, it seems like this is a much larger problem that I originally believed. I'll go back and take another look at the problem, then come back if new revelations appear.

When a web browser POSTs data, there is no standard way of communicating which encoding it's using.
That's just not true. Web browser should and do use the encoding of the web page that originally contained the form.
Not true. The latter is guaranteed (unless you assume bugs - but if you do, can you present a specific browser that has that bug?)
Email is a smaller problem, because it usually has a helpful content-type header, but that's no guarantee.
Then assume windows-1252. Mailers who don't use MIME for non-ASCII characters mostly died 10 years ago; those people who continue to use them likely can accept occasional moji-bake (or else they would have switched long ago).
For web forms, I always encode the pages in UTF-8, and that always works. For email, I once added encoding processing to the pipermail (the mailman archiver), and that also always works.
I'll go back and take another look at the problem, then come back if new revelations appear.
Good luck! Martin

Since the site that receives the POST doesn't necessarily have access to the Web page that originally contained the form, that's not really helpful. However, POSTs can use the MIME type "multipart/form-data" for non-Latin-1 content, and should. That contains facilities for indicating the encoding and other things as well. Bill

Bill Janssen writes:
You must be very special to get only compliant email. About half my colleagues use RFC 2047 to encode Japanese file names in MIME attachments (a MUST NOT behavior according to RFC 2047), and a significant fraction of the rest end up with binary Shift JIS or EUC or MacRoman in there. And those are just the most widespread violations I can think of off the top of my head. Not to mention that I find this: =?X-UNKNOWN?Q?Martin_v=2E_L=F6wis?= <martin@v.loewis.de>, in the header I got from you. (I'm not ragging on you, I get Martin's name wrong a significant portion of the time myself. :-( )

martin@v.loewis.de writes:
I wonder if the discussion is confusing two different things. Take a look at <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4>. There are two prescribed ways of sending form data: application/x-www-form-urlencoded, which can only be used with ASCII data, and multipart/form-data. ``The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.'' It's true that the page containing the form may specify which of these two forms to use, but the character encodings are determined by the choice.
For web forms, I always encode the pages in UTF-8, and that always works.
Should work, if you use the "multipart/form-data" format. Bill

On 2008-04-21 23:31, Martin v. Löwis wrote:
+1 I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a guess mechanism... http://chardet.feedparser.org/docs/faq.html#faq.yippie chardet is based on the Mozilla algorithm and at least in my experience that algorithm doesn't work too well. The Mozilla algorithm may work for Asian encodings due to the fact that those encodings are usually also bound to a specific language (and you can then use character and word frequency analysis), but for encodings which can encode far more than just a single language (e.g. UTF-8 or Latin-1), the correct detection rate is rather low. The problem becomes completely even more difficult when leaving the normal text domain or when mixing languages in the same text, e.g. when trying to detect source code with comments using a non-ASCII encoding. The "trick" to just pass the text through a codec and see whether it roundtrips also doesn't necessarily help: Latin-1, for example, will always round-trip, since Latin-1 is a subset of Unicode. IMHO, more research has to be done into this area before a "standard" module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random "stuff". I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill

The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. They're looking at "LSE"s, language-script-encoding triples; a "script" is a way of using a particular character set to write in a particular language. Their system has these requirements: R1. the response must be either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered" [the registered set of LSEs]; R2. Applicable to multi-LSE texts; R3. never accept a wrong answer, even when the program does not have enough data on an LSE; and R4. applicable to any LSE text. So, no wrong answers. The biggest disadvantage would seem to be that the registration data for a particular LSE is kind of bulky; on the order of 10,000 shift-codons, each of three bytes, about 30K uncompressed. http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf Bill

On 2008-04-22 18:33, Bill Janssen wrote:
Thanks for the reference. Looks like the existing research on this just hasn't made it into the mainstream yet. Here's their current project: http://www.language-observatory.org/ Looks like they are focusing more on language detection. Another interesting paper using n-grams: "Language Identification in Web Pages" by Bruno Martins and Mário J. Silva http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf And one using compression: "Text Categorization Using Compression Models" by Eibe Frank, Chang Chui, Ian H. Witten http://portal.acm.org/citation.cfm?id=789742
For a server based application that doesn't sound too large. Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote:
That is a fallacious alternative: the programmers that need encoding detection are not the same people who are omitting encoding information. I only have a small opinion on whether charset detection should appear in the stdlib, but I am somewhat perplexed by the arguments in this thread. I don't see how inclusion in the stdlib would make people more inclined to think that the algorithm is always correct. In terms of the need of this functionality: Martin wrote:
Any program that needs to examine the contents of documents/feeds/ whatever on the web needs to deal with incorrectly-specified encodings (which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers <grin> -Mike

That's not true. Most programs that need to examine the contents of a web page don't need to guess the encoding. In most such programs, the encoding can be hard-coded if the declared encoding is not correct. Most such programs *know* what page they are webscraping, or else they couldn't extract the information out of it that they want to get at. As for feeds - can you give examples of incorrectly encoded one (I don't ever use feeds, so I honestly don't know whether they are typically encoded incorrectly. I've heard they are often XML, in which case I strongly doubt they are incorrectly encoded) As for "whatever" - can you give specific examples?
Again, can you give *specific* examples that are not web browsers? Programs needing BeautifulSoup may still not need encoding guessing, since they still might be able to hard-code the encoding of the web page they want to process. In any case, I'm very skeptical that a general "guess encoding" module would do a meaningful thing when applied to incorrectly encoded HTML pages. Regards, Martin

On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:
I certainly agree that if the target set of documents is small enough it is possible to hand-code the encoding. There are many applications, however, that need to examine the content of an arbitrary, or at least non-small set of web documents. To name a few such applications: - web search engines - translation software - document/bookmark management systems - other kinds of document analysis (market research, seo, etc.)
I also don't have much experience with feeds. My statement is based on the fact that chardet, the tool that has been cited most in this thread, was written specifically for use with the author's feed parsing package.
As for "whatever" - can you give specific examples?
Not that I can substantiate. Documents & feeds covers a lot of what is on the web--I was only trying to make the point that on the web, whenever an encoding can be specified, it will be specified incorrectly for a significant chunk of exemplars.
Indeed, if it is only one site it is pretty easy to work around. My main use of python is processing and analyzing hundreds of millions of web documents, so it is pretty easy to see applications (which I have listed above). I think that libraries like Mark Pilgrim's FeedParser and BeautifulSoup are possible consumers of guessing as well.
Well, it does. I wish I could easily provide data on how often it is necessary over the whole web, but that would be somewhat difficult to generate. I can say that it is much more important to be able to parse all the different kinds of encoding _specification_ on the web (Content-Type/Content-Encoding/<meta http-equiv tags, etc), and the malformed cases of these. I can also think of good arguments for excluding encoding detection for maintenance reasons: is every case of the algorithm guessing wrong a bug that needs to be fixed in the stdlib? That is an unbounded commitment. -Mike

I'll question whether these are "many" programs. Web search engines and translation software have many more challenges to master, and they are fairly special-cased, so I would expect they need to find their own answer to character set detection, anyway (see Bill Janssen's answer on machine translation, also).
- document/bookmark management systems - other kinds of document analysis (market research, seo, etc.)
Not sure what specifically you have in mind, however, I expect that these also have their own challenges. For example, I would expect that MS-Word documents are frequent. You don't need character set detection there (Word is all Unicode), but you need an API to look into the structure of .doc files.
I firmly believe this assumption is false. If the encoding comes out of software (which it often does), it will be correct most of the time. It's incorrect only if the content editor has to type it.
Ok. What advantage would you (or somebody working on a similar project) gain if chardet was part of the standard library? What if it was not chardet, but some other algorithm?
Indeed, that's what I meant with my initial remark. People will expect that it works correctly - both with the consequence of unknowingly proceeding with the incorrect response, and then complaining when they find out that it did produce an incorrect answer. For chardet specifically, my usual standard-library remark applies: it can't become part of the standard library unless the original author contributes it, anyway. I would then hope that he or a group of people would volunteer to maintain it, with the threat of removing it from the stdlib again if these volunteers go away and too many problems show up. Regards, Martin

""Martin v. Löwis"" <martin@v.loewis.de> wrote in message news:480EC376.8070406@v.loewis.de... |> I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was part of the standard library? What if it was not | chardet, but some other algorithm? It seems to me that since there is not a 'correct' algorithm but only competing heuristics, encoding detection modules should be made available via PyPI and only be considered for stdlib after a best of breed emerges with community support.

On 2008-04-23 07:26, Terry Reedy wrote:
+1 Though in practice, determining the "best of breed" often becomes a problem (see e.g. the JSON implementation discussion). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

"Martin v. Löwis" writes:
That depends on whether you can get meaningful information about the language from the fact that you're looking at the page. In the browser context, for one, 99.44% of users are monolingual, so you only have to distinguish among the encodings for their language. In this context a two stage process of determining a category of encoding (eg, ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and then picking an encoding from the category according to a user-specified configuration has served Emacs/MULE users very well for about 20 years. It does *not* work in a context where multiple encodings from the same category are in use (eg, the email folder of a Polish Gastarbeiter in Berlin). Nonetheless it is pretty useful for user agents like mail clients, web browsers, and editors.

To the contrary, an encoding-guessing module is often needed, and guessing can be done with a pretty high success rate. Other Unicode libraries (e.g. ICU) contain guessing modules. I suppose the API could return two values: the guessed encoding and a confidence indicator. Note that the locale settings might figure in the guess. On Mon, Apr 21, 2008 at 10:28 AM, Georg Brandl <g.brandl@gmx.net> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido> Note that the locale settings might figure in the guess. Alas, locale settings in a web server have little or nothing to do with the locale settings in the client submitting the form. Skip

David> Is there some sort of text encoding detection module is the David> standard library? And, if not, is there any reason not to add David> one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. Skip

skip@pobox.com wrote:
The only approach I know of is a heuristic based approach. e.g. http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml (Which was 'borrowed' from docutils in the first place.) Michael Foord

On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
This isn't the only approach, although you're right that in general you have to rely on heuristics. See the charset detection features of ICU: http://www.icu-project.org/userguide/charsetDetection.html I think OSAF's pyicu exposes these APIs: http://pyicu.osafoundation.org/ Jean-Paul

Michael> The only approach I know of is a heuristic based approach. e.g. Michael> http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml Michael> (Which was 'borrowed' from docutils in the first place.) Yes, I implemented a heuristic approach for the Musi-Cal web server. I was able to rely on domain knowledge to guess correctly almost all the time. The heuristic was that almost all form submissions came from the US and the rest which didn't came from Western Europe. Python could never embed such a narrow-focused heuristic into its core distribution. Skip

On 21-Apr-08, at 12:44 PM, skip@pobox.com wrote:
Sorry, I wasn't very clear what I was asking. I was thinking about making an educated guess -- just like chardet (http://chardet.feedparser.org/). This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content).

At 1:14 PM -0400 4/21/08, David Wolever wrote:
Feedparser.org's chardet can't guess 'latin1', so it should be used as a last resort, just as the docs say. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

On Mon, Apr 21, 2008 at 06:37:20PM -0300, Rodrigo Bernardo Pimentel wrote:
The famous chardet returns probablity of its guessing:
Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand- rolled algorithm.

IMO, encoding estimation is something that many web programs will have to deal with
Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly.
Ok, let me try differently then. Please feel free to post a patch to bugs.python.org, and let other people rip it apart. For example, I don't think it should be a codec, as I can't imagine it working on streams. Regards, Martin

On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote: the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee. Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to "Unicode for Dummies".
As things frequently are, it seems like this is a much larger problem that I originally believed. I'll go back and take another look at the problem, then come back if new revelations appear.

When a web browser POSTs data, there is no standard way of communicating which encoding it's using.
That's just not true. Web browser should and do use the encoding of the web page that originally contained the form.
Not true. The latter is guaranteed (unless you assume bugs - but if you do, can you present a specific browser that has that bug?)
Email is a smaller problem, because it usually has a helpful content-type header, but that's no guarantee.
Then assume windows-1252. Mailers who don't use MIME for non-ASCII characters mostly died 10 years ago; those people who continue to use them likely can accept occasional moji-bake (or else they would have switched long ago).
For web forms, I always encode the pages in UTF-8, and that always works. For email, I once added encoding processing to the pipermail (the mailman archiver), and that also always works.
I'll go back and take another look at the problem, then come back if new revelations appear.
Good luck! Martin

Since the site that receives the POST doesn't necessarily have access to the Web page that originally contained the form, that's not really helpful. However, POSTs can use the MIME type "multipart/form-data" for non-Latin-1 content, and should. That contains facilities for indicating the encoding and other things as well. Bill

Bill Janssen writes:
You must be very special to get only compliant email. About half my colleagues use RFC 2047 to encode Japanese file names in MIME attachments (a MUST NOT behavior according to RFC 2047), and a significant fraction of the rest end up with binary Shift JIS or EUC or MacRoman in there. And those are just the most widespread violations I can think of off the top of my head. Not to mention that I find this: =?X-UNKNOWN?Q?Martin_v=2E_L=F6wis?= <martin@v.loewis.de>, in the header I got from you. (I'm not ragging on you, I get Martin's name wrong a significant portion of the time myself. :-( )

martin@v.loewis.de writes:
I wonder if the discussion is confusing two different things. Take a look at <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4>. There are two prescribed ways of sending form data: application/x-www-form-urlencoded, which can only be used with ASCII data, and multipart/form-data. ``The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.'' It's true that the page containing the form may specify which of these two forms to use, but the character encodings are determined by the choice.
For web forms, I always encode the pages in UTF-8, and that always works.
Should work, if you use the "multipart/form-data" format. Bill

On 2008-04-21 23:31, Martin v. Löwis wrote:
+1 I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a guess mechanism... http://chardet.feedparser.org/docs/faq.html#faq.yippie chardet is based on the Mozilla algorithm and at least in my experience that algorithm doesn't work too well. The Mozilla algorithm may work for Asian encodings due to the fact that those encodings are usually also bound to a specific language (and you can then use character and word frequency analysis), but for encodings which can encode far more than just a single language (e.g. UTF-8 or Latin-1), the correct detection rate is rather low. The problem becomes completely even more difficult when leaving the normal text domain or when mixing languages in the same text, e.g. when trying to detect source code with comments using a non-ASCII encoding. The "trick" to just pass the text through a codec and see whether it roundtrips also doesn't necessarily help: Latin-1, for example, will always round-trip, since Latin-1 is a subset of Unicode. IMHO, more research has to be done into this area before a "standard" module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random "stuff". I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill

The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. They're looking at "LSE"s, language-script-encoding triples; a "script" is a way of using a particular character set to write in a particular language. Their system has these requirements: R1. the response must be either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered" [the registered set of LSEs]; R2. Applicable to multi-LSE texts; R3. never accept a wrong answer, even when the program does not have enough data on an LSE; and R4. applicable to any LSE text. So, no wrong answers. The biggest disadvantage would seem to be that the registration data for a particular LSE is kind of bulky; on the order of 10,000 shift-codons, each of three bytes, about 30K uncompressed. http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf Bill

On 2008-04-22 18:33, Bill Janssen wrote:
Thanks for the reference. Looks like the existing research on this just hasn't made it into the mainstream yet. Here's their current project: http://www.language-observatory.org/ Looks like they are focusing more on language detection. Another interesting paper using n-grams: "Language Identification in Web Pages" by Bruno Martins and Mário J. Silva http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf And one using compression: "Text Categorization Using Compression Models" by Eibe Frank, Chang Chui, Ian H. Witten http://portal.acm.org/citation.cfm?id=789742
For a server based application that doesn't sound too large. Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote:
That is a fallacious alternative: the programmers that need encoding detection are not the same people who are omitting encoding information. I only have a small opinion on whether charset detection should appear in the stdlib, but I am somewhat perplexed by the arguments in this thread. I don't see how inclusion in the stdlib would make people more inclined to think that the algorithm is always correct. In terms of the need of this functionality: Martin wrote:
Any program that needs to examine the contents of documents/feeds/ whatever on the web needs to deal with incorrectly-specified encodings (which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers <grin> -Mike

That's not true. Most programs that need to examine the contents of a web page don't need to guess the encoding. In most such programs, the encoding can be hard-coded if the declared encoding is not correct. Most such programs *know* what page they are webscraping, or else they couldn't extract the information out of it that they want to get at. As for feeds - can you give examples of incorrectly encoded one (I don't ever use feeds, so I honestly don't know whether they are typically encoded incorrectly. I've heard they are often XML, in which case I strongly doubt they are incorrectly encoded) As for "whatever" - can you give specific examples?
Again, can you give *specific* examples that are not web browsers? Programs needing BeautifulSoup may still not need encoding guessing, since they still might be able to hard-code the encoding of the web page they want to process. In any case, I'm very skeptical that a general "guess encoding" module would do a meaningful thing when applied to incorrectly encoded HTML pages. Regards, Martin

On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:
I certainly agree that if the target set of documents is small enough it is possible to hand-code the encoding. There are many applications, however, that need to examine the content of an arbitrary, or at least non-small set of web documents. To name a few such applications: - web search engines - translation software - document/bookmark management systems - other kinds of document analysis (market research, seo, etc.)
I also don't have much experience with feeds. My statement is based on the fact that chardet, the tool that has been cited most in this thread, was written specifically for use with the author's feed parsing package.
As for "whatever" - can you give specific examples?
Not that I can substantiate. Documents & feeds covers a lot of what is on the web--I was only trying to make the point that on the web, whenever an encoding can be specified, it will be specified incorrectly for a significant chunk of exemplars.
Indeed, if it is only one site it is pretty easy to work around. My main use of python is processing and analyzing hundreds of millions of web documents, so it is pretty easy to see applications (which I have listed above). I think that libraries like Mark Pilgrim's FeedParser and BeautifulSoup are possible consumers of guessing as well.
Well, it does. I wish I could easily provide data on how often it is necessary over the whole web, but that would be somewhat difficult to generate. I can say that it is much more important to be able to parse all the different kinds of encoding _specification_ on the web (Content-Type/Content-Encoding/<meta http-equiv tags, etc), and the malformed cases of these. I can also think of good arguments for excluding encoding detection for maintenance reasons: is every case of the algorithm guessing wrong a bug that needs to be fixed in the stdlib? That is an unbounded commitment. -Mike

I'll question whether these are "many" programs. Web search engines and translation software have many more challenges to master, and they are fairly special-cased, so I would expect they need to find their own answer to character set detection, anyway (see Bill Janssen's answer on machine translation, also).
- document/bookmark management systems - other kinds of document analysis (market research, seo, etc.)
Not sure what specifically you have in mind, however, I expect that these also have their own challenges. For example, I would expect that MS-Word documents are frequent. You don't need character set detection there (Word is all Unicode), but you need an API to look into the structure of .doc files.
I firmly believe this assumption is false. If the encoding comes out of software (which it often does), it will be correct most of the time. It's incorrect only if the content editor has to type it.
Ok. What advantage would you (or somebody working on a similar project) gain if chardet was part of the standard library? What if it was not chardet, but some other algorithm?
Indeed, that's what I meant with my initial remark. People will expect that it works correctly - both with the consequence of unknowingly proceeding with the incorrect response, and then complaining when they find out that it did produce an incorrect answer. For chardet specifically, my usual standard-library remark applies: it can't become part of the standard library unless the original author contributes it, anyway. I would then hope that he or a group of people would volunteer to maintain it, with the threat of removing it from the stdlib again if these volunteers go away and too many problems show up. Regards, Martin

""Martin v. Löwis"" <martin@v.loewis.de> wrote in message news:480EC376.8070406@v.loewis.de... |> I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was part of the standard library? What if it was not | chardet, but some other algorithm? It seems to me that since there is not a 'correct' algorithm but only competing heuristics, encoding detection modules should be made available via PyPI and only be considered for stdlib after a best of breed emerges with community support.

On 2008-04-23 07:26, Terry Reedy wrote:
+1 Though in practice, determining the "best of breed" often becomes a problem (see e.g. the JSON implementation discussion). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

"Martin v. Löwis" writes:
That depends on whether you can get meaningful information about the language from the fact that you're looking at the page. In the browser context, for one, 99.44% of users are monolingual, so you only have to distinguish among the encodings for their language. In this context a two stage process of determining a category of encoding (eg, ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and then picking an encoding from the category according to a user-specified configuration has served Emacs/MULE users very well for about 20 years. It does *not* work in a context where multiple encodings from the same category are in use (eg, the email folder of a Polish Gastarbeiter in Berlin). Nonetheless it is pretty useful for user agents like mail clients, web browsers, and editors.

To the contrary, an encoding-guessing module is often needed, and guessing can be done with a pretty high success rate. Other Unicode libraries (e.g. ICU) contain guessing modules. I suppose the API could return two values: the guessed encoding and a confidence indicator. Note that the locale settings might figure in the guess. On Mon, Apr 21, 2008 at 10:28 AM, Georg Brandl <g.brandl@gmx.net> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido> Note that the locale settings might figure in the guess. Alas, locale settings in a web server have little or nothing to do with the locale settings in the client submitting the form. Skip
participants (17)
-
"Martin v. Löwis"
-
Bill Janssen
-
Christian Heimes
-
David Wolever
-
Georg Brandl
-
Greg Wilson
-
Guido van Rossum
-
Jean-Paul Calderone
-
M.-A. Lemburg
-
Michael Foord
-
Mike Klaas
-
Oleg Broytmann
-
Rodrigo Bernardo Pimentel
-
skip@pobox.com
-
Stephen J. Turnbull
-
Terry Reedy
-
Tony Nelson