I have a patch ready (http://bugs.python.org/issue1399) that adds an XML codec. This codec implements encoding detection as specified in http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing and could be used for the decoding phase of an XML parser. Other use cases are: The codec could be used for transcoding an XML input before passing it to the real parser, if the parser itself doesn't support the encoding in question. A text editor could use the codec to decode an XML file. When the user changes the XML declaration and resaves the file, it would be saved in the correct encoding. I'd like to have this codec in 2.6 and 3.0. Any comments? Servus, Walter
Any comments?
-1. First, (as already discussed on the tracker,) "xml" is a bad name for an encoding. How would you encode "Hello" "in xml"? Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs. For a text editor, much more useful than a codec would be a routine (say, xml.detect_encoding) which performs the auto-detection. Finally, I think the codec is incorrect. When saving XML to a file (e.g. in a text editor), there should rarely be encoding errors, since one could use character references in many cases. Also, the XML spec talks about detecting EBCDIC, which I believe your implementation doesn't. Regards, Martin
Martin v. Löwis wrote:
Any comments?
-1. First, (as already discussed on the tracker,) "xml" is a bad name for an encoding. How would you encode "Hello" "in xml"?
Then how about the suggested "xml-auto-detect"?
Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs.
But not all XML parsers support all encodings. The XML codec makes it trivial to add this support to an existing parser. Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library?
For a text editor, much more useful than a codec would be a routine (say, xml.detect_encoding) which performs the auto-detection.
There's a (currently undocumented) codecs.detect_xml_encoding() in the patch. We could document this function and make it public. But if there's no codec that uses it, this function IMHO doesn't belong in the codecs module. Should this function be available from xml/__init__.py or should be put it into something like xml/utils.py?
Finally, I think the codec is incorrect. When saving XML to a file (e.g. in a text editor), there should rarely be encoding errors, since one could use character references in many cases.
This requires some intelligent fiddling with the errors attribute of the encoder.
Also, the XML spec talks about detecting EBCDIC, which I believe your implementation doesn't.
Correct, but as long as Python doesn't have an EBCDIC codec, that won't help much. Adding *detection* of EBCDIC to detect_xml_encoding() is rather simple though. Servus, Walter
Then how about the suggested "xml-auto-detect"?
That is better.
Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs.
But not all XML parsers support all encodings. The XML codec makes it trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser that doesn't support a specific encoding and where adding such a codec solves that problem? In particular, why would that parser know how to process Python Unicode strings?
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO.
There's a (currently undocumented) codecs.detect_xml_encoding() in the patch. We could document this function and make it public. But if there's no codec that uses it, this function IMHO doesn't belong in the codecs module. Should this function be available from xml/__init__.py or should be put it into something like xml/utils.py?
Either - or.
Finally, I think the codec is incorrect. When saving XML to a file (e.g. in a text editor), there should rarely be encoding errors, since one could use character references in many cases.
This requires some intelligent fiddling with the errors attribute of the encoder.
Much more than that, I think - you cannot use a character reference in an XML Name. So the codec would have to parse the output stream to know whether or not a character reference could be used.
Correct, but as long as Python doesn't have an EBCDIC codec, that won't help much. Adding *detection* of EBCDIC to detect_xml_encoding() is rather simple though.
But it does! cp037 is EBCDIC, and supported by Python. Regards, Martin
Martin v. Löwis wrote:
Then how about the suggested "xml-auto-detect"?
That is better.
OK.
Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs. But not all XML parsers support all encodings. The XML codec makes it trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser that doesn't support a specific encoding
It seems that e.g. expat doesn't support UTF-32: from xml.parsers import expat p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) p.Parse(s, True) This fails with: Traceback (most recent call last): File "gurk.py", line 6, in <module> p.Parse(s, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1 Replace "utf-32" with "utf-16" and the problem goes away.
and where adding such a codec solves that problem?
In particular, why would that parser know how to process Python Unicode strings?
It doesn't have to. You can use an XML encoder to reencode the unicode string into bytes (forcing an encoding that the parser knows): import codecs from xml.parsers import expat ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True)
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO.
But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
There's a (currently undocumented) codecs.detect_xml_encoding() in the patch. We could document this function and make it public. But if there's no codec that uses it, this function IMHO doesn't belong in the codecs module. Should this function be available from xml/__init__.py or should be put it into something like xml/utils.py?
Either - or.
OK, so should I put the C code into a _xml module?
Finally, I think the codec is incorrect. When saving XML to a file (e.g. in a text editor), there should rarely be encoding errors, since one could use character references in many cases. This requires some intelligent fiddling with the errors attribute of the encoder.
Much more than that, I think - you cannot use a character reference in an XML Name. So the codec would have to parse the output stream to know whether or not a character reference could be used.
That's what I meant with "intelligent" fiddling. But I agree this is way beyond what a text editor should do. AFAIK it is way beyond what existing text editors do. However using the XML codec would at least guarantee that the encoding specified in the XML declaration and the encoding used for encoding the file stay consistent.
Correct, but as long as Python doesn't have an EBCDIC codec, that won't help much. Adding *detection* of EBCDIC to detect_xml_encoding() is rather simple though.
But it does! cp037 is EBCDIC, and supported by Python.
I didn't know that. I'm going to update the patch. Servus, Walter
Walter Dörwald wrote:
Martin v. Löwis wrote:
[...]
Correct, but as long as Python doesn't have an EBCDIC codec, that won't help much. Adding *detection* of EBCDIC to detect_xml_encoding() is rather simple though. But it does! cp037 is EBCDIC, and supported by Python.
I didn't know that. I'm going to update the patch.
Done: http://bugs.python.org/1399 I also renamed the codec to xml_auto_detect. Servus, Walter
On 11/8/07, Walter Dörwald <walter@livinglogic.de> wrote:
Martin v. Löwis wrote:
Then how about the suggested "xml-auto-detect"?
That is better.
OK.
Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs. But not all XML parsers support all encodings. The XML codec makes it trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser that doesn't support a specific encoding
It seems that e.g. expat doesn't support UTF-32:
from xml.parsers import expat
p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) p.Parse(s, True)
This fails with:
Traceback (most recent call last): File "gurk.py", line 6, in <module> p.Parse(s, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Replace "utf-32" with "utf-16" and the problem goes away.
and where adding such a codec solves that problem?
In particular, why would that parser know how to process Python Unicode strings?
It doesn't have to. You can use an XML encoder to reencode the unicode string into bytes (forcing an encoding that the parser knows):
import codecs from xml.parsers import expat
ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True)
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO.
But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. It's not even sufficient for XML: 1) round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked. 2) the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway. [2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html -- Adam Olsen, aka Rhamphoryncus
Adam Olsen wrote:
On 11/8/07, Walter Dörwald <walter@livinglogic.de> wrote:
[...]
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library? I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO. But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though.
So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
It's not even sufficient for XML:
1) round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked.
The chosen encoding is available from the incremental encoder: import codecs e = codecs.getincrementalencoder("xml-auto-detect")() e.encode(u"<?xml version='1.0' encoding='utf-32'?><foo/>", True) print e.encoding This prints utf-32.
2) the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway.
It does. You can pass an encoding to the stateless decoder, the incremental decoder and the streamreader. It will then use this encoding instead the one detected from the byte stream. It even will put the correct encoding into the XML declaration (if there is one): import codecs d = codecs.getdecoder("xml-auto-detect") print d("<?xml version='1.0' encoding='iso-8859-1'?><foo/>", encoding="utf-8")[0] This prints: <?xml version='1.0' encoding='utf-8'?><foo/> Servus, Walter
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though.
So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding. Regards, Martin
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding.
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder? Servus, Walter
On 2007-11-09 14:10, Walter Dörwald wrote:
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec? Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding.
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
FWIW: I'm +1 on adding such a codec. It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data. Whether it needs to be in C or not is another question (I would have done this in Python since performance is not really an issue), but since the code is already written, why not use it ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 09 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
M.-A. Lemburg wrote:
On 2007-11-09 14:10, Walter Dörwald wrote:
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec? Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding. And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
FWIW: I'm +1 on adding such a codec.
It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data.
Exactly. I have a version of sgmlop lying around that does that.
Whether it needs to be in C or not is another question (I would have done this in Python since performance is not really an issue), but since the code is already written, why not use it ?
Servus, Walter
On Nov 9, 2007, at 8:22 AM, M.-A. Lemburg wrote:
FWIW: I'm +1 on adding such a codec.
I'm undecided, and really don't feel strongly either way.
It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data.
Which is fine if you want to write a new parser. I've no interest in that myself.
Whether it needs to be in C or not is another question (I would have done this in Python since performance is not really an issue), but since the code is already written, why not use it ?
The reason not to use C is the usual one: The implementation is more cross-implementation if it's written in Python. This makes it more useful with Jython, IronPython, and PyPy. That seems a pretty good reason to me. -Fred -- Fred Drake <fdrake at acm.org>
It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data.
Having the functionality indeed makes things easier. However, I don't find s.decode(xml.detect_encoding(s)) particularly more difficult than s.decode("xml-auto-detection")
Whether it needs to be in C or not is another question (I would have done this in Python since performance is not really an issue), but since the code is already written, why not use it ?
It's a maintenance issue. Regards, Martin
Martin v. Löwis wrote:
It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data.
Having the functionality indeed makes things easier. However, I don't find
s.decode(xml.detect_encoding(s))
particularly more difficult than
s.decode("xml-auto-detection")
Not really, but the codec has more control over what happens to the stream, ie. it's easier to implement look-ahead in the codec than to do the detection and then try to push the bytes back onto the stream (which may or may not be possible depending on the nature of the stream).
Whether it needs to be in C or not is another question (I would have done this in Python since performance is not really an issue), but since the code is already written, why not use it ?
It's a maintenance issue.
I'm sure Walter will do a great job in maintaining the code :-) Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 09 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On Nov 9, 2007 3:59 PM, M.-A. Lemburg <mal@egenix.com> wrote:
Martin v. Löwis wrote:
It makes working with XML data a lot easier: you simply don't have to bother with the encoding of the XML data anymore and can just let the codec figure out the details. The XML parser can then work directly on the Unicode data.
Having the functionality indeed makes things easier. However, I don't find
s.decode(xml.detect_encoding(s))
particularly more difficult than
s.decode("xml-auto-detection")
Not really, but the codec has more control over what happens to the stream, ie. it's easier to implement look-ahead in the codec than to do the detection and then try to push the bytes back onto the stream (which may or may not be possible depending on the nature of the stream).
io.BufferedReader() standardizes a .peek() API, making it trivial. I don't see why we couldn't require it. (As an aside, .peek() will fail to do what detect_encodings() needs if BufferedReader's buffer size is too small. I do wonder if that limitation is appropriate.) -- Adam Olsen, aka Rhamphoryncus
Not really, but the codec has more control over what happens to the stream, ie. it's easier to implement look-ahead in the codec than to do the detection and then try to push the bytes back onto the stream (which may or may not be possible depending on the nature of the stream).
YAGNI. Regards, Martin
Martin v. Löwis wrote:
Not really, but the codec has more control over what happens to the stream, ie. it's easier to implement look-ahead in the codec than to do the detection and then try to push the bytes back onto the stream (which may or may not be possible depending on the nature of the stream).
YAGNI.
A non-seekable stream is not all that uncommon in network processing. I usually end up either reading the complete data into memory or doing the needed buffering by hand. Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 10 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On 2007-11-10 09:54, Martin v. Löwis wrote:
A non-seekable stream is not all that uncommon in network processing.
Right. But what is the relationship to XML encoding autodetection?
It pops up whenever you need to detect the encoding of the incoming XML data on the network connection, e.g. in XML RPC or data upload mechanisms. Even though XML data mostly uses UTF-8 in real life applications, a standards compliant XML interface must also support other possible encodings. It is also not always feasible to load all data into memory, so some form of buffering must be used. Since incremental codecs already implement buffering, it's only natural to let them take care of the auto detection. This approach is also needed if you want to stack stream codecs (not sure whether this is still possible in Py3, but that's how I designed them for Py2). Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 11 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
A non-seekable stream is not all that uncommon in network processing. Right. But what is the relationship to XML encoding autodetection?
It pops up whenever you need to detect the encoding of the incoming XML data on the network connection, e.g. in XML RPC or data upload mechanisms.
No, it doesn't. For XML-RPC, you pass the XML payload of the HTTP request to the XML parser, and it deals with the encoding.
It is also not always feasible to load all data into memory, so some form of buffering must be used.
Again, I don't see the use case. For XML-RPC, it's very feasible and standard procedure to have the entire document in memory (in a processed form).
This approach is also needed if you want to stack stream codecs (not sure whether this is still possible in Py3, but that's how I designed them for Py2).
The design of the Py2 codecs is fairly flawed, unfortunately. Regards, Martin
On 2007-11-11 14:51, Martin v. Löwis wrote:
A non-seekable stream is not all that uncommon in network processing. Right. But what is the relationship to XML encoding autodetection? It pops up whenever you need to detect the encoding of the incoming XML data on the network connection, e.g. in XML RPC or data upload mechanisms.
No, it doesn't. For XML-RPC, you pass the XML payload of the HTTP request to the XML parser, and it deals with the encoding.
First, XML-RPC is not the only mechanism using XML over a network connection. Second, you don't want to do this if you're dealing with several 100 MB of data just because you want to figure out the encoding.
It is also not always feasible to load all data into memory, so some form of buffering must be used.
Again, I don't see the use case. For XML-RPC, it's very feasible and standard procedure to have the entire document in memory (in a processed form).
You may not see the use case, but that doesn't really mean anything if the use cases exist in real life applications, right ?!
This approach is also needed if you want to stack stream codecs (not sure whether this is still possible in Py3, but that's how I designed them for Py2).
The design of the Py2 codecs is fairly flawed, unfortunately.
Fortunately, this sounds like a fairly flawed argument to me ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 11 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
First, XML-RPC is not the only mechanism using XML over a network connection. Second, you don't want to do this if you're dealing with several 100 MB of data just because you want to figure out the encoding.
That's my original claim/question: what SPECIFIC application do you have in mind that transfers XML over a network and where you would want to have such a stream codec? If I have 100MB of XML in a file, using the detection API, I do f = open(filename) s = f.read(100) while True: coding = xml.utils.detect_encoding(s) if coding is not undetermined: break s += f.read(100) f.close() Having the loop here is paranoia: in my application, I might be able to know that 100 bytes are sufficient to determine the encoding always.
Again, I don't see the use case. For XML-RPC, it's very feasible and standard procedure to have the entire document in memory (in a processed form).
You may not see the use case, but that doesn't really mean anything if the use cases exist in real life applications, right ?!
Right. However, I' will remain opposed to adding this to the standard library until I see why one would absolutely need to have that. Not every piece of code that is useful in some application should be added to the standard library. Regards, Martin
On 2007-11-11 18:56, Martin v. Löwis wrote:
First, XML-RPC is not the only mechanism using XML over a network connection. Second, you don't want to do this if you're dealing with several 100 MB of data just because you want to figure out the encoding.
That's my original claim/question: what SPECIFIC application do you have in mind that transfers XML over a network and where you would want to have such a stream codec?
XML-based web services used for business integration, e.g. based on ebXML. A common use case from our everyday consulting business is e.g. passing market and trading data to portfolio pricing web services.
If I have 100MB of XML in a file, using the detection API, I do
f = open(filename) s = f.read(100) while True: coding = xml.utils.detect_encoding(s) if coding is not undetermined: break s += f.read(100) f.close()
Having the loop here is paranoia: in my application, I might be able to know that 100 bytes are sufficient to determine the encoding always.
Doing the detection with files is easy, but that was never questioned.
Again, I don't see the use case. For XML-RPC, it's very feasible and standard procedure to have the entire document in memory (in a processed form). You may not see the use case, but that doesn't really mean anything if the use cases exist in real life applications, right ?!
Right. However, I' will remain opposed to adding this to the standard library until I see why one would absolutely need to have that. Not every piece of code that is useful in some application should be added to the standard library.
Agreed, but the application space of web services is large enough to warrant this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 11 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
First, XML-RPC is not the only mechanism using XML over a network connection. Second, you don't want to do this if you're dealing with several 100 MB of data just because you want to figure out the encoding. That's my original claim/question: what SPECIFIC application do you have in mind that transfers XML over a network and where you would want to have such a stream codec?
XML-based web services used for business integration, e.g. based on ebXML.
A common use case from our everyday consulting business is e.g. passing market and trading data to portfolio pricing web services.
I still don't see the need for this feature from this example. First, in ebXML messaging, the message are typically *not* large (i.e. much smaller than 100 MB). Furthermore, the typical processing of such a message would be to pass it directly to the XML parser, no need for the functionality under discussion.
Right. However, I' will remain opposed to adding this to the standard library until I see why one would absolutely need to have that. Not every piece of code that is useful in some application should be added to the standard library.
Agreed, but the application space of web services is large enough to warrant this.
If that was the case, wouldn't the existing Python web service libraries already include such a functionality? Regards, Martin
On 2007-11-11 23:22, Martin v. Löwis wrote:
First, XML-RPC is not the only mechanism using XML over a network connection. Second, you don't want to do this if you're dealing with several 100 MB of data just because you want to figure out the encoding. That's my original claim/question: what SPECIFIC application do you have in mind that transfers XML over a network and where you would want to have such a stream codec? XML-based web services used for business integration, e.g. based on ebXML.
A common use case from our everyday consulting business is e.g. passing market and trading data to portfolio pricing web services.
I still don't see the need for this feature from this example. First, in ebXML messaging, the message are typically *not* large (i.e. much smaller than 100 MB). Furthermore, the typical processing of such a message would be to pass it directly to the XML parser, no need for the functionality under discussion.
I don't see the point in continuing this discussion. If you think you know better, that's fine. Just please don't generalize this to everyone else working with Python and XML.
Right. However, I' will remain opposed to adding this to the standard library until I see why one would absolutely need to have that. Not every piece of code that is useful in some application should be added to the standard library. Agreed, but the application space of web services is large enough to warrant this.
If that was the case, wouldn't the existing Python web service libraries already include such a functionality?
No. To finalize this: We have a -1 from Martin and a +1 from Walter, Guido and myself. Pretty clear vote if you ask me. I'd say we end the discussion here and move on. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 12 2007)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
We have a -1 from Martin and a +1 from Walter, Guido and myself. Pretty clear vote if you ask me. I'd say we end the discussion here and move on.
If we're counting, you've got a -1 on the codec from me as well. Martin's right: there's no value to embedding the logic of auto- detection into the codec. A function somewhere in the xml package is all that's warranted. -Fred -- Fred Drake <fdrake at acm.org>
Fred Drake wrote:
On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
We have a -1 from Martin and a +1 from Walter, Guido and myself. Pretty clear vote if you ask me. I'd say we end the discussion here and move on.
If we're counting, you've got a -1 on the codec from me as well. Martin's right: there's no value to embedding the logic of auto- detection into the codec.
It isn't "embedded". codecs.detect_xml_encoding() is callable without any problems (though not documented).
A function somewhere in the xml package is all that's warranted.
Who would use such a function for what? Servus, Walter
On Nov 12, 2007, at 8:56 AM, Walter Dörwald wrote:
It isn't "embedded". codecs.detect_xml_encoding() is callable without any problems (though not documented).
"Not documented" means not available, I think.
Who would use such a function for what?
Being able to detect the encoding can be useful anytime you want information about a file, actually. In particular, presenting encoding information in a user interface (yes, you can call that contrived, but some people want to be able to see such things, and for them it's a requirement). If you want to parse the XML and re-encode, it's common to want to re-encode in the origin encoding; it's needed for that as well. If you just want to toss the text into an editor, the encoding is also needed. In that case, the codec approach *might* be acceptable (depending on the rest of the editor implementation), but the same re-encoding issue applies as well. Simply, it's sometimes desired to know the encoding for purposes that don't require immediate decoding. A function would be quite handing in these cases. -Fred -- Fred Drake <fdrake at acm.org>
On Nov 12, 2007, at 10:54 AM, Bill Janssen wrote:
In os.path? os.path.encoding(location)?
I wasn't thinking it would be that general; determining the encoding for an arbitrary text file is a larger problem than it is for an XML file. An implementation based strictly on the rules from the XML specification should be in the xml package (somewhere). Determining that the file is an XML file is separate. I doubt this really makes sense in os.path. -Fred -- Fred Drake <fdrake at acm.org>
Fred Drake wrote:
On Nov 12, 2007, at 8:56 AM, Walter Dörwald wrote:
It isn't "embedded". codecs.detect_xml_encoding() is callable without any problems (though not documented).
"Not documented" means not available, I think.
I just din't think that someone wants the detection function, but not the codec, so I left the function undocumented.
Who would use such a function for what?
Being able to detect the encoding can be useful anytime you want information about a file, actually. In particular, presenting encoding information in a user interface (yes, you can call that contrived, but some people want to be able to see such things, and for them it's a requirement).
And if you want to display the XML you'd need to decode it. An example might be a text viewer. E.g. Apples QuickLook.
If you want to parse the XML and re-encode, it's common to want to re-encode in the origin encoding; it's needed for that as well. If you just want to toss the text into an editor, the encoding is also needed. In that case, the codec approach *might* be acceptable (depending on the rest of the editor implementation), but the same re-encoding issue applies as well.
Simply, it's sometimes desired to know the encoding for purposes that don't require immediate decoding. A function would be quite handing in these cases.
So the consensus seems to be: Add an encoding detection function (implemented in Python) to the xml module? Servus, Walter
On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
We have a -1 from Martin and a +1 from Walter, Guido and myself. Pretty clear vote if you ask me. I'd say we end the discussion here and move on.
If we're counting, you've got a -1 on the codec from me as well. Martin's right: there's no value to embedding the logic of auto- detection into the codec. A function somewhere in the xml package is all that's warranted.
I agree with Fred here - it should be a function in the xml package, not a codec. -1 -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/
Walter Dörwald wrote:
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec? Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding.
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
In fact, we already have such a codec. The utf-16 decoder looks at the first two bytes and then decides to forward the rest to either a utf-16-be or a utf-16-le decoder. Servus, Walter
In fact, we already have such a codec. The utf-16 decoder looks at the first two bytes and then decides to forward the rest to either a utf-16-be or a utf-16-le decoder.
That's different. UTF-16 is a proper encoding that is just specified to use the BOM. "xml-auto-detection" is not an encoding. Regards, Martin
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
Because it is the XML parser that does the decoding, not the application. Also, it is better to provide functionality in a modular manner (i.e. encoding detection separately from encodings), and leaving integration of modules to the application, in particular if the integration is trivial. Regards, Martin
"Martin v. Löwis" sagte:
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
Because it is the XML parser that does the decoding, not the application. Also, it is better to provide functionality in a modular manner (i.e. encoding detection separately from encodings),
It is separate. Detection is done by codecs.detect_xml_encoding(), decoding is done by the codec.
and leaving integration of modules to the application, in particular if the integration is trivial.
Servus, Walter
On Nov 9, 2007 6:10 AM, Walter Dörwald <walter@livinglogic.de> wrote:
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding.
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
It seems to me that parsing XML requires 3 steps: 1) determine encoding 2) decode byte stream 3) parse XML (including handling of character references) All an xml codec does is make the first part a side-effect of the second part. Rather than this: encoding = detect_encoding(raw_data) decoded_data = raw_data.decode(encoding) tree = parse_xml(decoded_data, encoding) # Verifies encoding You'd have this: e = codecs.getincrementaldecoder("xml-auto-detect")() decoded_data = e.decode(raw_data, True) tree = parse_xml(decoded_data, e.encoding) # Verifies encoding It's clear to me that detecting an encoding is actually the simplest part of all this (so long as there's an API to do it!) Putting it inside a codec seems like the wrong subdivision of responsibility. (An example using streams would end up closer, but it still seems wrong to me. Encoding detection is always one way, while codecs are always two way (even if lossy.)) -- Adam Olsen, aka Rhamphoryncus
"Martin v. Löwis" writes:
It's clear to me that detecting an encoding is actually the simplest part of all this (so long as there's an API to do it!) Putting it inside a codec seems like the wrong subdivision of responsibility.
In case it isn't clear - this is exactly my view also.
But is there an API to do it? As MAL points out that API would have to return not an encoding, but a pair of an encoding and the rewound stream. For non-seekable, non-peekable streams (if any), what you'd need would be a stream that consisted of a concatenation of the buffered data used for detection and the continuation of the stream.
In case it isn't clear - this is exactly my view also.
But is there an API to do it? As MAL points out that API would have to return not an encoding, but a pair of an encoding and the rewound stream.
The API wouldn't operate on streams. Instead, you pass a string, and it either returns the detected encoding, or an information telling that it needs more data. No streams.
For non-seekable, non-peekable streams (if any), what you'd need would be a stream that consisted of a concatenation of the buffered data used for detection and the continuation of the stream.
The application would read data out of the stream, and pass it to the detection. It then can process it in whatever manner it meant to process it in the first place. Regards, Martin
Martin v. Löwis wrote:
In case it isn't clear - this is exactly my view also.
But is there an API to do it? As MAL points out that API would have to return not an encoding, but a pair of an encoding and the rewound stream.
The API wouldn't operate on streams. Instead, you pass a string, and it either returns the detected encoding, or an information telling that it needs more data. No streams.
But in many cases you read the data out of a stream and pass it to an incremental XML parser. So if you're transcoding the input (either because the XML parser can't handle the encoding in question or because there's an external encoding specified, but it's not possible to pass that to the parser), a codec makes the most sense.
For non-seekable, non-peekable streams (if any), what you'd need would be a stream that consisted of a concatenation of the buffered data used for detection and the continuation of the stream.
The application would read data out of the stream, and pass it to the detection. It then can process it in whatever manner it meant to process it in the first place.
Servus, Walter
ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True)
So how come the document being parsed is recognized as UTF-8?
OK, so should I put the C code into a _xml module?
I don't see the need for C code at all. Regards, Martin
Martin v. Löwis wrote:
ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True)
So how come the document being parsed is recognized as UTF-8?
Because you can force the encoder to use a specified encoding. If you do this and the unicode string starts with an XML declaration, the encoder will put the specified encoding into the declaration: import codecs e = codecs.getencoder("xml-auto-detect") print e(u"<?xml version='1.0' encoding='iso-8859-1'?><foo/>", encoding="utf-8")[0] This prints: <?xml version='1.0' encoding='utf-8'?><foo/>
OK, so should I put the C code into a _xml module?
I don't see the need for C code at all.
Doing the bit fiddling for Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the right thing to do. Servus, Walter
Because you can force the encoder to use a specified encoding. If you do this and the unicode string starts with an XML declaration
So what if the unicode string doesn't start with an XML declaration? Will it add one? If so, what version number will it use?
OK, so should I put the C code into a _xml module? I don't see the need for C code at all.
Doing the bit fiddling for Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the right thing to do.
Hmm. I don't think a sequence like + if (strlen>0) + { + if (*str++ != '<') + return 1; + if (strlen>1) + { + if (*str++ != '?') + return 1; + if (strlen>2) + { + if (*str++ != 'x') + return 1; + if (strlen>3) + { + if (*str++ != 'm') + return 1; + if (strlen>4) + { + if (*str++ != 'l') + return 1; + if (strlen>5) + { + if (*str != ' ' && *str != '\t' && *str != '\r' && *str != '\n') + return 1; is well-maintainable C. I feel it is much better writing if not s.startswith("<=?xml"): return 1 What bit fiddling are you referring to specifically that you think is better done in C than in Python? Regards, Martin
Martin v. Löwis wrote:
Because you can force the encoder to use a specified encoding. If you do this and the unicode string starts with an XML declaration
So what if the unicode string doesn't start with an XML declaration? Will it add one?
No.
If so, what version number will it use?
If we added this we could add an extra argument version to the encoder constructor defaulting to '1.0'.
OK, so should I put the C code into a _xml module? I don't see the need for C code at all. Doing the bit fiddling for Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the right thing to do.
Hmm. I don't think a sequence like
+ if (strlen>0) + { + if (*str++ != '<') + return 1; + if (strlen>1) + { + if (*str++ != '?') + return 1; + if (strlen>2) + { + if (*str++ != 'x') + return 1; + if (strlen>3) + { + if (*str++ != 'm') + return 1; + if (strlen>4) + { + if (*str++ != 'l') + return 1; + if (strlen>5) + { + if (*str != ' ' && *str != '\t' && *str != '\r' && *str != '\n') + return 1;
is well-maintainable C. I feel it is much better writing
if not s.startswith("<=?xml"): return 1
The point of this code is not just to return whether the string starts with "<?xml" or not. There are actually three cases: * The string does start with "<?xml" * The string starts with a prefix of "<?xml", i.e. we can only decide if it starts with "<?xml" if we have more input. * The string definitely doesn't start with "<?xml".
What bit fiddling are you referring to specifically that you think is better done in C than in Python?
The code that checks the byte signature, i.e. the first part of detect_xml_encoding_str(). Servus, Walter
So what if the unicode string doesn't start with an XML declaration? Will it add one?
No.
Ok. So the XML document would be ill-formed then unless the encoding is UTF-8, right?
The point of this code is not just to return whether the string starts with "<?xml" or not. There are actually three cases:
Still, it's overly complex for that matter:
* The string does start with "<?xml"
if s.startswith("<?xml"): return Yes
* The string starts with a prefix of "<?xml", i.e. we can only decide if it starts with "<?xml" if we have more input.
if "<?xml".startswith(s): return Maybe
* The string definitely doesn't start with "<?xml".
return No
What bit fiddling are you referring to specifically that you think is better done in C than in Python?
The code that checks the byte signature, i.e. the first part of detect_xml_encoding_str().
I can't see any *bit* fiddling there, except for the bit mask of candidates. For the candidate list, I cannot quite understand why you need a bit mask at all, since the candidates are rarely overlapping. I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data". - otherwise, implement annex F "literally". Make a dictionary of all prefixes that are exactly 4 bytes, i.e. prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ... ..., "\0\x3c\0\x3f":"utf-16le"} try: return prefixes4[s[:4]] except KeyError: pass if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be" ... if s.startswith("<?xml"): return get_encoding_from_declaration(s) return "utf-8" Regards, Martin
"Martin v. Löwis" sagte:
So what if the unicode string doesn't start with an XML declaration? Will it add one?
No.
Ok. So the XML document would be ill-formed then unless the encoding is UTF-8, right?
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info? If it is, then yes, the document would be ill-formed.
The point of this code is not just to return whether the string starts with "<?xml" or not. There are actually three cases:
Still, it's overly complex for that matter:
* The string does start with "<?xml"
if s.startswith("<?xml"): return Yes
* The string starts with a prefix of "<?xml", i.e. we can only decide if it starts with "<?xml" if we have more input.
if "<?xml".startswith(s): return Maybe
* The string definitely doesn't start with "<?xml".
return No
This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.
What bit fiddling are you referring to specifically that you think is better done in C than in Python?
The code that checks the byte signature, i.e. the first part of detect_xml_encoding_str().
I can't see any *bit* fiddling there, except for the bit mask of candidates. For the candidate list, I cannot quite understand why you need a bit mask at all, since the candidates are rarely overlapping.
I tried many variants and that seemed to be the most straitforward one.
I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data".
Can there be an XML document that is less then 4 bytes? I guess not.
- otherwise, implement annex F "literally". Make a dictionary of all prefixes that are exactly 4 bytes, i.e.
prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ... ..., "\0\x3c\0\x3f":"utf-16le"}
try: return prefixes4[s[:4]] except KeyError: pass if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be" ... if s.startswith("<?xml"): return get_encoding_from_declaration(s) return "utf-8"
get_encoding_from_declaration() would have to do the same yes/no/maybe decision. But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted? Servus, Walter
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info?
If there is external encoding info, matching the actual encoding, it would be well-formed. Of course, preserving that information would be up to the application.
This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.
I'm still opposed to making this a codec. Right - for a pure Python solution, the processing of the XML declaration would still need to be implemented.
I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data".
Can there be an XML document that is less then 4 bytes? I guess not.
No, the smallest document has exactly 4 characters (e.g. "<f/>"). However, external entities may be smaller, such as "x".
But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted?
I could agree to a Python implementation of this algorithm as long as it's not packaged as a codec. Regards, Martin
Martin v. Löwis wrote:
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info?
If there is external encoding info, matching the actual encoding, it would be well-formed. Of course, preserving that information would be up to the application.
OK. When the application passes an encoding to the decoder this is supposed to be the external encoding info, so for the decoder it makes sense to assume that the encoding passed to the encoder is the external encoding info and will be transmitted along with the encoded bytes.
This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.
I'm still opposed to making this a codec. Right - for a pure Python solution, the processing of the XML declaration would still need to be implemented.
I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data". Can there be an XML document that is less then 4 bytes? I guess not.
No, the smallest document has exactly 4 characters (e.g. "<f/>"). However, external entities may be smaller, such as "x".
But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted?
I could agree to a Python implementation of this algorithm as long as it's not packaged as a codec.
I still can't understand your objection to a codec. What's the difference between UTF-16 decoding and XML decoding? In fact PEP 263 IMHO does specify how to decode Python source, so in theory it could be a codec (in practice this probably wouldn't work because of bootstrapping problems). Servus, Walter
participants (8)
-
"Martin v. Löwis"
-
Adam Olsen
-
Andrew McNamara
-
Bill Janssen
-
Fred Drake
-
M.-A. Lemburg
-
Stephen J. Turnbull
-
Walter Dörwald