Unicode byte order mark decoding
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.
import codecs codecs.BOM_UTF8.decode( "utf8" ) u'\ufeff' codecs.BOM_UTF16.decode( "utf16" ) u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character? The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string. This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):
Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. [...] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic: def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode( "ascii" ) As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.
codecs.BOM_UTF16_LE.decode( "utf-16le" ) u'\ufeff' codecs.BOM_UTF16_BE.decode( "utf-16be" ) u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
codecs.BOM_UTF16_BE.decode( "utf-16le" ) u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See C10 in Section3.2, Conformance Requirements.)
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:
'\xff\xfe\xfe\xff'.decode( "utf16" ) u'\ufffe' '\xff\xfe\xff\xff'.decode( "utf16" ) u'\uffff'
Thanks for your feedback, Evan Jones
Evan Jones wrote:
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.
import codecs codecs.BOM_UTF8.decode( "utf8" ) u'\ufeff' codecs.BOM_UTF16.decode( "utf16" ) u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character?
The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files). It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point. The "utf-16" codec detects and removes the mark, while the two others "utf-16-le" (little endian byte order) and "utf-16-be" (big endian byte order) don't.
The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string.
-1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.
This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):
Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. [...] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic:
def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode( "ascii" )
Well, I'd say that's a very English way of dealing with encoded text ;-) BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.
codecs.BOM_UTF16_LE.decode( "utf-16le" ) u'\ufeff' codecs.BOM_UTF16_BE.decode( "utf-16be" ) u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
codecs.BOM_UTF16_BE.decode( "utf-16le" ) u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See C10 in Section3.2, Conformance Requirements.)
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:
'\xff\xfe\xfe\xff'.decode( "utf16" ) u'\ufffe' '\xff\xfe\xff\xff'.decode( "utf16" ) u'\uffff'
Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?! Other than that: +1 on fixing this case. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 01 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files).
Well, it's origins do not really matter since at this point the BOM is firmly encoded in the Unicode standard. It seems to me that it is in everyone's best interest to support it.
It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point.
You are correct: it is a legitimate character. However, its use as a ZWNBSP character has been deprecated:
The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.
Also, the Unicode specification is ambiguous on what an implementation should do about a leading ZWNBSP that is encoded in UTF-16. Like I mentioned, if you look at the Unicode standard, version 4, section 15.9, it says:
2. Unmarked Character Set. In some circumstances, the character set information for a stream of coded characters (such as a file) is not available. The only information available is that the stream contains text, but the precise character set is not known.
This seems to indicate that it is permitted to strip the BOM from the beginning of UTF-8 text.
-1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.
This is clearly incorrect. The UTF-8 is specified in the Unicode standard version 4, section 15.9:
In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>.
I normally find files with UTF-8 BOMs from many Windows applications when you save a text file as UTF8. I think that Notepad or WordPad does this, for example. I think UltraEdit also does the same thing. I know that Scintilla definitely does.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic. Well, I'd say that's a very English way of dealing with encoded text ;-)
Please note I am saying only that something like this may want to me considered for addition to the documentation, and not to the Python standard library. This example function more closely replicates the logic that is used on those Windows applications when opening ".txt" files. It uses the default locale if there is no BOM: def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode()
BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?
Well, the same argument could be applied to the UTF-16 decoder know that the string came from the start of a file, and not from slicing some already loaded file? The standard states that:
In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order.
So it is perfectly permissible to perform this type of processing if you consider a string to be equivalent to a stream.
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?!
Well, either one is possible, however the Unicode standard suggests, but does not require, silently removing them:
It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters.
I would prefer silently ignoring them from the str.decode() function, since I believe in "be strict in what you emit, but liberal in what you accept." I think that this only applies to str.decode(). Any other attempt to create non-characters, such as unichr( 0xffff ), *should* raise an exception because clearly the programmer is making a mistake.
Other than that: +1 on fixing this case.
Cool! Evan Jones
"MAL" == M
writes:
MAL> The BOM (byte order mark) was a non-standard Microsoft MAL> invention to detect Unicode text data as such (MS always uses MAL> UTF-16-LE for Unicode text files). The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds them to existing UTF-8 files lacking them. MAL> -1; there's no standard for UTF-8 BOMs - adding it to the MAL> codecs module was probably a mistake to begin with. You MAL> usually only get UTF-8 files with BOM marks as the result of MAL> recoding UTF-16 files into UTF-8. There is a standard for UTF-8 _signatures_, however. I don't have the most recent version of the ISO-10646 standard, but Amendment 2 (which defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to Annex F of that standard. Evan quotes Version 4 of the Unicode standard, which explicitly defines the UTF-8 signature. So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea. However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode). MAL> BTW, how do you know that s came from the start of a file and MAL> not from slicing some already loaded file somewhere in the MAL> middle ? The programmer or the application might, but Python's codecs don't. The point is that this is also true of rawstrings that happen to contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec shouldn't strip leading BOMs either, unless it has been told it has the beginning of the string. MAL> Evan Jones wrote: >> This is *not* a valid Unicode character. The Unicode >> specification (version 4, section 15.8) says the following >> about non-characters: >> >>> Applications are free to use any of these noncharacter code >>> points internally but should never attempt to exchange >>> them. If a noncharacter is received in open interchange, an >>> application is not required to interpret it in any way. It is >>> good practice, however, to recognize it as a noncharacter and >>> to take appropriate action, such as removing it from the >>> text. Note that Unicode conformance freely allows the removal >>> of these characters. (See C10 in Section3.2, Conformance >>> Requirements.) >> >> My interpretation of the specification means that Python should The specification _permits_ silent removal; it does not recommend. >> silently remove the character, resulting in a zero length >> Unicode string. Similarly, both of the following lines should >> also result in a zero length Unicode string: >>>> '\xff\xfe\xfe\xff'.decode( "utf16" ) > u'\ufffe' >>>> '\xff\xfe\xff\xff'.decode( "utf16" ) > u'\uffff' I strongly disagree; these decisions should be left to a higher layer. In the case of specified UTFs, the codecs should simply invert the UTF to Python's internal encoding. MAL> Hmm, wouldn't it be better to raise an error ? After all, a MAL> reversed BOM mark in the stream looks a lot like you're MAL> trying to decode a UTF-16 stream assuming the wrong byte MAL> order ?! +1 on (optionally) raising an error. -1 on removing it or anything like that, unless under control of the application (ie, the program written in Python, not Python itself). It's far too easy for software to generate broken Unicode streams[1], and the choice of how to deal with those should be with the application, not with the implementation language. Footnotes: [1] An egregious example was the Outlook Express distributed with early Win2k betas, which produced MIME bodies with apparent Content-Type: text/html; charset=utf-16, but the HTML tags and newlines were 7-bit ASCII! -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea.
I would personally like to see an "utf-8-bom" codec (perhaps better named "utf-8-sig", which strips the BOM on reading (if present) and generates it on writing.
However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).
With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. Regards, Martin
Martin v. Löwis wrote:
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea.
I would personally like to see an "utf-8-bom" codec (perhaps better named "utf-8-sig", which strips the BOM on reading (if present) and generates it on writing.
+1.
However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).
With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter. Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
[...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter.
Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation. Bye, Walter Dörwald
Walter Dörwald sagte:
M.-A. Lemburg wrote:
[...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter.
Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation.
OK, here is the patch: http://www.python.org/sf/1177307 The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters. A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now. Bye, Walter Dörwald
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.
Shouldn't the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is <ef> <bb> <bf>. If it did this type of partial matching, this issue would be avoided except in rare situations.
A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.
This functionality is provided by a flush() method on similar objects, such as the zlib compression objects. Evan Jones
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
This functionality is provided by a flush() method on similar objects, such as the zlib compression objects.
Or by close() on other objects (htmllib, HTMLParser, the SAX incremental parser, etc.). Too bad there's more than one way to do it. :-( -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>
Evan Jones sagte:
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.
Shouldn't the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is <ef> <bb> <bf>. If it did this type of partial matching, this issue would be avoided except in rare situations.
A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.
This functionality is provided by a flush() method on similar objects, such as the zlib compression objects.
Theoretically the name is unimportant, but read(..., final=True) or flush() or close() should subject the pending bytes to normal error handling and must return the result of decoding these pending bytes just like the other methods do. This would mean that we would have to implement a decodecode(), a readclose() and a readlineclose(). IMHO it would be best to add this argument to decode, read and readline directly. But I'm not sure, what this would mean for iterating through a StreamReader. Bye, Walter Dörwald
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.
This can be improved, of course: If the first byte is "a", it most definitely is *not* an UTF-8 signature. So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes *anyway*, so we can always decide with the first byte only whether we need to wait for three bytes.
A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.
Shouldn't an empty read from the underlying stream be taken as an EOF? Regards, Martin
Martin v. Löwis sagte:
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.
This can be improved, of course: If the first byte is "a", it most definitely is *not* an UTF-8 signature.
So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes *anyway*, so we can always decide with the first byte only whether we need to wait for three bytes.
OK, I've updated the patch so that the first bytes will only be kept in the buffer if they are a prefix of the BOM.
A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.
Shouldn't an empty read from the underlying stream be taken as an EOF?
There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report. Bye, Walter Dörwald
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.
Yes, but these are not file-like objects. In the IncrementalParser, it is *not* the case that a read operation returns an empty string. Instead, the application repeatedly feeds data explicitly. For a file-like object, returning "" indicates EOF. Regards, Martin
Martin v. Löwis sagte:
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.
Yes, but these are not file-like objects.
True, on the outside there are no file-like objects. But the IncrementalParser gets passed the XML bytes in chunks, so it has to use a stateful decoder for decoding. Unfortunately this means that is has to use a stream API. (See http://www.python.org/sf/1101097 for a patch that somewhat fixes that.) (Another option would be to completely ignore the stateful API and handcraft stateful decoding (or only support stateless decoding), like most XML parsers for Python do now.)
In the IncrementalParser, it is *not* the case that a read operation returns an empty string. Instead, the application repeatedly feeds data explicitly.
That's true, but the parser has to wrap this data into an object that can be passed to the StreamReader constructor. (See the Queue class in Lib/test/test_codecs.py for an example.)
For a file-like object, returning "" indicates EOF.
Not neccassarily. In the example above the IncrementalParser gets fed a chunk of data, it stuffs this data into the Queue, so that the StreamReader can decode it. Once the data from the Queue is exhausted, there won't any further data until the user calls feed() on the IncrementalParser again. Bye, Walter Dörwald
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.
I've actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says: 'utf-16': 16-bit variable length encoding (little/big endian) and: Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output. But this appears to be in error, at least in the current unicode standard. 'utf-16', as defined by the unicode standard, is big-endian in the absence of a BOM: --- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. --- The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now. -- Nick
Nicholas Bastin wrote:
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.
I've actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says:
'utf-16': 16-bit variable length encoding (little/big endian)
and:
Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
But this appears to be in error, at least in the current unicode standard. 'utf-16', as defined by the unicode standard, is big-endian in the absence of a BOM:
--- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---
The problem is "in the absence of a higher level protocol": the codec doesn't know anything about a protocol - it's the application using the codec that knows which protocol get's used. It's a lot safer to require the BOM for UTF-16 streams and raise an exception to have the application decide whether to use UTF-16-BE or the by far more common UTF-16-LE. Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration parameter, not merely a signature. In terms of history, I don't recall whether your quote was already in the standard at the time I wrote the PEP. You are the first to have reported a problem with the current implementation (which has been around since 2000), so I believe that application writers are more comfortable with the way the UTF-16 codec is currently implemented. Explicit is better than implicit :-)
The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file.
I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.
See above. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file.
Yes, see, I read a *lot* of UTF-16 that comes from other sources. It's not a matter of writing with python and reading with python. -- Nick
Nicholas Bastin wrote:
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file.
Yes, see, I read a *lot* of UTF-16 that comes from other sources. It's not a matter of writing with python and reading with python.
Ok, but I don't really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that's most likely going to cause more problems that it seems to solve: namely complete garbage if the data turns out to be UTF-16-LE encoded and, what's worse, enters the application undetected. If you do have UTF-16 without a BOM mark it's much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
Ok, but I don't really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that's most likely going to cause more problems that it seems to solve: namely complete garbage if the data turns out to be UTF-16-LE encoded and, what's worse, enters the application undetected.
The crux of my argument is that the spec declares that UTF-16 without a BOM is BE. If the file is encoded in UTF-16LE and it doesn't have a BOM, it doesn't deserve to be processed correctly. That being said, treating it as UTF-16BE if it's LE will result in a lot of invalid code points, so it shouldn't be non-obvious that something has gone wrong.
If you do have UTF-16 without a BOM mark it's much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE.
This is about what we do now - we catch UnicodeError and then add a BOM to the file, and read it again. We know our files are UTF-16BE if they don't have a BOM, as the files are written by code which observes the spec. We can't use UTF-16BE all the time, because sometimes they're UTF-16LE, and in those cases the BOM is set. It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users' to ask for behaviour which conforms to the standard. I'm not saying that you can't work around the issue now, what I'm saying is that you shouldn't *have* to - I think there is a reasonable expectation that the UTF-16 codec conforms to the spec, and if you wanted it to do something else, it is those users who should be forced to come up with a workaround. -- Nick
Nicholas Bastin sagte:
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
[...]
If you do have UTF-16 without a BOM mark it's much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE.
This is about what we do now - we catch UnicodeError and then add a BOM to the file, and read it again. We know our files are UTF-16BE if they don't have a BOM, as the files are written by code which observes the spec. We can't use UTF-16BE all the time, because sometimes they're UTF-16LE, and in those cases the BOM is set.
It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users' to ask for behaviour which conforms to the standard.
It should be feasible to implement your own codec for that based on Lib/encodings/utf_16.py. Simply replace the line in StreamReader.decode(): raise UnicodeError,"UTF-16 stream does not start with BOM" with: self.decode = codecs.utf_16_be_decode and you should be done.
[...]
Bye, Walter Dörwald
Walter Dörwald sagte:
Nicholas Bastin sagte:
It should be feasible to implement your own codec for that based on Lib/encodings/utf_16.py. Simply replace the line in StreamReader.decode(): raise UnicodeError,"UTF-16 stream does not start with BOM" with: self.decode = codecs.utf_16_be_decode and you should be done.
Oops, this only works if you have a big endian system. Otherwise you have to redecode the input with: codecs.utf_16_ex_decode(input, errors, 1, False) Bye, Walter Dörwald
Nicholas Bastin wrote:
It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users' to ask for behaviour which conforms to the standard.
Alternatively, the UTF-16BE codec could support the BOM, and do UTF-16LE if the "other" BOM is found. This would also support your usecase, and in a better way. The Unicode assertion that UTF-16 is BE by default is void these days - there is *always* a higher layer protocol, and it more often than not specifies (perhaps not in English words, but only in the source code of the generator) that the default should by LE. Regards, Martin
Martin v. Löwis wrote:
Nicholas Bastin wrote:
It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users' to ask for behaviour which conforms to the standard.
Alternatively, the UTF-16BE codec could support the BOM, and do UTF-16LE if the "other" BOM is found.
That would violate the Unicode standard - the BOM character for UTF-16-LE and -BE must be interpreted as ZWNBSP.
This would also support your usecase, and in a better way. The Unicode assertion that UTF-16 is BE by default is void these days - there is *always* a higher layer protocol, and it more often than not specifies (perhaps not in English words, but only in the source code of the generator) that the default should by LE.
I've checked the various versions of the Unicode standard docs: it seems that the quote you have was silently introduced between 3.0 and 4.0. Python currently uses version 3.2.0 of the standard and I don't think enough people are aware of the change in the standard to make a case for dropping the exception raising in the case of a UTF-16 finding a stream without a BOM mark. By the time we switch to 4.1 or later, we can then make the change in the native UTF-16 codec as you requested. Personally, I think that the Unicode consortium should not have introduced a default for the UTF-16 encoding byte order. Using big endian as default in a world where most Unicode data is created on little endian machines is not very realistic either. Note that the UTF-16 codec starts reading data in the machines native byte order and then learns a possibly different byte order by looking for BOMs. Implementing a codec which implements the 4.0 behavior is easy, though. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"MvL" == "Martin v. Löwis"
writes:
MvL> This would also support your usecase, and in a better way. MvL> The Unicode assertion that UTF-16 is BE by default is void MvL> these days - there is *always* a higher layer protocol, and MvL> it more often than not specifies (perhaps not in English MvL> words, but only in the source code of the generator) that the MvL> default should by LE. That is _not_ a protocol. A protocol is a published specification, not merely a frequent accident of implementation. Anyway, both ISO 10646 and the Unicode standard consider that "internal use" and there is no requirement at all placed on those data. And such generators typically take great advantage of that freedom---have you looked in a .doc file recently? Have you noticed how many different options (previous implementations) of .doc are offered in the Import menu?
"MAL" == "M.-A. Lemburg"
writes:
MAL> I've checked the various versions of the Unicode standard MAL> docs: it seems that the quote you have was silently MAL> introduced between 3.0 and 4.0. Probably because ISO 10646 was _always_ BE until the standards were unified. But note that ISO 10646 standardizes only use as a communications medium. Neither ISO 10646 nor Unicode makes any specification about internal usage. Conformance in internal processing is a matter of the programmer's convenience in producing conforming output. MAL> Python currently uses version 3.2.0 of the standard and I MAL> don't think enough people are aware of the change in the MAL> standard There's only one (corporate) person that matters: Microsoft. MAL> By the time we switch to 4.1 or later, we can then make the MAL> change in the native UTF-16 codec as you requested. While in principle I sympathize with Nick, pragmatically Microsoft is unlikely to conform. They will take the position that files created by Windows are "internal" to the Windows environment, except where explicitly intended for exchange with arbitrary platforms, and only then will they conform. As Martin points out, that is what really matters for these defaults. I think you should look to see what Microsoft does. MAL> Personally, I think that the Unicode consortium should not MAL> have introduced a default for the UTF-16 encoding byte MAL> order. Using big endian as default in a world where most MAL> Unicode data is created on little endian machines is not very MAL> realistic either. It's not a default for the UTF-16 encoding byte order. It's a default for the UTF-16 encoding byte order _when UTF-16 is a communications medium_. Given that the generic network byte order is bigendian, I think it would be insane to specify littleendian as Unicode's default. With Unicode same as network, you specify UTF-16 strings internally as an array of uint16_t, and when you put them on the wire (including saving them to a file that might be put on the wire as octet-stream) you apply htons(3) to it. On reading, you apply ntohs(3) to it. The source code is portable, the file is portable. How can you beat that? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
"Martin" == Martin v Löwis
writes:
Martin> Stephen J. Turnbull wrote: >> However, this option should be part of the initialization of an >> IO stream which produces Unicodes, _not_ an operation on >> arbitrary internal strings (whether raw or Unicode). Martin> With the UTF-8-SIG codec, it would apply to all operation Martin> modes of the codec, whether stream-based or from strings. I had in mind the ability to treat a string as a stream. Martin> Whether or not to use the codec would be the application's Martin> choice. What I think should be provided is a stateful object encapsulating the codec. Ie, to avoid the need to write out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8") -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Stephen J. Turnbull wrote:
Martin> With the UTF-8-SIG codec, it would apply to all operation Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Hmm. A string is not a stream, but it could be the contents of a stream. A typical application of codecs goes like this: data = stream.read() [analyze data, e.g. by checking whether there is encoding= in
Martin> Whether or not to use the codec would be the application's Martin> choice.
What I think should be provided is a stateful object encapsulating the codec. Ie, to avoid the need to write
out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
No. People who want streaming should use cStringIO, i.e.
s=cStringIO.StringIO() s1=codecs.getwriter("utf-8")(s) s1.write(u"Hallo") s.getvalue() 'Hallo'
Regards, Martin
"Martin" == Martin v Löwis
writes:
Martin> So people do use the "decode-it-all" mode, where no Martin> sequential access is necessary - yet the beginning of the Martin> string is still the beginning of what once was a Martin> stream. This case must be supported. Of course it must be supported. My point is that many strings (in my applications, all but those strings that result from slurping in a file or process output in one go -- example, not a statistically valid sample!) are not the beginning of "what once was a stream". It is error-prone (not to mention unaesthetic) to not make that distinction. "Explicit is better than implicit." Martin> Whether or not to use the codec would be the application's Martin> choice. >> What I think should be provided is a stateful object >> encapsulating the codec. Ie, to avoid the need to write >> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8") Martin> No. People who want streaming should use cStringIO, i.e.
s=cStringIO.StringIO() s1=codecs.getwriter("utf-8")(s) s1.write(u"Hallo") s.getvalue() 'Hallo'
Yes! Exactly (except in reverse, we want to _read_ from the slurped stream-as-string, not write to one)! ... and there's no need for a utf-8-sig codec for strings, since you can support the usage in exactly this way. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Stephen J. Turnbull wrote:
Of course it must be supported. My point is that many strings (in my applications, all but those strings that result from slurping in a file or process output in one go -- example, not a statistically valid sample!) are not the beginning of "what once was a stream". It is error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
I can't put these two paragraphs together. If you think that explicit is better than implicit, why do you not want to make different calls for the first chunk of a stream, and the subsequent chunks?
s=cStringIO.StringIO() s1=codecs.getwriter("utf-8")(s) s1.write(u"Hallo") s.getvalue() 'Hallo'
Yes! Exactly (except in reverse, we want to _read_ from the slurped stream-as-string, not write to one)! ... and there's no need for a utf-8-sig codec for strings, since you can support the usage in exactly this way.
However, if there is an utf-8-sig codec for streams, there is currently no way of *preventing* this codec to also be available for strings. The very same code is used for streams and for strings, and automatically so. Regards, Martin
"Martin" == Martin v Löwis
writes:
Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequent chunks? Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind. The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream. I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense. I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run. >> Yes! Exactly (except in reverse, we want to _read_ from the >> slurped stream-as-string, not write to one)! ... and there's >> no need for a utf-8-sig codec for strings, since you can >> support the usage in exactly this way. Martin> However, if there is an utf-8-sig codec for streams, there Martin> is currently no way of *preventing* this codec to also be Martin> available for strings. The very same code is used for Martin> streams and for strings, and automatically so. And of course it should be. But if it's not possible to move the -sig facility out of the codecs into the streams, that would be a shame. I think we should encourage people to use streams where initialization or finalization semantics are non-trivial, as they are with signatures. But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and utf-8-sig are available, I can program as I want to (and refer those whose strings get cratered by stray BOMs to you<wink>). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Stephen J. Turnbull wrote:
"Martin" == Martin v Löwis
writes: Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequent chunks?
Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind.
The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream.
I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense.
Not really. In every encoding where a sequence of more than one byte maps to one Unicode character, you will always need some kind of buffering. If we remove the handling of initial BOMs from the codecs (except for UTF-16 where it is required), this wouldn't change any buffering requirements.
I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run.
I'm not exactly sure, what you're proposing here. That all codecs (even UTF-16) pass the BOM through and some other infrastructure is responsible for dropping it?
[...]
Bye, Walter Dörwald
"Walter" == Walter Dörwald
writes:
Walter> Not really. In every encoding where a sequence of more Walter> than one byte maps to one Unicode character, you will Walter> always need some kind of buffering. If we remove the Walter> handling of initial BOMs from the codecs (except for Walter> UTF-16 where it is required), this wouldn't change any Walter> buffering requirements. Sure. My point is that codecs should be stateful only to the extent needed to assemble semantically meaningful units (ie, multioctet coded characters). In particular, they should not need to know about location at the beginning, middle, or end of some stream---because in the context of operating on a string they _can't_. >> I don't know whether that's really feasible in the short >> run---I suspect there may be a lot of stream-like modules that >> would need to be updated---but it would be a saner in the long >> run. Walter> I'm not exactly sure, what you're proposing here. That all Walter> codecs (even UTF-16) pass the BOM through and some other Walter> infrastructure is responsible for dropping it? Not exactly. I think that at the lowest level codecs should not implement complex mode-switching internally, but rather explicitly abdicate responsibility to a more appropriate codec. For example, autodetecting UTF-16 on input would be implemented by a Python program that does something like data = stream.read() for detector in [ "utf-16-signature", "utf-16-statistical" ]: # for the UTF-16 detectors, OUT will always be u"" or None out, data, codec = data.decode(detector) if codec: break while codec: more_out, data, codec = data.decode(codec) out = out + more_out if data: # a real program would complain about it pass process(out) where decode("utf-16-signature") would be implemented def utf-16-signature-internal (data): if data[0:2] == "\xfe\xff": return (u"", data[2:], "utf-16-be") else if data[0:2] == "\xff\xfe": return (u"", data[2:], "utf-16-le") else # note: data is undisturbed if the detector fails return (None, data, None) The main point is that the detector is just a codec that stops when it figures out what the next codec should be, touches only data that would be incorrect to pass to the next codec, and leaves the data alone if detection fails. utf-16-signature only handles the BOM (if present), and does not handle arbitrary "chunks" of data. Instead, it passes on the rest of the data (including the first chunk) to be handled by the appropriate utf-16-?e codec. I think that the temptation to encapsulate this logic in a utf-16 codec that "simplifies" things by calling the appropriate utf-16-?e codec itself should be deprecated, but YMMV. What I would really like is for the above style to be easier to achieve than it currently is. BTW, I appreciate your patience in exploring this; after Martin's remark about different mental models I have to suspect this approach is just somehow un-Pythonic, but fleshing it out this way I can see how it will be useful in the context of a different project. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Stephen J. Turnbull wrote:
Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind.
I'm sorry, but I'm losing track as to what precisely you are trying to say. You seem to be using a mental model that is entirely different from mine.
The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream.
But what follows from that point? So it shows some kind of matter... what does that mean for actual changes to Python API?
I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense.
I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run.
What is "that" which might be really feasible? To "solve Walter's conundrum"? That "signatures make sense"? So I can't really respond to your message in a meaningful way; I just let it rest... Regards, Martin
Stephen J. Turnbull wrote:
"MAL" == M
writes: MAL> The BOM (byte order mark) was a non-standard Microsoft MAL> invention to detect Unicode text data as such (MS always uses MAL> UTF-16-LE for Unicode text files).
The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office always use UTF-16-LE + BOM when saving text as "Unicode text".
MAL> -1; there's no standard for UTF-8 BOMs - adding it to the MAL> codecs module was probably a mistake to begin with. You MAL> usually only get UTF-8 files with BOM marks as the result of MAL> recoding UTF-16 files into UTF-8.
There is a standard for UTF-8 _signatures_, however. I don't have the most recent version of the ISO-10646 standard, but Amendment 2 (which defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to Annex F of that standard. Evan quotes Version 4 of the Unicode standard, which explicitly defines the UTF-8 signature.
Ok, as signature the BOM does make some sense - whether to strip signatures from a document is a good idea or not is a different matter, though. Here's the Unicode Cons. FAQ on the subject: http://www.unicode.org/faq/utf_bom.html#22 They also explicitly warn about adding BOMs to UTF-8 data since it can break applications and protocols that do not expect such a signature.
So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea.
However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).
Right.
MAL> BTW, how do you know that s came from the start of a file and MAL> not from slicing some already loaded file somewhere in the MAL> middle ?
The programmer or the application might, but Python's codecs don't. The point is that this is also true of rawstrings that happen to contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec shouldn't strip leading BOMs either, unless it has been told it has the beginning of the string.
The UTF-16 stream codecs implement this logic. The UTF-16 encode and decode functions will however always strip the BOM mark from the beginning of a string. If the application doesn't want this stripping to happen, it should use the UTF-16-LE or -BE codec resp.
MAL> Evan Jones wrote:
>> This is *not* a valid Unicode character. The Unicode >> specification (version 4, section 15.8) says the following >> about non-characters: >> >>> Applications are free to use any of these noncharacter code >>> points internally but should never attempt to exchange >>> them. If a noncharacter is received in open interchange, an >>> application is not required to interpret it in any way. It is >>> good practice, however, to recognize it as a noncharacter and >>> to take appropriate action, such as removing it from the >>> text. Note that Unicode conformance freely allows the removal >>> of these characters. (See C10 in Section3.2, Conformance >>> Requirements.) >> >> My interpretation of the specification means that Python should
The specification _permits_ silent removal; it does not recommend.
>> silently remove the character, resulting in a zero length >> Unicode string. Similarly, both of the following lines should >> also result in a zero length Unicode string:
>>>> '\xff\xfe\xfe\xff'.decode( "utf16" ) > u'\ufffe' >>>> '\xff\xfe\xff\xff'.decode( "utf16" ) > u'\uffff'
I strongly disagree; these decisions should be left to a higher layer. In the case of specified UTFs, the codecs should simply invert the UTF to Python's internal encoding.
MAL> Hmm, wouldn't it be better to raise an error ? After all, a MAL> reversed BOM mark in the stream looks a lot like you're MAL> trying to decode a UTF-16 stream assuming the wrong byte MAL> order ?!
+1 on (optionally) raising an error.
The advantage of raising an error is that the application can deal with the situation in whatever way seems fit (by registering a special error handler or by simply using "ignore" or "replace"). I agree that much of this lies outside the scope of codecs and should be handled at an application or protocol level.
-1 on removing it or anything like that, unless under control of the application (ie, the program written in Python, not Python itself). It's far too easy for software to generate broken Unicode streams[1], and the choice of how to deal with those should be with the application, not with the implementation language.
Footnotes: [1] An egregious example was the Outlook Express distributed with early Win2k betas, which produced MIME bodies with apparent Content-Type: text/html; charset=utf-16, but the HTML tags and newlines were 7-bit ASCII!
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"MAL" == M
writes:
MAL> Stephen J. Turnbull wrote: >> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it >> even adds them to existing UTF-8 files lacking them. MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode MAL> text". Yes, it is an MS application. I'll have to borrow somebody's box to check, but IIRC UTF-8 is the native "text" encoding for Japanese now. (Japanized applications generally behave differently from everything else, as there are so many "standards" for encoding Japanese.) M> The UTF-16 stream codecs implement this logic. M> The UTF-16 encode and decode functions will however always M> strip the BOM mark from the beginning of a string. M> If the application doesn't want this stripping to happen, it M> should use the UTF-16-LE or -BE codec resp. That sounds like it would work fine almost all the time. If it doesn't it's straightforward to work around, and certainly would be more convenient for the non-standards-geek programmer. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
participants (7)
-
"Martin v. Löwis"
-
Evan Jones
-
Fred Drake
-
M.-A. Lemburg
-
Nicholas Bastin
-
Stephen J. Turnbull
-
Walter Dörwald