Mon Jan 18 10:37:10 CET 2010
Thanks for all the answers! I will try to sum up all ideas here.
(1) Change default open() behaviour or make it optional?
Guido would like to add an option and keep open() unchanged. He wrote that
checking for BOM and using system locale are too much different to be the same
Antoine would like to check BOM by default, because both options (system
locale vs checking for BOM) is the same thing.
About Antoine choice (encoding=None): which file modes would check for a BOM?
I would like to answer only the read only mode, but then open(filename, "r")
and open(filename, "r+") would behave differently?
=> 1 point for Guido (encoding="BOM" is more explicit)
Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and
UTF-32 for all applications and all modules using open(filename).
=> 1 point for Antoine
(2) Check for a BOM while reading or detect it before?
Everybody agree that checking BOM is an interesting option and should not be
limited to open().
Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file
name or a binary file-like object: it returns the encoding and seek to the
file start or just after the BOM.
I dislike this function because it requires extra file operations (open
(optional), read() and seek()) and it doesn't work if the file is not seekable
(eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to
avoid extra file operations.
Note: I implemented the BOM check in TextIOWrapper; so it's already usable for
any file-like object.
(3) tell() and seek() on a text file starting with a BOM
To be consistent with Antoine example:
>>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
>>> f = io.TextIOWrapper(bio, encoding='utf-16')
* tell() should return zero at file start,
* seek(0) should go be to file start,
* and the BOM should always be "ignored".
with open("utf8bom.txt", encoding="BOM") as fp:
assert fp.tell() == 0
text = fp.read() # no BOM here
assert fp.read() == text
About my patch:
- BOM check is explicit: open(filebame, encoding="BOM")
- tell() / seek(0) works as expected
- BOM bytes are always skipped in read() / readlines() result
Hum, I don't know if this email can be called a sum up ;-)
More information about the Python-Dev