"Martin" == Martin v Löwis email@example.com writes:
Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequent chunks?
Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind.
The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream.
I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense.
I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run.
>> Yes! Exactly (except in reverse, we want to _read_ from the >> slurped stream-as-string, not write to one)! ... and there's >> no need for a utf-8-sig codec for strings, since you can >> support the usage in exactly this way.
Martin> However, if there is an utf-8-sig codec for streams, there Martin> is currently no way of *preventing* this codec to also be Martin> available for strings. The very same code is used for Martin> streams and for strings, and automatically so.
And of course it should be. But if it's not possible to move the -sig facility out of the codecs into the streams, that would be a shame. I think we should encourage people to use streams where initialization or finalization semantics are non-trivial, as they are with signatures.
But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and utf-8-sig are available, I can program as I want to (and refer those whose strings get cratered by stray BOMs to you<wink>).