gagsl-py2 at yahoo.com.ar
Mon Sep 17 10:17:31 CEST 2007
On 17 sep, 02:55, "mensana... at aol.com" <mensana... at aol.com> wrote:
> On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> > En Sun, 16 Sep 2007 21:58:09 -0300, mensana... at aol.com
> > <mensana... at aol.com> escribi :
> > >> I'm eagerly awaiting publication of your professional specification
> > >> for correctly detecting the encoding of an arbitrary stream of
> > >> bytes
> > > The very presence of an algorithm to detect encoding is a bug.
> > > Files with they .txt extension should always be treated as ANSI
> > > even if they contain binary data.
> > Why ANSI?
> Because that's the absence of encoding?
Are you kidding?
> > Because it's convenient to *you*?
> No, it's ANSI unless told otherwise.
Oh, yes, it's a joke surely.
(Anyway, *which* ANSI standard? AFAIK, the Windows character set has
never been standardized by ANSI).
> > What about the rest of the world that don't speak
> > English or even worse, don't use the Latin alpabet?
> When the rest of the world creates the next
> generation of computers, THEY can chosse the
> > What do you mean by "binary data"?
> 8-bit, ASCII is only 7-bit.
Being "binary" as opposed to "text" has nothing to do with the number
of bits. "¡Olé!" is text, and contains characters outside the ASCII
set. A signal with range 0-63 can be encoded into 6 bits, but it's
binary data, not text.
> > Notepad is not interpreting the file as
> > "binary", it's text,
> And will treat non-ASCII data as if it were ASCII.
I think you were complaining about the opposite situation.
> > but interpreted using the wrong encoding.
> So that's not a serious bug? To decide that a file
> is Unicode despite the absence of the appropriate
Which are "the appropiate markers"? A BOM is not always required, and
Notepad supported Unicode even before the BOM was invented.
Please redirect your bug reports to bugs at microsoft.com
> > As every character goes into the same code block the heuristics concludes
> > that the text is some Estern language encoded in UTF16.
> But...but...Notepad doesn't have a UTF16 option.
What it calls "Unicode" is in fact UTF16, or UCS2 on some previous
> > This is the "Well you are speed" phrase interpreted as UTF16:
> > u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'
> How can you tell from that that it's UTF16? If there's
> something stored in addition to those 18 bytes, you're
> being misleading.
*I* can tell it's not, but Notepad (which presumibly calls
IsTextUnicode) cannot, and I can't blame it given a so small sample of
less than 20 bytes.
> > > Notepad should never be
> > > allowed to try to decide what the encoding is if the the open
> > > dialog has the encoding set to ANSI.
> > I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and
> > that's exactly what happens. I have to explicitely select Unicode in order
> > to see those Han characters.
> So which is worse, you having to tell it that it's
> Unicode or Notepad deciding on its own that a file
> is Unicode when it isn't.
I don't know, and I don't care, and I don't use Notepad.
More information about the Python-list