[Python-Dev] email package status in 3.X

Tue Jun 22 03:46:27 CEST 2010

On 6/21/2010 2:46 PM, P.J. Eby wrote:

> This ignores the existence of use cases where what you have is text that
> can't be properly encoded in unicode.

I think it depends on what you mean by 'properly'. I will try to explain 
with English examples.

1. Unicode represents a finite set of characters and symbols and a few 
control or markup operators. The potential set is unbounded, so unicode 
includes a user area. I include use of that area in 'properly'. I kind 
of suspect that the statement above does not since any byte or short 
byte sequence that does not translate can instead use the user area.

2. Unicode disclaims direct representation of font and style 
information, leaving that to markup either in or out of the text stream. 
(It made an exception for japanese narrow and wide ascii chars, which I 
consider to essentially be duplicate font variations of the normal ascii 
codes.) Html uses both in-band and out-of-band (css) markup. Stripping 
markup information is a loss of information. If one wants it, one must 
keep it in one form or another.

I believe that some early editors like Wordstar used high-bit-set bytes 
for bold, underline, italic on and off. Assuming I have the example 
right, can Wordstar text be 'properly encoded in unicode'? If one 
insists that that mean replacement of each of the format markup chars 
with a single defined char in the Basic Multilingual Plane, then 'no'. 
If one allows replacement by <bold>, </bold>, and so on, then 'yes'.

3. Unicode disclaims direct representation of glyphic variants (though 
again, exceptions were made for asian acceptance). For example, in 
English, mechanically printed 'a' and 'g' are different from manually 
printed 'a' and 'g'. Representing both by the same codepoint, in itself, 
loses information. One who wishes to preserve the distinction must 
instead use a font tag or perhaps a <handprinted> tag. Similarly, older 
English had a significantly different glyph for 's', which looks more 
like a modern 'f'.

If IBM's EBCDIC had codes for these glyph variants, IBM might have 
insisted that unicode also have such so char for char round-tripping 
would be possible. It does not and unicode does not. (Wordstar and other 
1980s editor publishers were mostly defunct or weak and not in a 
position to make such demands.)

If one wants to write on the history of glyph evolution, say of latin 
chars, one much either number the variants 'e-0', 'e-1', etc, or resort 
to the user area. In either case, proprietary software would be needed 
to actually print the variations with other text.

> I know, it's a hard thing to wrap
> one's head around, since on the surface it sounds like unicode is the
> programmer's savior. Unfortunately, real-world text data exists which
> cannot be safely roundtripped to unicode,

I do not believe that. Digital information can always be recoded one way 
or another. As it is, the rules were bent for Japanese, in a way that 
they were not for English, to aid round-tripping of the major public 
encodings. I can, however, believe that there were private encodings for 
which round-tripping is more difficult. But there are also difficulties 
for old proprietary and even private English encodings.

-- 
Terry Jan Reedy