[I18n-sig] Re: How does Python Unicode treat surrogates?

Machin, John JMachin@Colonial.com.au
Mon, 25 Jun 2001 23:51:29 +1000


Marc-Andre,

> I should have added "please correct me if I'm wrong", sorry.

I'm sorry too; I didn't intend to be rude; it's just
that I normally operate under a protocol where that
licence ("please correct me if I'm wrong") is the
default and doesn't need to be stated explicitly in each paragraph.

> Say you have a Unicode string which contains the following data:
>
>        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
>       ("a"    "b"    "c"    ?      "d"    "e"    "f")
>
> Would you consider this sequence a Unicode string or not ? 

I think you are using "Unicode string" with two different meanings here.

However, the pragmatic question is what should Python do when given such a
sequence.
Do we permit such a sequence to be held internally as a "Unicode string"?
Is u"\udc00" legal in source code or should Python throw a syntax error?
Same question for u"\uffff".

We *do* need to consider UTF encodings, because Unicode *expressly* allows
decoding UTF sequences 
that become unpaired surrogates, or other "not 100% valid" scalars such as
0xffff and 0xfffe. So, 
given that Python supports Unicode, not ISO 10646, we must IMO permit such
sequences in our internal 
representation. It follows that we should stop worrying about these
irregular values -- it's less
programming that way. Unicode 3.1 will create enough extra programming as it
is, because we now have
variable-length characters again -- just what Unicode was going to save us
from :-(

Cheers,
John

-----Original Message-----
From: M.-A. Lemburg [mailto:mal@lemburg.com]
Sent: Monday, 25 June 2001 22:56
To: Machin, John
Cc: 'Gaute B Strokkenes'; Tim Peters; i18n-sig@python.org;
unicode@unicode.org
Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates?


"Machin, John" wrote:
> 
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you throw
> things at me?

Sure :-)
 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters.

I should have added "please correct me if I'm wrong", sorry.

Let me put this into an example:
Say you have a Unicode string which contains the following data:

        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
       ("a"    "b"    "c"    ?      "d"    "e"    "f")

Would you consider this sequence a Unicode string or not ? Please
note that I am not talking about some UTF-n encoding here. The
above snippet is simply to be seen as sequence of data entries
which are referenced by the Unicode database.

> On the other hand, UTF code sequences that would decode into lone
surrogates
> are not "illegal".
> Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
> further clarified by Unicode 3.1
> which expressly lists legal UTF-8 sequences; these encompass lone
> surrogates.
> 
> -----Original Message-----
> From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
> Sent: Monday, 25 June 2001 22:04
> To: M.-A. Lemburg
> Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
> Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
> 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
> 
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig
> 
> **************   IMPORTANT MESSAGE  **************
> 
> The information contained in or attached to this message is intended only
for the people it is addressed to. If you are not the intended recipient,
any use, disclosure or copying of this information is unauthorised and
prohibited. This information may be confidential or subject to legal
privilege. It is not the expressed view of Colonial Limited or any of its
subsidiaries unless that is clearly stated. Colonial cannot accept liability
for any virus damage caused by this message.
> 
> **************************************************

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/