[Python-Dev] Python and the Unicode Character Database

Thu Dec 2 19:55:31 CET 2010

Le jeudi 02 décembre 2010 à 13:14 -0500, Alexander Belopolsky a écrit :
> > I don't understand why you think Arabic or Hebrew text is any different
> > from Western text. Surely right-to-left isn't more conceptually
> > complicated than left-to-right, is it?
> >
> 
> No, but a mix of LTR and RTL is certainly more difficult that either
> of the two.  I invite you to digest Unicode Standard Annex #9 before
> we continue this discussion.
> 
> See <http://unicode.org/reports/tr9/>.

“This annex describes specifications for the *positioning* of characters
flowing from right to left” (emphasis mine)

Looks like something for implementors of rendering engines, which
python-dev is not AFAICT.

> Same users may want to be able to cut and paste their decimals as
> well.  More importantly, however, legacy formats may not have support
> for mixed-direction text and may require that "John is 41" be stored
> as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is
> 14"  that will still display as  "41 si nhoJ", but int(s[-2:]) will
> return 14, not 41.

The legacy format argument looks like a red herring to me. When
converting from a format to another it is the programmer's job to
his/her job right.

> >> If we've got it right for Arabic, it is by
> >> chance and not by design.  This still leaves us with 41 other types of
> >> digits for at least 30 different languages.
> >
> > So why do you trust the Unicode standard on other things and not on this
> > one?
> 
> What other things?

Everything which the Unicode database stores and that we already rely
on.

> As far as I understand the only str method that was
> designed to comply with Unicode recomendations was str.isidentifier().

I don't think so.  str.split() and str.splitlines() are also defined in
conformance to the SPEC, AFAIK.  They certainly try to.
And, outside of str itself, the re module tries to follow Unicode
categories as well (for example, "\d" should match non-ASCII digits).

Regards

Antoine.