Mailman 3 Unicode: When Things Get Hairy - Python-Dev

newer
Re: [Python-Dev] finalization again

Unicode: When Things Get Hairy

older
Conventional wisdom on finalization

Moshe Zadka

11 Mar 2000 11 Mar '00

9:10 a.m.

The following "problem" is easy to fix. However, what I wanted to know is if people (Skip and Guido most importantly) think it is a problem:

...

...
...
"a" in u"bbba" 1 u"a" in "bbba" Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: string member test needs char left operand

Suggested fix: in stringobject.c, explicitly allow a unicode char left operand. -- Moshe Zadka . http://www.oreilly.com/news/prescod_0300.html

Show replies by date

M.-A. Lemburg

11 Mar 11 Mar

10:24 a.m.

Moshe Zadka wrote:

...

The following "problem" is easy to fix. However, what I wanted to know is if people (Skip and Guido most importantly) think it is a problem:

...
...
...
"a" in u"bbba" 1 u"a" in "bbba" Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: string member test needs char left operand

Suggested fix: in stringobject.c, explicitly allow a unicode char left operand.

Hmm, this must have been introduced by your contains code... it did work before. The normal action taken by the Unicode and the string code in these mixed type situations is to first convert everything to Unicode and then retry the operation. Strings are interpreted as UTF-8 during this conversion. To simplify this task, I added method APIs to the Unicode object which do the conversion for you (they apply all the necessariy coercion business to all arguments). I guess adding another PyUnicode_Contains() wouldn't hurt :-) Perhaps I should also add a tp_contains slot to the Unicode object which then uses the above API as well. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Moshe Zadka

11:05 a.m.

On Sat, 11 Mar 2000, M.-A. Lemburg wrote:

...

Hmm, this must have been introduced by your contains code... it did work before.

Nope: the string "in" semantics were forever special-cased. Guido beat me soundly for trying to change the semantics...

...

The normal action taken by the Unicode and the string code in these mixed type situations is to first convert everything to Unicode and then retry the operation. Strings are interpreted as UTF-8 during this conversion.

Hmmm....PySeqeunce_Contains doesn't do any conversion of the arguments. Should it? (Again, it didn't before). If it does, then the order of testing for seq_contains and seq_getitem and conversions

...

Perhaps I should also add a tp_contains slot to the Unicode object which then uses the above API as well.

But that wouldn't help at all for u"a" in "abbbb" PySequence_Contains only dispatches on the container argument :-( (BTW: I discovered it while contemplating adding a seq_contains (not tp_contains) to unicode objects to optimize the searching for a bit.) PS: MAL: thanks for the a great birthday present! I'm enjoying the unicode patch a lot. -- Moshe Zadka . http://www.oreilly.com/news/prescod_0300.html

Guido van Rossum

12:16 p.m.

[Moshe discovers that u"a" in "bbba" raises TypeError] [Marc-Andre]

...

...
Hmm, this must have been introduced by your contains code... it did work before.

Nope: the string "in" semantics were forever special-cased. Guido beat me soundly for trying to change the semantics...

But I believe that Marc-Andre added a special case for Unicode in PySequence_Contains. I looked for evidence, but the last snapshot that I actually saved and built before Moshe's code was checked in is from 2/18 and it isn't in there. Yet I believe Marc-Andre. The special case needs to be added back to string_contains in stringobject.c.

...

...
The normal action taken by the Unicode and the string code in these mixed type situations is to first convert everything to Unicode and then retry the operation. Strings are interpreted as UTF-8 during this conversion.

Hmmm....PySeqeunce_Contains doesn't do any conversion of the arguments. Should it? (Again, it didn't before). If it does, then the order of testing for seq_contains and seq_getitem and conversions

Or it could be done this way.

...

...
Perhaps I should also add a tp_contains slot to the Unicode object which then uses the above API as well.

Yes.

...

But that wouldn't help at all for

u"a" in "abbbb"

It could if PySeqeunce_Contains would first look for a string and a unicode argument (in either order) and in that case convert the string to unicode.

...

PySequence_Contains only dispatches on the container argument :-(

(BTW: I discovered it while contemplating adding a seq_contains (not tp_contains) to unicode objects to optimize the searching for a bit.)

You may beat Marc-Andre to it, but I'll have to let him look at the code anyway -- I'm not sufficiently familiar with the Unicode stuff myself yet. BTW, I added a tag "pre-unicode" to the CVS tree to the revisions before the Unicode changes were made. --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg

1:32 p.m.

Guido van Rossum wrote:

...

[Moshe discovers that u"a" in "bbba" raises TypeError]

[Marc-Andre]

...
...
Hmm, this must have been introduced by your contains code... it did work before.

Nope: the string "in" semantics were forever special-cased. Guido beat me soundly for trying to change the semantics...

But I believe that Marc-Andre added a special case for Unicode in PySequence_Contains. I looked for evidence, but the last snapshot that I actually saved and built before Moshe's code was checked in is from 2/18 and it isn't in there. Yet I believe Marc-Andre. The special case needs to be added back to string_contains in stringobject.c.

Moshe was right: I had probably not checked the code because the obvious combinations worked out of the box... the only combination which doesn't work is "unicode in string". I'll fix it next week. BTW, there's a good chance that the string/Unicode integration is not complete yet: just keep looking for them.

...

...
...
The normal action taken by the Unicode and the string code in these mixed type situations is to first convert everything to Unicode and then retry the operation. Strings are interpreted as UTF-8 during this conversion.

Hmmm....PySeqeunce_Contains doesn't do any conversion of the arguments. Should it? (Again, it didn't before). If it does, then the order of testing for seq_contains and seq_getitem and conversions

Or it could be done this way.

...
...
Perhaps I should also add a tp_contains slot to the Unicode object which then uses the above API as well.

Yes.

...
But that wouldn't help at all for

u"a" in "abbbb"

It could if PySeqeunce_Contains would first look for a string and a unicode argument (in either order) and in that case convert the string to unicode.

I think the right way to do this is to add a special case to seq_contains in the string implementation. That's how most other auto-coercions work too. Instead of raising an error, the implementation would then delegate the work to PyUnicode_Contains().

...

...
PySequence_Contains only dispatches on the container argument :-(

(BTW: I discovered it while contemplating adding a seq_contains (not tp_contains) to unicode objects to optimize the searching for a bit.)

You may beat Marc-Andre to it, but I'll have to let him look at the code anyway -- I'm not sufficiently familiar with the Unicode stuff myself yet.

I'll add that one too. BTW, Happy Birthday, Moshe :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg

1:57 p.m.

I couldn't resist :-) Here's the patch... BTW, how should we proceed with future patches ? Should I wrap them together about once a week, or send them as soon as they are done ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x *.bak -x *.s -x DEADJOE -x Demo -x CVS CVS-Python/Include/unicodeobject.h Python+Unicode/Include/unicodeobject.h --- CVS-Python/Include/unicodeobject.h Fri Mar 10 23:33:05 2000 +++ Python+Unicode/Include/unicodeobject.h Sat Mar 11 14:45:59 2000 @@ -683,6 +683,17 @@ PyObject *args /* Argument tuple or dictionary */ ); +/* Checks whether element is contained in container and return 1/0 + accordingly. + + element has to coerce to an one element Unicode string. -1 is + returned in case of an error. */ + +extern DL_IMPORT(int) PyUnicode_Contains( + PyObject *container, /* Container string */ + PyObject *element /* Element string */ + ); + /* === Characters Type APIs =============================================== */ /* These should not be used directly. Use the Py_UNICODE_IS* and diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x *.bak -x *.s -x DEADJOE -x Demo -x CVS CVS-Python/Lib/test/test_unicode.py Python+Unicode/Lib/test/test_unicode.py --- CVS-Python/Lib/test/test_unicode.py Sat Mar 11 00:23:20 2000 +++ Python+Unicode/Lib/test/test_unicode.py Sat Mar 11 14:52:29 2000 @@ -219,6 +219,19 @@ test('translate', u"abababc", u'iiic', {ord('a'):None, ord('b'):ord('i')}) test('translate', u"abababc", u'iiix', {ord('a'):None, ord('b'):ord('i'), ord('c'):u'x'}) +# Contains: +print 'Testing Unicode contains method...', +assert ('a' in 'abdb') == 1 +assert ('a' in 'bdab') == 1 +assert ('a' in 'bdaba') == 1 +assert ('a' in 'bdba') == 1 +assert ('a' in u'bdba') == 1 +assert (u'a' in u'bdba') == 1 +assert (u'a' in u'bdb') == 0 +assert (u'a' in 'bdb') == 0 +assert (u'a' in 'bdba') == 1 +print 'done.' + # Formatting: print 'Testing Unicode formatting strings...', assert u"%s, %s" % (u"abc", "abc") == u'abc, abc' diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x *.bak -x *.s -x DEADJOE -x Demo -x CVS CVS-Python/Misc/unicode.txt Python+Unicode/Misc/unicode.txt --- CVS-Python/Misc/unicode.txt Sat Mar 11 00:14:11 2000 +++ Python+Unicode/Misc/unicode.txt Sat Mar 11 14:53:37 2000 @@ -743,8 +743,9 @@ stream codecs as available through the codecs module should be used. -XXX There should be a short-cut open(filename,mode,encoding) available which - also assures that mode contains the 'b' character when needed. +The codecs module should provide a short-cut open(filename,mode,encoding) +available which also assures that mode contains the 'b' character when +needed. File/Stream Input: @@ -810,6 +811,10 @@ Introduction to Unicode (a little outdated by still nice to read): http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html +For comparison: + Introducing Unicode to ECMAScript -- + http://www-4.ibm.com/software/developer/library/internationalization-support... + Encodings: Overview: @@ -832,7 +837,7 @@ History of this Proposal: ------------------------- -1.2: +1.2: Removed POD about codecs.open() 1.1: Added note about comparisons and hash values. Added note about case mapping algorithms. Changed stream codecs .read() and .write() method to match the standard file-like object methods diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x *.bak -x *.s -x DEADJOE -x Demo -x CVS CVS-Python/Objects/stringobject.c Python+Unicode/Objects/stringobject.c --- CVS-Python/Objects/stringobject.c Sat Mar 11 10:55:09 2000 +++ Python+Unicode/Objects/stringobject.c Sat Mar 11 14:47:45 2000 @@ -389,7 +389,9 @@ { register char *s, *end; register char c; - if (!PyString_Check(el) || PyString_Size(el) != 1) { + if (!PyString_Check(el)) + return PyUnicode_Contains(a, el); + if (PyString_Size(el) != 1) { PyErr_SetString(PyExc_TypeError, "string member test needs char left operand"); return -1; diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x *.bak -x *.s -x DEADJOE -x Demo -x CVS CVS-Python/Objects/unicodeobject.c Python+Unicode/Objects/unicodeobject.c --- CVS-Python/Objects/unicodeobject.c Fri Mar 10 23:53:23 2000 +++ Python+Unicode/Objects/unicodeobject.c Sat Mar 11 14:48:52 2000 @@ -2737,6 +2737,49 @@ return -1; } +int PyUnicode_Contains(PyObject *container, + PyObject *element) +{ + PyUnicodeObject *u = NULL, *v = NULL; + int result; + register const Py_UNICODE *p, *e; + register Py_UNICODE ch; + + /* Coerce the two arguments */ + u = (PyUnicodeObject *)PyUnicode_FromObject(container); + if (u == NULL) + goto onError; + v = (PyUnicodeObject *)PyUnicode_FromObject(element); + if (v == NULL) + goto onError; + + /* Check v in u */ + if (PyUnicode_GET_SIZE(v) != 1) { + PyErr_SetString(PyExc_TypeError, + "string member test needs char left operand"); + goto onError; + } + ch = *PyUnicode_AS_UNICODE(v); + p = PyUnicode_AS_UNICODE(u); + e = p + PyUnicode_GET_SIZE(u); + result = 0; + while (p < e) { + if (*p++ == ch) { + result = 1; + break; + } + } + + Py_DECREF(u); + Py_DECREF(v); + return result; + +onError: + Py_XDECREF(u); + Py_XDECREF(v); + return -1; +} + /* Concat to string or Unicode object giving a new Unicode object. */ PyObject *PyUnicode_Concat(PyObject *left, @@ -3817,6 +3860,7 @@ (intintargfunc) unicode_slice, /* sq_slice */ 0, /* sq_ass_item */ 0, /* sq_ass_slice */ + (objobjproc)PyUnicode_Contains, /*sq_contains*/ }; static int

8811

Age (days ago)

8811

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Guido van Rossum
M.-A. Lemburg
Moshe Zadka

Unicode: When Things Get Hairy

Moshe Zadka

M.-A. Lemburg

Moshe Zadka

Guido van Rossum

M.-A. Lemburg

M.-A. Lemburg

tags

participants (3)