unicode strings and such

Thu Sep 13 11:55:11 EDT 2001

On Thu, 13 Sep 2001, Garth Grimm wrote:

> If true, that clears up one misconception.  It would mean that the following initialization:
> repairList = [
>   ( '(ãf~ãf«ãf--)(\d{3})', '\g<1> \g<2>' ),
>   ( 'ã??ã? ã?.ã?".*ãf~ãf«ãf--', 'hit me' ),
>   ( '^ã??ã? ã?.ã?"$', '\g<0> hit me again' ),
> ]
>
> should be thought of as making each element of the tuples an array of bytes that is being treated as
> a String with each byte representing a Latin-1 character code point.  The u'' notation just
> stipulates that the array of bytes should be treated as a String with byte values representing UCS-2
> character code points.  Correct?

Well, wchar_t values, but otherwise yes.

> It would also mean that since the data file was written in UTF-8, they array of bytes wouldn't
> really represent anything useful.

Bingo.

> Any idea of what the str() is actually doing here?  If I remove the str() statements from the
> following code, I get different resuts (i.e. no substitutions will take place in qt).  So either
> pattern and patch (after the assignment at the beginning of the loop) aren't actually string
> objects, or str() is doing something more than the API docs state -- "For strings, this returns the
> string itself."

I still can't read the UTF-8 properly. Do the following and show me what your
results are:

---
>>> a='ãf~ãf«ãf--'
>>> a
'\xe3f~\xe3f\xab\xe3f--'
>>> str(a)
'\xe3f~\xe3f\xab\xe3f--'
>>> unicode(str(a), 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: invalid data
>>>
---

-- 
Ignacio Vazquez-Abrams  <ignacio at openservices.net>