[Tutor] unicode problem

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon Apr 28 12:57:06 2003


On Mon, 28 Apr 2003, Paul Tremblay wrote:

> When I use Sax, I am getting a unicode problem.
>
> If I put an "=F6" in my file (ö), then sax translates this to a
> unicode string:
>
> u'?' (some value)


Hi Paul,

Sounds ok so far.



> I then cannot parse the string. If I try to add to it:
>
> my_string =3D my_string + '\n'
>
> Then I get this error:
>
>
>  File "/home/paul/lib/python/paul/format_txt.py", line 159, in r_border
>     line =3D line + filler + padding + border + "\n"
> UnicodeError: ASCII decoding error: ordinal not in range(128)




Hmm... Let's see:

###
>>> from xml.dom.pulldom import parseString
>>> for e, n in parseString("<h>ello&#xf6;</h>"):
=2E..     print e
=2E..     print n
=2E..
START_DOCUMENT
<xml.dom.minidom.Document instance at 0x82d671c>
START_ELEMENT
<DOM Element: h at 137193460>
CHARACTERS
<DOM Text node "ello">
CHARACTERS

Traceback (most recent call last):
  File "<stdin>", line 3, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
###



It looks like the 'print' statement doesn't like high-order bytes.  Let's
double check:

###
>>> print u"\xf6"

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
###



Ok, so we're getting a similar error here.


But what I'm still trying to figure out is why doing a string
concatenation is making that error pop up for you.  Let me play around
with this more.

###
>>> e_n_pairs =3D list(parseString("<h>ello&#xf6;</h>"))
>>> e_n_pairs
[('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x82d2b4c>),
 ('START_ELEMENT', <DOM Element: h at 137178148>),
 ('CHARACTERS', <DOM Text node "ello">),
 ('CHARACTERS', <DOM Text node "\xf6">),
 ('END_ELEMENT', <DOM Element: h at 137178148>)]
>>>
>>> e_n_pairs[3]
('CHARACTERS', <DOM Text node "\xf6">)
>>> e_n_pairs[3][1]
<DOM Text node "\xf6">
>>> e_n_pairs[3][1].data
u'\xf6'
###


I've used pulldom to isolate that umlauted character.  Ok, I will try to
ellicit your errors by doing the string concatenation by hand.

###
>>> weird_char =3D e_n_pairs[3][1].data
>>> weird_char + weird_char
u'\xf6\xf6'
>>> weird_char + weird_char + "foobar"
u'\xf6\xf6foobar'
###


Odd.  I'm having problems getting it to break.  *grin*



The error message you report:

>  File "/home/paul/lib/python/paul/format_txt.py", line 159, in r_border
>     line =3D line + filler + padding + border + "\n"
> UnicodeError: ASCII decoding error: ordinal not in range(128)


doesn't smell right to me --- for the life of me, I can't imagine why
string concatenation would raise that kind of error.  Are 'line',
'filler', 'padding' and 'border' all strings?

I'd expect something like a 'print', or a file.write(), or a
string.encode() sort of thing, but string concatenation should be pretty
safe.  If your program is short, it might help us see more clearly what
Python if you post the program on Tutor.  I'm baffled, and I think we need
to see some more source code to understand what's happening.



By the way, you might find the 'unicode_escape' encodings useful:

###
>>> weird_char
u'\xf6'
>>> weird_char.encode('raw_unicode_escape')
'\xf6'
>>> print weird_char.encode('raw_unicode_escape')
=F6
>>> weird_char.encode('raw_unicode_escape').decode('raw_unicode_escape')
u'\xf6'
>>>
>>>
>>> weird_char.encode('unicode_escape')
'\\xf6'
>>> print weird_char.encode('unicode_escape')
\xf6
###



I hope we can help fix this problem fast.  Talk to you later!