[Tutor] Unicode strings

Kent Johnson kent37 at tds.net
Fri Aug 22 21:01:27 CEST 2008


On Fri, Aug 22, 2008 at 2:23 PM, eShopping
<etrade.griffiths at dsl.pipex.com> wrote:
> Hi
>
> I am trying to read in non-ASCII data from file using Unicode, with this
> test app:
>
> vocab=[("abends","in the evening"),
> ("die Auff\xFCrung","performance (of a play)"),
> ("der Au\xDFenhandel","foreign trade")

The \x escapes are interpreted by the Python parser, they are not part
of the string. In other words, the string contains actual latin-1 byte
codes.

> The data in the file"eng_ger.txt" is listed below.  When I parse the data
> from the list, I get the correct text displayed but when reading it from
> file, the encoding into unicode does not occur.  I would be really grateful
> if someone could explain why the string-> unicode conversion works with
> lists but not with files!
>
> Thanks in advance
>
> Alun Griffiths
>
> Contents of "eng_ger.txt"
>
> abends,in the evening
> die Auff\xFCrung,performance (of a play)
> der Au\xDFenhandel,foreign trade

Here, the python parser is not interpreting the \x escapes so the file
contains actual \x rather than latin-1 characters.

Two options:
- Create the file with actual latin-1 characters
- Use the special 'string-escape' codec to interpret the data from the
file, e.g.
  print "   ",words[0],unicode(words[0].decode('string-escape'),"latin1")

Kent


More information about the Tutor mailing list