utf-8 encoding issue

Fri Sep 19 08:34:16 EDT 2003

Marc Petitmermet wrote:

> In a web form, the user enters "öttinger" and wants to search with this
> search string. My idea is now to convert the search string (which also
> could be e.g. some cyrillic text) into unicode and then to utf-8:
>
>    unicode(search_string).encode('utf-8')
>
> This gives me the utf-8 encoded version of the string but not yet in the
> correct representation. How can I get the correct one (is this the hex
> version? I don't know the correct terminology.)?
>
> In short: how do I e.g. convert a sting containing a "ö" into a string
> containing a "%&#xd6;"?

that's not UTF-8, that's HTML/XML-style charrefs.

if mysql translates the charref's to unicode characters, you can simply
use:

    s = u.encode("ascii", "xmlcharrefreplace")

where "u" is a unicode string.

if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:

    def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
    s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))

to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.

decoding the charrefs *before* you add the strings to the database
is a better idea, though.

</F>