Handle foreign character web input
Alan Meyer
ameyer2 at yahoo.com
Sat Jun 29 15:56:19 EDT 2019
On 6/28/19 4:25 PM, Tobiah wrote:
> A guy comes in and enters his last name as RÖnngren.
>
> So what did the browser really give me; is it encoded
> in some way, like latin-1? Does it depend on whether
> the name was cut and pasted from a Word doc. etc?
> Should I handle these internally as unicode? Right
> now my database tables are latin-1 and things seem
> to usually work, but not always.
>
> Also, what do people do when searching for a record.
> Is there some way to get 'Ronngren' to match the other
> possible foreign spellings?
The first thing I'd want to do is to produce a front-end to discover the
character set (latin-1, whatever) and convert it to a standard UTF-8. e.g.:
data.decode('latin1').encode('utf8')
That gets rid of character set variations in the data, simplifying
things before any of the hard work has to be done.
Then you have a choice - store and index everything as utf-8, or
transliterate some or all strings to 7 bit US ASCII. You may have to
perform the same processing on input search strings.
I have not used it myself but there is a Python port of a Perl module
by Sean M. Burke called Unidecode. It will transliterate non-US ASCII
strings into ASCII using reasonable substitutions of non-ASCII
sequences. I believe that there are other packages that can also do this.
The easy way to use packages like this is to transliterate entire
records before putting them into your database, but then you may perplex
or even offend some users who will look at a record and say "What's
this? That's not French!" You'll also have to transliterate all input
search strings.
A more sophisticated way is to leave the records in Unicode, but add
transliterated index strings for those index strings that wind up
containing utf-8 non-ASCII chars.
There are various ways to do this that tradeoff time, space, and
programming effort. You can store two versions of each record, search
one and display the other. You can just process index strings and add
the transliterations to the record. What to choose depends on your
needs and resources.
And of course all bets are off if some of your data is Chinese,
Japanese, Hebrew, or maybe even Russian or Greek.
Sometimes I think, Why don't we all just learn Esperanto? But we all
know that that isn't going to happen.
Alan
More information about the Python-list
mailing list