character sets? unicode?

Thu Feb 3 12:40:52 EST 2005

Michael wrote:

> I'm trying to import text from email I've received, run some regular expressions on it, and save 
> the text into a database. I'm trying to figure out how to handle the issue of character sets. I've 
> had some problems with my regular expressions on email that has interesting character sets. Korean 
> text seems to be filled with a lot of '=3D=21' type of stuff.

looks like

    http://python.org/doc/lib/module-quopri.html

plus perhaps some encoding.

instead of rolling your own message handling code, consider using this
package instead:

    http://python.org/doc/lib/module-email.html

in either case, the MIME specification is required reading here (for a link,
see the quopri page above).

> Do I need to do anything special when passing text with non-ascii
> characters to re

depends on your patterns.  by default, RE operators like \w and \s assume
ASCII.  to use other encodings, use the (?u) flag and convert your text to
Unicode before passing it to the RE module.

> Is it better to save the text as-is in my db and save the  character set type
> too or should I try to convert all text to some  default format like UTF-8?

depends on your application; using a standard encoding has many advantages,
but storing the original text "as is" guarantees that no information is lost, even if
you have bugs in your conversion code.  when in doubt, save the original and
do the conversion on the way out.

</F>