Python and UTF-8

Martin v. Loewis martin at v.loewis.de
Sat Jan 5 02:27:43 EST 2002


Dave Pawson <DaveP at dpawsonNOSPam.freeserve.co.uk> writes:

> > You have to know the encoding the data is currently, say
> > current_encoding. Then, converting it into UTF-8, you write
> > 
> > data = unicode(data, current_encoding).encode('utf-8')
> 
> If, having a file with 8859-1 encodings, can I use the same
> approach?

Certainly. Just use 'iso-8859-1' as current_encoding, and data as the
file contents.

> This prior to xslt processing, with older html files
> originating in Scandanavia, which blow up when XSLT
> gets hold of them with no encoding specified!

If these are HTML files, the likely have other problems beyond
encoding, such as missing closing tags. Notice that, AFAIK, the
default encoding of HTML is Latin-1, so not specifying any other
encoding implies Latin-1. A HTML processor should know that.

If you want to convert HTML documents to XHTML, I recommend to use
HTML Tidy.

Regards,
Martin



More information about the Python-list mailing list