Ascii to Unicode.

Carey Tilden carey.tilden at gmail.com
Thu Jul 29 14:18:41 EDT 2010


On Thu, Jul 29, 2010 at 10:59 AM, Joe Goldthwaite <joe at goldthwaites.com> wrote:
> Hi Ulrich,
>
> Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
> few characters above the 128 range that are causing Postgresql Unicode
> errors.  Those characters work fine in the Windows world but they're not the
> correct byte representation for Unicode. What I'm attempting to do is
> translate those upper range characters into the correct Unicode
> representations so that they look the same in the Postgresql database as
> they did in the CSV file.

Having bytes outside of the ASCII range means, by definition, that the
file is not ASCII encoded.  ASCII only defines bytes 0-127.  Bytes
outside of that range mean either the file is corrupt, or it's in a
different encoding.  In this case, you've been able to determine the
correct encoding (latin-1) for those errant bytes, so the file itself
is thus known to be in that encoding.

Carey



More information about the Python-list mailing list