What extended ASCII character set uses 0x9D?

MRAB python at mrabarnett.plus.com
Thu Aug 17 22:24:27 EDT 2017


On 2017-08-18 01:30, John Nagle wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
>   >      I'm cleaning up some data which has text description fields from
>   > multiple sources.
> A few more cases:
> 
> bytearray(b'miguel \xe3\x81ngel santos')
> bytearray(b'lidija kmeti\xe4\x8d')
> bytearray(b'\xe5\x81ukasz zmywaczyk')
> bytearray(b'M\x81\x81\xfcnster')

I suspect that it's b'M\xc3\xbcnster', i.e. 'Münster'.encode('utf'8')

> bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
> bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
> bytearray(b'petr urban\xe4\x8d\xe3\xadk')
> 
> 0x9d is the most common; that occurs in English text. The others
> seem to be in some Eastern European character set.
> 
> Understand, there's no metadata available to disambiguate this. What I
> have is a big CSV file in which different character sets are mixed.
> Each field has a uniform character set, so I need character set
> detection on a per-field basis.
> 



More information about the Python-list mailing list