What extended ASCII character set uses 0x9D?
MRAB
python at mrabarnett.plus.com
Thu Aug 17 22:24:27 EDT 2017
On 2017-08-18 01:30, John Nagle wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
> > I'm cleaning up some data which has text description fields from
> > multiple sources.
> A few more cases:
>
> bytearray(b'miguel \xe3\x81ngel santos')
> bytearray(b'lidija kmeti\xe4\x8d')
> bytearray(b'\xe5\x81ukasz zmywaczyk')
> bytearray(b'M\x81\x81\xfcnster')
I suspect that it's b'M\xc3\xbcnster', i.e. 'Münster'.encode('utf'8')
> bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
> bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
> bytearray(b'petr urban\xe4\x8d\xe3\xadk')
>
> 0x9d is the most common; that occurs in English text. The others
> seem to be in some Eastern European character set.
>
> Understand, there's no metadata available to disambiguate this. What I
> have is a big CSV file in which different character sets are mixed.
> Each field has a uniform character set, so I need character set
> detection on a per-field basis.
>
More information about the Python-list
mailing list