I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`?
> So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution.
Well, I outlined a solution: work with `bytes` arrays with utilities to convert to/from the Unicode-aware string dtypes (or `object`).A UTF-8-specific dtype and maybe a string-specialized `object` dtype address the very real and consequential problems that I face (namely and respectively, working with HDF5 and in-memory manipulation of string datasets).I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been- warned-you're-gonna-get-mojiba ke option. It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal). --
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy- discussion