[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Jan 17 07:35:42 EST 2014


On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen <pav at iki.fi> wrote:
> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
>> - inconvenience in dealing with strings in python 3.
>>
>> bytes are not strings in python3 which means ascii data is either a byte
>> array which can be inconvenient to deal with or 4 byte unicode which
>> wastes space.
>>
>> A proposal to fix this would be to add a one or two byte dtype with a specific
>> encoding that behaves similar to bytes but converts to string when outputting
>> to python for comparisons etc.
>>
>> For backward compatibility we *cannot* change S. Maybe we could change
>> the meaning of 'a' but it would be safer to add a new dtype, possibly
>> 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype.
>>
>> The main issue is probably: is it worth it and who does the work?
>
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?
>
> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)
>
> Christopher Barker wrote:
>>
>> How do you spell the dtype that 'S' give you????
>>
>
> 'S' is bytes.
>
> dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent.


'S' is bytes, is a feature not a bug, I thought.

I didn't pay much attention to the two threads because I don't use
loadtxt. But I think the same issue is in genfromtxt, recfromtxt, ...

I don't have a lot of experience with python 3, but in the initial
python 3 compatibility conversion of statsmodels, I followed numpy's
lead and used the numpy helper functions and converted all strings to
bytes.

Everything loaded by genfromtxt or similar reades bytes, files are
opened with "rb".

In most places our code doesn't really care, as long as numpy.unique,
and similar work either way. But in some cases there were some strange
things working with bytes.

There are also some weirder cases with non-ASCII "strings", and I also
have problems in interactive work when the interpreter encoding
interfers.
Also maybe related, our Stata data file reader genfromdta handles
cyrillic languages (Russian IIRC) in the same way as ascii, I don't
know the details but Skipper fixed a bug so it works.

I'm pretty sure interaction statsmodels/pandas/patsy has problems/bugs
with non-ASCII support in variable names, but my impression is that
string data as bytes causes few problems.


Josef

>
> --
> Pauli Virtanen
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list