[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Wed Jan 15 07:38:57 EST 2014

On 01/15/2014 11:25 AM, Daπid wrote:
> On 15 January 2014 11:12, Hedieh Ebrahimi <hedieh.ebrahimi at amphos21.com
> <mailto:hedieh.ebrahimi at amphos21.com>> wrote:
>
>     I try to print my fileContent array after I read it and it looks
>     like this :
>
>     ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
>       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
>       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>
>     Why is this happening and how can I prevent it ?
>     Also if I have a line that starts like this in my file, python will
>     crash on me. how can i fix this ?
>
>
> What is wrong with this case? If you are concerned about the multiple
> backslashes, they are there because they are special symbols, and so
> they have to be escaped (you actually want a backslash, not whatever
> else they could mean).
>

you have the bytes representation and a duplicate slash in it.
Its due to unicode strings in python3.
A workaround that only works for ascii is:

np.loadtxt(file, dtype=bytes).astype(str)

for non ascii I guess you should use python directly as numpy would also 
require a python loop with explicit decoding.

Currently handling strings in python3 with numpy is even worse than 
before, you always have to go over bytes and do explicit decodes to get 
python strings out of ascii data.

What we might need in numpy is new string xtypes specifying encodings to 
allow sane conversion to python3 strings without the excessive memory 
usage of 4 byte unicode (ucs-4).
e.g. if its ascii reuse a (which currently maps to bytes)

     np.loadtxt(file, dtype='a')

for utf 8 data:

     d = np.loadtxt(file, dtype='utf8')

so that type(d[0]) is unicode and not bytes as is currently the case if 
you don't want to store your arrays in 4 bytes per character.