[Numpy-discussion] Use of NameValidator in np.genfromtxt is inconsistent with the rules for naming structured array fields

Alistair Muldal alimuldal at gmail.com
Sat Mar 21 10:32:51 EDT 2015


Hi all,

I originally posted this to the issue tracker 
(https://github.com/numpy/numpy/issues/5686), and am posting here as 
well at the request of charris.

Currently, np.genfromtxt uses a numpy.lib._iotools.NameValidator which 
mangles field names by replacing spaces and stripping out certain 
non-alphanumeric characters etc.:

     import numpy as np
     from io import BytesIO

     s = 'name,name with spaces,2*(x-1)!\n1,2,3\n4,5,6'
     x = np.genfromtxt(BytesIO(s), delimiter=',', names=True)
     print(repr(x))
     # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
     #       dtype=[('name', '<f8'), ('name_with_spaces', '<f8'), 
('2x1', '<f8')])

This behaviour has been the cause of some confusion in the past, e.g. 
http://stackoverflow.com/q/29097917/1461210, 
http://stackoverflow.com/q/16020137/1461210. Part of the issue is that 
it's currently not very well covered by the documentation for 
np.genfromtext - at best, it's alluded to in the descriptions for some 
of the keyword arguments ('deletechars', 'autostrip', 'replace_space' etc.).

However, I think the more fundamental problem is that this behaviour 
seems to be inconsistent with the rules for naming the fields in 
structured arrays. In the example above, all of the original field names 
are perfectly legal:

     names = ['name', 'name with spaces', '2*(x-1)!']
     types = ('f',) * 3
     dtype = zip(names, types)

     x2 = np.empty(2, dtype=dtype)
     x2[:] = [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)]
     print(repr(x2))
     # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
     #       dtype=[('name', '<f4'), ('name with spaces', '<f4'), 
('2*(x-1)!', '<f4')])
     print(x2['2*(x-1)!'])
     # [3. 6.]

What is the rationale behind the use of NameValidator here? One possible 
reason would be to ensure that the field names would also be legal for 
an np.recarray. However, this doesn't make sense for several reasons:

Firstly, the names above also seem to be legal field names for a recarray:

     xr = x2.view(np.recarray)
print(repr(x2))
# rec.array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
     #       dtype=[('name', '<f4'), ('name with spaces', '<f4'), 
('2*(x-1)!', '<f4')])

Obviously if the field names aren't valid Python identifiers then it 
won't be possible to access them via 'xr.fieldname' syntax, but 
dict-style indexing is still fine, e.g. xr['2*(x-1)!']. Also, if the 
goal of NameValidator were to ensure that the field names were always 
valid Python identifiers then it currently fails at this anyway, since 
in my first example, '2x1' is not a valid Python identifier.

What is perhaps most confusing is the fact that np.genfromtxt will even 
mangle field names that you pass in directly via the 'names' keyword 
argument. Suppose you wanted to specify field names that NameValidator 
doesn't like. You might try something like this:

     print(repr(np.genfromtxt(BytesIO(s), delimiter=',', names=names, 
skip_header=1)))
     # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
     #       dtype=[('name', '<f8'), ('name_with_spaces', '<f8'), 
('2x1', '<f8')])

Or even this:

     print(repr(np.genfromtxt(BytesIO(s), delimiter=',', names=names, 
skip_header=1,
                deletechars=[], replace_space=False, excludelist=[], 
autostrip=False)))
     # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
     #       dtype=[('name', '<f8'), ('name_with_spaces', '<f8'), 
('2x1', '<f8')])

Still no luck! As far as I can tell, there is no option in np.genfromtxt 
that allows you to preserve field names that don't conform to 
NameValidator's seemingly arbitrary rules.

What should be done about this? Personally, I think that either things 
like spaces and non-alphanumeric characters should be disallowed in 
structured array field names altogether (and my second example should 
raise an exception), or np.genfromtxt should leave field names alone by 
default.

It would also be a good idea to raise a SyntaxWarning in a case where 
the user creates a recarray containing field names that are not valid 
Python identifiers (and are therefore incompatible with the dot indexing 
syntax). This is essentially what PyTables does for non-conforming HDF5 
node names: 
https://github.com/PyTables/PyTables/blob/13047c897d28b7278cbeab732f12feadbfef3f22/tables/exceptions.py#L285-L294.

Any thoughts on this?

Alistair

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150321/211fff09/attachment.html>


More information about the NumPy-Discussion mailing list