Bug in genfromtxt with usecols and converters
Hi, I tried to load data from a csv file into numpy using genfromtxt. I need only a subset of the columns and want to apply some conversions to the data. attached is a minimal script showing the error. In brief, I want to load columns 1,2 and 4. But in the converter function for the 4th column, I get the 3rd value. The issue does not occur if I also load the 3rd column. Did I somehow misunderstand how the function is supposed to work or is this indeed a bug? I'm using python 3.3.1 with numpy 1.8.1 Regards Adrian
Hi Adrian,
I tried to load data from a csv file into numpy using genfromtxt. I need only a subset of the columns and want to apply some conversions to the data. attached is a minimal script showing the error. In brief, I want to load columns 1,2 and 4. But in the converter function for the 4th column, I get the 3rd value. The issue does not occur if I also load the 3rd column. Did I somehow misunderstand how the function is supposed to work or is this indeed a bug?
not sure whether to call it a bug; the error seems to arise before reading any actual data (even on reading from an empty string); when genfromtxt is checking the filling_values used to substitute missing or invalid data it is apparently testing on default testing values of 1 or -1 which your conversion scheme does not know about. Although I think it is rather the user’s responsibility to provide valid converters, probably the documentation should at least be updated to make them aware of this requirement. I see two possible fixes/workarounds: provide an keyword argument filling_values=[0,0,'1:1’] or add the default filling values to your relEnum dictionary, e.g. { … '-1':-1, '1':-1} Could you check if this works for your case? HTH, Derek
not sure whether to call it a bug; the error seems to arise before reading any actual data (even on reading from an empty string); when genfromtxt is checking the filling_values used to substitute missing or invalid data it is apparently testing on default testing values of 1 or -1 which your conversion scheme does not know about. Although I think it is rather the user’s responsibility to provide valid converters, probably the documentation should at least be updated to make them aware of this requirement. I see two possible fixes/workarounds:
provide an keyword argument filling_values=[0,0,'1:1’] This workaround seems to be work, but I doubt that the actual problem is
Hi Derek, thanks for your answer. the converter function I pass. The '-1', which is used as the testing value is the first_values from the 3rd column (line 1574 in npyio.py), but the converter is defined for column 4. by setting the filling_values to an array of length 3, this obviously makes the problem disappear. But I think if the first row is used, it should also use the values from the column for which the converter is defined. Best Adrian
Hi Adrian,
not sure whether to call it a bug; the error seems to arise before reading any actual data (even on reading from an empty string); when genfromtxt is checking the filling_values used to substitute missing or invalid data it is apparently testing on default testing values of 1 or -1 which your conversion scheme does not know about. Although I think it is rather the user’s responsibility to provide valid converters, probably the documentation should at least be updated to make them aware of this requirement. I see two possible fixes/workarounds:
provide an keyword argument filling_values=[0,0,'1:1’] This workaround seems to be work, but I doubt that the actual problem is the converter function I pass. The '-1', which is used as the testing value is the first_values from the 3rd column (line 1574 in npyio.py), but the converter is defined for column 4. by setting the filling_values to an array of length 3, this obviously makes the problem disappear. But I think if the first row is used, it should also use the values from the
column for which the converter is defined.
it is certainly related to the converter function because a KeyError for the dictionary you provide is raised: File "test.py", line 13, in <module> 3: lambda rel: relEnum[rel.decode()]}) File "/sw/lib/python3.4/site-packages/numpy/lib/npyio.py", line 1581, in genfromtxt missing_values=missing_values[i],) File "/sw/lib/python3.4/site-packages/numpy/lib/_iotools.py", line 784, in update tester = func(testing_value or asbytes('1')) File "test.py", line 13, in <lambda> 3: lambda rel: relEnum[rel.decode()]}) KeyError: '-1’ But you are right that the problem with using the first_values, which should of course be valid, somehow stems from the use of usecols, it seems that in that loop for (i, conv) in user_converters.items(): i in user_converters and in usecols get out of sync. This certainly looks like a bug, the entire way of modifying i inside the loop appears a bit dangerous to me. I’ll have look if I can make this safer. As long as your data don’t actually contain any missing values you might also simply use np.loadtxt. Cheers, Derek
Hi Derek,
But you are right that the problem with using the first_values, which should of course be valid, somehow stems from the use of usecols, it seems that in that loop
for (i, conv) in user_converters.items():
i in user_converters and in usecols get out of sync. This certainly looks like a bug, the entire way of modifying i inside the loop appears a bit dangerous to me. I’ll have look if I can make this safer. Thanks.
As long as your data don’t actually contain any missing values you might also simply use np.loadtxt. Ok, wasn't aware of that function so far. I will try that!
Best wishes Adrian
On 26 Aug 2014, at 09:05 pm, Adrian Altenhoff
But you are right that the problem with using the first_values, which should of course be valid, somehow stems from the use of usecols, it seems that in that loop
for (i, conv) in user_converters.items():
i in user_converters and in usecols get out of sync. This certainly looks like a bug, the entire way of modifying i inside the loop appears a bit dangerous to me. I’ll have look if I can make this safer. Thanks.
As long as your data don’t actually contain any missing values you might also simply use np.loadtxt. Ok, wasn't aware of that function so far. I will try that!
It was first_values that needs to be addressed by the original indices. I have created a short test from your case and submitted a fix at https://github.com/numpy/numpy/pull/5006 Cheers, Derek
participants (2)
-
Adrian Altenhoff
-
Derek Homeier