Hi Matthew,
maybe it is not the best place to discuss problems of pandas, but to show that I am not missing something, let's consider a simple example.
# simplest DataFrame
x = pandas.DataFrame(dict(a=numpy.arange(10), b=numpy.arange(10, 20)))
# simplest indexing. Can you predict results without looking at comments?
x[:2] # returns two first rows, as expected
x[[0, 1]] # returns copy of x, whole dataframe
x[numpy.array(2)] # fails with IndexError: indices are out-of-bounds (can you guess why?)
x[[0, 1], :] # unhashable type: list
just in case - I know about .loc and .iloc, but when you write code with many subroutines, you concentrate on numpy inputs, and at some point you simply forget to convert some of the data you operated with to numpy and it continues to work, but it yields wrong results (while you tested everything, but you tested this for numpy). Checking all the inputs in each small subroutine is strange.
Ok, a bit more:
x[x['a'] > 5] # works as expected
x[x['a'] > 5, :] # 'Series' objects are mutable, thus they cannot be hashed
lookup = numpy.arange(10)
x[lookup[x['a']] > 5] # works as expected
x[lookup[x['a']] > 5, :] # TypeError: unhashable type: 'numpy.ndarray'
x[lookup]['a'] # indexError
x['a'][lookup] # works as expected
Now let's go a bit further: train/test splitted the data for machine learning (again, the most frequent operation)
from sklearn.model_selection import train_test_split
x1, x2 = train_test_split(x, random_state=42)
# compare next to operations with pandas.DataFrame
col = x1['a']
print col[:2] # first two elements
print col[[0, 1]] # doesn't fail (while there in no row with index 0), fills it with NaN
print col[numpy.arange(2)] # same as previous
print col[col > 4] # as expected
print col[col.values > 4] # as expected
print col.values[col > 4] # converts boolean to int, uses int indexing, but at least raises warning
Mistakes done by such silent misoperating are not easy to detect (when your data pipeline consists of several steps), quite hard to locate the source of problem and almost impossible to be sure that you indeed avoided all such caveats. Code review turns into paranoidal process (if you care about the result, of course).
Things are even worse, because I've demonstrated this for my installation, and probably if you run this with some other pandas installation, you get some other results (that were really basic operations). So things that worked ok in one version, may work different way in the other, this becomes completely intractable.
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be minimal.
That's why I am looking for a reliable pandas substitute, which should be:
- completely consistent with numpy and should fail when this wasn't implemented / impossible
- fewer new abstractions, nobody wants to learn one-more-way-to-manipulate-the-data, specifically other researchers
- it may be less convenient for interactive data mungling
- in particular, less methods is ok
- written code should be interpretable, and hardly can be misinterpreted.
- not super slow, 1-10 gigabytes datasets are a normal situation
Well, that's it.
Sorry for large letter.
Alex.
Alex,
Can you please post some code showing exactly what you are trying to do and any issues you are having, particularly the "irritating problems with its row indexing and some other problems" you quote above?
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.orghttps://mail.scipy.org/mailman/listinfo/numpy-discussion