small improvement idea for the CSV module

Hello Python-devs, The csv module is probably heavily utilized by newcomers to Python, being a very popular data exchange format. Although, there are better tools for processing tabular data like SQLite, or Pandas, I suspect this is still a very popular module. There are many examples floating around how one can read and process CSV with the csv module. Quite a few tutorials show how to use namedtuple to gain memory saving and speed, over the DictReader. Python's own documentation has got a recipe in the collections modules[1] Hence, I was wondering why not go the extra step and add a new class to the CSV module NamedTupleReader? This class would do a good service for Python's users, especially newcomers who are still not aware of modules like the collections module. Would someone be willing to sponsor and review such a PR from me? As a smaller change, we could simply add a link from the CSV module's documentation to the recipe in the collections module. What do you think? Best regards Oz [1]: https://docs.python.org/3/library/collections.html?highlight=namedtuple%20cs... --- Imagine there's no countries it isn't hard to do Nothing to kill or die for And no religion too Imagine all the people Living life in peace

Since 3.7 it may be that dataclasses offer a cleaner implementation of the functionality you suggest. It shouldn't be too difficult to produce code that uses dataclasses in 3.7+ but falls back to namedtuples when necessary. You may wish to consider such an implementation strategy. Best wishes, Steve Holden On Tue, Oct 29, 2019 at 10:59 PM Oz Tiram <oz.tiram@gmail.com> wrote:
Hello Python-devs,
The csv module is probably heavily utilized by newcomers to Python, being a very popular data exchange format. Although, there are better tools for processing tabular data like SQLite, or Pandas, I suspect this is still a very popular module. There are many examples floating around how one can read and process CSV with the csv module. Quite a few tutorials show how to use namedtuple to gain memory saving and speed, over the DictReader. Python's own documentation has got a recipe in the collections modules[1] Hence, I was wondering why not go the extra step and add a new class to the CSV module NamedTupleReader? This class would do a good service for Python's users, especially newcomers who are still not aware of modules like the collections module. Would someone be willing to sponsor and review such a PR from me? As a smaller change, we could simply add a link from the CSV module's documentation to the recipe in the collections module. What do you think?
Best regards Oz
[1]: https://docs.python.org/3/library/collections.html?highlight=namedtuple%20cs...
--- Imagine there's no countries it isn't hard to do Nothing to kill or die for And no religion too Imagine all the people Living life in peace
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GRPUTYZO... Code of Conduct: http://python.org/psf/codeofconduct/

Hi Steve, Thanks for your reply. While dataclass provide a cleaner API than DictRow (you can access `row.id` instead of `row["id"]`). However, dataclass still use the built in `__dict__` instead of `__slots__`. ```
@dataclass ... class InventoryItem: ... '''Class for keeping track of an item in inventory.''' ... name: str ... unit_price: float ... quantity_on_hand: int = 0 ...
cf = InventoryItem("cornflakes", 0.99, 123) cf InventoryItem(name='cornflakes', unit_price=0.99, quantity_on_hand=123) cf.__dict__ {'name': 'cornflakes', 'unit_price': 0.99, 'quantity_on_hand': 123}
This means that the users reading large files won't see the suggested
memory improvements.
On the other hand, I'm willing to implement CSVReader classes for both.
`DataClassCSVReader` does offer the
benefit of row instances being mutable, `NamedTupleCSVReader` can be useful
for people leaning toward functional
programming style, where queries on CSV are only meant to find items or
calculate quantities quickly without actually
modifying the rows.
I would be more than happy to know whether such PR would accept.
Best regards
Oz
On Wed, Oct 30, 2019 at 8:39 AM Steve Holden <steve@holdenweb.com> wrote:
> Since 3.7 it may be that dataclasses offer a cleaner implementation of the
> functionality you suggest. It shouldn't be too difficult to produce code
> that uses dataclasses in 3.7+ but falls back to namedtuples when necessary.
> You may wish to consider such an implementation strategy.
>
> Best wishes,
> Steve Holden
>
>
> On Tue, Oct 29, 2019 at 10:59 PM Oz Tiram <oz.tiram@gmail.com> wrote:
>
>> Hello Python-devs,
>>
>> The csv module is probably heavily utilized by newcomers to Python, being
>> a very popular data exchange format.
>> Although, there are better tools for processing tabular data like SQLite,
>> or Pandas, I suspect this is still a very popular
>> module.
>> There are many examples floating around how one can read and process CSV
>> with the csv module.
>> Quite a few tutorials show how to use namedtuple to gain memory saving
>> and speed, over the DictReader.
>> Python's own documentation has got a recipe in the collections modules[1]
>> Hence, I was wondering why not go the extra step and add a new class to
>> the CSV module NamedTupleReader?
>> This class would do a good service for Python's users, especially
>> newcomers who are still not aware of
>> modules like the collections module.
>> Would someone be willing to sponsor and review such a PR from me?
>> As a smaller change, we could simply add a link from the CSV module's
>> documentation to the recipe in the collections module.
>> What do you think?
>>
>> Best regards
>> Oz
>>
>> [1]:
>> https://docs.python.org/3/library/collections.html?highlight=namedtuple%20csv#collections.namedtuple
>>
>> ---
>> Imagine there's no countries
>> it isn't hard to do
>> Nothing to kill or die for
>> And no religion too
>> Imagine all the people
>> Living life in peace
>>
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-leave@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/GRPUTYZOPWTTU532CKZOHCTRSHNFKE2M/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
--
---
Imagine there's no countries
it isn't hard to do
Nothing to kill or die for
And no religion too
Imagine all the people
Living life in peace

If using a dictionary but still requiring attribute access, techniques such as those used at https://github.com/holdenweb/hw can be used to simply client code. Kind regards, Steve Holden On Wed, Oct 30, 2019 at 11:15 AM Oz Tiram <oz.tiram@gmail.com> wrote:
Hi Steve,
Thanks for your reply. While dataclass provide a cleaner API than DictRow (you can access `row.id` instead of `row["id"]`). However, dataclass still use the built in `__dict__` instead of `__slots__`.
```
@dataclass ... class InventoryItem: ... '''Class for keeping track of an item in inventory.''' ... name: str ... unit_price: float ... quantity_on_hand: int = 0 ...
cf = InventoryItem("cornflakes", 0.99, 123) cf InventoryItem(name='cornflakes', unit_price=0.99, quantity_on_hand=123) cf.__dict__ {'name': 'cornflakes', 'unit_price': 0.99, 'quantity_on_hand': 123}
This means that the users reading large files won't see the suggested memory improvements. On the other hand, I'm willing to implement CSVReader classes for both. `DataClassCSVReader` does offer the benefit of row instances being mutable, `NamedTupleCSVReader` can be useful for people leaning toward functional programming style, where queries on CSV are only meant to find items or calculate quantities quickly without actually modifying the rows. I would be more than happy to know whether such PR would accept. Best regards Oz On Wed, Oct 30, 2019 at 8:39 AM Steve Holden <steve@holdenweb.com> wrote: > Since 3.7 it may be that dataclasses offer a cleaner implementation of > the functionality you suggest. It shouldn't be too difficult to produce > code that uses dataclasses in 3.7+ but falls back to namedtuples when > necessary. You may wish to consider such an implementation strategy. > > Best wishes, > Steve Holden > > > On Tue, Oct 29, 2019 at 10:59 PM Oz Tiram <oz.tiram@gmail.com> wrote: > >> Hello Python-devs, >> >> The csv module is probably heavily utilized by newcomers to Python, >> being a very popular data exchange format. >> Although, there are better tools for processing tabular data like >> SQLite, or Pandas, I suspect this is still a very popular >> module. >> There are many examples floating around how one can read and process CSV >> with the csv module. >> Quite a few tutorials show how to use namedtuple to gain memory saving >> and speed, over the DictReader. >> Python's own documentation has got a recipe in the collections modules[1] >> Hence, I was wondering why not go the extra step and add a new class to >> the CSV module NamedTupleReader? >> This class would do a good service for Python's users, especially >> newcomers who are still not aware of >> modules like the collections module. >> Would someone be willing to sponsor and review such a PR from me? >> As a smaller change, we could simply add a link from the CSV module's >> documentation to the recipe in the collections module. >> What do you think? >> >> Best regards >> Oz >> >> [1]: >> https://docs.python.org/3/library/collections.html?highlight=namedtuple%20csv#collections.namedtuple >> >> --- >> Imagine there's no countries >> it isn't hard to do >> Nothing to kill or die for >> And no religion too >> Imagine all the people >> Living life in peace >> >> _______________________________________________ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-leave@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/GRPUTYZOPWTTU532CKZOHCTRSHNFKE2M/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -- --- Imagine there's no countries it isn't hard to do Nothing to kill or die for And no religion too Imagine all the people Living life in peace

On 10/30/2019 02:53 AM, Steve Holden wrote:
If using a dictionary but still requiring attribute access, techniques such as those used at https://github.com/holdenweb/hw can be used to simply client code.
Unless I'm missing something, that doesn't have the memory improvement that namedtuples would. -- ~Ethan~

On Wed, Oct 30, 2019 at 11:55 PM Oz Tiram <oz.tiram@gmail.com> wrote:
Hi Steve,
Thanks for your reply. While dataclass provide a cleaner API than DictRow (you can access `row.id` instead of `row["id"]`). However, dataclass still use the built in `__dict__` instead of `__slots__`.
This means that the users reading large files won't see the suggested memory improvements.
FWIW, there is memory improvements thanks to the Key-sharing dictionary. See PEP 412 [1]. I have an idea about utilizing Key-sharing dictionary in DictReader, but I have not implemented it yet. [1]: https://www.python.org/dev/peps/pep-0412/

On Wed, Oct 30, 2019, 09:43 Steve Holden wrote:
Since 3.7 it may be that dataclasses offer a cleaner implementation of the functionality you suggest.
Actually, IMO in this case it would be more useful and fitting to use namedtuples rather than dataclasses, since CSV rows are naturally tuple-like, and it would be compatible with existing code written for the tuple-based interface.

29.10.19 22:37, Oz Tiram пише:
Quite a few tutorials show how to use namedtuple to gain memory saving and speed, over the DictReader. Python's own documentation has got a recipe in the collections modules[1] Hence, I was wondering why not go the extra step and add a new class to the CSV module NamedTupleReader? This class would do a good service for Python's users, especially newcomers who are still not aware of modules like the collections module.

Hi Serhiy, Thanks! Now, I am feeling confused. On the one hand, it's already been tried 10 years ago. On the other hand, obviously people do wish to have it. I'm going to send a PR In GitHub. Let's see if a new PR with some documentation can be appreciated. Oz On Wed, Oct 30, 2019, 16:25 Serhiy Storchaka <storchaka@gmail.com> wrote:
29.10.19 22:37, Oz Tiram пише:
Quite a few tutorials show how to use namedtuple to gain memory saving and speed, over the DictReader. Python's own documentation has got a recipe in the collections modules[1] Hence, I was wondering why not go the extra step and add a new class to the CSV module NamedTupleReader? This class would do a good service for Python's users, especially newcomers who are still not aware of modules like the collections module.
See https://bugs.python.org/issue1818 _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/H6L74TDM... Code of Conduct: http://python.org/psf/codeofconduct/

On 29Oct2019 21:37, Oz Tiram <oz.tiram@gmail.com> wrote:
Quite a few tutorials show how to use namedtuple to gain memory saving and speed, over the DictReader. [...] Python's own documentation has got a recipe in the collections modules[1] Hence, I was wondering why not go the extra step and add a new class to the CSV module NamedTupleReader? This class would do a good service for Python's users, especially newcomers who are still not aware of modules like the collections module.
Just for some context: It is often suggested that several things which are just documentation examples in modules like collections or itertools get promoted to presupplied functions/classes. It tends to fail. Personally I think at least some of these would be a good thing. (Disclaimer: I am not a core dev.) The tricky bit is the bikeshedding: what nice-but-superfluous features or corner cases should it support? I'd be for such things provided they lent themselve to extension i.e. "here's a basic implementation which is easily subclassable for extra features". The other thing is that there are probably examples of csv->namedtuple scattered throughout pypi, users might just need to find them. I know I've written such a thing for myself: https://pypi.org/project/cs.csvutils/ I entirely agree this would be easier to find and use in the stdlib. And mine is probably overfeatured and underclean for use in the stdlib. Cheers, Cameron Simpson <cs@cskk.id.au>
participants (7)
-
Cameron Simpson
-
Ethan Furman
-
Inada Naoki
-
Oz Tiram
-
Serhiy Storchaka
-
Steve Holden
-
Tal Einat