Proposal to change the default number of rows for DataFrame display (lower max_rows)
*[Note for those reading it on the pydata mailing list, please answer to pandas-dev@python.org <pandas-dev@python.org> to keep discussion centralised there]* Hi all, I am reposting the mail of Clemens below, but with slightly changed focus, as I think the main discussion point is about the number of rows. The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to lower the default number of rows shown when displaying a Series or DataFrame from 60 to 20. Thoughts on that? Best, Joris 2017-11-28 11:57 GMT+01:00 Clemens Brunner <clemens.brunner@gmail.com>:
Hello!
We're currently discussing a change in how data frames are displayed by default in https://github.com/pandas-dev/pandas/pull/17023. There are two proposed changes:
(1) Set pd.options.display.max_columns=0 (previously this was set to 20). (2) Set pd.options.display.max_rows=20 (previously this was set to 60).
Change (1) means that the number of printed columns is adapted to fit within the width of the terminal. If there are too many columns, ellipsis will be shown to indicate collapsed columns in the middle of the data frame. This doesn't work if Python is run as a Jupyter kernel (e.g. in a Jupyter notebook or in IPython QtConsole), in which case the maximum columns remain 20.
Example: ======== import pandas as pd import numpy as np pd.DataFrame(np.random.rand(5, 10))
Output before (in a terminal with 100 chars width): --------------------------------------------------- 0 1 2 3 4 5 6 \ 0 0.643979 0.690414 0.018603 0.991478 0.707534 0.376765 0.670848 1 0.547836 0.810972 0.054448 0.415112 0.268120 0.904528 0.839258 2 0.582256 0.732149 0.284208 0.405197 0.213591 0.715367 0.150106 3 0.197348 0.317159 0.051669 0.738405 0.821046 0.179270 0.245793 4 0.483466 0.583330 0.999213 0.882883 0.315169 0.045712 0.897048
7 8 9 0 0.891467 0.494220 0.713369 1 0.601304 0.449880 0.266205 2 0.113262 0.360580 0.238833 3 0.798063 0.077769 0.471169 4 0.262779 0.530565 0.992084
Output after: ------------- 0 1 2 3 ... 6 7 8 9 0 0.673621 0.211505 0.943201 0.946548 ... 0.900453 0.612182 0.861933 0.710967 1 0.670855 0.834449 0.796273 0.785976 ... 0.609954 0.686663 0.684582 0.837505 2 0.544736 0.814827 0.352893 0.459556 ... 0.650993 0.735943 0.279110 0.840203 3 0.440125 0.554323 0.745462 0.940896 ... 0.544576 0.224175 0.852603 0.509837 4 0.225551 0.791834 0.476059 0.321857 ... 0.391165 0.423213 0.290683 0.954423
[5 rows x 10 columns]
Change (2) implies fewer rows are displayed before auto-hiding takes place. I find that 60 rows almost always causes the terminal to scroll (most terminals have between 25-40 rows), so reducing the value to 20 increases the chance that a data frame can be observed on one terminal page. I'm not including a before/after output since it should be easy to imagine how this change affects the output.
Both changes would make Pandas behave similar to R's Tidyverse (which I really like), but this should not be the main reason why these changes are a good idea. I mainly like them because these settings make (large) data frames much nicer to look at.
Note that these changes affect the default values. Of course, users are free to change them back in their active Python session.
Comments to both proposed changes are highly welcome (either here on the mailing list or at https://github.com/pandas-dev/pandas/pull/17023.
Clemens _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
*[Note for those reading it on the pydata mailing list, please answer to pandas-dev@python.org <pandas-dev@python.org> to keep discussion centralised there]*
Hi all,
I am reposting the mail of Clemens below, but with slightly changed focus, as I think the main discussion point is about the number of rows.
The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to lower the default number of rows shown when displaying a Series or DataFrame from 60 to 20. Thoughts on that?
Personally, I always set the max rows to 10 or 20, so I'd be OK with it if the community is on board. Tom
Best, Joris
2017-11-28 11:57 GMT+01:00 Clemens Brunner <clemens.brunner@gmail.com>:
Hello!
We're currently discussing a change in how data frames are displayed by default in https://github.com/pandas-dev/pandas/pull/17023. There are two proposed changes:
(1) Set pd.options.display.max_columns=0 (previously this was set to 20). (2) Set pd.options.display.max_rows=20 (previously this was set to 60).
Change (1) means that the number of printed columns is adapted to fit within the width of the terminal. If there are too many columns, ellipsis will be shown to indicate collapsed columns in the middle of the data frame. This doesn't work if Python is run as a Jupyter kernel (e.g. in a Jupyter notebook or in IPython QtConsole), in which case the maximum columns remain 20.
Example: ======== import pandas as pd import numpy as np pd.DataFrame(np.random.rand(5, 10))
Output before (in a terminal with 100 chars width): --------------------------------------------------- 0 1 2 3 4 5 6 \ 0 0.643979 0.690414 0.018603 0.991478 0.707534 0.376765 0.670848 1 0.547836 0.810972 0.054448 0.415112 0.268120 0.904528 0.839258 2 0.582256 0.732149 0.284208 0.405197 0.213591 0.715367 0.150106 3 0.197348 0.317159 0.051669 0.738405 0.821046 0.179270 0.245793 4 0.483466 0.583330 0.999213 0.882883 0.315169 0.045712 0.897048
7 8 9 0 0.891467 0.494220 0.713369 1 0.601304 0.449880 0.266205 2 0.113262 0.360580 0.238833 3 0.798063 0.077769 0.471169 4 0.262779 0.530565 0.992084
Output after: ------------- 0 1 2 3 ... 6 7 8 9 0 0.673621 0.211505 0.943201 0.946548 ... 0.900453 0.612182 0.861933 0.710967 1 0.670855 0.834449 0.796273 0.785976 ... 0.609954 0.686663 0.684582 0.837505 2 0.544736 0.814827 0.352893 0.459556 ... 0.650993 0.735943 0.279110 0.840203 3 0.440125 0.554323 0.745462 0.940896 ... 0.544576 0.224175 0.852603 0.509837 4 0.225551 0.791834 0.476059 0.321857 ... 0.391165 0.423213 0.290683 0.954423
[5 rows x 10 columns]
Change (2) implies fewer rows are displayed before auto-hiding takes place. I find that 60 rows almost always causes the terminal to scroll (most terminals have between 25-40 rows), so reducing the value to 20 increases the chance that a data frame can be observed on one terminal page. I'm not including a before/after output since it should be easy to imagine how this change affects the output.
Both changes would make Pandas behave similar to R's Tidyverse (which I really like), but this should not be the main reason why these changes are a good idea. I mainly like them because these settings make (large) data frames much nicer to look at.
Note that these changes affect the default values. Of course, users are free to change them back in their active Python session.
Comments to both proposed changes are highly welcome (either here on the mailing list or at https://github.com/pandas-dev/pandas/pull/17023.
Clemens _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
-- You received this message because you are subscribed to the Google Groups "PyData" group. To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Coming back to this (we are discussing again a concrete PR proposing this change https://github.com/pandas-dev/pandas/pull/20514) 2017-12-08 16:11 GMT+01:00 Tom Augspurger <tom.augspurger88@gmail.com>:
On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
*[Note for those reading it on the pydata mailing list, please answer to pandas-dev@python.org <pandas-dev@python.org> to keep discussion centralised there]*
Hi all,
I am reposting the mail of Clemens below, but with slightly changed focus, as I think the main discussion point is about the number of rows.
The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to lower the default number of rows shown when displaying a Series or DataFrame from 60 to 20. Thoughts on that?
Personally, I always set the max rows to 10 or 20, so I'd be OK with it if the community is on board.
I also often set this at a lower value like that (eg typically for tutorials), so I am also in favor of changing *something*. However, my main 'problem' is that, in interactive usage, with a lower default it becomes very cumbersome to actually look at more data (changing the setting just to inspect some data). For example if the new max_rows default would be 10, doing df.head(20) to quickly inspect some more data will still only show 10 rows. We cannot change what a function like head does (it is still a normal repr following the same options, since it needs to actually return a dataframe, not only display it), but therefore, I have another proposal: - We have 2 thresholds instead of 1 (the current 'max_rows'): a number of rows to show *in* a truncated repr, and a max number of rows to show without truncating - For 'big' dataframes, we show a truncated repr. And then I would go even lower than 20 and only show first/last 5 (so like a max_rows of 10) - For 'small' dataframes, we show the full dataframe without truncating, up to the threshold. Of course, then the difficulty is to determine what we call 'big' and 'small', so what is the threshold to show a tuncated repr (and this part will again get more subjective :)). But for example, using the current max_rows of 60: we could show a full repr up to 60 rows, and once the number of rows > 60, we only show 10 (first/last 5). You can then still set both thresholds at the same number (like 20) to not get this variable behaviour. This is actually similar to what numpy arrays do (but with a bigger threshold: eg np.random.randn(1000) shows all 1000 elements, np.random.randn(1001) shows the first/lst 3). It's just an idea, but I think this might be a way to satisfy more use cases at once (and more possibility to fine tune the behaviour). Joris
Tom
2018-03-28 12:16 GMT+02:00 Joris Van den Bossche < jorisvandenbossche@gmail.com>:
Coming back to this (we are discussing again a concrete PR proposing this change https://github.com/pandas-dev/pandas/pull/20514)
2017-12-08 16:11 GMT+01:00 Tom Augspurger <tom.augspurger88@gmail.com>:
On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
*[Note for those reading it on the pydata mailing list, please answer to pandas-dev@python.org <pandas-dev@python.org> to keep discussion centralised there]*
Hi all,
I am reposting the mail of Clemens below, but with slightly changed focus, as I think the main discussion point is about the number of rows.
The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to lower the default number of rows shown when displaying a Series or DataFrame from 60 to 20. Thoughts on that?
Personally, I always set the max rows to 10 or 20, so I'd be OK with it if the community is on board.
I also often set this at a lower value like that (eg typically for tutorials), so I am also in favor of changing *something*. However, my main 'problem' is that, in interactive usage, with a lower default it becomes very cumbersome to actually look at more data (changing the setting just to inspect some data). For example if the new max_rows default would be 10, doing df.head(20) to quickly inspect some more data will still only show 10 rows.
We cannot change what a function like head does (it is still a normal repr following the same options, since it needs to actually return a dataframe, not only display it), but therefore, I have another proposal:
- We have 2 thresholds instead of 1 (the current 'max_rows'): a number of rows to show *in* a truncated repr, and a max number of rows to show without truncating - For 'big' dataframes, we show a truncated repr. And then I would go even lower than 20 and only show first/last 5 (so like a max_rows of 10) - For 'small' dataframes, we show the full dataframe without truncating, up to the threshold.
Of course, then the difficulty is to determine what we call 'big' and 'small', so what is the threshold to show a tuncated repr (and this part will again get more subjective :)). But for example, using the current max_rows of 60: we could show a full repr up to 60 rows, and once the number of rows > 60, we only show 10 (first/last 5).
You can then still set both thresholds at the same number (like 20) to not get this variable behaviour.
This is actually similar to what numpy arrays do (but with a bigger threshold: eg np.random.randn(1000) shows all 1000 elements, np.random.randn(1001) shows the first/lst 3).
And it seems this is also what R tibbles do: they have a "print_min" and "print_max" options with exactly this behaviour, only their "print_max" is lower (it's 10 and 20, respectively):
options(tibble.print_max = n, tibble.print_min = m): if there are more than
n rows, print only the first m rows. Use options(tibble.print_max = Inf) to always show all rows.
(from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
It's just an idea, but I think this might be a way to satisfy more use cases at once (and more possibility to fine tune the behaviour).
Joris
Tom
participants (2)
-
Joris Van den Bossche -
Tom Augspurger