[Tutor] Most efficient way to replace ", " with "." in a array and/or dataframe
Cameron Simpson
cs at cskk.id.au
Sun Sep 22 04:20:52 EDT 2019
On 22Sep2019 07:39, Albert-Jan Roskam <sjeik_appie at hotmail.com> wrote:
>On 22 Sep 2019 04:27, Cameron Simpson <cs at cskk.id.au> wrote:
>On 21Sep2019 20:42, Markos <markos at c2o.pro.br> wrote:
>>I have a table.csv file with the following structure:
>>
>>, Polyarene conc ,, mg L-1 ,,,,,,,
>>Spectrum, Py, Ace, Anth,
>>1, "0,456", "0,120", "0,168"
>>2, "0,456", "0,040", "0,280"
>>3, "0,152", "0,200", "0,280"
>>
>>I open as dataframe with the command:
>>data = pd.read_csv ('table.csv', sep = ',', skiprows = 1)
>[...]
>>And the data_array variable gets the fields in string format:
>>[['0,456' '0,120' '0,168']
>[...]
>
>>Please see the documentation for the >read_csv function here:
>
>> https://pandas.pydata.org/pandas
>
>>docs/stable/reference/api/pandas.read_cs> v.html?highlight=read_csv#pandas.read_csv
>
>Do you think it's a deliberate design choice that decimal and thousands
>where used here as params, and not a 'locale' param? It seems nice to
>be able to specify e.g. locale='dutch' and then all the right
>lc_numeric, lc_monetary, lc_time where used. Or even
>locale='nl_NL.1252' and you also wouldn't need 'encoding' as a separate
>param. Or might that be bad on windows where there's no locale-gen?
>Just wondering...
Locales are tricky; I don't know enough.
A locale parameter might be convenient for some things, but such things
are table driven. From an arbitrary Linux box nearby:
% locale -a
C
C.UTF-8
POSIX
en_AU.utf8
No "dutch" or similar there.
I doubt pandas would ship with such a thing. And the OP probably doesn't
know the originating locale anyway. Nor do _we_ know that those values
themselves were driven from some well known locale table.
The advantage of specifical decimal= and thousands= parameters is that
they do exactly what they say, rather than looking up a locale and
hoping for a specific side effect. So the specific parameters offer
better control.
The thousands= itself is a little parachial (for example, in India a
factor of 100 is a common division point[1]), but it may merely be used
to strip this character from the left portion of the number.
[1] https://en.wikipedia.org/wiki/Indian_numbering_system
So while I am not a pandas person, I would expect that decimal= and
thousands= are useful parameters for specific lexical situations (like
the OP's CSV data) and work regardless of any locale knowledge.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list