Preprocessing not quite fixed-width file before parsing
Loris Bennett
loris.bennett at fu-berlin.de
Fri Nov 25 02:09:52 EST 2022
Thomas Passin <list1 at tompassin.net> writes:
> On 11/24/2022 9:06 AM, Loris Bennett wrote:
>> Thomas Passin <list1 at tompassin.net> writes:
>>
>>> On 11/23/2022 11:00 AM, Loris Bennett wrote:
>>>> Hi,
>>>> I am using pandas to parse a file with the following structure:
>>>> Name fileset type KB quota limit
>>>> in_doubt grace | files quota limit in_doubt grace
>>>> shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
>>>> gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
>>>> nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none
>>>> I was initially able to use
>>>> df = pandas.read_csv(file_name, delimiter=r"\s+")
>>>> because all the values for 'grace' were 'none'. Now, however,
>>>> non-"none" values have appeared and this fails.
>>>> I can't use
>>>> pandas.read_fwf
>>>> even with an explicit colspec, because the names in the first column
>>>> which are too long for the column will displace the rest of the data to
>>>> the right.
>>>> The report which produces the file could in fact also generate a
>>>> properly delimited CSV file, but I have a lot of historical data in the
>>>> readable but poorly parsable format above that I need to deal with.
>>>> If I were doing something similar in the shell, I would just pipe
>>>> the
>>>> file through sed or something to replace '5 days' with, say '5_days'.
>>>> How could I achieve a similar sort of preprocessing in Python, ideally
>>>> without having to generate a lot of temporary files?
>>>
>>> This is really annoying, isn't it? A space-separated line with spaces
>>> in data entries. If the example you give is typical, I don't think
>>> there is a general solution. If you know there are only certain
>>> values like that, then you could do a search-and-replace for them in
>>> Python just like the example you gave for "5 days".
>>>
>>> If you know that the field that might contain entries with spaces is
>>> the same one, e.g., the one just before the "|" marker, you could make
>>> use of that. But it could be tricky.
>>>
>>> I don't know how many files like this you will need to process, nor
>>> how many rows they might contain. If I were to do tackle this job, I
>>> would probably do some quality checking first. Using this example
>>> file, figure out how many fields there are supposed to be. First,
>>> split the file into lines:
>>>
>>> with open("filename") as f:
>>> lines = f.readlines()
>>>
>>> # Check space-separated fields defined in first row:
>>> fields = lines[0].split()
>>> num_fields = len(fields)
>>> print(num_fields) # e.g., 100)
>>>
>>> # Find lines that have the wrong number of fields
>>> bad_lines = []
>>> for line in lines:
>>> fields = line.split()
>>> if len(fields) != num_fields:
>>> bad_lines.append(line)
>>>
>>> print(len(bad_lines))
>>>
>>> # Inspect a sample
>>> for line in bad_lines[:10]:
>>> print(line)
>>>
>>> This will give you an idea of how many problems lines there are, and
>>> if they can all be fixed by a simple replacement. If they can and
>>> this is the only file you need to handle, just fix it up and run it.
>>> I would replace the spaces with tabs or commas. Splitting a line on
>>> spaces (split()) takes care of the issue of having a variable number
>>> of spaces, so that's easy enough.
>>>
>>> If you will need to handle many files, and you can automate the fixes
>>> - possibly with a regular expression - then you should preprocess each
>>> file before giving it to pandas. Something like this:
>>>
>>> def fix_line(line):
>>> """Test line for field errors and fix errors if any."""
>>> # ....
>>> return fixed
>>>
>>> # For each file
>>> with open("filename") as f:
>>> lines = f.readlines()
>>>
>>> fixed_lines = []
>>> for line in lines:
>>> fixed = fix_line(line)
>>> fields = fixed.split()
>>> tabified = '\t'.join(fields) # Could be done by fix_line()
>>> fixed_lines.append(tabified)
>>>
>>> # Now use an IOString to feed the file to pandas
>>> # From memory, some details may not be right
>>> f = IOString()
>>> f.writelines(fixed_lines)
>>>
>>> # Give f to pandas as if it were an external file
>>> # ...
>>>
>> Thanks to both Gerard and Thomas for the pointer to IOString. I
>> ended up
>> just reading the file line-by-line, using a regex to replace
>> '<n> <units> |'
>> with
>> '<n><units> |'
>> and writing the new lines to an IOString, which I then passed to
>> pandas.read_csv.
>> The wrapper approach looks interesting, but it looks like I need to
>> read
>> up more on contexts before adding that to my own code, otherwise I may
>> not understand it in a month's time.
>
> Glad that IOString works for you here. I seem to remember that after
> writing to the IOString, you have to seek to 0 before reading from
> it. Better check that point!
Stefan (whom I forgot to thank: Verziehung, Stefan!), mentioned seek(0),
so fortunately I was primed when I read the Python documentation for
IOString.
Cheers,
Loris
--
This signature is currently under constuction.
More information about the Python-list
mailing list