Preprocessing not quite fixed-width file before parsing

Wed Nov 23 16:36:53 EST 2022

On 11/23/2022 11:00 AM, Loris Bennett wrote:
> Hi,
> 
> I am using pandas to parse a file with the following structure:
> 
> Name       fileset    type             KB      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace
> shortname  sharedhome USR        14097664  524288000  545259520          0     none |   107110       0        0        0     none
> gracedays  sharedhome USR       774858944  524288000  775946240          0   5 days |  1115717       0        0        0     none
> nametoolong sharedhome USR        27418496  524288000  545259520          0     none |    11581       0        0        0     none
> 
> I was initially able to use
> 
>    df = pandas.read_csv(file_name, delimiter=r"\s+")
> 
> because all the values for 'grace' were 'none'.  Now, however,
> non-"none" values have appeared and this fails.
> 
> I can't use
> 
>    pandas.read_fwf
> 
> even with an explicit colspec, because the names in the first column
> which are too long for the column will displace the rest of the data to
> the right.
> 
> The report which produces the file could in fact also generate a
> properly delimited CSV file, but I have a lot of historical data in the
> readable but poorly parsable format above that I need to deal with.
> 
> If I were doing something similar in the shell, I would just pipe the
> file through sed or something to replace '5 days' with, say '5_days'.
> How could I achieve a similar sort of preprocessing in Python, ideally
> without having to generate a lot of temporary files?

This is really annoying, isn't it?  A space-separated line with spaces 
in data entries.   If the example you give is typical, I don't think 
there is a general solution.  If you know there are only certain values 
like that, then you could do a search-and-replace for them in Python 
just like the example you gave for "5 days".

If you know that the field that might contain entries with spaces is the 
same one, e.g., the one just before the "|" marker, you could make use 
of that. But it could be tricky.

I don't know how many files like this you will need to process, nor how 
many rows they might contain. If I were to do tackle this job, I would 
probably do some quality checking first.  Using this example file, 
figure out how many fields there are supposed to be.  First, split the 
file into lines:

with open("filename") as f:
     lines = f.readlines()

# Check space-separated fields defined in first row:
fields = lines[0].split()
num_fields = len(fields)
print(num_fields)   # e.g., 100)

# Find lines that have the wrong number of fields
bad_lines = []
for line in lines:
    fields = line.split()
    if len(fields) != num_fields:
      bad_lines.append(line)

print(len(bad_lines))

# Inspect a sample
for line in bad_lines[:10]:
     print(line)

This will give you an idea of how many problems lines there are, and if 
they can all be fixed by a simple replacement.  If they can and this is 
the only file you need to handle, just fix it up and run it.  I would 
replace the spaces with tabs or commas.  Splitting a line on spaces 
(split()) takes care of the issue of having a variable number of spaces, 
so that's easy enough.

If you will need to handle many files, and you can automate the fixes - 
possibly with a regular expression - then you should preprocess each 
file before giving it to pandas.  Something like this:

def fix_line(line):
    """Test line for field errors and fix errors if any."""
    # ....
    return fixed

# For each file
with open("filename") as f:
     lines = f.readlines()

fixed_lines = []
for line in lines:
     fixed = fix_line(line)
     fields = fixed.split()
     tabified = '\t'.join(fields) # Could be done by fix_line()
     fixed_lines.append(tabified)

# Now use an IOString to feed the file to pandas
# From memory, some details may not be right
f = IOString()
f.writelines(fixed_lines)

# Give f to pandas as if it were an external file
# ...