Preprocessing not quite fixed-width file before parsing

Wed Nov 23 15:38:49 EST 2022

This seems to work. I’m inferring the | is present in each line that needs to be fixed.

import pandas
import logging

class Wrapper:
    """Wrap file to fix up data"""

    def __init__(self, filename):
        self.filename = filename

    def __enter__(self):
        self.fh = open(self.filename,'r')
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.fh.close()

    def __iter__(self):
        """This is required by pandas for some reason, even though it doesn't seem to be called"""
        raise ValueError("Unsupported operation")

    def read(self, n: int):
        """Read data. Replace 'grace' before | if it has underscores in it"""
        try:
            data = self.fh.readline()
            ht = data.split('|', maxsplit=2)
            if len(ht) == 2:
                head,tail = ht
                hparts = head.split(maxsplit=7)
                assert len(hparts) == 8
                if ' ' in hparts[7].strip():
                    hparts[7] = hparts[7].strip().replace(' ','_')
                    fixed_data = f"{' '.join(hparts)} | {tail}"
                    return fixed_data

            return data
        except:
            logging.exception("read")

logging.basicConfig()
with Wrapper('data.txt') as f:
    df = pandas.read_csv(f, delimiter=r"\s+")
print(df)

From: Python-list <python-list-bounces+gweatherby=uchc.edu at python.org> on behalf of Loris Bennett <loris.bennett at fu-berlin.de>
Date: Wednesday, November 23, 2022 at 2:00 PM
To: python-list at python.org <python-list at python.org>
Subject: Preprocessing not quite fixed-width file before parsing
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Hi,

I am using pandas to parse a file with the following structure:

Name       fileset    type             KB      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace
shortname  sharedhome USR        14097664  524288000  545259520          0     none |   107110       0        0        0     none
gracedays  sharedhome USR       774858944  524288000  775946240          0   5 days |  1115717       0        0        0     none
nametoolong sharedhome USR        27418496  524288000  545259520          0     none |    11581       0        0        0     none

I was initially able to use

  df = pandas.read_csv(file_name, delimiter=r"\s+")

because all the values for 'grace' were 'none'.  Now, however,
non-"none" values have appeared and this fails.

I can't use

  pandas.read_fwf

even with an explicit colspec, because the names in the first column
which are too long for the column will displace the rest of the data to
the right.

The report which produces the file could in fact also generate a
properly delimited CSV file, but I have a lot of historical data in the
readable but poorly parsable format above that I need to deal with.

If I were doing something similar in the shell, I would just pipe the
file through sed or something to replace '5 days' with, say '5_days'.
How could I achieve a similar sort of preprocessing in Python, ideally
without having to generate a lot of temporary files?

Cheers,

Loris

--
This signature is currently under constuction.
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$>