[Tutor] Finding unique strings.
Cameron Simpson
cs at cskk.id.au
Fri May 3 18:43:59 EDT 2019
On 03May2019 22:07, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>I have a list of strings which has been downloaded from my bank. I am
>trying
>to build a program to find the unique string patterns which I want to use
>with a dictionary. So I can group the different transactions together. Below
>are example unique strings which I have manually extracted from the data.
>Everything after the example text is different. I cannot show the full data
>due to privacy.
>
>WITHDRAWAL AT HANDYBANK
>PAYMENT BY AUTHORITY
>WITHDRAWAL BY EFTPOS
>WITHDRAWAL MOBILE
>DEPOSIT ACCESSPAY
>
>Note: Some of the entries, have an store name contained in the string
>towards the end. For example:
>
>WITHDRAWAL BY EFTPOS 0304479 KMART 1075 CASTLE HILL 24/09
>
>Thus I want to extract the KMART as part of the unique key. As the shown
>example transaction always has a number. I was going to use a test condition
>for the above to test for the number. Then the next word would be added to
>the string for the key.
[...]
I'm assuming you're handed the text as one string, for example this:
WITHDRAWAL BY EFTPOS 0304479 KMART 1075 CASTLE HILL 24/09
I'm assuming is a single column from a CSV of transactions.
I've got 2 observations:
1: For your unique key, if it is a string (it needn't be), you just
need to put the relevant parts into your key. FOr the above, perhaps
that might be:
WITHDRAWAL 0304479 KMART
or:
WITHDRAWAL KMART 1075
etc depending on what the relevant parts are.
2: To pull out the relevant words from the description I would be
inclined to do a more structured parse. Consider something like the
following (untested):
# example data
desc = 'WITHDRAWAL BY EFTPOS 0304479 KMART 1075 CASTLE HILL 24/09'
# various things which may be recognised
method = None
terminal = None
vendor = None
vendor_site = None
# inspect the description
words = desc.split()
flavour = desc.pop(0) # "WITHDRAWAL" etc
word0 = desc.pop(0)
if word0 in ('BY', 'AT'):
method = word0 + ' ' + desc.pop(0) # "BY EFTPOS"
elif word0 in ('MOBILE', 'ACCESSPAY'):
method = word0
word0 = words.pop(0)
if word0.isdigit():
# probably really part of the "BY EFTPOS" description
terminal = word0
word0 = words.pop(0)
vendor = word0
word0 = words.pop(0)
if word0.isdigit():
vendor_site = word0
word0 = words.pop(0)
# ... more dissection ...
# assemble the key - include only the items that matter
# eg maybe leave out terminal and vendor_site, etc
key = (flavour, method, terminal, vendor, vendor_site)
This is all rather open ended, and totally dependent on your bank's
reporting habits. Also, it needs some length checks: words.pop(0) will
raise an exception when "words" is empty, as it will be for the shorter
descriptions at some point.
The important point is to get a structured key containing just the
relevant field values: being assembled as a tuple from strings
(immutable hashable Python values) it is usable as a dictionary key.
For more ease of use you can make the key a namedtuple:
from collections import defaultdict, namedtuple
........
KeyType = namedtuple('KeyType', 'flavour method vendor')
transactions = defaultdict(list)
........ loop over the CSV data ...
key = KeyType(flavour, method, vendor)
transactions[key].append(transcaction info here...)
which gets you a dictionary "transactions" containing lists of
transaction record (in whatever form you make them, when might be simply
the row from the CSV data as a first cut).
The nice thing about a namedtuple is that the values are available as
attributes: you can use "key.flavour" etc to inspect the tuple.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Tutor
mailing list