[Tutor] Finding unique strings.

Fri May 3 18:43:59 EDT 2019

On 03May2019 22:07, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>I have a list of strings which has been downloaded from my bank. I am 
>trying
>to build a program to find the unique string patterns which I want to use
>with a dictionary. So I can group the different transactions together. Below
>are example unique strings which I have manually extracted from the data.
>Everything after the example text is different. I cannot show the full data
>due to privacy.
>
>WITHDRAWAL AT HANDYBANK
>PAYMENT BY AUTHORITY
>WITHDRAWAL BY EFTPOS
>WITHDRAWAL MOBILE
>DEPOSIT          ACCESSPAY
>
>Note: Some of the entries, have an store name contained in the string
>towards the end. For example:
>
>WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09
>
>Thus I want to extract the KMART as part of the unique key. As the shown
>example transaction always has a number. I was going to use a test condition
>for the above to test for the number. Then the next word would be added to
>the string for the key.
[...]

I'm assuming you're handed the text as one string, for example this:

  WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09

I'm assuming is a single column from a CSV of transactions.

I've got 2 observations:

1: For your unique key, if  it is a string (it needn't be), you just 
need to put the relevant parts into your key. FOr the above, perhaps 
that might be:

  WITHDRAWAL 0304479 KMART

or:

  WITHDRAWAL KMART 1075

etc depending on what the relevant parts are.

2: To pull out the relevant words from the description I would be 
inclined to do a more structured parse. Consider something like the 
following (untested):

  # example data
  desc = 'WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09'
  # various things which may be recognised
  method = None
  terminal = None
  vendor = None
  vendor_site = None
  # inspect the description
  words = desc.split()
  flavour = desc.pop(0)     # "WITHDRAWAL" etc
  word0 = desc.pop(0)
  if word0 in ('BY', 'AT'):
    method = word0 + ' ' + desc.pop(0)    # "BY EFTPOS"
  elif word0 in ('MOBILE', 'ACCESSPAY'):
    method = word0
  word0 = words.pop(0)
  if word0.isdigit():
    # probably really part of the "BY EFTPOS" description
    terminal = word0
    word0 = words.pop(0)
  vendor = word0
  word0 = words.pop(0)
  if word0.isdigit():
    vendor_site = word0
    word0 = words.pop(0)
  # ... more dissection ...
  # assemble the key - include only the items that matter
  # eg maybe leave out terminal and vendor_site, etc
  key = (flavour, method, terminal, vendor, vendor_site)

This is all rather open ended, and totally dependent on your bank's 
reporting habits. Also, it needs some length checks: words.pop(0) will 
raise an exception when "words" is empty, as it will be for the shorter 
descriptions at some point.

The important point is to get a structured key containing just the 
relevant field values: being assembled as a tuple from strings 
(immutable hashable Python values) it is usable as a dictionary key.

For more ease of use you can make the key a namedtuple:

  from collections import defaultdict, namedtuple
  ........
  KeyType = namedtuple('KeyType', 'flavour method vendor')
  transactions = defaultdict(list)
  ........ loop over the CSV data ...
    key = KeyType(flavour, method, vendor)
    transactions[key].append(transcaction info here...)

which gets you a dictionary "transactions" containing lists of 
transaction record (in whatever form you make them, when might be simply 
the row from the CSV data as a first cut).

The nice thing about a namedtuple is that the values are available as 
attributes: you can use "key.flavour" etc to inspect the tuple.

Cheers,
Cameron Simpson <cs at cskk.id.au>