Good use for itertools.dropwhile and itertools.takewhile

Vlastimil Brom vlastimil.brom at gmail.com
Wed Dec 5 22:36:36 CET 2012


2012/12/5 Nick Mellor <thebalancepro at gmail.com>:
> Neil,
>
> Further down the data, found another edge case:
>
> "Spring ONION from QLD"
>
> Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.
>
> N
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Just for completeness..., it (likely) can be done using regex (given
the current specificatioin), but if the data are even more complex and
varying, the tools like pyparsing or dedicated parsing functions might
be more appropriate;

hth,
   vbr:


>>> import re
>>> test_product_data = """BEANS hand picked
... BEETROOT certified organic
... BOK CHOY (bunch)
... BROCCOLI Mornington Peninsula
... BRUSSEL  SPROUTS
... CABBAGE green
... CABBAGE Red
... CAPSICUM RED
... CARROTS
... CARROTS loose
... CARROTS juicing, certified organic
... CARROTS Trentham, large seconds, certified organic
... CARROTS Trentham, firsts, certified organic
... CAULIFLOWER
... CELERY Mornington Peninsula IPM grower
... CELERY Mornington Peninsula IPM grower
... CUCUMBER
... EGGPLANT
... FENNEL
... GARLIC (from Argentina)
... GINGER fresh uncured
... KALE (bunch)
... KOHL RABI certified organic
... LEEKS
...  LETTUCE iceberg
... MUSHROOM cup or flat
... MUSHROOM Swiss brown
... ONION brown
... ONION red
... ONION spring (bunch)
... PARSNIP, certified organic
... POTATOES certified organic
... POTATOES Sebago
... POTATOES Desiree
... POTATOES Bullarto chemical free
... POTATOES Dutch Cream
... POTATOES Nicola
... POTATOES Pontiac
... POTATOES Otway Red
... POTATOES teardrop
... PUMPKIN certified organic
... SCHALLOTS brown
... SNOW PEAS
... SPINACH I'll try to get certified organic (bunch)
... SWEET POTATO gold certified organic
... SWEET POTATO red small
... SWEDE certified organic
... TOMATOES  Qld
... TURMERIC fresh certified organic
... ZUCCHINI
... APPLES Harcourt  Pink Lady, Fuji, Granny Smith
... APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
... AVOCADOS
... AVOCADOS certified organic, seconds
... BANANAS Qld, organic
... GRAPEFRUIT
... GRAPES crimson seedless
... KIWI FRUIT Qld certified organic
... LEMONS
... LIMES
... MANDARINS
... ORANGES Navel
... PEARS Beurre Bosc Harcourt new season
... PEARS Packham, Harcourt new season
... SULTANAS 350g pre-packed bags
... EGGS Melita free range, Barker's Creek
... BASIL (bunch)
... CORIANDER (bunch)
... DILL (bunch)
... MINT (bunch)
... PARSLEY (bunch)
... Spring ONION from QLD"""
>>>
>>> len(test_product_data.splitlines())
72
>>>
>>> for prod_item in re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data): print prod_item
...
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL  SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP', ', certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt  Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')
>>> len(re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data))
72
>>>



More information about the Python-list mailing list