split string into multi-character "letters"
Tim Chase
python.list at tim.thechases.com
Wed Aug 25 16:17:54 EDT 2010
On 08/25/10 14:46, Jed wrote:
> Hi, I'm seeking help with a fairly simple string processing task.
> I've simplified what I'm actually doing into a hypothetical
> equivalent.
> Suppose I want to take a word in Spanish, and divide it into
> individual letters. The problem is that there are a few 2-character
> combinations that are considered single letters in Spanish - for
> example 'ch', 'll', 'rr'.
> Suppose I have:
>
> alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
> the whole alphabet but I shortened it here
> theword = 'churro'
>
> I would like to split the string 'churro' into a list containing:
>
> 'ch','u','rr','o'
>
> So at each letter I want to look ahead and see if it can be combined
> with the next letter to make a single 'letter' of the Spanish
> alphabet. I think this could be done with a regular expression
> passing the list called "alphabet" to re.match() for example, but I'm
> not sure how to use the contents of a whole list as a search string in
> a regular expression, or if it's even possible.
My first attempt at the problem:
>>> import re
>>> special = ['ch', 'rr', 'll']
>>> r = re.compile(r'(?:%s)|[a-z]' % ('|'.join(re.escape(c) for
c in special)), re.I)
>>> r.findall('churro')
['ch', 'u', 'rr', 'o']
>>> [r.findall(word) for word in 'churro lorenzo caballo'.split()]
[['ch', 'u', 'rr', 'o'], ['l', 'o', 'r', 'e', 'n', 'z', 'o'],
['c', 'a', 'b', 'a', 'll', 'o']]
This joins escaped versions of all your special characters. Due
to the sequential nature used by Python's re module to handle "|"
or-branching, the paired versions get tested (and found) before
proceeding to the single-letters.
-tkc
More information about the Python-list
mailing list