Code improvement question
MRAB
python at mrabarnett.plus.com
Tue Nov 14 18:25:10 EST 2023
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
> I'd like to improve the code below, which works. It feels clunky to me.
>
> I need to clean up user-uploaded files the size of which I don't know in
> advance.
>
> After cleaning they might be as big as 1Mb but that would be super rare.
> Perhaps only for testing.
>
> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
> xxxxxxx-xx-x eg., 1012300-77-4
>
> def remove_alpha(txt):
>
> """ r'[^0-9\- ]':
>
> [^...]: Match any character that is not in the specified set.
>
> 0-9: Match any digit.
>
> \: Escape character.
>
> -: Match a hyphen.
>
> Space: Match a space.
>
> """
>
> cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>
> bits = cleaned_txt.split()
>
> pieces = []
>
> for bit in bits:
>
> # minimum size of a CAS number is 7 so drop smaller clumps of digits
>
> pieces.append(bit if len(bit) > 6 else "")
>
> return " ".join(pieces)
>
>
> Many thanks for any hints
>
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
More information about the Python-list
mailing list