Code improvement question
MRAB
python at mrabarnett.plus.com
Fri Nov 17 13:56:54 EST 2023
On 2023-11-17 09:38, jak via Python-list wrote:
> Mike Dewhirst ha scritto:
>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>> I'd like to improve the code below, which works. It feels clunky to me.
>>>>
>>>> I need to clean up user-uploaded files the size of which I don't know in
>>>> advance.
>>>>
>>>> After cleaning they might be as big as 1Mb but that would be super rare.
>>>> Perhaps only for testing.
>>>>
>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>
>>>> def remove_alpha(txt):
>>>>
>>>> """ r'[^0-9\- ]':
>>>>
>>>> [^...]: Match any character that is not in the specified set.
>>>>
>>>> 0-9: Match any digit.
>>>>
>>>> \: Escape character.
>>>>
>>>> -: Match a hyphen.
>>>>
>>>> Space: Match a space.
>>>>
>>>> """
>>>>
>>>> cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>
>>>> bits = cleaned_txt.split()
>>>>
>>>> pieces = []
>>>>
>>>> for bit in bits:
>>>>
>>>> # minimum size of a CAS number is 7 so drop smaller clumps
>>>> of digits
>>>>
>>>> pieces.append(bit if len(bit) > 6 else "")
>>>>
>>>> return " ".join(pieces)
>>>>
>>>>
>>>> Many thanks for any hints
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>
>> I think I can see what you did there but it won't make sense to me - or
>> whoever looks at the code - in future.
>>
>> That answers your specific question. However, I am in awe of people who
>> can just "do" regular expressions and I thank you very much for what
>> would have been a monumental effort had I tried it.
>>
>> That little re.sub() came from ChatGPT and I can understand it without
>> too much effort because it came documented
>>
>> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
>>
>> Thanks
>>
>> Mike
>
> I respect your opinion but from the point of view of many usenet users
> asking a question to chatgpt to solve your problem is truly an overkill.
> The computer world overflows with people who know regex. If you had not
> already had the answer with the use of 're' I would have sent you my
> suggestion that as you can see it is practically identical. I am quite
> sure that in this usenet the same solution came to the mind of many
> people.
>
> with open(file) as fp:
> try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
> except: ret = []
>
> The only difference is '\d' instead of '[0-9]' but they are equivalent.
>
Bare excepts are a very bad idea.
More information about the Python-list
mailing list