Code improvement question

Fri Nov 17 13:56:54 EST 2023

On 2023-11-17 09:38, jak via Python-list wrote:
> Mike Dewhirst ha scritto:
>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>> I'd like to improve the code below, which works. It feels clunky to me.
>>>>
>>>> I need to clean up user-uploaded files the size of which I don't know in
>>>> advance.
>>>>
>>>> After cleaning they might be as big as 1Mb but that would be super rare.
>>>> Perhaps only for testing.
>>>>
>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>
>>>> def remove_alpha(txt):
>>>>
>>>>       """  r'[^0-9\- ]':
>>>>
>>>>       [^...]: Match any character that is not in the specified set.
>>>>
>>>>       0-9: Match any digit.
>>>>
>>>>       \: Escape character.
>>>>
>>>>       -: Match a hyphen.
>>>>
>>>>       Space: Match a space.
>>>>
>>>>       """
>>>>
>>>>       cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>
>>>>       bits = cleaned_txt.split()
>>>>
>>>>       pieces = []
>>>>
>>>>       for bit in bits:
>>>>
>>>>           # minimum size of a CAS number is 7 so drop smaller clumps 
>>>> of digits
>>>>
>>>>           pieces.append(bit if len(bit) > 6 else "")
>>>>
>>>>       return " ".join(pieces)
>>>>
>>>>
>>>> Many thanks for any hints
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>> 
>> I think I can see what you did there but it won't make sense to me - or 
>> whoever looks at the code - in future.
>> 
>> That answers your specific question. However, I am in awe of people who 
>> can just "do" regular expressions and I thank you very much for what 
>> would have been a monumental effort had I tried it.
>> 
>> That little re.sub() came from ChatGPT and I can understand it without 
>> too much effort because it came documented
>> 
>> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
>> 
>> Thanks
>> 
>> Mike
> 
> I respect your opinion but from the point of view of many usenet users
> asking a question to chatgpt to solve your problem is truly an overkill.
> The computer world overflows with people who know regex. If you had not
> already had the answer with the use of 're' I would have sent you my
> suggestion that as you can see it is practically identical. I am quite
> sure that in this usenet the same solution came to the mind of many
> people.
> 
> with open(file) as fp:
>       try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
>       except: ret = []
> 
> The only difference is '\d' instead of '[0-9]' but they are equivalent.
> 
Bare excepts are a very bad idea.