how to avoid leading white spaces

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jun 3 22:05:12 EDT 2011


On Fri, 03 Jun 2011 12:29:52 -0700, rurpy at yahoo.com wrote:

>>> I often find myself changing, for example, a startwith() to a RE when
>>> I realize that the input can contain mixed case
>>
>> Why wouldn't you just normalise the case?
> 
> Because some of the text may be case-sensitive.

Perhaps you misunderstood me. You don't have to throw away the 
unnormalised text, merely use the normalized text in the expression you 
need.

Of course, if you include both case-sensitive and insensitive tests in 
the same calculation, that's a good candidate for a regex... or at least 
it would be if regexes supported that :)


>>[...]
>>> or that I have
>>> to treat commas as well as spaces as delimiters.
>>
>> source.replace(",", " ").split(" ")
> 
> Uhgg. create a whole new string just so you can split it on one rather
> than two characters?

You say that like it's expensive.

And how do you what the regex engine is doing under the hood? For all you 
know, it could be making hundreds of temporary copies and throwing them 
away. Or something. It's a black box.

The fact that creating a whole new string to split on is faster than 
*running* the regex (never mind compiling it, loading the regex engine, 
and anything else that needs to be done) should tell you which does more 
work. Copying is cheap. Parsing is expensive.


> Sorry, but I find
> 
>     re.split ('[ ,]', source)
> 
> states much more clearly exactly what is being done with no obfuscation.

That's because you know regex syntax. And I'd hardly call the version 
with replace obfuscated.

Certainly the regex is shorter, and I suppose it's reasonable to expect 
any reader to know at least enough regex to read that, so I'll grant you 
that this is a small win for clarity. A micro-optimization for 
readability, at the expense of performance.


> Obviously this is a simple enough case that the difference is minor but
> when the pattern gets only a little more complex, the clarity difference
> becomes greater.

Perhaps. But complicated tasks require complicated regexes, which are 
anything but clear.



[...]
>>> After doing this a
>>> number of times, one starts to use an RE right from the get go unless
>>> one is VERY sure that there will be no requirements creep.
>>
>> YAGNI.
> 
> IAHNI. (I actually have needed it.)

I'm sure you have, and when you need it, it's entirely appropriate to use 
a regex solution. But you stated that you used regexes as insurance *just 
in case* the requirements changed. Why, is your text editor broken? You 
can't change a call to str.startswith(prefix) to re.match(prefix, str) if 
and when you need to? That's what I mean by YAGNI -- don't solve the 
problem you think you might have tomorrow.


>> There's no need to use a regex just because you think that you *might*,
>> someday, possibly need a regex. That's just silly. If and when
>> requirements change, then use a regex. Until then, write the simplest
>> code that will solve the problem you have to solve now, not the problem
>> you think you might have to solve later.
> 
> I would not recommend you use a regex instead of a string method solely
> because you might need a regex later.  But when you have to spend 10
> minutes writing a half-dozen lines of python versus 1 minute writing a
> regex, your evaluation of the possibility of requirements changing
> should factor into your decision.

Ah, but if your requirements are complicated enough that it takes you ten 
minutes and six lines of string method calls, that sounds to me like a 
situation that probably calls for a regex!

Of course it depends on what the code actually does... if it counts the 
number of nested ( ) pairs, and you're trying to do that with a regex, 
you're sacked! *wink*



[...]
>> There are a few problems with regexes:
>>
>> - they are another language to learn, a very cryptic a terse language;
> 
> Chinese is cryptic too but there are a few billion people who don't seem
> to be bothered by that.

Chinese isn't cryptic to the Chinese, because they've learned it from 
childhood. 

But has anyone done any studies comparing reading comprehension speed 
between native Chinese readers and native European readers? For all I 
know, Europeans learn to read twice as quickly as Chinese, and once 
learned, read text twice as fast. Or possibly the other way around. Who 
knows? Not me.

But I do know that English typists typing 26 letters of the alphabet 
leave Asian typists and their thousands of ideograms in the dust. There's 
no comparison -- it's like quicksort vs bubblesort *wink*.


[...]
>> - debugging regexes is a nightmare;
> 
> Very complex ones, perhaps.  "Nightmare" seems an overstatement.

You *can't* debug regexes in Python, since there are no tools for (e.g.) 
single-stepping through the regex, displaying intermediate calculations, 
or anything other than making changes to the regex and running it again, 
hoping that it will do the right thing this time.

I suppose you can use external tools, like Regex Buddy, if you're on a 
supported platform and if they support your language's regex engine.


[...]
>> Regarding their syntax, I'd like to point out that even Larry Wall is
>> dissatisfied with regex culture in the Perl community:
>>
>> http://www.perl.com/pub/2002/06/04/apo5.html
> 
> You did see the very first sentence in this, right?
> 
>   "Editor's Note: this Apocalypse is out of date and remains here for
>   historic reasons. See Synopsis 05 for the latest information."

Yes. And did you click through to see the Synopsis? It is a bare 
technical document with all the motivation removed. Since I was pointing 
to Larry Wall's motivation, it was appropriate to link to the Apocalypse 
document, not the Synopsis.


> (Note that "Apocalypse" is referring to a series of Perl design
> documents and has nothing to do with regexes in particular.)

But Apocalypse 5 specifically has everything to do with regexes. That's 
why I linked to that, and not (say) Apocalypse 2.


> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
> syntax.  I didn't see anything about de-emphasizing them in Perl.  (But
> I have no idea what is going on for Perl 6 so I could be wrong about
> that.)

I never said anything about de-emphasizing them. I said that Larry Wall 
was dissatisfied with Perl's culture of regexes -- his own words were:

"regular expression culture is a mess"

and he is also extremely critical of current (i.e. Perl 5) regex syntax. 
Since Python's regex syntax borrows heavily from Perl 5, that's extremely 
pertinent to the issue. When even the champion of regex culture says 
there is much broken about regex culture, we should all listen.



> As for the original reference, Wall points out a number of problems with
> regexes, mostly details of their syntax.  For example that more
> frequently used non-capturing groups require more characters than
> less-frequently used capturing groups. Most of these criticisms seem
> irrelevant to the question of whether hard-wired string manipulation
> code or regexes should be preferred in a Python program.

It is only relevant in so far as the readability and relative obfuscation 
of regex syntax is relevant. No further.

You keep throwing out the term "hard-wired string manipulation", but I 
don't understand what point you're making. I don't understand what you 
see as "hard-wired", or why you think

source.startswith(prefix)

is more hard-wired than

re.match(prefix, source)


[...]
> Perhaps you stopped reading after seeing his "regular expression culture
> is a mess" comment without trying to see what he meant by "culture" or
> "mess"?

Perhaps you are being over-sensitive and reading *far* too much into what 
I said. If regexes were more readable, as proposed by Wall, that would go 
a long way to reducing my suspicion of them.



-- 
Steven



More information about the Python-list mailing list