how to avoid leading white spaces

Mon Jun 6 02:03:39 EDT 2011

On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 12:29:52 -0700, rurpy at yahoo.com wrote:
>
>>>> I often find myself changing, for example, a startwith() to a RE when
>>>> I realize that the input can contain mixed case
>>>
>>> Why wouldn't you just normalise the case?
>>
>> Because some of the text may be case-sensitive.
>
> Perhaps you misunderstood me. You don't have to throw away the
> unnormalised text, merely use the normalized text in the expression you
> need.
>
> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)

I did not choose a good example to illustrate what I find often
motivates my use of regexes.

You are right that for a simple .startwith() using a regex "just
in case" is not a good choice, and in fact I would not do that.

The process that I find often occurs is that I write (or am about
to write string method solution and when I think more about the
input data (which is seldom well-specified), I realize that using
a regex I can get better error checking, do more of the "parsing"
in one place, and adapt to changes in input format better than I
could with a .startswith and a couple other such methods.

Thus what starts as
  if line.startswith ('CUSTOMER '):
    try: kw, first_initial, last_name, code, rest = line.split(None,
4)
    ...
often turns into (sometimes before it is written) something like
  m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})')
  if m: first_initial, last_name, code = m.group(...)

>>>[...]
>>>> or that I have
>>>> to treat commas as well as spaces as delimiters.
>>>
>>> source.replace(",", " ").split(" ")
>>
>> Uhgg. create a whole new string just so you can split it on one rather
>> than two characters?
>
> You say that like it's expensive.

No, I said it like it was ugly.  Doing things unrelated to the
task at hand is ugly.  And not very adaptable -- see my reply
to Chris Torek's post.  I understand it is a common idiom and
I use it myself, but in this case there is a cleaner alternative
with re.split that expresses exactly what one is doing.

> And how do you what the regex engine is doing under the hood? For all you
> know, it could be making hundreds of temporary copies and throwing them
> away. Or something. It's a black box.

That's a silly argument.
And how do you know what replace is doing under the hood?
I would expect any regex processor to compile the regex into
an FSM.  As usual, I would expect to pay a small performance
price for the generality, but that is reasonable tradeoff in
many cases.  If it were a potential problem, I would test it.
What I wouldn't do is throw away a useful tool because, "golly,
I don't know, maybe it'll be slow" -- that's just a form of
cargo cult programming.

> The fact that creating a whole new string to split on is faster than
> *running* the regex (never mind compiling it, loading the regex engine,
> and anything else that needs to be done) should tell you which does more
> work. Copying is cheap. Parsing is expensive.

In addition to being wrong (loading is done once, compilation is
typically done once or a few times, while the regex is used many
times inside a loop so the overhead cost is usually trivial compared
with the cost of starting Python or reading a file), this is another
micro-optimization argument.

I'm not sure why you've suddenly developed this obsession with
wringing every last nanosecond out of your code.  Usually it
is not necessary.  Have you thought of buying a faster computer?
Or using C?  *wink*

>> Sorry, but I find
>>
>>     re.split ('[ ,]', source)
>>
>> states much more clearly exactly what is being done with no obfuscation.
>
> That's because you know regex syntax. And I'd hardly call the version
> with replace obfuscated.
>
> Certainly the regex is shorter, and I suppose it's reasonable to expect
> any reader to know at least enough regex to read that, so I'll grant you
> that this is a small win for clarity. A micro-optimization for
> readability, at the expense of performance.
>
>
>> Obviously this is a simple enough case that the difference is minor but
>> when the pattern gets only a little more complex, the clarity difference
>> becomes greater.
>
> Perhaps. But complicated tasks require complicated regexes, which are
> anything but clear.

Complicated tasks require complicated code as well.

As another post pointed out, there are ways to improve the
clarity of a regex such as the re.VERBOSE flag.
There is no doubt that a regex encapsulates information much more
densely than python string manipulation code.  One should not
be surprised that is might take as much time and effort to understand
a one-line regex as a dozen (or whatever) lines Python code that
do the same thing.  In most cases I'll bet, given equal fluency
in regexes and Python, the regex will take less.

> [...]
>>>> After doing this a
>>>> number of times, one starts to use an RE right from the get go unless
>>>> one is VERY sure that there will be no requirements creep.
>>>
>>> YAGNI.
>>
>> IAHNI. (I actually have needed it.)
>
> I'm sure you have, and when you need it, it's entirely appropriate to use
> a regex solution. But you stated that you used regexes as insurance *just
> in case* the requirements changed. Why, is your text editor broken? You
> can't change a call to str.startswith(prefix) to re.match(prefix, str) if
> and when you need to? That's what I mean by YAGNI -- don't solve the
> problem you think you might have tomorrow.

Retracted above.

>>> There's no need to use a regex just because you think that you *might*,
>>> someday, possibly need a regex. That's just silly. If and when
>>> requirements change, then use a regex. Until then, write the simplest
>>> code that will solve the problem you have to solve now, not the problem
>>> you think you might have to solve later.
>>
>> I would not recommend you use a regex instead of a string method solely
>> because you might need a regex later.  But when you have to spend 10
>> minutes writing a half-dozen lines of python versus 1 minute writing a
>> regex, your evaluation of the possibility of requirements changing
>> should factor into your decision.
>
> Ah, but if your requirements are complicated enough that it takes you ten
> minutes and six lines of string method calls, that sounds to me like a
> situation that probably calls for a regex!

Recall that the post that started this discussion presented
a problem that took me six lines of code (actually spread out
over a few more for readability) to do without regexes versus
one line with.

So you do agree that that a regex was a better solution in
that case?  I ask beause we agree both seem to agree that
regexes are useful tools and preferable when the corresponding
Python code is "too" complex.  We also agree that when the
need can be handled by very simple python code, python may be
preferable.  So I'm trying to calibrate your switch-over point
a little better.

> Of course it depends on what the code actually does... if it counts the
> number of nested ( ) pairs, and you're trying to do that with a regex,
> you're sacked! *wink*

Right.  And again repeating what I said before, regexes
aren't a universal solution to every problem.  *wink*

> [...]
>>> There are a few problems with regexes:
>>>
>>> - they are another language to learn, a very cryptic a terse language;
>>
>> Chinese is cryptic too but there are a few billion people who don't seem
>> to be bothered by that.
>
> Chinese isn't cryptic to the Chinese, because they've learned it from
> childhood.
>
> But has anyone done any studies comparing reading comprehension speed
> between native Chinese readers and native European readers? For all I
> know, Europeans learn to read twice as quickly as Chinese, and once
> learned, read text twice as fast. Or possibly the other way around. Who
> knows? Not me.
>
> But I do know that English typists typing 26 letters of the alphabet
> leave Asian typists and their thousands of ideograms in the dust. There's
> no comparison -- it's like quicksort vs bubblesort *wink*.

70 years ago there was all sorts of scientific evidence
that showed white, Western-European culture did lots of
things better than everyone else, especially non-whites,
in the world.  Let's not go there.  *wink*

> [...]
>>> - debugging regexes is a nightmare;
>>
>> Very complex ones, perhaps.  "Nightmare" seems an overstatement.
>
> You *can't* debug regexes in Python, since there are no tools for (e.g.)
> single-stepping through the regex, displaying intermediate calculations,
> or anything other than making changes to the regex and running it again,
> hoping that it will do the right thing this time.

Thinking in addition to hoping will help quite a bit.

There are two factors that migigate the lack of debuggers.

1) REs are not a Turing complete language so in some sense
are simpler than Python.

2) The vast majority of REs that I have had to fix or write
are not complex enough to require a debugger.  Often they simply
look complex due to all the parens and backslashes -- once you
reformat them (permanently with the re.VERBOSE flag, or
temporarily in a text editor, they don't look so bad.

> I suppose you can use external tools, like Regex Buddy, if you're on a
> supported platform and if they support your language's regex engine.
>
> [...]
>>> Regarding their syntax, I'd like to point out that even Larry Wall is
>>> dissatisfied with regex culture in the Perl community:
>>>
>>> http://www.perl.com/pub/2002/06/04/apo5.html
>>
>> You did see the very first sentence in this, right?
>>
>>   "Editor's Note: this Apocalypse is out of date and remains here for
>>   historic reasons. See Synopsis 05 for the latest information."
>
> Yes. And did you click through to see the Synopsis? It is a bare
> technical document with all the motivation removed. Since I was pointing
> to Larry Wall's motivation, it was appropriate to link to the Apocalypse
> document, not the Synopsis.

OK, fair enough.

>> (Note that "Apocalypse" is referring to a series of Perl design
>> documents and has nothing to do with regexes in particular.)
>
> But Apocalypse 5 specifically has everything to do with regexes. That's
> why I linked to that, and not (say) Apocalypse 2.

Where did I suggest that you should have linked to Apocalypse 2?
I wrote what I wrote to point out that the "Apocalypse" title was
not a pejorative comment on regexes.  I don't see how I could have
been clearer.

>> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
>> syntax.  I didn't see anything about de-emphasizing them in Perl.  (But
>> I have no idea what is going on for Perl 6 so I could be wrong about
>> that.)
>
> I never said anything about de-emphasizing them. I said that Larry Wall
> was dissatisfied with Perl's culture of regexes -- his own words were:
>
> "regular expression culture is a mess"

Right, and I quoted that.  But I don't know what he meant
by "culture of regexes".  Their syntax?  Their extensive use
in Perl?  Something else?  If you don't care about their
de-emphasis in Perl, then presumably their extensive use
there is not part of what you consider "culture of regexes",
yes?  So to you, "culture of regexes" refers only to the
syntax of Perl regexes?

I pointed out that the use of regexs in Perl 6 (AFAICT from
the Synopsis 05 document) are still as widely used as in
Perl 5.  However the document also describes changes in *how*
they are used within Perl (e.g, the production of Match objects)
So I conclude the *use* of regexes is part of Larry Wall concept
of "regex culture".

Further, my guess is that the term means something else again
to many Python programmers -- something more akin to the
LW concept but with a much greater negative valuation.

> and he is also extremely critical of current (i.e. Perl 5) regex syntax.
> Since Python's regex syntax borrows heavily from Perl 5, that's extremely
> pertinent to the issue. When even the champion of regex culture says
> there is much broken about regex culture, we should all listen.

I'll just note that "extremely" is a description you have chosen
to apply.  He identified problems (some of which have developed
since regexes started being widely used) and changes to improve
them.  One could say GvR was "extremely" critical of the str/-
unicode situation in Python-2.  It would be a bit much to use
that to say that one should avoid the use of text in Python 2 '
programs.

The Larry Wall who you claim is "extremely critical of current
regex syntax" proposed the following in the new "fixed" regex
syntax (from the Synopsis 05 doc):

    Unchanged syntactic features
      The following regex features use the same syntax as in Perl 5:
      Capturing: (...)
      Repetition quantifiers: *, +, and ?
      Alternatives: |
      Backslash escape: \
      Minimal matching suffix: ??, *?, +?

Those, with character classes (including "\"-named ones) and non-
capturing ()'s, constitute about 99+% of my regex uses and the
overwhelming majority of regexes I have had to work with.

Nobody here has claimed that regexes are perfect.  No doubt the
Perl 6 changes are an improvement but I doubt that they change
the nature of regexes anywhere near enough to overcome the complaints
against them voiced in this group.  Further, those changes will
likely take years or decades to make their way into the Python
standard library if at all.  (Perl is no longer the thought-leader
it once was, and the new syntax is competing against innumerable
established uses of the old syntax outside of Perl.)  Thus, although
I look forward to the new syntax, I don't see it as any kind of
justification not to use the existing syntax in the meantime.

>> As for the original reference, Wall points out a number of problems with
>> regexes, mostly details of their syntax.  For example that more
>> frequently used non-capturing groups require more characters than
>> less-frequently used capturing groups. Most of these criticisms seem
>> irrelevant to the question of whether hard-wired string manipulation
>> code or regexes should be preferred in a Python program.
>
> It is only relevant in so far as the readability and relative obfuscation
> of regex syntax is relevant. No further.

OK, again you are confirming it is only the syntax of regexes
that bothers you?

> You keep throwing out the term "hard-wired string manipulation", but I
> don't understand what point you're making. I don't understand what you
> see as "hard-wired", or why you think
>
> source.startswith(prefix)
>
> is more hard-wired than
>
> re.match(prefix, source)

What I mean is that I see regexes as being and extremely small,
highly restricted, domain specific language targeted specifically
at describing text patterns.  Thus they do that job better than
than trying to describe patterns implicitly with Python code.

> [...]
>> Perhaps you stopped reading after seeing his "regular expression culture
>> is a mess" comment without trying to see what he meant by "culture" or
>> "mess"?
>
> Perhaps you are being over-sensitive and reading *far* too much into what
> I said.

Not sensitive at all.  I expressed an opinion that I thought
is under-represented here and could help some get over their
regex-phobia.  Since it doesn't have a provably right or wrong
answer, I expected it would be contested and have no problem
with that.

As for reading too much into what you said, possibly.  I look
forward to your clarifications.

> If regexes were more readable, as proposed by Wall, that would go
> a long way to reducing my suspicion of them.

I am delighted to read that you find the new syntax more
acceptable.  I guess that means that although you would
object to the Perl 5 regex

  /(?mi)^(?:[a-z]|\d){1,2}(?=\s)/

you find its Perl 6 form

  / :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /

a big improvement?

And I presume, based on your lack of comment, the size of the
document required to describe the new syntax does not raise
any concerns for you?  Or the many additional new "line-noise"
meta-characters ("too few metacharacters" was one of the
problems LW described in the Apocalypse document you referred
us to)?  Again, I wonder if you and Larry Wall are really on
the same page with the faults you find in the Perl 5 syntax..

And again with the qualifier that I have not spent much time
reading about the changes, and further my regex-fu is at
a low enough level that I am probably unable to fully
appreciate many of the improvements, the syntax doesn't
really look different enough that I see it overcoming the
objections that I often read here.  Consequently I don't
find the argument, avoid using what is currently available,
very convincing.