[Tutor] regex advice

Tue Jan 6 15:17:53 CET 2015

Norman Khine wrote:

> i have a blade template file, as
> 
> replace page
>   .row
>     .large-8.columns
>       form( method="POST", action="/product/saveall/#{style._id}" )
>         input( type="hidden" name="_csrf" value=csrf_token )
>         h3 #{t("Generate Product for")} #{tt(style.name)}
>         .row
>           .large-6.columns
>             h4=t("Available Attributes")
>             - for(var i = 0; i < attributes.length; i++)
>               - var attr = attributes[i]
>               - console.log(attr)
>                 ul.attribute-block.no-bullet
>                   li
>                     b= tt(attr.name)
>                   - for(var j = 0; j < attr.values.length; j++)
>                     - var val = attr.values[j]
>                       li
>                         label
>                           input( type="checkbox" title="#{tt(attr.name)}:
> #{tt(val.name)}" name="#{attr.id}" value="#{val.id}")
>                           |
>                           =tt(val.name)
>                           = " [Code: " + (val.code || val._id) + "]"
>                           !=val.htmlSuffix()
>           .large-6.columns
>             h4 Generated Products
>             ul#products
>         button.button.small
>           i.icon-save
>           |=t("Save")
>         =" "
>         a.button.small.secondary( href="/product/list/#{style.id}" )
>           i.icon-cancel
>           |t=("Cancel")
> 
> when i run the above code, i get
> 
> - file add.blade (full path:
> ../node-blade-boiler-template/views/product/add.blade)
>  type="hidden" name="_csrf" value=csrf_token
> "Generate product for")} #{tt(style.name
> "Available Attributes"
> attr.name
>  type="checkbox" title="#{tt(attr.name)}: #{tt(val.name)}"
>  name="#{attr.id}"
> value="#{val.id}"
> val.name
> "Generated products"
> "Save"
> 
> 
> 
> so, gettext_re = re.compile(r"""[t]\((.*)\)""").findall is not correct as
> it includes
> 
> results such as input( type="hidden" name="_csrf" value=csrf_token )
> 
> what is the correct way to pull all values that are within t(" ") but
> exclude any tt( ) and input( )
> 
> any advice much appreciated

You can require a word boundary before the 't'. Quoting 
<https://docs.python.org/dev/library/re.html#regular-expression-syntax>:

"""
\b
Matches the empty string, but only at the beginning or end of a word. A word 
is defined as a sequence of Unicode alphanumeric or underscore characters, 
so the end of a word is indicated by whitespace or a non-alphanumeric, non-
underscore Unicode character. Note that formally, \b is defined as the 
boundary between a \w and a \W character (or vice versa), or between \w and 
the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 
'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

By default Unicode alphanumerics are the ones used, but this can be changed 
by using the ASCII flag. Inside a character range, \b represents the 
backspace character, for compatibility with Python’s string literals.
"""

Also you are probably better off with a non-greedy match. So

>>> sample = 'yadda t("foo") [t("bar")] input("baz")'
>>> re.findall(r"t\((.*)\)", sample)
['"foo") [t("bar")] input("baz"']
>>> re.findall(r"t\((.*?)\)", sample)
['"foo"', '"bar"', '"baz"']
>>> re.findall(r"\bt\((.*?)\)", sample)
['"foo"', '"bar"']