Help to find a regular expression to parse po file

Hallvard B Furuseth h.b.furuseth at usit.uio.no
Mon Jul 6 11:04:06 EDT 2009


gialloporpora writes:
> I would like to extract string from a PO file. To do this I have created
> a little python function to parse po file and extract string:
>
> import re
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
> m=r.findall(s)

I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
    r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr?  If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for?  Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast.  Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.

-- 
Hallvard



More information about the Python-list mailing list