Using Groups inside Braces with Regular Expressions
John Machin
sjmachin at lexicon.net
Sun Jul 13 21:13:11 EDT 2008
On Jul 14, 9:05 am, Chris <chriss... at gmail.com> wrote:
Misleading subject.
[] brackets or "square brackets"
{} braces or "curly brackets"
() parentheses or "round brackets"
> I'm trying to delimit sentences in a block of text by defining the
> end-of-sentence marker as a period followed by a space followed by an
> uppercase letter or end-of-string.
... which has at least two problems:
(1) You are insisting on at least one space between the period and the
end-of-string (this can be overcome, see later).
(2) Periods are often dropped in after abbreviations and contractions
e.g. "Mr. Geo. Smith". You will get three "sentences" out of that.
>
> I'd imagine the regex for that would look something like:
> [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
>
> However, Python keeps giving me an "unbalanced parenthesis" error for
> the [^] part.
It's nice to know that Python is consistent with its error messages.
> If this isn't valid regex syntax,
If? It definitely isn't valid syntax. The brackets should delimit a
character class. You are trying to cram a somewhat complicated
expression into a character class, or you should be using parentheses.
However it's a bit hard to determine what you really meant that part
of the pattern to achieve.
> how else would I match
> a block of text that doesn't the delimiter pattern?
Start from the top down:
A sentence is:
anything (with some qualifications)
followed by (but not including):
a period
followed by
either
1 or more whitespaces then a capital letter
or
0 or more whitespaces then end-of-string
So something like this might do the trick:
>>> sep = re.compile(r'\.(?:\s+(?=[A-Z])|\s*(?=\Z))')
>>> sep.split('Hello. Mr. Chris X\nis here.\nIP addr 1.2.3.4. ')
['Hello', 'Mr', 'Chris X\nis here', 'IP addr 1.2.3.4', '']
More information about the Python-list
mailing list