Nlp, Python and period
Paul Boddie
paul at boddie.org.uk
Mon Aug 4 09:32:12 EDT 2008
On 4 Aug, 12:34, Fred Mangusta <a... at bbb.it> wrote:
>
> thanks for replying. I'm interested in knowing more about your regex
> approach, but as you point out in your comment, seems like access to the
> sourceforge mail archive is restricted. Is there any way I can read
> about it? Would you be so kind to cut and paste it here for instance?
I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:
sentence_pattern = re.compile(
r'(' +
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,a-z,0-9]' + # Match sentence with specific start
character
r'.+?' + # Match sentence content - "?" means non-
greedy
r'[\.\!\?]' + # End of sentence
r'[\)\"\]]*' + # End quoting or bracketing
r')' +
r'(\s+)' + # Spaces
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,0-9]' # Match sentence with specific start
character
)
This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.
As I noted, I'd be interested to hear of any better solutions which
don't involve training.
Paul
More information about the Python-list
mailing list