[Tutor] Picking up citations

Sun Feb 8 23:53:39 CET 2009

Dinesh B Vadhia <dineshbvadhia at hotmail.com> wrote:
> Hi!  I want to process text that contains citations, in this case in legal
> documents, and pull-out each individual citation.

Here is my stab at it, using regular expressions. Any comments welcome.

I had to use two regexes, one to find all citations, and the other one to
split-up citations into their components. They are basically the same, the
former without grouping, and the latter with named groups.

***
text = "¤some common-law legal comments¤"

split_up_cit = re.compile('(?P<name>[A-Z]\w+(?:\s[A-Za-z]\w+)*?)'  #name
 +'\sv\.\s'  #versus
 +'(?P<other_name>[A-Z]\w+(?:\s[A-Za-z]\w+)*?),'  #other name
 +'(?P<refs>[^\(]+)'  #references
 +'(?P<year>\(.*?\d\d\d\d\))[,;.]')  # years

whole_cit = re.compile('[A-Z]\w+(?:\s[A-Za-z]\w+)*?'  #name
    +'\sv\.\s'  #versus
    +'[A-Z]\w+(?:\s[A-Za-z]\w+)*?,'  #other name
    +'[^\(]+'  #references
    +'\(.*?\d\d\d\d\)[,;.]')  # years

for cit in whole_cit.findall(text):
    ref_list = split_up_cit.search(cit).group('refs').split(',')
    for ref in ref_list:
        print split_up_cit.search(cit).group('name'),
        print 'v.',
        print split_up_cit.search(cit).group('other_name'),
        print ref,
        print split_up_cit.search(cit).group('year')
***

The results looks like what is expected, with the exception of "In John
Doggone Williams" rather than just "John Doggone Williams". As Kent remarked
it is difficult to left out of names the parts that should be left out.

"Page 500" is easier to deal with, however. I make it mandatory that the
first word of the name starts with an uppercase letter ([A-Z]), and that all
other words of the name start with a letter ([A-Za-z]). Yes, I include
lowercase letter so that names like 'Pierre Choderlos de Laclos' or 'Guido
van Rossum' are dealt with correctly. Note that with the [A-Za-z] range,
accented letters may not be dealt with correctly.

Emmanuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090208/1ba296cd/attachment.htm>