[Tutor] Picking up citations

Kent Johnson kent37 at tds.net
Sat Feb 7 15:21:24 CET 2009


On Sat, Feb 7, 2009 at 1:11 AM, Dinesh B Vadhia
<dineshbvadhia at hotmail.com> wrote:
> Hi!  I want to process text that contains citations, in this case in legal
> documents, and pull-out each individual citation.  Here is a sample text:

<snip>

> The results required are:
>
> Carter v. Jury Commission of Greene County, 396 U.S. 320 (1970)
> Carter v. Jury Commission of Greene County, 90 S.Ct. 518 (1970)
> Carter v. Jury Commission of Greene County, 24 L.Ed.2d 549 (1970)
>
> Lathe Turner v. Fouche, 396 U.S. 346 (1970)
> Lathe Turner v. Fouche, 90 S.Ct. 532 (1970)
> Lathe Turner v. Fouche, 24 L.Ed.2d 567 (1970)
>
> White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966)
>
> John Doggone Williams v. Florida, 399 U.S. 78 (1970)
> John Doggone Williams v. Florida, 90 S.Ct. 1893, 234 (1970)
> John Doggone Williams v. Florida, 26 L.Ed.2d 446 (1970)

Here is a close solution using pyparsing. It only gets the last word
of the first name, and it doesn't handle multiple page numbers so it
missing J. D. Williams entirely. The name is hard - how do you know
that "Page 500" is not part of "Carter" and "In" is not part of "John
Doggone Williams"? The page numbers seem possible in theory but I
don't know how to get pyparsing to do it.

from pprint import pprint as pp
from pyparsing import *

text = "" # your text

Name1 = Word(alphas).setResultsName('name1')
Name2 = Combine(OneOrMore(Word(alphas)), joinString=' ',
adjacent=False).setResultsName('name2')

Volume = Word(nums).setResultsName('volume')
Reporter = Word(alphas, alphanums+".").setResultsName('reporter')
Page = Word(nums).setResultsName('page')

VolumeCitation = (Volume + Reporter +
Page).setResultsName('volume_citation', listAllMatches=True)
VolumeCitations = delimitedList(VolumeCitation)

Date = (Suppress('(') +
Combine(CharsNotIn(')')).setResultsName('date') + Suppress(')'))

FullCitation = Name1 + Suppress('v.') + Name2 + Suppress(',') +
VolumeCitations + Date

for item in FullCitation.scanString(text):
    fc = item[0]
    # Uncomment the following to see the raw parse results
    # pp(fc)
    # print
    # print fc.name1
    # print fc.name2
    # for vc in fc.volume_citation:
    #     pp(vc)
    for vc in fc.volume_citation:
        print '%s v. %s, %s %s %s (%s)' % (fc.name1, fc.name2,
vc.volume, vc.reporter, vc.page, fc.date)
    print


The output is:
Carter v. Jury Commission of Greene County, 396 U.S. 320 (1970)
Carter v. Jury Commission of Greene County, 90 S.Ct. 518 (1970)
Carter v. Jury Commission of Greene County, 24 L.Ed.2d 549 (1970)

Turner v. Fouche, 396 U.S. 346 (1970)
Turner v. Fouche, 90 S.Ct. 532 (1970)
Turner v. Fouche, 24 L.Ed.2d 567 (1970)

White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966)

Kent


More information about the Tutor mailing list