intro to regular expressions
Here's a module people could easily expand upon, staying with Jabberwocky as the target text. I'm by no means a master of the regexp. For example, I wanted to pick out all sentences with Jabberwock including those beginning and ending with quote marks (if present), i.e. keeping the quotes in the match. My current attempt loses the quote marks, keeping the enclosed sentence. One could imagine 20-30 more regexps, if not hundreds, populating this file. The doctest version could display expected output (except it gets kinda verbose (appended) -- maybe selected examples only...). Kirby === """ Playing with regular expressions... GPL 2010 4D Solutions This small suite of tests could easily be augmented with more elaborate ones, or simply variations on the theme. Consider this a workbench for test out your regexps. """ import re poem = """ 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought- So rested he by the Tumtum tree, And stood awhile in thought. And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came! One, two! One, two! and through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back. "And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!" He chortled in his joy. 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. """ def show_all(title, regexp, match_list): print "%s\n%s" % (title, len(title) * "=") print "regexp: %s\n" % regexp if match_list: for list_item in match_list: print list_item,'\n----' else: print "No Matches" print "\n\n" def test0(): """ Show the line of text in which Jabberwock appears, with ^ matching after \n, not just at the start of the string (the purpose of the MULTILINE flag). Given the word Jabberwock is not in the first line, there is no match without MULTILINE """ regexp = r"^.*Jabberwock.*$" p = re.compile(regexp, re.MULTILINE) m = p.findall(poem) show_all("Lines with Jabberwock", regexp, m) def test1(): """ Sentences in which Jabberwok appears, starting with a capital letter and ending with punctuation. The non-greedy .*? matches across \n because of DOTALL. After the first capitalized word, the matcher goes through any character that's not a terminating punctuation mark, through the string Jabberwock, and on to the terminus. """ regexp = r'[A-Z]\w+\b[^.!?"]+Jabberwock.*?[?!.]' # how to include outside quotes if present? p = re.compile(regexp, re.MULTILINE | re.DOTALL) m = p.findall(poem) show_all("Sentences with Jabberwock", regexp, m) def test2(): """ Find all strings enclosed in quotes (") that also and with an exclamation point. The *? makes * behave in a "non-greedy" manner, so the first satisfying pattern is considered a match. """ regexp = r'".*?!"' p = re.compile(regexp, re.MULTILINE | re.DOTALL) m = p.findall(poem) show_all("Exclamations", regexp, m) def test3(): """ Here we're looking for words starting with capital letters, then we're grabbing up to 3 characters on either side, including newlines if need be. The DOTALL is what picks up newlines. """ regexp = r'.{0,3}[A-Z]\w+\b.{0,3}' p = re.compile(regexp, re.MULTILINE | re.DOTALL) m = p.findall(poem) show_all("Capitals", regexp, m) def test4(): """ Here we're looking for words starting with capital letters, then we're grabbing up to 3 characters on either side, including newlines if need be. The DOTALL is what picks up newlines. """ regexp = r'.{0,3}[A-Z]\w+\b.{0,3}' p = re.compile(regexp, re.MULTILINE | re.DOTALL) m = p.findall(poem) show_all("Capitals", regexp, m) def alltests(): test0() test1() test2() test3() if __name__ == "__main__": alltests() === Lines with Jabberwock ===================== regexp: ^.*Jabberwock.*$ "Beware the Jabberwock, my son! ---- The Jabberwock, with eyes of flame, ---- "And hast thou slain the Jabberwock? ---- Sentences with Jabberwock ========================= regexp: [A-Z]\w+\b[^.!?"]+Jabberwock.*?[?!.] Beware the Jabberwock, my son! ---- And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came! ---- And hast thou slain the Jabberwock? ---- Exclamations ============ regexp: ".*?!" "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" ---- "And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!" ---- Capitals ======== regexp: .{0,3}[A-Z]\w+\b.{0,3} 'Twas br ---- es Did gy ---- e; All mi ---- s, And th ---- "Beware th ---- e Jabberwock, m ---- n! The ja ---- h! Beware th ---- e Jubjub bi ---- un The fr ---- us Bandersnatch!" ---- He to ---- d: Long ti ---- t- So re ---- he Tumtum tr ---- e, And st ---- . And as ---- d, The Ja ---- e, Came wh ---- d, And bu ---- ! One, t ---- o! One, t ---- gh The vo ---- k! He le ---- ad He we ---- "And ha ---- he Jabberwock? C ---- y! Callooh! C ---- !" He ch ---- 'Twas br ---- es Did gy ---- e; All mi ---- s, And th ----
participants (1)
-
kirby urner