[Tutor] How to match strange characters
J. Van Brimmer
jerry.vb at gmail.com
Mon Sep 8 17:59:20 CEST 2008
Thanks Paul, this looks like just what I need to reformat PRECESS's
output into what I need.
Paul McGuire wrote:
> Instead of trying to match on the weird characters, in order to remove them,
> here is a pyparsing program that ignores those header lines and just
> extracts the interesting data for each section.
> In a pyparsing program, you start by defining what patterns you want to look
> for. This is similar to the re module, but uses friendlier names like
> OneOrMore, Group, and Combine instead of special characters that require
> backslashes and so on. By default, pyparsing skips over whitespace between
> expressions, so we use Combine to override this (as in realnum, in which we
> want to match "3.1415", but not "3 . 1415").
> Here is the opening part of the program, that defines the basic bits in your
> data file, and the input parameter prompts:
> from pyparsing import Combine, Word, nums, Literal, Group, oneOf, OneOrMore
> # define basic expressions
> realnum = Combine(Word(nums) + "." + Word(nums))
> two_digit_num = Word(nums,exact=2)
> four_digit_num = Word(nums,exact=4)
> date = Combine(two_digit_num + '-' + two_digit_num + '-' + four_digit_num)
> timestamp = Combine(two_digit_num + ':' + two_digit_num + ':' +
> two_digit_num + '.' + two_digit_num)
> # literal prompt strings
> enter_date = Literal("Enter Date for Precession as (MM-DD-YYYY) or C/R for
> enter_catalog = Literal("Enter the Catalog Name or C/R for CATALOG.SRC >")
> the_julian_date_is = Literal("The Julian Date is =")
> # build up the header definition
> enter_date_line = enter_date + date + ">"
> julian_date_line = the_julian_date_is + realnum("julian_date")
> header = Group(enter_date_line + date("date") +
> enter_catalog + julian_date_line)
> This next part uses similar style to define the format of the lines of data.
> # build up the definition for a line of data
> field_1 = Word(nums,exact=4) + "+" + Word(nums,exact=3)
> field_2 = realnum
> field_3 = Combine(oneOf("+ -") + realnum)
> field_4 = timestamp
> field_5 = timestamp
> # change the results names as appropriate - I just made these up
> data_line = Group( field_1("fld1") + field_2("magnitude") +
> field_3("phase") + field_4("start_time") + field_5("end_time") )
> I guessed at/made up names for the fields in the data_line ("fld1",
> "magnitude", etc.). You should change these to names that make sense in
> your application.
> Now a final definition that puts everything together:
> # put everything together into a PRECESS run header+data section
> section = header("header") + OneOrMore(data_line)("data")
> And now use section.scanString to locate all the matching data in your input
> test = """
> ? Radio Source Precession Program ?
> ? by John B. Doe ?
> ? 31 August 1992 ?
> Enter Date for Precession as (MM-DD-YYYY) or C/R for 05-28-2004 >
> Enter the Catalog Name or C/R for CATALOG.SRC >
> The Julian Date is = 2453153.5
> 0022+002 5.6564 +0.2713 00:22:37.54 00:16:16.65
> 0106+013 17.2117 +1.6052 01:08:50.80 01:36:18.58
> # use scanString to read through the input data - this will ignore the
> # parts of the header with the weird characters
> for data_section, start,end in section.scanString(test):
> # each data_section returns the parsed results, which can be treated
> # like an object or a dict, using the results names for attribute names
> # or dict keys - the dump() method shows a structured output, keys()
> # values(), and items() work just like in a dict
> print data_section.dump()
> print data_section.header.julian_date
> # note the use of results name to access the "data" part
> for d in data_section.data:
> print d.dump()
> print " ", d.start_time, d.end_time, d.phase
> Note how the results names are used to access the matched fields in the
> This creates the following output:
> [['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', '05-28-2004', ...
> - data: [['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', ...
> - header: ['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', ...
> - date: 05-28-2004
> - julian_date: 2453153.5
> ['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', '00:16:16.65']
> - end_time: 00:16:16.65
> - fld1: ['0022', '+', '002']
> - magnitude: 5.6564
> - phase: +0.2713
> - start_time: 00:22:37.54
> 00:22:37.54 00:16:16.65 +0.2713
> ['0106', '+', '013', '17.2117', '+1.6052', '01:08:50.80', '01:36:18.58']
> - end_time: 01:36:18.58
> - fld1: ['0106', '+', '013']
> - magnitude: 17.2117
> - phase: +1.6052
> - start_time: 01:08:50.80
> 01:08:50.80 01:36:18.58 +1.6052
> You can get the complete program at this pastebin URL:
> If you still want to use re's, then this program might still help you in at
> least laying out what your re's should match for at different places in the
> -- Paul
> Tutor maillist - Tutor at python.org
More information about the Tutor