[Tutor] How to match strange characters

J. Van Brimmer jerry.vb at gmail.com
Mon Sep 8 17:59:20 CEST 2008


Thanks Paul, this looks like just what I need to reformat PRECESS's 
output into what I need.

Thanks,
Jerry

Paul McGuire wrote:
> Instead of trying to match on the weird characters, in order to remove them,
> here is a pyparsing program that ignores those header lines and just
> extracts the interesting data for each section.
>
> In a pyparsing program, you start by defining what patterns you want to look
> for.  This is similar to the re module, but uses friendlier names like
> OneOrMore, Group, and Combine instead of special characters that require
> backslashes and so on.  By default, pyparsing skips over whitespace between
> expressions, so we use Combine to override this (as in realnum, in which we
> want to match "3.1415", but not "3 . 1415").
>
> Here is the opening part of the program, that defines the basic bits in your
> data file, and the input parameter prompts:
>
> from pyparsing import Combine, Word, nums, Literal, Group, oneOf, OneOrMore
>
> # define basic expressions
> realnum = Combine(Word(nums) + "." + Word(nums))
> two_digit_num = Word(nums,exact=2)
> four_digit_num = Word(nums,exact=4)
> date = Combine(two_digit_num + '-' + two_digit_num + '-' + four_digit_num)
> timestamp = Combine(two_digit_num + ':' + two_digit_num + ':' + 
>                     two_digit_num + '.' + two_digit_num)
>
> # literal prompt strings
> enter_date = Literal("Enter Date for Precession as (MM-DD-YYYY) or C/R for
> ")
> enter_catalog = Literal("Enter the Catalog Name or C/R for CATALOG.SRC >")
> the_julian_date_is = Literal("The Julian Date is =")
>
> # build up the header definition
> enter_date_line = enter_date + date + ">"
> julian_date_line = the_julian_date_is + realnum("julian_date")
> header = Group(enter_date_line + date("date") + 
>                 enter_catalog + julian_date_line)
>
>
> This next part uses similar style to define the format of the lines of data.
>
> # build up the definition for a line of data
> field_1 = Word(nums,exact=4) + "+" + Word(nums,exact=3)
> field_2 = realnum
> field_3 = Combine(oneOf("+ -") + realnum)
> field_4 = timestamp
> field_5 = timestamp
> # change the results names as appropriate - I just made these up
> data_line = Group( field_1("fld1") + field_2("magnitude") + 
>         field_3("phase") + field_4("start_time") + field_5("end_time") )
>
> I guessed at/made up names for the fields in the data_line ("fld1",
> "magnitude", etc.).  You should change these to names that make sense in
> your application.
>
> Now a final definition that puts everything together:
>
> # put everything together into a PRECESS run header+data section
> section = header("header") + OneOrMore(data_line)("data")
>
>
> And now use section.scanString to locate all the matching data in your input
> file:
> test = """
> ??????????????????????????????????????????????
> ? Radio Source Precession Program ?
> ? by John B. Doe ?
> ? 31 August 1992 ?
> ??????????????????????????????????????????????
> Enter Date for Precession as (MM-DD-YYYY) or C/R for 05-28-2004 > 
> 05-28-2004
> Enter the Catalog Name or C/R for CATALOG.SRC >
> The Julian Date is = 2453153.5
> 0022+002 5.6564 +0.2713 00:22:37.54 00:16:16.65
> 0106+013 17.2117 +1.6052 01:08:50.80 01:36:18.58
> """
>
> # use scanString to read through the input data - this will ignore the 
> # parts of the header with the weird characters
> for data_section, start,end in section.scanString(test):
>     # each data_section returns the parsed results, which can be treated
>     # like an object or a dict, using the results names for attribute names
>     # or dict keys - the dump() method shows a structured output, keys()
>     # values(), and items() work just like in a dict
>     print data_section.dump()
>     print data_section.header.julian_date
>     # note the use of results name to access the "data" part
>     for d in data_section.data:
>         print d.dump()
>         print "  ", d.start_time, d.end_time, d.phase
>
>
> Note how the results names are used to access the matched fields in the
> input.
>
> This creates the following output:
> [['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', '05-28-2004', ...
> - data: [['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', ...
> - header: ['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', ...
>   - date: 05-28-2004
>   - julian_date: 2453153.5
> 2453153.5
> ['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', '00:16:16.65']
> - end_time: 00:16:16.65
> - fld1: ['0022', '+', '002']
> - magnitude: 5.6564
> - phase: +0.2713
> - start_time: 00:22:37.54
>    00:22:37.54 00:16:16.65 +0.2713
> ['0106', '+', '013', '17.2117', '+1.6052', '01:08:50.80', '01:36:18.58']
> - end_time: 01:36:18.58
> - fld1: ['0106', '+', '013']
> - magnitude: 17.2117
> - phase: +1.6052
> - start_time: 01:08:50.80
>    01:08:50.80 01:36:18.58 +1.6052
>
> You can get the complete program at this pastebin URL:
> http://pyparsing.pastebin.com/m6f0ae6bc
>
> If you still want to use re's, then this program might still help you in at
> least laying out what your re's should match for at different places in the
> data.
>
> -- Paul
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>   



More information about the Tutor mailing list