[Tutor] lil help please - updated (fwd)
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Thu Nov 24 11:35:24 CET 2005
[Slightly busy at the moment; can someone else help?
Alan, in the future, don't send replies directly to me: send them to the
Tutor list. It's an ad-hoc way to load-balance your questions across all
the tutors.]
---------- Forwarded message ----------
Date: Thu, 24 Nov 2005 03:55:00 -0600
From: Alan <ldapguru at yahoo.com>
To: ldapguru at yahoo.com, dyoo at hkn.eecs.berkeley.edu
Subject: RE: lil help please - updated
Sorry
Lil better english
I have about 150 lines of python extracting text from large file, the
problem I need a few lines to clean first to avoid the problem the
script is facing
Overview
There is large text and I am trying to organize it for the python script
to process, it is badly organized and I attempted to do it like this
which the master script understand
Keywords:
##### is number like 1 thru 99999
|H paragraphs
|F reFerence
|R Rating
BEFORE I organized by text global and replace
Each set of tokens was like this
##### paragraph
F reference
R rating
Now (where master script understand)
|H###### paragraph
|F reference
|R rating
Notice no ##### in |F |R
PROBLEMS
Phase 1
PROBLEM 1
the |H paragraph (multi lines) has some words between () such as (xyz
blah words) also maybe in multi lines
.( blah blah
blah blah)
We need to move it to the end of |F reference (xyz blah words)
Example
BEFORE
|H 00100 a friend in need is a friend indeed (author means both young \
and old) so select the best friend as soon as you can blah
|F Old London book
|R Cool
AFTER your process
|H 00100 "a friend in need is a friend indeed so select the best friend
as soon as you can blah"
|F Old London book
|R Cool
PROBLEM 2
I need to find out if the order is broken so I go and fix it by hand
i.e. |H##### |F |R is any other order so it is outputted in
ErrorOrderLogFile
|H##### paragraph
|H paragraph
|R rating
or any order like
run new cleaning script and cat ErrorOrderLogFile
|H00299 paragraph
|F Reference
|H Rating
|H00300 paragraph
|H paragraph
|H rating
cat ErrorOrderLogFile:
bad set orders
|H00300 paragraph
Phase II
PROBLEM 3
Once I fix by the order hand I need to renumber all from say 00001 to
99999
In this format
|H00001 paragraph
|F00001 reference
|R00001 rating
|H99999 paragraph
|F99999 reference
|R99999 rating
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004
More information about the Tutor
mailing list