[Tutor] lil help please - updated (fwd)

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Nov 24 11:35:24 CET 2005


[Slightly busy at the moment; can someone else help?

Alan, in the future, don't send replies directly to me: send them to the
Tutor list.  It's an ad-hoc way to load-balance your questions across all
the tutors.]


---------- Forwarded message ----------
Date: Thu, 24 Nov 2005 03:55:00 -0600
From: Alan <ldapguru at yahoo.com>
To: ldapguru at yahoo.com, dyoo at hkn.eecs.berkeley.edu
Subject: RE: lil help please - updated

Sorry
Lil better english

I have about 150 lines of python extracting text from large file, the
problem I need a few lines to clean first to avoid the problem the
script is facing

Overview
There is large text and I am trying to organize it for the python script
to process, it is badly organized and I attempted to do it like this
which the master script understand

Keywords:
##### is number like 1 thru 99999
|H paragraphs
|F reFerence
|R Rating

BEFORE I organized by text global and replace
Each set of tokens was like this

#####  paragraph
F reference
R rating

Now (where master script understand)

|H###### paragraph
|F reference
|R rating

Notice no ##### in |F |R

PROBLEMS
Phase 1
PROBLEM 1
the |H paragraph (multi lines) has some words between () such as (xyz
blah words) also maybe in multi lines
….( blah blah
blah blah) …

We need to move it to the end of |F reference (xyz blah words)


Example
BEFORE

|H 00100 a friend in need is a friend indeed (author means both young \
and old) so select the best friend as soon as you can blah
|F Old London book
|R Cool

AFTER your process
|H 00100 "a friend in need is a friend indeed so select the best friend
as soon as you can blah"
|F Old London book
|R Cool

PROBLEM 2
I need to find out if the order is broken so I go and fix it by hand
i.e. |H##### |F |R is any other order so it is outputted in
ErrorOrderLogFile

|H##### paragraph
|H paragraph
|R rating

or any order like

run new cleaning script and cat ErrorOrderLogFile
|H00299 paragraph
|F Reference
|H Rating

|H00300 paragraph
|H paragraph
|H rating

cat ErrorOrderLogFile:
bad set orders
|H00300 paragraph


Phase II
PROBLEM 3
Once I fix by the order hand I need to renumber all from say 00001 to
99999
In this format

|H00001 paragraph
|F00001 reference
|R00001 rating

|H99999 paragraph
|F99999 reference
|R99999 rating





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004




More information about the Tutor mailing list