[Tutor] lil help please - updated (fwd)

Thu Nov 24 18:52:15 CET 2005

| I have about 150 lines of python extracting text from large file, the
| problem I need a few lines to clean first to avoid the problem the
| script is facing

Hello,

This seems like a well laid out task. If you post what you are trying and the problems you are encountering, that would be helpful.

One suggestion that I have is that you switch problems 1 and 2. If the ordering is broken (e.g. HHFR instead of HFRH) then knowing where to put the parenthetical comment is going to be a problem.  Also, you said that you wanted it put after the "F" reference did you mean that is should look like this:

| AFTER your process
|| H 00100 "a friend in need is a friend indeed so select the best
|| friend 
| as soon as you can blah"
|| F Old London book (xyz blah words)  <=== parenthetical here?
|| R Cool

It's a little hard to tell from what you've said, but it looks like the "|" was an unnecessary addition. If your record markers were always a single character at the beginning of a line, those are easy enough to find--provided there is never an H, F, or R that is a NON-record marker at the beginning of a line as a single character.

######
>>> text='''H This is the start.
... F here is a reference. 
... Right here is a non-reference R but it's not a single character starting the line
... so it won't be matched; and the single one in the middle isn't at the start.
... R cool'''
>>> import re
>>> text = '\n'+text     #make the first one like all the others: preceded by newline character
>>> re.findall(r'\n([HFR])\b', text)
['H', 'F', 'R']
>>> re.split(r'\n([HFR])\b', text)
['', 'H', ' This is the start.', 'F', " here is a reference. \nRight here is a non-reference R but it's not a single character starting the line\nso it won't be matched; and the single one in the middle isn't at the start.", 'R', ' cool']

######

That last list has all the groups with the identifier preceding the corresponding data.

Finally, I'm not sure how you are checking the correctness of the HFR sequence, but the findall used above suggests a way to do it:

-do the findall
-join the results together
-replace 'HFR' with '.'
-if the whole string isn't dots then there was a problem and the number of dots before the non-dot tell you how many correct records there were.

######
>>> bad='''
... H
... F
... R
... R
... '''
>>> re.findall(r'\n([HFR])\b', bad)
['H', 'F', 'R', 'R']
>>> ''.join(_)            # the _ refers to the last output
'HFRR'
>>> _.replace('HFR', '.')
'.R'
>>> len(_),_.count('.')
(2, 1)

######

Notice that since not all the HFRs were complete, there are not all the characters are periods (and so the count of periods is not the same as the length of the string). In this case there was one correct record (thus one leading dot) before the problem occurred.

/c