how to split this kind of text into sections
Tim Chase
python.list at tim.thechases.com
Sat Apr 26 12:59:56 EDT 2014
On 2014-04-26 23:53, oyster wrote:
> I will try to explain my situation to my best, but English is not my
> native language, I don't know whether I can make it clear at last.
Your follow-up reply made much more sense and your written English is
far better than many native speakers'. :-)
> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called
> the KEY for this section. For every 2 neighbor sections, they have
> different KEYs.
I suspect you have a minimum number of characters (or words) to
consider, otherwise a single character duplicated at the beginning of
the line would delimit a section, such as
abcd
afgh
because they share the commonality of an "a". The code I provided
earlier should give you what you describe. I've tweaked and tested,
and provided it below. Note that I require a minimum overlap of 6
characters (MIN_LEN). It also gathers the initial stuff (that you
want to discard) under the empty key, so you can either delete that,
or ignore it.
> I need a method to split the whole text into SECTIONs and to know
> all the KEYs
>
> I have tried to solve this problem via re module
I don't think the re module will be as much help here.
-tkc
from collections import defaultdict
import itertools as it
MIN_LEN = 6
def overlap(s1, s2):
"Given 2 strings, return the initial overlap between them"
return ''.join(
c1
for c1, c2
in it.takewhile(
lambda pair: pair[0] == pair[1],
it.izip(s1, s2)
)
)
prevline = "" # the initial key under which preamble gets stored
output = defaultdict(list)
key = None
with open("data.txt") as f:
for line in f:
if len(line) >= MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]:
key = overlap(prevline, line)
output[key].append(line)
prevline = line
for k,v in output.items():
print str(k).center(60,'=')
print ''.join(v)
.
More information about the Python-list
mailing list