Discussion on some Code Issues

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jul 5 02:02:44 CEST 2012


On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

[...]
> I got to code a bunch of documents  which are combined together.
[...]
> The task is to separate the documents on the fly and to parse each of
> the documents with a definite set of rules.
> 
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
[...]
> But they are separated by a tag set
[...] 
> To detect the document boundaries,

Let me see if I understand your problem.

You have a bunch of documents. You stick them all together into one 
enormous lump. And then you try to detect the boundaries between one file 
and the next within the enormous lump.

Why not just process each file separately? A simple for loop over the 
list of files, before consolidating them into one giant file, will avoid 
all the difficulty of trying to detect boundaries within files.

Instead of:

merge(output_filename,  list_of_files)
for word in parse(output_filename):
    if boundary_detected: do_something()
    process(word)

Do this instead:

for filename in  list_of_files:
    do_something()
    for word in parse(filename):
        process(word)


> I am splitting them into a bag of
> words and using a simple for loop as, 
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)


What happens if a file already has a $ in it?


> There is no issue. I am segmenting it nicely. I am using annotated
> corpus so applying parse rules.
> 
> The confusion comes next,
> 
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length,
> and using slice. Works perfect.

I don't understand this. What sort of combinations do you think you need 
to support? What are "I" values, and why are they important?



-- 
Steven



More information about the Python-list mailing list