Newbie: text filtering.

Alex Martelli alex at magenta.com
Mon Aug 7 08:52:32 EDT 2000


"Aesop" <me at my.own.computer> wrote in message
news:398EA691.65CAE8D0 at my.own.computer...
> Heya peoples.
>         Just a quick question. Writing a script that takes a line of
> text, searches for a  particular part of a word, then turns that word
> into an entry in a resulting HTML output. For example, A collection of
> pdf files is catalouged by a text file containing filename and
> description. I am searching for the filename on the basis of its
> extension ".pdf" and assuming the rest of the line is the description.
> The output is a line of HTML that comprises a table, complete with link
> to the pdf, and description.

OK, pretty clear problem description... except that I'm not clear on
where the 'filename' _begins_ on each line.  It ends right before the
'.pdf', OK.  But where does it start?  At the start of the line?  That
would make it easiest.  Or else, where?


> Eveything is working fine. But on the basis of fixed length columns.
> Which they aren't.
>
> How could I go about say searching for the ".pdf", then working
> backwards to find the start of the word?

Suppose we have the line text in a variable called line.  Then, if the
filename starts at the beginning of the line, and everything after the
'.pdf' is description, the easiest approach is probably:

    import string

    filename, description = string.split(line,'.pdf',1)

which works in Python 1.5.2 (and older and newer ones too!-);
or if you're using Python 1.6 or better, then more simply

    filename, description = line.split('.pdf',1)

Note that with this approach the filename will NOT include the
known '.pdf' part; if the line includes the line-end \n character
at the end, that will be also part of 'description'.


For more complex specs (not 'filename starts at the end of the
line', but more complicated stuff) regular expressions may be
worth the trouble.

Suppose, for example, that 'filename' must be a non-empty
sequence of non-space characters, and that everything before
that in the line (if anything) must be ignored, as must the one
or more spaces separating the .pdf from the start of the
description.  Then, a re-based approach might be:

#once, at the start:
    import re
    rema=re.compile(r'(\S+\.pdf)\s+(.*)')

#then in the loop, when you have 'line':
    filename,description = rema.search(line).groups()


Here, I'm ignoring the possibility that invalid lines (ones not
matching the pattern) may be present among those you are
looping on.  You can take several approaches to those, such
as explicit tests or exception-handling; if you "know" there
aren't going to be any, don't worry -- if the "knowledge" turns
out to be false, your program will terminate with an exception
identifying the point of failure, and you can correct your
assumptions at that point.


Alex






More information about the Python-list mailing list