[Tutor] how to extract data only after a certain condition is met

Josep M. Fontana josep.m.fontana at gmail.com
Sun Oct 10 21:35:01 CEST 2010


Hi,

First let me apologize for taking so long to acknowledge your answers and to
thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry if I left
anyone) for your help and your time.

One of the reasons I took so long in responding (besides having gotten busy
with some urgent matters related to my work) is that I was a bit embarrassed
at realizing how poorly I had defined my problem.
As Alan said, I should at least have told you which operations were giving
me a headache. So I went back to my Python reference books to try to write
some code and thus be able to define my problems more precisely. Only after
I did that, I said to myself, I would come back to the list with more
specific questions.

The only problem is that doing this made me painfully aware of how little
Python I know. Well, actually my problem is not so much that I don't know
Python as that I have very little experience programming in general. Some
years ago I learned a little Perl and basically I used it to do some text
manipulation using regular expressions but that's all my experience. In
order to learn Python, I read a book called "Beginning Python: From Novice
to Professional" and I was hoping that just by starting to use the knowledge
I had supposedly acquired by reading that book to solve real problems
related to my project I would learn. But this turned out to be much more
difficult than I had expected. Perhaps if I had worked through the excellent
book/tutorial Alan has written (of which I was not aware when I started), I
would be better prepared to confront this problem.

Anyway (sorry for the long intro), since Emile laid out the problem very
clearly, I will use his outline to point out the problems I'm having:

Emile says:
--------------
Conceptually, you'll need to:

  -a- get the list of file names to change then for each
  -b- determine the new name
  -c- rename the file

For -a- you'll need glob. For -c- use os.rename.  -b- is a bit more
involved.  To break -b- down:

  -b1- break out the x-xx portion of the file name
  -b2- look up the corresponding year in the other file
  -b3- convert the year to the century-half structure
  -b4- put the pieces together to form the new file name

For -b2- I'd suggest building a dictionary from your second files
contents as a first step to facilitate the subsequent lookups.

---------------------

OK. Let's start with -b- . My first problem is that I don't really know how
to go about building a dictionary from the file with the comma separated
values. I've discovered that if I use a file method called 'readlines' I can
create a list whose elements would be each of the lines contained in the
document with all the codes followed by comma followed by the year. Thus if
I do:

fileNameCentury = open(r
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'
).readlines()

Where 'FileNamesYears.txt' is the document with the following info:

A-01, 1278
A-02, 1501
...
N-09, 1384

I get a list of the form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09,
...]

Would this be a good first step to creating a dictionary? It seems to me
that I should be able to iterate over this list in some way and make the
substring before the comma the key and the substring after the comma its
value. The problem is that I don't know how. Reading the book I read has not
prepared me for this. I have the feeling that all the pieces of knowledge I
need to solve the problem where there, but I don't know how to put them
together. Greg mentioned the csv module. I checked the references but I
could not see any way in which I could create a dictionary using that
module, either.

Once I have the dictionary built, what I would have to do is use the os
module (or would it be the glob module?) to get a list of the file names I
want to change and build another loop that would iterate over those file
names and, if the first part of the name (possibly represented by a regular
expression of the form r'[A-Z]-[0-9]+') matches one of the keys in the
dictionary, then a) it would get the value for that key, b) would do the
numerical calculation to determine whether it is the first part of the
century or the second part and c) would insert the string representing this
result right before the extension .txt.

In the abstract it sounds easy, but I don't even know how to start.  Doing
some testing with glob I see that it returns a list of strings representing
the whole paths to all the files whose names I want to manipulate. But in
the reference documents that I have consulted, I see no way to change those
names. How do I go about inserting the information about the century right
before the substring '.txt'?

As you see, I am very green. My embarrassment at realizing how basic my
problems were made me delay writing another message but I decided that if I
don't do it, I will never learn.

Again, thanks so much for all your help.

Josep M.




> Message: 2
> Date: Sat, 2 Oct 2010 17:56:53 +0200
> From: "Josep M. Fontana" <josep.m.fontana at gmail.com>
> To: tutor at python.org
> Subject: [Tutor] Using contents of a document to change file names
> Message-ID:
>        <AANLkTikjOFYhieL70E=-BaE_PEdc0nG+iGY3j+qO+FMZ at mail.gmail.com<BaE_PEdc0nG%2BiGY3j%2BqO%2BFMZ at mail.gmail.com>
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> This is my first posting to this list. Perhaps this has a very easy answer
> but before deciding to post this message I consulted a bunch of Python
> manuals and on-line reference documents to no avail. I would be very
> grateful if someone could lend me a hand with this.
>
> Here's the problem I want to solve. I have a lot of files with the
> following
> name structure:
>
> A-01-namex.txt
> A-02-namey.txt
> ...
> N-09-namez.txt
>
> These are different text documents that I want to process for an NLP
> project
> I'm starting. Each one of the texts belongs to a different century and it
> is
> important to be able to include the information about the century in the
> name of the file as well as inside the text.
>
> Then I have another text file containing information about the century each
> one of the texts was written. This document has the following structure:
>
> A-01, 1278
> A-02, 1501
> ...
> N-09, 1384
>
> What I would like to do is to write a little script that would do the
> following:
>
> . Read each row of the text containing information about the centuries each
> one of the texts was written
> . Change the name of the file whose name starts with the code in the first
> column in the following way
>
>        A-01-namex.txt --> A-01-namex_13-2.txt
>
>    Where 13-1 means: 13th 2nd half. Obviously this information would com
> from the second column in the text: 1278 (the first two digits + 1 =
> century; if the 3rd and 4th digits > 50, then 2; if < 50 then     1)
>
> Then in the same script or in a new one, I would need to open each one of
> the texts and add information about the century they were written on the
> first line preceded by some symbol (e.g @13-2)
>
> I've found a lot of information about changing file names (so I know that I
> should be importing the os module), but none of the examples that were
> cited
> involved getting the information for the file changing operation from the
> contents of a document.
>
> As you can imagine, I'm pretty green in Python programming and I was hoping
> the learn by doing method would work.  I need to get on with this project,
> though, and I'm kind of stuck. Any help you guys can give me will be very
> helpful.
>
> Josep M.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101010/dd16f872/attachment-0001.html>


More information about the Tutor mailing list