[Tutor] Using contents of a document to change file names, (was Re: how to extract data only after a certain ...)

Josep M. Fontana josep.m.fontana at gmail.com
Mon Oct 11 13:46:53 CEST 2010


 Sorry about the confusion with the subject line. I was receiving messages
in digest mode and I copied and pasted the wrong heading in my previous
message. Now I have written the heading corresponding to my initial message.
I have also changed the settings for this list from the digest mode to the
default mode because it is easier to manage if you are participating in
threads.

OK, first thanks Emile and Bob for your help.

Both of you noticed that the following line of code returned a string
instead of a list as it would be expected from using .readlines():

open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt').readlines()

returns --> ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09, ...']

Yes, I had not noticed it but this is what I get. You guessed correctly that
I am using a Mac. Just in case it might be useful, I'm also using PyDev in
Eclipse (I figured since I'm learning to program, I can start using an IDE
that will grow with my programming skills).

I tried your suggestion of using .split() to get around the problem but I
still cannot move forward. I don't know if my implementation of your
suggestion is the correct one but here's the problem I'm having. When I do
the following:

-----------------

fileNameCentury = open(r
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'
.split('\r'))
dct = {}
for pair in fileNameCentury:
    key,value = pair.split(',')
    dct[key] = value
print dct
--------------

I get the following long error message:

-------------------

pydev debugger: warning: psyco not available for speedups (the debugger will
still work correctly, but a bit slower)

pydev debugger: starting

Traceback (most recent call last):

  File
"/Applications/eclipse/plugins/org.python.pydev.debug_1.6.3.2010100513/pysrc/pydevd.py",
line 1145, in <module>

    debugger.run(setup['file'], None, None)

  File
"/Applications/eclipse/plugins/org.python.pydev.debug_1.6.3.2010100513/pysrc/pydevd.py",
line 916, in run

    execfile(file, globals, locals) #execute the script

  File "/Volumes/DATA/Documents/workspace/GCA/src/file_name_change.py", line
2,

in <module>

    fileNameCentury =
open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'.split('\n'))

TypeError: coercing to Unicode: need string or buffer, list found
------------

Before reporting this problem, I did some research on the newline problems
and I saw that you can set the mode in open() to 'U' to handle similar
problems. So I tried the following:

   >>>fileNameCentury = open(r
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt',
"U")

   >>>output = fileNameCentury.readlines()

   >>>print output


Interestingly I get closer to the solution but with a little twist:

['A-01,1374\n', 'A-02,1499\n', 'A-05,1449\n', 'A-06,1374\n', 'A-09,1449\n',
'B-01,1299\n', 'B-02,1299\n', 'B-06,1349\n'...]

That is, now I do get a list but as you can see I get the newline character
as part of each one of the strings in the list. This is pretty weird. Is
this a general problem with Macs?


Josep M.



From: Emile van Sebille <emile at fenx.com>
To: tutor at python.org

On 10/10/2010 12:35 PM Josep M. Fontana said...
<snip>
>
> fileNameCentury = open(r
>
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'
> ).readlines()
>
> Where 'FileNamesYears.txt' is the document with the following info:
>
> A-01, 1278
> A-02, 1501
> ...
> N-09, 1384
>
> I get a list of the form
['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09,
> ...]
>
> Would this be a good first step to creating a dictionary?

Hmmm... It looks like you got a single string -- is that the output from
read and not readlines?  I also see you're just getting \r which is the
Mac line terminator.  Are you on a Mac, or was 'FileNamesYears.txt'
created on a Mac?.  Python's readlines tries to be smart about which
line terminator to expect, so if there's a mismatch you could have
issues related to that.  I would have expected you'd get something more
like: ['A-01,1374\r','A-02,1499\r','A-05,1449\r','A-06,1374\r','A-09, ...]

In any case, as you're getting a single string, you can split a string
into pieces, for example, print "1\r2\r3\r4\r5".split("\r").  That way
you can force creation of a list of strings following the format
"X-NN,YYYY" each of which can be further split with xxx.split(",").
Note as well that you can assign the results of split to variable names.
 For example, ky,val = "A-01, 1278".split(",") sets ky to A-01 and val
to 1278.  So, you should be able to create an empty dict, and for each
line in your file set the dict entry for that line.

Why don't you start there and show us what you get.

HTH,

Emile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101011/14a9017d/attachment.html>


More information about the Tutor mailing list