Python text file fetch specific part of line
cs at zip.com.au
cs at zip.com.au
Thu Jul 28 03:04:02 EDT 2016
On 27Jul2016 22:12, Arshpreet Singh <arsh840 at gmail.com> wrote:
>I am writing Imdb scrapper, and getting available list of titles from IMDB
>website which provide txt file in very raw format, Here is the one part of
>file(http://pastebin.com/fpMgBAjc) as the file provides tags like Distribution
>Votes,Rank,Title I want to parse title names, I tried with readlines() method
>but it returns only list which is quite heterogeneous, is it possible that I
>can parse each value comes under title section?
Just for etiquette: please just post text snippets like that inline in your
text. Some people don't like fetching random URLs, and some of us are not
always online when reading and replying to email. Either way, having the text
in the message, especially when it is small, is preferable.
To your question:
Your sample text looks like this:
New Distribution Votes Rank Title
0000000125 1680661 9.2 The Shawshank Redemption (1994)
0000000125 1149871 9.2 The Godfather (1972)
0000000124 786433 9.0 The Godfather: Part II (1974)
0000000124 1665643 8.9 The Dark Knight (2008)
0000000133 860145 8.9 Schindler's List (1993)
0000000133 444718 8.9 12 Angry Men (1957)
0000000123 1317267 8.9 Pulp Fiction (1994)
0000000124 1209275 8.9 The Lord of the Rings: The Return of the King
(2003)
0000000123 500803 8.9 Il buono, il brutto, il cattivo (1966)
0000000133 1339500 8.8 Fight Club (1999)
0000000123 1232468 8.8 The Lord of the Rings: The Fellowship of the
Ring (2001)
0000000223 832726 8.7 Star Wars: Episode V - The Empire Strikes Back
(1980)
0000000233 1243066 8.7 Forrest Gump (1994)
0000000123 1459168 8.7 Inception (2010)
0000000223 1094504 8.7 The Lord of the Rings: The Two Towers (2002)
0000000232 676479 8.7 One Flew Over the Cuckoo's Nest (1975)
0000000232 724590 8.7 Goodfellas (1990)
0000000233 1211152 8.7 The Matrix (1999)
Firstly, I would suggest you not use readlines(), it pulls all the text into
memory. For small text like this is it ok, but some things can be arbitrarily
large, so it is something to avoid if convenient. Normally you can just iterate
over a file and get lines.
You want "text under the Title." Looking at it, I would be inclined to say that
the first line is a header and the rest consist of 4 columns: a number
(distribution?), a vote count, a rank and the rest (title plus year).
You can parse data like that like this (untested):
# presumes `fp` is reading from the text
for n, line in enumerate(fp):
if n == 0:
# heading, skip it
continue
distnum, nvotes, rank, etc = split(line, 3)
... do stuff with the various fields ...
I hope that gets you going. If not, return with what code you have, what
happened, and what you actually wanted to happen and we may help further.
Cheers,
Cameron Simpson <cs at zip.com.au>
More information about the Python-list
mailing list