Python text file fetch specific part of line
honeygne at gmail.com
honeygne at gmail.com
Tue Aug 2 02:55:18 EDT 2016
On Thursday, July 28, 2016 at 1:00:17 PM UTC+5:30, c... at zip.com.au wrote:
> On 27Jul2016 22:12, Arshpreet Singh <arsh840 at gmail.com> wrote:
> >I am writing Imdb scrapper, and getting available list of titles from IMDB
> >website which provide txt file in very raw format, Here is the one part of
> >file(http://pastebin.com/fpMgBAjc) as the file provides tags like Distribution
> >Votes,Rank,Title I want to parse title names, I tried with readlines() method
> >but it returns only list which is quite heterogeneous, is it possible that I
> >can parse each value comes under title section?
>
> Just for etiquette: please just post text snippets like that inline in your
> text. Some people don't like fetching random URLs, and some of us are not
> always online when reading and replying to email. Either way, having the text
> in the message, especially when it is small, is preferable.
>
> To your question:
>
> Your sample text looks like this:
>
> New Distribution Votes Rank Title
> 0000000125 1680661 9.2 The Shawshank Redemption (1994)
> 0000000125 1149871 9.2 The Godfather (1972)
> 0000000124 786433 9.0 The Godfather: Part II (1974)
> 0000000124 1665643 8.9 The Dark Knight (2008)
> 0000000133 860145 8.9 Schindler's List (1993)
> 0000000133 444718 8.9 12 Angry Men (1957)
> 0000000123 1317267 8.9 Pulp Fiction (1994)
> 0000000124 1209275 8.9 The Lord of the Rings: The Return of the King
> (2003)
> 0000000123 500803 8.9 Il buono, il brutto, il cattivo (1966)
> 0000000133 1339500 8.8 Fight Club (1999)
> 0000000123 1232468 8.8 The Lord of the Rings: The Fellowship of the
> Ring (2001)
> 0000000223 832726 8.7 Star Wars: Episode V - The Empire Strikes Back
> (1980)
> 0000000233 1243066 8.7 Forrest Gump (1994)
> 0000000123 1459168 8.7 Inception (2010)
> 0000000223 1094504 8.7 The Lord of the Rings: The Two Towers (2002)
> 0000000232 676479 8.7 One Flew Over the Cuckoo's Nest (1975)
> 0000000232 724590 8.7 Goodfellas (1990)
> 0000000233 1211152 8.7 The Matrix (1999)
>
> Firstly, I would suggest you not use readlines(), it pulls all the text into
> memory. For small text like this is it ok, but some things can be arbitrarily
> large, so it is something to avoid if convenient. Normally you can just iterate
> over a file and get lines.
>
> You want "text under the Title." Looking at it, I would be inclined to say that
> the first line is a header and the rest consist of 4 columns: a number
> (distribution?), a vote count, a rank and the rest (title plus year).
>
> You can parse data like that like this (untested):
>
> # presumes `fp` is reading from the text
> for n, line in enumerate(fp):
> if n == 0:
> # heading, skip it
> continue
> distnum, nvotes, rank, etc = split(line, 3)
> ... do stuff with the various fields ...
>
> I hope that gets you going. If not, return with what code you have, what
> happened, and what you actually wanted to happen and we may help further.
Thanks I am able to do it with following:
https://github.com/alberanid/imdbpy/blob/master/bin/imdbpy2sql.py (it was very helpful)
python imdbpy2sql.py -d <.txt files downloaded from IMDB> -u sqlite:/where/to/save/db --sqlite-transactions
More information about the Python-list
mailing list