[Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

mhysnm1964 at gmail.com mhysnm1964 at gmail.com
Sun Jan 27 06:24:30 EST 2019


Peter,

I am aware that I am avoiding functions that can make my life easier. But I
want to learn some of this data structure navigation concepts to improve my
skills in programming. What you have provided I will review in depth and
have a play with.

A big thanks.


-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Peter Otten
Sent: Sunday, 27 January 2019 10:13 PM
To: tutor at python.org
Subject: Re: [Tutor] Web scraping using selenium and navigating nested
dictionaries / lists.

mhysnm1964 at gmail.com wrote:

> All,
> 
>  
> 
> Goal of new project.
> 
> I want to scrape all my books from Audible.com that I have purchased.
> Eventually I want to export this as a CSV file or maybe Json. I have 
> not got that far yet. The reasoning behind this is to  learn selenium  
> for my work and get the list of books I have purchased. Killing two 
> birds with one stone
> here. The work focus is to see if selenium   can automate some of the
> testing I have to do and collect useful information from the web page 
> for my reports. This part of the goal is in the future. As I need to 
> build my python skills up.
> 
>  
> 
> Thus far, I have been successful in logging into Audible and showing 
> the library of books. I am able to store the table of books and want 
> to use BeautifulSoup to extract the relevant information. Information 
> I will want from the table is:
> 
> *	Author
> *	Title
> *	Date purchased
> *	Length
> *	Is the book in a series (there is a link for this)
> *	Link to the page storing the publish details.
> *	Download link
> 
> Hopefully this has given you enough information on what I am trying to 
> achieve at this stage. AS I learn more about what I am doing, I am 
> adding possible extra's tasks. Such as verifying if I have the book 
> already download via itunes.
> 
>  
> 
> Learning goals:
> 
> Using the BeautifulSoup  structure that I have extracted from the page 
> source for the table. I want to navigate the tree structure. 
> BeautifulSoup provides children, siblings and parents methods. This is 
> where I get stuck with programming logic. BeautifulSoup does provide 
> find_all method plus selectors which I do not want to use for this 
> exercise. As I want to learn how to walk a tree starting at the root 
> and visiting each node of the tree.

I think you make your life harder than necessary if you avoid the tools
provided by the library you are using.

> Then I can look at the attributes for the tag as I go. I believe I 
> have to set up a recursive loop or function call. Not sure on how to 
> do this. Pseudo code:
> 
>  
> 
> Build table structure
> 
> Start at the root node.
> 
> Check to see if there is any children.
> 
> Pass first child to function.
> 
> Print attributes for tag at this level
> 
> In function, check for any sibling nodes.
> 
> If exist, call function again
> 
> If no siblings, then start at first sibling and get its child.
> 
>  
> 
> This is where I get struck. Each sibling can have children and they 
> can have siblings. So how do I ensure I visit each node in the tree?

The problem with your description is that siblings do not matter. Just

- process root
- iterate over its children and call the function recursively with every
  child as the new root.

To make the function more useful you can pass a function instead of hard-
coding what you want to do with the elements. Given

def process_elements(elem, do_stuff):
    do_stuff(elem)
    for child in elem.children:
        process_elements(child, do_stuff)

you can print all elements with

soup = BeautifulSoup(...)
process_elements(soup, print)

and

process_elements(soup, lambda elem: print(elem.name)) will print only the
names.

You need a bit of error checking to make it work, though.

But wait -- Python's generators let you rewrite process_elements so that you
can use it without a callback:

def gen_elements(elem):
    yield elem
    for child in elem.children:
        yield from gen_elements(child)

for elem in gen_elements(soup):
    print(elem.name)

Note that 'yield from iterable' is a shortcut for 'for x in iterable: yield
x', so there are actually two loops in gen_elements().

> Any tips or tricks for this would be grateful. As I could use this in 
> other situations.


_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list