[Tutor] XML Programs

Thu Apr 19 12:57:02 EDT 2018

Glen wrote:

> Hey guys,
> 
> I have the following code:
> 
> https://repl.it/@glendog/HurtfulPunctualInterface

from lxml import etree
catalog = etree.parse("example.xml")

def getbooks(xmldata):
    books = xmldata.xpath("//catalog")[0]
    for item in books:
        print(item.findtext("title"))

getbooks(catalog)

> Using the function I have define I can print to screen a list of books.
> However, how can I search for records within the xml using an ID or the
> title of the book etc? I've tried reading the tutorial but the penny is
> not dropping.

As a rule of thumb do not print() anything in functions that process your 
data. Have them return something instead, like a list of titles that may be 
searched, sorted, or filtered later. The final result of this can then be 
printed or displayed in a GUI or webpage, written to a file, or saved in a 
database.

In that spirit I would rewrite your getbooks() function as

def get_book_titles(xmldata):
    books = xmldata.xpath("//catalog")[0]
    return [book.findtext("title") for book in books]

Now you have something to work with. You provide an xml tree and get a list 
of book titles. As long as you don't change that you can rewrite the 
implementation without breaking the rest of your script.

Now let's suppose we want to find all titles containing some words the user 
can provide. We break the task into tiny subtasks:

- break a string into words
- search a string for words
- filter a list of titles

def get_words(title):
    return title.split()

def has_words(words, title):
    title_words = get_words(title)
    return all(word in title_words for word in words)

def find_matching_titles(titles, words):
    return [title for title in titles if has_words(title, words)]

We can check if the above works with a few lines of code:

catalog = etree.parse("example.xml")
titles = get_book_titles(catalog)
print(find_matching_titles(titles, ["Guide"]))

Seems to work. What we really should be doing is to write unit tests for 
every function. If turns out that our program sometimes doesn't work as we 
would like it to we can identify the lacking function and improve only that.
Let's say you want to make the search case-insensitive. That should be 
possible by having get_words() return case-folded strings. 

Finally we can add a simple user interface:

def lookup_title(titles):
    while True:
        try:
            words = get_words(input("Print titles containing all words: "))
        except EOFError:
            break
        matching_titles = find_matching_titles(titles, words)
        if matching_titles:
            for i, title in enumerate(matching_titles):
                print(i, title)
        else:
            print("no matches")
    print()
    print("That's all, folks")

if __name__ == "__main__":
    catalog = etree.parse("example.xml")
    titles = get_book_titles(catalog)
    lookup_title(titles)

If instead of just titles you want to process book objects

class Book:
    def __init__(self, title, author):
        self.title = title
        self.author = author

you can only reuse some of the functions, but you can keep the structure of 
the script. For example get_book_titles() could be replaced with

def get_books(xmldata):
    books = xmldata.xpath("//catalog")[0]
    return [
        Book(
            book.findtext("title"),
            book.findtext("author")
        )
        for book in books
    ]

and the filter function could be modified

def find_matching_titles(books, words):
    return [book for book in book if has_words(book.title, words)]

You may even change the above to search words in an arbitrary attribute of 
the book instance

def find_matching_books(books, words, attribute):
    return [
        book for book in books
        if has_words(getattr(book, attribute), words)
    ]