[Tutor] code review

Wed Jun 11 21:37:55 CEST 2014

Ok, not so bad, hoewer there are some parts of the code that could be 
done a bit cleaner.
I'll write them below in the response.

>>>> Thanks for the reply Steven.  It's no more than 100 lines at a
>>>> guess
>>> In that case just copy and paste it into a message and send it to
>>> the group. Anyone with time available can then take a peek.
>> One way noobs anywhere can learn is by listening in to other people's
>> conversations - it's called lurking, I believe.
>>
>> So I would say, please do this on the list, and many more people than
>> Adam may benefit. Others can ignore the thread if they wish.
>>
>> Bob
> Oke doke, here it is below.  Just for convenience's sake, I'm going to
> repeat what the basic steps are.  It's a backup script for certain xen
> virtual machines ("VM") running on my server.  Each VM runs on its own
> logical volume (as opposed to a file-based loop device).  From my own
> (bitter) experience, the absolutely best way to back up a VM running on
> a logical volume is to clone it to an image file using dd.  I'm aware
> that a separate discussion could be had around this (on a different
> mailing list) but, unless someone thinks this is a horribly flawed
> approach, it may be best to assume this approach is 'fine' so as not to
> distract from the code review!!
>
> Here are the steps:
> 1) create snapshots of the xen logical volumes using the built in
> snapshot feature of LVM2 (this way I can backup each logical volume
> without having to shut down the VM)
> 2) dd and bzip2 (using a pipe) the snapshots to .img.bz2 files for
> storage on the same server
> 3) gpg encrypt the same files and upload them to Amazon s3
> 4) remove the logical volume snapshots (because they accumulate disk
> space and I'm doing this daily) and the .gpg files
> 5) deletes files in the s3 directory which are older than X days
>
> As I've mentioned, I'm a real noob, so I'm still mastering some basic
> stuff.  The script works fine for my purposes, I'm keen to understand
> where it could be improved from a python pov.  Finally, yes I could have
> written this in bash but I prefer python!
>
> P.S. I think some of the comments have been wrapped onto more than one
> line by my email client, I hope this doesn't cause too much inconvenience.
> ====================================
>
> #!/usr/bin/python3
>
> ############################################
> ## XEN VIRTUAL MACHINE BACKUP SCRIPT
> ##
> ## Copyright (C) 2014 Adam Gold
> ##
>
>
> ## This program is free software: you can redistribute it and/or modify
> ## it under the terms of the GNU General Public License as published by
> ## the Free Software Foundation, either version 3 of the License, or (at
> ## your option) any later version.
> ##
> ## This program is distributed in the hope that it will be useful, but
> ## WITHOUT ANY WARRANTY; without even the implied warranty of
> ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
> ## the GNU General Public License for more details.
> ##
> ## You should have received a copy of the GNU General Public License
> ## along with this program. If not see <http://gnu.org/licenses/>
> ##
>
> ## Version: 0.4
> ## 2014-06-10
>
> ############################################
>
> import datetime, time, subprocess, shlex, os, gnupg, glob, shutil
imports should be one module per line and alphabetically- more readable:

import datetime
import glob
import gnupg
import os
import shlex
import shutil
import subprocess
import time

>
> # logical volumes exist in two different volume groups, vgxen and vg_data
> # hence two lists of vms
> vgxenList = ['vm1', 'vm2', 'vm3', 'vm4', 'vm5', 'vm6' ]
> vg_dataList = ['vm1', 'vm2']
> backupList = [ ]
> snapNameList = [ ]
>
>
> # create snapshot names like the following: 2014-06-10T01-00-01.vm1.img.bz2
> for i in vgxenList:
>      DATE = datetime.datetime.now().strftime("%Y-%m-%d" + "T" + "%H-%M-%S")
>      vgxenName = "/dev/vgxen/"
>      lvName = i
>      origName = vgxenName + lvName
>      snapName= DATE + "." + lvName
>      snapNameList.append(snapName)
>      backupList.append(vgxenName + snapName)
>      subprocess.call(['lvcreate', '-s', '-L1G', origName, '-n', snapName])
>
>
> for h in vg_dataList:
>      DATE = datetime.datetime.now().strftime("%Y-%m-%d" + "T" +  "%H-%M-%S")
>      vg_dataName = "/dev/vg_data/"
>      lvName = h
>      origName = vg_dataName + lvName
>      snapName = DATE + "." + lvName
>      snapNameList.append(snapName)
>      backupList.append(vg_dataName + snapName)
>      subprocess.call(['lvcreate', '-s', '-L1G', origName, '-n', snapName])
Here in the two loops, you can notice they're almost the same ...
NOTE: I thing you're dealing with file paths, you should use os.path and 
os.path.join instead of string conactenation
             and you should also check if the file exists first before 
calling subprocess ... you can figure out how yourself
Basic DRY (don't repeat yourself) principle - distill their essence into 
a function or more:

def snapshot_name(item, name):
     """
     :param item: string (one of 'vm1', 'vm2' ..)
     :param name: string ('/dev/vgxen/')
     """
     isodate = datetime.datetime.now().isoformat()  # notice datetime 
already has this time format
     orig_name = '{}{}'.format(name, item)  # i'd rather use format or % 
string formatting than +
     snapshot_name = '{}.{}'.format(isodate, item)
     backup_path = '{}{}'.format(name, snapshot_name)

     return (orig_name, snapshot_name, backup_path)

for h in vg_dataList:  # note: vg_dataList - use either _ or CamelCase, 
not both!
     orig_name, snapshot_name, backup_path = snapshot_name(h, 
'/dev/vg_data/')
     backup_list.append(backup_path)  # only add path if exists ...
     subprocess.call(....)
>
> # backupPath is list of full paths of each snapshot
> # the string is extacted from backupList using 'join'
> backupPath = ' '.join(backupList)
> for j, k in zip(backupList, snapNameList):
>      backupPath = j
>      backupSnapshot = k
> # run dd and pipe to bz2 file using subprocess module
>      ddIf = shlex.split("dd if=%s bs=4k conv=noerror,notrunc,sync" %
> (backupPath))
>      compress = "pbzip2"
>      filename = "/home/files/temp/%s.img.bz2" % (backupSnapshot)
>      p1 = subprocess.Popen(ddIf, stdout=subprocess.PIPE)
>      with p1.stdout as fin, open(filename, "w") as fout:
>          p2 = subprocess.Popen(compress, stdin=fin, stdout=fout)
>      ret1 = p1.wait()
>      ret2 = p2.wait()
This part with dd ... you can read the file with open(filename, 'rb')
and then bzip it with python - it is up to you
example:
with open(filename, 'rb') as f:
     bzip_file = bz2.BZ2File(outfile, 'w')  # you specify compress level 
here - see docs
     bzip_file.write(f.read())  # you can do sequential read of the file 
and sequential compression - better for snapshots, this might not 
actually work for large files
     bzip_file.close()
>
>
> # create list of files to be encrypted with full path names
> # start with list of unencrypted files
> cryptDir = '/home/files/temp/'
> unencrypted = [u for u in os.listdir(cryptDir)]
> # join absolute path to file names to create new list (list comprehension)
> cryptDir_unencrypted = [ os.path.join(cryptDir, s) for s in unencrypted ]
>
>
> # encrypt files
> for G in cryptDir_unencrypted:
>      gpg = gnupg.GPG(gnupghome='/root/.gnupg')
>      phrase = "passphrase"  # HORRIBLE SECURITY, I KNOW!  The script is
> running as a cronjob so I can't interactively enter the passphrase.
> Suggestions are welcome.
>      cipher = "AES256"
>      with open(G, 'rb') as f:
>          status = gpg.encrypt_file(f, None, armor=False,
> passphrase=phrase, symmetric=cipher.upper(), output=G + '.gpg')
For the encryption part, you can create a keypair, encrypt the snapshot 
with public key
and decrypt the snapshot with private key on the amazon S3 directly 
without passwords - pycrypto module
>
>
> # move unencypted files out of temp directory
> for data in glob.glob(cryptDir + '*.bz2'):
>      shutil.move(data,'/home/files/')
>
>
> # delete snapshots
> for r in snapNameList:
>      removeSnapshots1 = 'lvremove -f ' + vgxenName + r
>      subprocess.call(shlex.split(removeSnapshots1))
>      removeSnapshots2 = 'lvremove -f ' + vg_dataName + r
>      subprocess.call(shlex.split(removeSnapshots2))
Also DRY - make a function that does this and takes parameters.
>
>
> # create list of file names to be uploaded (list comprehension)
> uploads = [y for y in os.listdir(cryptDir)]
> # join absolute path to file names to create new list (list comprehension)
> cryptDir_uploads = [ os.path.join(cryptDir, t) for t in uploads ]
>
> # upload to Amazon s3
> for d in cryptDir_uploads:
>      s3Upload = 's3cmd put ' + d + ' s3://bucket/dir/'
>      subprocess.call(shlex.split(s3Upload))
>      subprocess.call(shlex.split('rm ' + d))
>
>
> # move working path to list of unencrypted vm backup files
> path = '/home/files/'
> os.chdir(path)
>
> # build list of unencrypted vm backup files (list comprehension)
> fileNames = [ u for u in os.listdir(path) if os.path.isfile(u) ]
>
> # build list of	unencrypted vm backup files with .gpg
> # this will mirror the list of files in s3 (list comprehension)
> fileNames_gpg = [ p + '.gpg' for p in fileNames ]
>
> # NOTE: I tried to collapse the previous two list comprehensions into one
> # as it would seem to be possible - the contents of fileNames_gpg should be
> # an os.listdir() of the directory they are in.  For some reason I couldn't
> # do this so I did the above hack by appending the .gpg extension to a
> different list
>
> # calculate age of unencrypted vm backup files and
> # hence encrypted files on s3 (list comprehension)
> # NOTE: I have to use the unencrypted files on the server
> # as the gpg files get deleted after each upload to s3
> ageList = [ round((time.time() - os.stat(n).st_mtime)/60/60/24) for n in
> fileNames ]
>
> # delete files older than 'age' variable; age of file in ageList compared
> # against name in fileNames_gpg
> age = 7
> for x, y in zip(fileNames_gpg, ageList):
>      if y > age:
>          subprocess.call(shlex.split('s3cmd del s3://bucket/dir/' + x))
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

Also a few hints:
I'm not so sure it's a good idea to delete the files from this script - 
make another that deletes old scripts (only when there are more than one 
for "partition") and call that with cron.
Would it not be better to make one loop that prepares all the files? - 
make snapshots, move them bzip them, encrypt them. And another that just 
uploads them all?

I hope this helps!
Lukas