Monday, May 23, 2011

"mydata" : Python script for rsync shorthand

This script is for those messy ones of us who have copies of all their personal files on multiple different machines/storage devices but with a possibly different folder structure at each location, and want to synchronize them using rsync. It's quick, it's simple and it makes rsync way easier to use.

The long story: When I first learned how to use rsync I was amazed. Such a vast alphabet soup of option flags to chose from to tightly control the behavior of the program as it moved bits around, bringing two copies of a file or folder to similarity. But one thing was always annoying: I had to type in paths every time I wanted to synchronize my documents, images, music (et cetera) between my desktop and my notebook/netbook, backup server, etc., so I set out to write a script to call rsync for me. Ideally, it would read a configuration file and compose the appropriate rsync command, by parsing my sloppy shorthand keystrokes for locations and categories of data corresponding to different directories on different computers - i.e. on one computer, music might be in "~/Music/" whereas on another it might be in "~/Audio/". Originally I wrote it in Perl and the project took me dozens of hours to complete, not to mention after-the-fact debugging and improvement that was necessary to make it not suck so much, and how whenever you write something in Perl it's a headache to understand later. Also, it was really slow. So, one Saturday afternoon last November, I completely re-wrote it in Python, and now it's blissfully fast and elegant.

Anyhow, the basic setup: store your locations and data categories in a configuration file (~/.mydata/conf), put the script in a userscripts folder referenced in your $PATH variable, and invoke with five arguments: category alias, location alias, direction and behavior, which have the following roles:

  1. category alias: Any combination of whitespace-free characters that is a substring of the string defined in the configuration file, which refers to a category of data (whose paths on each machine/location are also defined there)

  2. location alias: Similar to category alias in that whatever is entered here is matched against a string in the configuration file for a location on which the category of data might be located. It could be another host, or even a location on a backup hard drive.

  3. direction: Push or pull; refers to whether the "remote" copy (the copy on the machine where the script is being run) is synched against the "remote" copy (push), in which case the changes to the local copy will be copied to the remote, or vice versa (pull).

  4. behavior: I hard coded a few aliases that this argument gets matched against such that each corresponds to a different set of option flags. For example, "copynew" ignores preexisting files and only copies files that don't exist at the target but do at the source (the local copy is the target if pulling, remote if pushing), while "clean" deletes files at the target that don't exist at the source.



The configuration file, in my case, contains the following:
###########################
# # 1: Location Aliases # #
###########################
# Define aliases and the strings against which command line arguments will be matched here as
# "alias = string" pairs; for example:
# sbs = sshrbuSSH
# bhd = hddHDD
# the alias "sbs" can then be used to identify other information, such as the location, base path
# to data, etc., while any substring of the big string set as the value can be used to identify
# and use the particular location for rsync at the command line.
[location_aliases]
pol = PolarispolarisPOLARISncp
cor = corvusCorvus

###########################
# # 2: Category Aliases # #
###########################
# Define aliases and strings against which command line arguments will be matched, similar to the
# location aliases;
[category_aliases]
aud = musicaudioAudioMusic
arc = archviesArchves
doc = documentsDocuments
img = imagesImagesimg
usc = userscripts
ffx = firefoxFireFoxbrowser
eml = thunderbirdThunderbirdemailEmail
key = keyvaultcayskeys
tby = tomboyTomboynotes
usc = userscripts

#####################
# # 3: Base Paths # #
#####################
# For each data category and location, enter its path as follows:
# FOR LOCATIONS: use
# alias = path
# for the path, use the path to which all category base paths will be relative, without a trailing
# slash; for example, if one is using a backup hard drive of category alias "bhd" mounted at
# /media/BackupHDD, one could write:
# bhd = /media/BackupHDD
# and for an SSH backup server with alias "sbs" and backup files stored in ~/personal/backup/, write:
# sbs = username@remotehost:~/personal/backup
# FOR CATEGORIES: use
# alias = localias1:path1/
# localias2:path2/
# etc.
# (include trailing slashes)
# For example, if you have a category with alias "mus", for Music, and on the backup hard drive it
# is at /media/BackupHDD/Music, while on the server it is at ~/personal/backup/media/Audio, the
# entry would be as follows:
# mus = bhd:/Music/
# sbs:/media/Audio/
[basepath]
pol = Polaris:~
cor = Corvus:~
arc = cor:/Archives/
pol:/bak/Archives/
aud = cor:/Audio/Library/
pol:/aud/
doc = cor:/Documents/
pol:/bak/Documents/
ffx = cor:/.mozilla/firefox/981pmhrz.default/
pol:/bak/.mozilla/firefox/981pmhrz.default/
eml = cor:/.mozilla-thunderbird/wm2glu61.default/
pol:/bak/.mozilla-thunderbird/wm2glu61.default/
img = cor:/Images/
pol:/bak/Images/
tby = cor:/.local/share/tomboy/
pol:/bak/.local/share/tomboy/
usc = cor:/.alt/
pol:/bak/.alt/

####################################
# # 4: Category-Specific Options # #
####################################
# If you wish to add specific options for categories of data, you can define them here -- i.e.,
# i.e. to show how much of a large file has been transferred when synchronizing a video files folder
# with alias "vid":
# vid = -h --progress
[category_options]
key = -c
aud = -h --progress

####################
# # 5: Hostnames # #
####################
# These are used to identify where the script is being run. You MUST define at least one, for the
# host name of the computer on which you will most often run the script. This functionality was
# added so that the script may be invoked on any of the machines where the data is stored.
[hostnames]
cor = Corvus
pol = Polaris


With the examples detailed in the comments of the configuration here, you could make the music folder on the SSH backup server identical to the one on the computer where you run the script by saving it (assuming it's saved as "mydata.sync" in a directory included in one's $PATH variable):

$ mydata.sync mu ssh push ver

This command selects the "music" category (assuming 'mu' is uniquely a substring of whatever string pool the 'mus' alias was tied to in the configuration), selects the ssh server (alias "sbs"), uses the "push" action, which will bring the copy on the server to match the local copy, and "ver" matches/selects the "verbatim" option set, which makes the target a carbon copy of the source, more or less.*

So, now for the code! Note that the option flag sets are defined in the dictionary "opts", and that you can control the default flags passed to rsync by changing the "opts_default" variable. I use -a for archive mode, -v for verbose output (show what's being copied and when) and --modify-window=2, for machines where the time may have drifted apart by a few seconds. Come to think of it I don't need that last one and you may not either; I needed it when I owned a PowerBook G4 and its system time would vary from the time on my Linux machines by a second or two for some odd reason.

#!/usr/bin/env python
import ConfigParser,pickle,socket,sys
from os import path,stat,system,environ

# Additional hard-coded configuration:
opts_default = '-av --modify-window=2'
opts = {'simple':['',"Use only the default options."],
'clean':["--delete-after --ignore-existing --existing",
'Delete files in target not in source, and do nothing else.'],
'copynew':['--ignore-existing',
'Skip files that exist already in target, copy new files.'],
'update':["-u --existing",
'Update files that exist in both target and source such that '
+'the source copy is newer, but don\'t copy new files from the source.'],
'fullupdate':['-u',
'Update all files on the target, skip files that have been touched '
+'more recently on the source, and copy new files.'],
'verbatim':['--delete-after',
'Make the target a 100% carbon copy of the source, using difference in '
+'file size or timestamp as the criterium for transferring (WARNING: '
+'this will overwrite recent changes on the target and delete files not '
+'in the source).'],
'carboncopycheck':['-c --delete-after',
'Makes the target a 100% carbon copy of the source, using checksum '
+'difference as the criterium for transferring (warning: this will '
+'be considerably slower for larger files and larger numbers of files).']}

def matchitem(inputstr,confdict,match_dir=1): # return the key corresponding to a matching value
for item in confdict.items():
match = (item[1] in inputstr,inputstr in item[1])[match_dir]
if match:
return item[0]
break
return None

def parseconf():
global aliases,categories,catopt,hostnames,locations,paths
confn = environ['HOME']+'/.mydata/conf'
conff = open(confn,'r')
confc = environ['HOME']+'/.mydata/conf_cache'
try:
refresh =stat(confc).st_mtime < stat(confn).st_mtime
except OSError:
system('touch '+confc)
if not path.exists(confn):
print 'Could not find configuration file: ~/.mydata/conf'
system('touch '+confn)
refresh=True

if refresh:
aliases = {}
conf = ConfigParser.ConfigParser()
conf.readfp(conff);
localiases = dict(conf.items('location_aliases'))
cataliases = dict(conf.items('category_aliases'))
pathraw = dict(conf.items('basepath'))
catopt = dict(conf.items('category_options'))
hostnames = dict(conf.items('hostnames'))
aliases.update(localiases)
aliases.update(cataliases)
locations = localiases.keys()
categories = cataliases.keys()
paths = {}
for alias in pathraw.keys():
if alias in categories:
paths[alias] = dict([item.split(':') for item in pathraw[alias].split('\n')])
else:
paths[alias] = pathraw[alias]
confsav = open(confc,'wb')
savlist = [aliases,categories,catopt,hostnames,locations,paths]
pickle.dump(savlist,confsav,pickle.HIGHEST_PROTOCOL)
else:
confc = open(confc,'r')
conf = pickle.load(confc)
aliases = conf[0]
categories = conf[1]
catopt = conf[2]
hostnames = conf[3]
locations = conf[4]
paths = conf[5]
confc.close()

def main():
# Step 1: initialize
args = sys.argv
parseconf()
cmd = 'rsync ' + opts_default
if len(args) > 5: # add extra command line arguments
for i in range(5,len(args)):
cmd += ' ' + args[i]
elif len(args) < 5:
print 'Insufficient number of arguments.'
sys.exit(0)

# Step 2: orientation
localhost = matchitem(socket.gethostname(),hostnames,match_dir=0)
category = matchitem(args[1],aliases)
location = matchitem(args[2],aliases)
pull = 'pul' in args[3]
action = matchitem(args[4],dict([[key,key] for key in opts.keys()]))


# Step 3: compose the rsync command
if not location in locations or not category in categories:
print 'Did not find a match for location or category. Check the configuration and try again;\n'
sys.exit(1)
localpath = paths[localhost].replace(hostnames[localhost]+':','')+paths[category][localhost]
remotepath = paths[location]+paths[category][location]
cmd += ' '+opts[action][0] # Command line options based on type of transfer
cmd += ' '+(localpath + ' ' + remotepath,remotepath+' '+localpath)[pull]

# Step 4: execute
try:
raw_input('Command: "%s"\n%s\nProceed? (hit enter if ready)'%(cmd,opts[action][1]))
system(cmd)
print 'Done.'
except KeyboardInterrupt:
print '\nCancelled.'
sys.exit(0)

if __name__ == '__main__':
main()


Make a directory ~/.mydata, create a configuration file ~/.mydata/conf, save the script in a userscripts folder whose path is included in your $PATH shell environment variable, and you're all set!

* This is assuming that changes in a file's content and timestamps will be sufficient to determine that there has been change in the file; if you're paranoid and don't mind waiting longer for the synchronization to complete, you can put -c in the category flags for the data category so that rsync performs checksumming to determine if any files have changed. In my case, I used it for the "key" category, which contained TrueCrypt file containers - and from experience, they can to change in content but not in size or timestamp, so they're a slippery lot.

No comments:

Post a Comment