Dear Salonnières,

thank you so much for attending compu-salon episode 6—controlling your versions. Here's a summary of our discussion.

This week (Mar 9) we'll talk about building a basic "CGI" web application (i.e., a webpage that takes input from the user, runs on a server, and returns some useful output). I'm thinking of taking a simple GW SNR calculation as an example, but if you have ideas for something directly useful to you, please let me know.

I'll send out a reminder on Friday morning. Until then!

Michele

Compu-salon ep. 6 (2012/03/02) in summary

Also at www.vallis.org/salon.

Why you should be using version control

Many good reasons:

To provide an "undo" command for your work. Your code used to work fine, until you implemented that one extra feature... Your prose used to read beautifully, until you worked in suggestions from your coauthor... And now they're a mess. With version control you can revert to an older, better version.
To collaborate with colleagues. You find yourself endlessly e-mailing files back and forth (probably with creative filenames denoting the date and authorship of changing); making complicated plans of who's going to edit which chapter; running different version of your code without realizing it. With version control you can always get the latest version, and work serially or even in parallel on the same files, while maintaining an official record of who's done what.
To document the history of your project as it evolves. It's good for proper credit, for the authoritativeness of your computational results, for your overall sanity.
To manage multiple versions, usually of code. Say you want to make a public release of v1.0 of your wonderful package, so the community can use it, and you want to keep fixing bugs in v1.0 while adding new features for the even more wonderful v1.1. Say you want to tag the specific version of all files used to make the figures in your paper, in case you have to do it again. You can do all of this easily in version control using branches and tags.
You want to make your code available in open-source fashion, encouraging re-use and enlisting the help of remote contributors. Indeed, version control is a basic component of more comprehensive platforms that enable such collaboration (Google Code, GitHub, Trac).

About Subversion

Subversion (SVN) is a classic (but evolved) version control system. It's classic in that it's based on a centralized, authoritative repository, which usually sits on a server and is accessed using an internet protocol. It's improved with respect to older standards such as CVS, in that, among other desirable properties, it has a global version counter with "atomic" multiple-file commits (i.e., a single revision number describes the state of all the files in the repository, and multiple files can be updated at once generating only one new revision); it allows the movement of files and directories while keeping proper history; it has convenient options for remote operation, and strong client tools. The one drawback that's often cited for SVN is that branches are tags are not first-class citizens, but they are implemented by convention by making copies of subdirectories.

Some good resources on SVN:

The homepage
The free online book "Version control with Subversion"
The SVN cheatsheet
Two very good commercial clients on OS X are Versions and Cornerstone.

My own cheatsheet

We need to distinguish between operations performed on the central repository itself (creating it, adding or copying a directory, importing non-version-controlled files) and operations performed by individual users on a working copy of the repository.

The repository always holds the authoritative version of all files; there is never any conflict in the repository. All editing work happens in working copies, and is then committed to the central repository; commits are only successful (establishing new authoritative versions) if there are no conflicts between the files in the working copy and those in the repository; if there are conflicts, they must be resolved in the working copy.

Operations on the repository

# create repository (general form, followed by example)
$ svnadmin create PATH
$ svnadmin create /home/user/repository

# make directories within repository
$ svn mkdir -m MESSAGE URL              
$ svn mkdir -m "New project" svn+ssh://server/home/user/paper
$ svn mkdir -m "New project" svn+ssh://server/home/user/paper/trunk

# add all files from a directory into repository
$ svn import -m MESSAGE PATH URL
# (the directory itself is stripped out, so in this example files end up in "trunk")
$ svn import -m "First import" /home/user/paperfiles svn+ssh://server/home/user/paper/trunk

Operations on a working copy

# check out a working copy of the repository
$ svn checkout URL DIR
# the path is stripped out, so in this example the files end up in the new directory "workingcopy"
$ svn checkout svn+ssh://server/home/user/paper/trunk workingcopy

# display modified (M), added (A), deleted (D), or unknown (?) local files
# the command does not query the repository, but checks against the local cache of the last update
$ svn status
$ svn status -u         # also checks for updates
$ svn info              # even more information
$ svn log [FILE]        # see the commit history of a directory or file

# schedule new files for addition, deletion, copy
$ svn add FILE1 FILE2 ...
$ svn delete FILE1 FILE2 ...
$ svn copy FILE1 FILE2

# commit edits (as well as additions/deletions/copies) to the repository
# commit will fail if the working copy has not been _updated_ to the latest repository revision  
$ svn commit -m MESSAGE FILE1 FILE2 ...
# or commit the entire directory:
$ svn commit -m MESSAGE .

# undo local changes by reverting to the latest revision updated from the repository (_not_ the latest version in the repository)
$ svn revert FILE

# compare local changes to the latest revision updated from the repository
# in the output, the ranges between @@/@@ indicate the blocks of lines that have changed
$ svn diff FILE
# compare two arbitrary revisions (note also BASE = cached revision; HEAD = latest in repository; PREV = previous revision)
$ svn -rREVNUM1:REVNUM2 FILE

# update the working copy to the latest state of the repository
# will print A,G,C for files added, successfully merged, in conflict
$ svn update [FILE]

# IMPORTANT: after you have updated, be careful not to save an older version of a file that you may have in your editor's buffer

# if there are conflicts in FILE, four files appear (FILE, FILE.mine, FILE.HEADREVISION, FILE.BASEREVISION)
# conflicts are resolved by editing FILE, removing <<< === >>> blocks, and issuing
$ svn resolved [FILE]

Two useful tricks

# undo changes by specifying a reverse version range
$ svn merge -rWRONGREV:RIGHTREV URL
$ svn merge -r303:302 svn+ssh://server/home/user/paper/trunk/refs.bib
# then commit...

# resurrecting a deleted item
$ svn copy URL/FILE@REVNUM ./filename
$ svn copy svn+ssh://server/home/user/paper/trunk/figure.pdf@807 .
# then commit...

Branching and tagging

By convention, for each project an SVN repository should include a trunk directory (the place for stable code and text), as well as branches and tags directories. There are two main approaches to dealing with branches.

In the "never branch" approach (probably appropriate for papers, sometime for code), development happens on the trunk, releases are branched off, and tags are made as appropriate.

In the "always branch" approach (useful especially for code, see Jean-Michel Feurprier):

No work is done on the trunk, except for easy fixes, and data files that carry no "logic".
Branches are created to develop new features, and destroyed after the features are reintegrated into the trunk.
Tags are read-only pictures of the trunk or branches; one should never commit to a tag; it's appropriate to update tags to a different revision (where it makes sense, as for "stable", "latest", "production").

Note that SVN has no internal notion of branching or tagging—users implement these by making copies of directories to a different location in the repository, usually within the branches and tags directories. However, the copies are "cheap": internally, SVN replicates files with symbolic links until they're modified.

So in practice:

# create a branch (on the server!)
$ svn copy -m MESSAGE URL/trunk URL/branches/BRANCHNAME
$ svn copy -m "Create branch" svn+ssh://server/home/user/paper/trunk svn+ssh://server/home/user/paper/branches/newbranch

# check out a branch
$ svn checkout URL/branches/BRANCHNAME DIR
$ svn checkout svn+ssh://server/home/user/paper/branches/newbranch branchcopy

After which, you continue your work in the working copy, occasionally merging the changes that have happened on the trunk (this is a sync merge):

# _sync merge_: bring a branch up to date with changes made to ancestral parent branch
$ svn merge URL/trunk       # (while in the branch working copy) 
$ svn merge svn+ssh://server/home/user/paper/trunk

# then (possibly resolve conflicts) and commit the new state of the branch
$ svn commit -m "Merged trunk changes to branch"

Note that revision numbers are unique throughout the repository, so if commits are made both on the trunk and on the branch, the log history of each will skip some revision numbers. Also, the history of the branch won't be visible on the trunk, and vice versa (to see both, you'd have to check out the project repository directory that contains both trunk and branches).

Once the work on the branch is complete, it is time to port the results of your development back to the trunk (a branch reintegration merge)—but first:

do one last sync merge
verify that the code works correctly
tell your collaborators about the imminent reintegration, and solicit their feedback on your code
commit your last changes to the branch

Then:

# while in a working copy of the _trunk_
$ svn merge --reintegrate URL/branches/BRANCHNAME
$ svn merge --reintegrate svn+ssh://server/home/user/paper/branches/newbranch
# the --reintegrate option is important for svn to keep the history right

# then (possibly resolve conflicts) and commit the new state of the trunk
$ svn commit -m "Merged branch back into trunk"

It is good practice to delete a branch (svn -m MESSAGE delete URL/branches/BRANCHNAME) after it's been reintegrated (it will remain in the history in any case), or at least rename it (svn move URL/branches/BRANCHNAME URL/branches/OBSNAME) so that it is clearly marked as obsolete.

One last thing: tagging is the same as creating a branch:

# create a tag
$ svn copy -m MESSAGE trunk-URL tag-URL
$ svn copy -m "Tagged reintegrated trunk" svn+ssh://server/home/user/paper/trunk svn+ssh://server/home/user/paper/tags/reintegrate-newbranch

Protocols

SVN can operate across several remote internet protocols. You may have noticed that whenever I specified a repository URL in the tutorial above, it began with svn+ssh://server. That's one example of a protocol, which runs SVN remotely after accessing the server over ssh. Here are all of them.

Host the repository in a shared directory accessible through local file access: svn commands will include URLs that look like file:///location/project/trunk. The file permissions of the repository need to be such that all users can edit the files.
Run a dedicated SVN server on a host: you will need to maintain a database of users, and URLs will look like svn://server.com/location/project/trunk. See the SVN book on this.
Configure the Apache httpd server to run SVN: URLs will look like http://server.com/location/project/trunk or https://server.com/location/project/trunk. See the SVN book on this.
The best option IMHO is to run svn over ssh, which can be done if all SVN users have account on the server, or even better by having all of them connect as the same user using private-key cryptography. See the SVN book on this, but here are some details:

Setting up single-user svn over ssh

The users generate private/public pairs of ssh keys:

# on user1's account
$ ssh-keygen -t rsa -f user1_svn_rsa

Then user1 sends the public key user1_svn_rsa.pub to the server administrator. Let's say SVN will be run under the account "svn". The public key needs to be added in a new line to the file ~svn/.ssh/authorized_keys, as follows:

command="svnserve -t --tunnel-user=user1 --root=SVNDIR",no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-pty KEYTYPE KEYBODY user1

The repository also needs to be created on the server (svnadmin create PROJECT inside SVNDIR), and will be accessed with the URL svn+ssh://svn@server.com/PROJECT/TRUNK (note the account name svn); however user1 needs to tell SVN to use his private key, which he can do, for instance, by defining

$ export SVN_SSH="ssh -q -i PATH/user1_svn_rsa"

svn and properties

svn does support keywords (e.g., $Id, $Revision, $Date, $Author), which are replaced in the file upon committing, much like CVS does. However, they need to be enabled for each file:

svn propset svn:keywords "Id Revision Date Author" /path/to/filename

which needs to be followed by a commit. There's a way to set keywords automatically for new files, by adding the following to your ~/.subversion/config

[miscellany]
enable-auto-props = true

[auto-props]
*.m = svn:keywords=Id Revision Date Author
(ANY OTHER FILE TYPES THAT NEED KEYWORDS WOULD GO HERE...)

About distributed version-control systems

To avoid confusing you too much, I won't say much about them, other than it's good stuff, and you should have a look when you feel advanced enough, or if your colleagues prompt you.

The idea is that there is no centralized repository, but the repositorIES reside in the accounts or workstations of individual users. So commits are local operations performed on the local repository, which holds the user's authoritative version of the files. There can be (and there is usually) a central reference repository, which can be cloned (rather than checked out), and from which and to which users pull and push changes. The main advantages are that development can proceed even if the central repository is not available; that experimental or development branches can be dealt with locally, without burdening all users.

The distributed version-control system du jour is Git. I especially like how I can version-control a directory in my account just by doing git init.