For months I've been telling myself I should push portions of my blog to Fedora Planet. Today I'm doing so as part of Fedora Summer Coding 2010. I have the privilege of being a project mentor in the program this summer helping Matt Mooney with the CHASM project. In this post I'll tell you a little more about the project's goals as well as introduce myself a little more.
Goals of CHASM
I came up with the idea for CHASM in the summer of 2009. At that time I was the sole systems administrator for the Rensselaer Center for Open Source at RPI. Due to some events in my personal life I was not able to check up on the mirror for several days to a week. In that time the script that I had borrowed from the community documentation for the distribution went haywire. rsync has a tendency to accumulate temporary files of the form .~tmp~. I would assume the user account creating the repo is different from the user running rsyncd as I did not have permission to mirror these files. As a result the check to see if rsync succeeds (exit code 0) would never evaulate to true. The overall result was that the mirror got stuck in a loop wherein it was continually trying to synchronize.
Looking back on it, I should have checked the script to make sure it was correct.
This experience led me to the following observations about mirroring large volumes of software:
Correctness The tools used should be more than a random collection of scripts that manipulate rsync or ftp. From my research it appears that Fedora is the best in this regard with Mirror Manager. Debian has the ftpsync scriptset (although it appears that Ubuntu has not taken advantage of this and simply has an rsync script).
Efficiency rsync is great for point to point transfers, but with many nodes arranged in a tree form, it does much more than is necessary. Each pair of nodes in the tree must re-establish what needs to be exchanged to complete the transfer.
I spoke with Peter Poeml of the MirrorBrain project and he indicated that within the mirror infrastructure of SUSE, they have each node maintain a list of things that need to be pushed to its children. On the next sync, the accumulated list is passed to rsync.
Integrity A systems administrator should be able to verify the integrity of a mirror should hardware fail or a malicious user break in. rsync provides for this using the --checksum option, but with a big caveat: both ends of the connection hash all files considered for transfer.
This is less of an issue when bandwidth and I/O are plentiful, but becomes an issue when mirrors are already near capacity without performing checksumming operations. As a result, Fedora recommends prohibiting the use of the --checksum option.
A systems administrator should be able to verify the content they are providing is genuine. In the ideal case this should be possible without a network connection.
We (Ben, Joe, and myself) designed CHASM to meet each of these goals. It's designed to enable one machine to share files with many geographically distributed (from a network perspective) machines. It is optimized for when there are a large number of files in the collection, but only a small number of files change with each update.
A manifest is produced which identifies all metadata of the collection. A nice side effect is that all files are identified by a cryptographic hash. Two parties with the same manifest can easily communicate which files are available for transfer by relying upon this manifest. In experiments, a 500G mirror of Fedora (excluding certain architectures and rawhide) took ~100MB of disk space to hold the uncompressed manifest and ~64kB to describe availability of files to peers with the manifest. Compression reduces the manifest to ~30MB.
More information about CHASM can be found in its technical design document. I will be posting its URL on the CHASM Blog in the near future.
A Little About Me
I'm a 2010 graduate of Rensselaer Polytechnic Institute where I received a B.S. in Computer Science. At RPI much of my work in the open source community was sponsored by the Rensselaer Center for Open Source (RCOS). CHASM is the second project I worked on for RCOS. The first project was Firmant which is static web framework. It is easily usable as a blog out of the box. In fact, this blog and both the Firmant and CHASM blogs are powered by Firmant.
My interests focus primarily on improving the infrastructure for open source development. Both Firmant and CHASM evolved from my experiences with the open source program at RPI and are designed to fit a niche that I felt was not adequately filled. Some other projects I've been working on recently are also designed to meet needs I personally have, and I hope they'll help others as well.
I'm happy to be offered the chance to work as a mentor in the Fedora Summer Coding program (do I get a shirt as well?). I hope to interact with the community more in the future as I've done a poor job of it so far.
Thank you for taking your time to read my post. Feedback from the community is appreciated via email (not published here, but a creative individual could find it) or on irc (rescrv on irc.freenode.net #rcos #chasmd #firmant).