Robert Escriva

Things I wish I learned earlier

Updated Comment Policy

Just a post to make note that I made a sideways change in my comment policy in order to make things easier for everyone.

Before I was moderating all comments in an attempt to kill off blog spam. That worked fine when no one actually read my blog, but it seems that Google has been sending people my way. In an effort to make it easier to leave comments, I removed moderation, but enabled Akismet to curtail spam. Additionally I'm holding comments coming from certain IPs or containing certain keywords in moderation.

Thanks for your patience. If your comment doesn't appear immediately, I will get to it.

Daemon Challenge 3 Dissection

So I guess I am efficient. I've been chosen as the winner of Dustin Kirkland's Daemon Challenge 3 I'm not going to provide a full writeup as to how the challenge can be completed; Dustin did that thoroughly here. Instead I'm going to discuss the reason the approaches outlined by Dustin and Dave Walker are superior to the less efficient methods. In the process you'll probably learn how SHA512, and hashing functions in general, work.

My Approach

This is the main approach I used to tackle this problem.

Cracking the MD5

So as mentioned before the solution was relatively straightforward. For cracking the hash, I took the same approach as Dave and turned to Google. Unfortunately, Google did not know the original hash (whose input included a trailing newline). If Google had not provided me with a solution, I had a contingency plan: Andrew Zonenberg's distributed hash cracker This project has massive potential. It's a distributed hash cracking system that will eventually support pluggable hashes. Using his system on 2 cores, he managed to obtain 12 million hashes per second and was able to find the answer in just 69.15 seconds; this is the worst case scenario as he was scanning the entire 8-bit domain. It took only 2 seconds flat to scan alphanumeric permutations. A big thanks to him for being there as a backup in case Google failed me. (update: my original numbers for Andrew's cracker were way off)

On another side-note it appears that this is a common trend in these contests. So much so that it might be advantageous to add a feature to his cracker that tries all combinations with a newline at the end. During one contest RPISEC was a participant in, we had to reverse a hash of 'potaton' and had trouble brute-forcing it until Google came to the rescue.

Solving the riddle

To solve the riddle inside the ecryptfs, I wrote a Python script that exploited the round structure of SHA512. I also wrote a bash script (for kicks) that has been running since Wednesday and has yet to reach 750,000 lines in the file. I will cede that there are improvements to be made in the bash script, but they still pale in comparison to the Python script. Those interested can grab my script: sha512.sh

My Python script takes a more efficient approach which I'll describe in a following section. This script takes about 4.1 s real time on my Q6600 desktop. Much better than several days worth of time.

Another (non-trivial) savings is that the Python script is a single process that runs until completion. My bash script will fork millions of processes over the course of its lifetime.

The answer to the questions inside the riddle can be found on Dustin's blog.

SHA512 Structure

SHA512 uses what's called a compression function to map an arbitrarily long string to a constant-length (128 chars) string. To do this it operates in chunks of a fixed length, calculating an initialization vector for the next chunk. Python's implementation of SHA512 will not rehash data from previous chunks. Each time the update function is called, it calculates the next initialization vector, thus minimizing the time necessary to perform the steps of the riddle. The bash implementation recalculates these chunks each and every time. That's a lot of excess and wasteful work being done by a CPU assigned to process these tasks.

Time Analysis

In short, the difference between the two algorithms comes down to exploiting the round structure and compression function of SHA512. But just how many resources do we waste? The short answer is it is the difference between O(n) and O(n^2). The long answer is below:

Each line in the file is 129 characters; that is, 128 characters for the hash, and 1 for the newline. The linear time algorithm will hash 999,999 * 129 characters, and it will write 1,000,000 * 129 characters to memory (I'm ignoring the initial calculation of 'Daemonn' as it doesn't matter in the long run).

The shell script, in comparison, will do much worse, even if we assume equal read and write performance compared to the in-memory Python implementation. In short, it reads in the following pattern:

129 * 1 + 129 * 2 + 129 * 3 + 129 * 4 + ... + 129 * 999,997 + 129 * 999,998 + 129 * 999,999

Which can be more concisely expressed as:

/files/2009-01-23-daemon-challenge-3-dissection/ay.gif

Note that these numbers are in bytes which equates to 58 TB of data run through the SHA512 sum program. Compare that to the 123 MB of data processed by the Python script. It is also good to notice that I run wc -l to calculate the number of iterations that remain. The bash script is reading the file multiple times at each iteration, which implies that 58 TB is scaled by a constant factor. Talk about inefficiency.

Conclusion

In conclusion I'd like to express my gratitude to many people:

  • Dustin Kirkland for all the time and effort he put into this contest. It was something unique and gave me that "hacker high" that we all strive to achieve.
  • Daniel Suarez and his publishers. I look forward to enjoying my copy of Daemon when it arrives.
  • Google for being such a quality hash cracking utility.
  • Andrew Zonenberg for his extremely efficient MD5 implementation. If you're looking for a good cracker, I recommend following his work.
  • Yonatan Naamad for the improved latex/gif.

One last thought: Computing time is generally less expensive than a programmer's time; but a poor implementation cannot be improved through more computing power.

Richard Stallman @ RPI

Last night was an amazing opportunity for me and several others at RPI. Thanks to the Free Software Foundation and several people from the RPI community we were able to have Richard Stallman as a guest lecturer at RPI.

I went out of my way to arrive early for a good seat, and I can say that it was definitely worth being there early. For those who missed it, the presentation was on the motivation for free software, and what makes software free (and why such a distinction is important).

I don't think I need to repeat the content of the evening here as it is much more thoroughly expressed here. This post is mainly one of appreciation and thanks to all those who made this lecture possible. A big thanks for all your effort; you've made at least one student's experience this semester a worthwhile one. I'd also like to thank the Free Software Foundation for their part in bringing Stallman to RPI.

Moved to Slicehost

I've recently moved my blog to a slice on Slicehost VPS hosting. I've been using Slicehost for awhile now to host my pet project, Firmant and have been impressed with it. This post discusses some of the changes I made to the Firmant stack as well as the changes to my blog.

Over the past few days I migrated the Trac setup for Firmant to Redmine as I felt Redmine better met my needs (no hard feelings toward Trac). Part of this transition included changing the software on the server.

A brief overview of the changes to Firmant's stack:

  • Move from Apache 2.2 to LigHTTPD 1.4. This change made the largest difference. Firmant was hosted using the MPM-Prefork version of Apache 2.2. Moving to LigHTTPD has decreased the memory consumption of my http server, decreased the number of processes, and increased the throughput of my server for static files.
  • Move from PostgreSQL to MySQL. This change is largely because Trac recommends PostgreSQL while Redmine recommends MySQL. Also it is easier to get MySQL to behave in low memory situations.
  • Removed selinux from my slice. I had selinux enabled for the extra security, but the slight overhead involved made the difference between swapping or not.

After I finished migrating Firmant, I chose to migrate my blog from the previous host to my slice as an experiment as an experiment. This is made possible by the fact that Wordpress relies upon MySQL. When Firmant goes stable I will convert and restructure my slice; until then, I will stick with MySQL.

The host on which my blog used to reside was roughly equivalent to my slice on Slicehost in terms of CPU and memory, but had a slower disk and network connection.

Overall I highly recommend LigHTTPD for both Redmine and Wordpress. If anyone needs help configuring it for either, feel free to email 'me at this domain dot com' or leave a question in the comments.

Update

As of December 3, 2009, I have switched away from LigHTTPD to nginx. This move was prompted by the superior FastCGI configuration options that nginx provides. Performance is comparable to what it was under LigHTTPD.

git-daemon Init Scripts on CentOS 5.x

I operate a CentOS 5.2 using Slice Host and use it for many personal projects. Rather than manually start git-daemon in the event of a server reboot, I wrote this init script to bring up git-daemon on boot:

#!/bin/sh
#
#   Startup/shutdown script for Git Daemon
#
#   Linux chkconfig stuff:
#
#   chkconfig: 345 56 10
#   description: Startup/shutdown script for Git Daemon
#
. /etc/init.d/functions

DAEMON=git-daemon
ARGS='--base-path=/home/git/ --detach --user=git --group=git'

prog=git-daemon

start () {
    echo -n $"Starting $prog: "

    # start daemon
    daemon $DAEMON $ARGS
        RETVAL=$?
    echo
    [ $RETVAL = 0 ] && touch /var/lock/git-daemon
    return $RETVAL
}

stop () {
    # stop daemon
    echo -n $"Stopping $prog: "
    killproc $DAEMON
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f /var/lock/git-daemon
}

restart() {
    stop
    start
}

case $1 in
    start)
        start
    ;;
    stop)
        stop
    ;;
    restart)
        restart
    ;;
    status)
        status $DAEMON
        RETVAL=$?
    ;;
    *)

    echo $"Usage: $prog {start|stop|restart|status}"
    exit 3
esac

exit $RETVAL

Some quick notes:

  • It serves all git repos with the file git-daemon-export-ok in /home/git/. To mark a repo as ok for export, create git-daemon-export-ok in the .git folder.
  • It requires the binary git-daemon to be present on the system. Many third-party CentOS repositories provide packages for this.
  • The user and group git must exist for the daemon to drop its privileges when it runs.
  • I make no warranty regarding use of this script. Don't blame me if it crashes or eats your first born son. You have been warned.
  • The structure of the script was extracted from several other init scripts on a CentOS 5 box. I do not recall which ones.

Manage Your $HOME With Git

At the end of last semester I became serious about managing my home directory with some form of version control so that I could keep files somewhat synchronized across multiple computers.

Versioning $HOME provides several benefits from my perspective:

  • Making a backup is as easy as a new checkout of your $HOME.
  • Distributed version control systems make it easy to eliminate single points of failure.
  • A distributed VCS will also provide a built in backup mechanism that is temporal and spacial (that is as far back as you've been using the VCS and as distributed as your computers are).

The only con i have encountered so far is that it takes quite a bit of thought to plan how you want to setup your repositor(y|ies).

I ended up going with the dVCS git as it is very robust and fast. I also have more experience using it than any other DVCS. Most of the concepts I show should easily apply to any distributed system with a little tweaking. Some things I've discovered git does well that I had not anticipated:

  • Using SHA1 sums for identifying content works well to ensure integrity of files.
  • An efficient packing structure ensures that each file is only stored once.
  • Bash completion (under Ubuntu) is very convenient and can be adapted to improve the workflow I now use.

When I started playing with versioning $HOME, I put all files into one big repository. I quickly realized this was not the best solution as it made managing of config files harder than I wanted; for example, I have three computers I use regularly: one is a dual-head desktop, one a T60, and one is an EEE 900. Because of the different configurations of hardware, my .conkyrc is slightly different for each. The logical solution is to have a separate branch for each machine, each with a different .conkyrc. The problem comes in when I also manage my coursework in the same repo. The coursework is agnostic as to what hardware I'm running, and thus must be the same across all branches. If I were to merge changes from one branch into another, it would also merge the .conkyrc. An alternative to merging would be cherry-picking changes or using stgit to maintain the patches. I chose a different, more flexible solution.

Each logical set of files has its own git repo. My dotfiles have their own repo. As these are not going to change often, the hassle of cherry-picking changes is reduced, and the expense of a separate branch for each computer is reduced to something I will tolerate. The set of coursework can remain on a single branch for all machines (or one branch for each semester if I wish to hide older coursework). My resume has a branch for each employer I submit a custom resume to. Overall using multiple repositories to manage my $HOME works well with few hiccups.

To do this I created a few different repos, and then added a few bash aliases to streamline the process of maintaining the repos.

I'll go through the steps I performed to create the repo for my dotfiles. All commands were run in ~/

  • Initialize an empty repository with: git --git-dir=.config.git/init
  • Add some dotfiles: git --git-dir=.config.git/ --work-tree=. add .bash_logout .bashrc .profile
  • Commit the changes: git --git-dir=.config.git/ --work-tree=. -m "Bash dotfiles"
  • Edit .bashrc and add the following alias line: alias config='git --git-dir=$HOME/.config.git/ --work-tree=$HOME'

This will create a git repo for your home directory and allow you to use all of the normal git commands without --git-dir or --work-tree by substituting 'config' for 'git'. For example I can see a change log by typing config log.

The only thing left to do locally is to add bash completion to your bashrc. In the default Ubuntu bashrc there is a conditional:

if [ -f /etc/bash_completion ]; then

that loads the bash completion file for the system. After this line all routines defined by the git bash completion scripts will be available. Inside the conditional I added the line:

complete -o default -o nospace -F _git config

which enables bash completion for the config command. Save your changes and open a new bash instance to reload the changes.

Where to go from here:

  • Setup a remote git repository to share your local repo with your other computers.
  • Setup SSH Keys to save yourself from entering the password each time you push/pull from your remote(s).
  • Setup a cron script to use git-archive to create snapshot tarballs, and encrypt/email them to a gmail account.
  • Let me know if you find this useful or improve upon any techniques I've developed.

Update: Silas Sewell has written a great article expanding upon what I wrote here. He's taken what I have done a step further to use GitHub as a remote repository.

Copyright © 2010 Robert Escriva ¦ Powered by Firmant