Robert Escriva

Thoughts on technology, security, and computer science.

RCOS Bittorrent Mirror

So I’ve finally found some time to write about the RCOS mirror I mentioned Sunday. We’ve managed to create a mirror that has provided 60GB to the RPI campus while only consuming 20GB from the off campus links.

In this article I’ll be discussing the configuration of the tracker, the seed box, and the public www manager in hopes that others can recreate this system to spread (legal) content and save bandwidth.

The Tracker

The tracker we use is PyTracker. I posted about PyTracker in February when I first created it, and have since been testing it. It holds up under high load; its performance will depend more on lighttpd and PostgreSQL.

I’ve got the PyTracker software running as a FastCGI process under lighttpd. The lighttpd user is given permission to operate on the ‘peers’ and ‘whitelist’ databases.

It should be fairly straight forward to setup the tracker on any host with FastCGI or WSGI interfaces.

The Seed Box

The ‘official’ seed boxes are all configured the same. Each has a user dedicated to the torrents (the user ‘torrent’ in these examples).

The user runs rtorrent in a detached screen session. The rtorrent instance watches a particular directory for new torrents, and downloads all ISO files to a separate directory. There is no bandwidth limit set in rtorrent’s config.

To prevent the seed box from running wild, iptables is configured to reject all connections the user torrent initiates that connect to off RPI’s campus.

New torrents are pulled to the seedbox using rsync for the torrent files, and bittorrent for data transfer.

Providing Users with Torrents

To provide the user with torrents, I opted to make them available over flat HTML files. All the torrents have been modified to use the tracker set up under the ‘tracker’ section in addition to the trackers provided in the default torrents. All the torrents provided via rsync for seed boxes are modified to only use our tracker.

To accomplish this, I wrote a small script (available here) that will take a directory hierarchy and transform all .txt files to .html, copy all images, and perform the modifications that the torrents require.

I’m storing all torrents in a git repository for easy modification. When I push to the server, the pycms code is run to update the webroot (for users), and the rsync directory (for seed boxes).

Summary

I have given most of the details of our setup (without delving into boring configuration details). It performs rather well on our hardware (a dual core Intel with 1GB of RAM) and has already provided a significant service to the RPI community.

If you’re looking to setup your own software mirror, I’d be happy to provide help with setup and configuration of the above components. Please don’t solicit my help for sharing of files which you have not obtained permission to share.

No Comments

RCOS Linux Mirror

It’s been a busy semester so I haven’t been blogging as much as I did last semester, but I’ve got several topics to write about in the coming weeks as classes wind down.

Last week I made public the on-campus free/open source software mirror sponsored by the Rensselaer Center for Open Source Software (RCOS). This mirror provides two very important services to the RPI community. First, it provides a complete mirror of the Ubuntu archive. Second, it provides RPI with a large collection of Linux and *BSD ISO torrents to download.

If you’re on RPI’s campus (128.113.0.0/16 or 128.213.0.0/16) you can access our mirror here. The mirror is restricted to only users on campus because of limitations on RPI’s upstream bandwidth.

In a future article I’ll describe the architecture of the torrent network and provide the software necessary for other users to create their own local free software repositories.

No Comments

DNS Transfer

I’ll be changing my registrar for this domain in the next few days. Please let me know if you have any issues sending mail and/or reaching any services provided under this domain.

No Comments

Introducing PyTracker

As part of an initiative to setup an on-campus large file mirror, I wrote PyTracker.

Get the code with git clone git://robescriva.com/pytracker.git

The thought behind PyTracker is that at RPISEC’s next contest, we can share the contest image over bittorrent instead of relying on thumbdrives (Alex uses netcat). The tracker is very simplistic, but works well (it’s been running about a week without errors as far as I can tell). We also hope to setup seed boxes for Linux/*BSD ISO files.

It is written in Python and runs as a FastCGI process. It uses PostgreSQL for the backend. It currently only supports announces and blacklisting or whitelisting of torrents.

It’s not well documented at the moment, but that may change in the future (or not).

If you need help setting it up, leave a comment or email me. My email address is in the commit log.

No Comments

Updated Comment Policy

Just a post to make note that I made a sideways change in my comment policy in order to make things easier for everyone.

Before I was moderating all comments in an attempt to kill off blog spam. That worked fine when no one actually read my blog, but it seems that Google has been sending people my way. In an effort to make it easier to leave comments, I removed moderation, but enabled Akismet to curtail spam. Additionally I’m holding comments coming from certain IPs or containing certain keywords in moderation.

Thanks for your patience. If your comment doesn’t appear immediately, I will get to it.

No Comments

Daemon Challenge 3 Dissection

So I guess I am efficient. I’ve been chosen as the winner of Dustin Kirkland’s Daemon Challenge 3. I’m not going to provide a full writeup as to how the challenge can be completed; Dustin did that thoroughly here. Instead I’m going to discuss the reason the approaches outlined by Dustin and Dave Walker are superior to the less efficient methods. In the process you’ll probably learn how SHA512, and hashing functions in general, work.

My approach

This is the main approach I used to tackle this problem.

Cracking the MD5

So as mentioned before the solution was relatively straightforward. For cracking the hash, I took the same approach as Dave and turned to Google. Unfortunately, the original shadow file had a trailing newline, although I was in the process of reversing that hash. On a side note, if Google had not provided me with a solution, I had a contingency plan: Andrew Zonenberg’s distributed hash cracker. This project has massive potential. It’s a distributed hash cracking system that will eventually support pluggable hashes. Using his system on 2 cores, he managed to obtain 12 million hashes per second and was able to find the answer in just 69.15 seconds; this is the worst case scenario as he was scanning the entire 8-bit domain. It took only 2 seconds flat to scan alphanumeric permutations. A big thanks to him for being there as a backup in case Google failed me. (update: my original numbers for Andrew’s cracker were way off)

On another side-note it appears that this is a common trend in these contests. So much so that it might be advantageous to add a feature to his cracker that tries all combinations with a newline at the end. During one contest RPISEC was a participant in, we had to reverse a hash of ‘potato\n’ and had trouble brute-forcing it until Google came to the rescue.

Solving the riddle

To solve the riddle inside the ecryptfs, I wrote a Python script that exploited the round structure of SHA512. I also wrote a bash script (for kicks) that has been running since Wednesday and has yet to reach 750,000 lines in the file. I will cede that there are improvements to be made in the bash script, but they still pale in comparison to the Python script. Those interested can grab my script: sha512.sh.

My python script takes a more efficient approach which I’ll describe in a following section. This script takes about 4.1 s real time on my Q6600 desktop. Much better than several days worth of time.

Another (non-trivial) savings is that the Python script is a single process that runs until completion. My bash script will fork millions of processes over the course of its lifetime.

The answer to the questions inside can be found on Dustin’s blog.

SHA512 Structure

SHA512 uses what’s called a compression function to map an arbitrarily long string to a constant-length (128 chars) string. To do this it operates in chunks of a fixed length, calculating an initialization vector for the next chunk. Python’s implementation of SHA512 will not rehash data from previous chunks. Each time the update function is called, it calculates a the next initialization vector, thus minimizing the time necessary to perform the steps of the riddle. The bash implementation recalculates these chunks each and every time. That’s a lot of excess and wasteful work being done by a CPU assigned to process these tasks.

Time Analysis

So in short, the difference between the two algorithms comes down to exploiting the round structure and compression function of SHA512. But just how many resources do we waste? The short answer is it is the difference between O(n) and O(n^2). The long answer is below:

Each line in the file is 129 characters; that is, 128 characters for the hash, and 1 for the newline. The linear time algorithm will hash 999,999 * 129 characters, and it will write 1,000,000 * 129 characters to memory (I’m ignoring for now the initial calculation of ‘Daemon\n’ as it doesn’t matter in the long run).

The shell script, in comparison, will do much worse, even if we assume equal read and write performance compared to the in-memory Python implementation. In short, it reads in the following pattern:

129 * 1 + 129 * 2 + 129 * 3 + 129 * 4 + … + 129 * 999,997 + 129 * 999,998 + 129 * 999,999

Which can be more concisely expressed as:

129 \sum_{i = 1}^{999,999} i = 129 * {(999,999)(1,000,000) \over 2} = 64,499,935,500,000

Note that these numbers are in bytes which equates to 58 TB of data run through the SHA512 sum program. Compare that to the 123 MB of data processed by the Python script. It is also good to notice that I run wc -l to calculate the number of iterations that remain. The bash script is reading the file multiple times at each iteration, which implies that 58 TB is a minimum multiplier for the amount of data read. Talk about inefficiency.

Conclusion

In conclusion I’d like to express my gratitude to many people:

  • Dustin Kirkland for all the time and effort he put into this contest. It was something unique and something that gave me that “hacker high” that we all strive to achieve.
  • Daniel Suarez and his publishers. I look forward to enjoying my copy of Daemon when it arrives.
  • Google for being such a quality hash cracking utility.
  • Andrew Zonenberg for his extremely efficient MD5 implementation. If you’re looking for a good cracker, I recommend following his work.
  • Yonatan Naamad for the improved latex/gif

One last thought: Computing time is generally less expensive than a programmer’s time; but a poor implementation cannot be improved through more computing power.

1 Comment

Locked out of Ubuntu

Ever lock yourself out of your Ubuntu desktop or workstation (I’d imagine this trick works on other GNU/Linux distros out there)?

If you have SSH access to the machine, simply login with your locked account and run:
killall gnome-screensaver

On my machine this unlocks the session, but your mileage may vary.

No Comments

Richard Stallman @ RPI

Last night was an amazing opportunity for me and several others at RPI. Thanks to the Free Software Foundation and several people from the RPI community we were able to have Richard Stallman as a guest lecturer at RPI.

I went out of my way to arrive early for a good seat, and I can say that it was definitely worth being there early. For those who missed it, the presentation was on the motivation for free software, not just open source software.

I don’t think I need to repeat the content of the evening here as it is much more thoroughly expressed here. This post is mainly one of appreciation and thanks to all those who made this lecture possible. A big thanks for all your effort; you’ve made at least one student’s experience this semester a worthwhile one. I’d also like to thank the Free Software Foundation for their part in bringing Stallman to RPI.

No Comments

Moved to Slicehost

I’ve recently moved my blog to a slice on Slicehost VPS hosting. I’ve been using Slicehost for awhile now to host my pet project, Firmant and have been impressed with it. This post discusses some of the changes I made to the Firmant stack as well as the changes to my blog.

Over the past few days I migrated the Trac setup for Firmant to Redmine as I felt Redmine better met my needs (no hard feelings toward Trac). Part of this transition included changing the software on the server.

A brief overview of the changes to Firmant’s stack:

  • Move from Apache 2.2 to LigHTTPD 1.4. This change made the largest difference. Firmant was hosted using the MPM-Prefork version of Apache 2.2. Moving to LigHTTPD has decreased the memory consumption of my http server, decreased the number of processes, and increased the throughput of my server for static files.
  • Move from PostgreSQL to MySQL. This change is largely because Trac recommends PostgreSQL while Redmine recommends MySQL. Also it is easier to get MySQL to behave in low memory situations.
  • Removed selinux from my slice. I had selinux enabled for the extra security, but the slight overhead involved made the difference between swapping or not.

After I finished migrating Firmant, I chose to migrate my blog as an experiment. This is made possible by the fact that Wordpress relies upon MySQL (which I had now switched to). When Firmant goes stable I will convert and restructure my slice; until then, I will stick with MySQL.

The host on which my blog used to reside was roughly equivalent to my slice on Slicehost, except that it had a slower disk, was a physical server, and resided on a T-1 Internet connection.

Overall I highly recommend LigHTTPD for both Redmine and Wordpress. If anyone needs help configuring it for either, feel free to email ‘me at this domain dot com’ or leave a question in the comments.

No Comments

git-daemon init Scripts on CentOS 5.2

I operate a CentOS 5.2 using Slice Host and use it for many personal projects. Rather than manually start git-daemon in the event of a server reboot, I wrote this init script to bring up git-daemon on boot.


#!/bin/sh
#
#   Startup/shutdown script for Git Daemon
#
#   Linux chkconfig stuff:
#
#   chkconfig: 345 56 10
#   description: Startup/shutdown script for Git Daemon
#
. /etc/init.d/functions

DAEMON=git-daemon
ARGS='--base-path=/home/git/ --detach --user=git --group=git'

prog=git-daemon

start () {
	echo -n $"Starting $prog: "

	# start daemon
	daemon $DAEMON $ARGS
        RETVAL=$?
	echo
	[ $RETVAL = 0 ] && touch /var/lock/git-daemon
	return $RETVAL
}

stop () {
	# stop daemon
	echo -n $"Stopping $prog: "
	killproc $DAEMON
	RETVAL=$?
	echo
	[ $RETVAL = 0 ] && rm -f /var/lock/git-daemon
}

restart() {
	stop
	start
}

case $1 in
	start)
		start
	;;
	stop)
		stop
	;;
	restart)
		restart
	;;
	status)
		status $DAEMON
		RETVAL=$?
	;;
	*)

	echo $"Usage: $prog {start|stop|restart|status}"
	exit 3
esac

exit $RETVAL

Some quick notes:

  • It serves all git repos with ‘git-daemon-export-ok’ in /home/git/. To mark a repo as ok for export, create ‘git-daemon-export-ok’ in the .git folder
  • It requires the binary git-daemon to be present on the system. Many third-party CentOS repositories provide packages for this.
  • The user and group ‘git’ must exist for the daemon to drop its privileges when it runs.
  • I make no warranty regarding use of this script. Don’t blame me if it crashes or eats your first born son. You have been warned.
  • The structure of the script was extracted from several other init scripts on a CentOS 5 box. I do not recall which ones.
No Comments

Copyright © 2008 · Robert Escriva
Unless otherwise specified, all posts and articles are licensed under a Creative Commons Attribution 3.0 U.S. License.
Creative Commons License