This website is no longer maintained. My new homepage is rescrv.net.

Robert Escriva

Things I wish I learned earlier

Post withdrawn

I originally wrote a letter to Barnes and Noble trying to get clarification on the status of the developer mode for Nook Color. Looking back on the letter, it was much more charged with emotion than a rational post should be. I've decided not to post any more rants on my blog, and stick to what I do best: write about technical problems, or small hurdles I overcome in my day-to-day work. Many of the writings will be related to my research, and may include things such as undocumented (or poorly documented) features in libraries I find myself using.

All apologies to those who had to read the previous post.

Stirimango: Moving databases forward

Stirimango is my vision for database migrations done in Python.

I'm working on a new project that relies heavily on the database. For convenience, I've created a package for Python 3 that automates database migration.

As far as I know, the only similar package to this is South and South is targeted to Django applications.

Keeping It Simple

Stirimango is written in Python and weighs in at 999 lines with 360 Python statements. Migrations are stored in Python packages and accessed with pkg_resources. A migration is a single Python file that contains three variables:

DESCRIPTION = '''
Create a table to hold products for sale.
'''

FORWARD = '''
CREATE TABLE products
(
    id INTEGER NOT NULL,
    title VARCHAR(16)
);
'''

BACKWARD = '''
DROP TABLE products;
'''

There are five Stirimango commands: init, backward, forward, status, and list.

list

List the migrations available in a package. This is inspired by the log functions of most VCS systems:

$ bin/stirimango list testdata.sample
2010-07-04T00:00:00_create_products_table

    Create a table to hold products for sale.

2010-07-05T00:05:00_products_add_description_column

    Add a column for describing the products to customers.

2010-07-05T00:05:50_no_description

2010-07-06T00:00:00_products_remove_description_column

    Remove the `description` column in the `products` table.

    The product descriptions will be stored somewhere else.

init

Initialization creates the table necessary to track which migrations have been applied, and the order in which they were applied:

$ bin/stirimango init -W -U rescriva -d rescriva
Password:

The -W, -U, and -d parameters specify database connection options, and are borrowed from the psql command-line utility.

status

Show the status of the migration package in the specified database:

$ bin/stirimango status -W -U rescriva -d rescriva testdata.sample
Password:
 Migrations Package
  |  Database
  V  V
 [✓][ ] 2010-07-04T00:00:00_create_products_table
 [✓][ ] 2010-07-05T00:05:00_products_add_description_column
 [✓][ ] 2010-07-05T00:05:50_no_description
 [✓][ ] 2010-07-06T00:00:00_products_remove_description_column

Rollback order:

The rollback order is specified so that if migrations are applied out-of-order (e.g. after a VCS merge), the user knows which migrations will be rolled back first.

forward

Moving migrations forwards applies unapplied diffs to the database:

$ bin/stirimango forward -W -U rescriva -d rescriva testdata.sample
Password:
$ bin/stirimango status -W -U rescriva -d rescriva testdata.sample
Password:
 Migrations Package
 |  Database
 V  V
[✓][✓] 2010-07-04T00:00:00_create_products_table
[✓][✓] 2010-07-05T00:05:00_products_add_description_column
[✓][✓] 2010-07-05T00:05:50_no_description
[✓][✓] 2010-07-06T00:00:00_products_remove_description_column

Rollback order:
0. 2010-07-06T00:00:00_products_remove_description_column
1. 2010-07-05T00:05:50_no_description
2. 2010-07-05T00:05:00_products_add_description_column
3. 2010-07-04T00:00:00_create_products_table

backward

Moving migrations backwards applies the reverse diffs to the database:

$ bin/stirimango backward --number 1 -W -U rescriva -d rescriva testdata.sample
Password:
$ bin/stirimango status -W -U rescriva -d rescriva testdata.sample
Password:
 Migrations Package
 |  Database
 V  V
[✓][✓] 2010-07-04T00:00:00_create_products_table
[✓][✓] 2010-07-05T00:05:00_products_add_description_column
[✓][✓] 2010-07-05T00:05:50_no_description
[✓][ ] 2010-07-06T00:00:00_products_remove_description_column

Rollback order:
0. 2010-07-05T00:05:50_no_description
1. 2010-07-05T00:05:00_products_add_description_column
2. 2010-07-04T00:00:00_create_products_table

Concluding Thoughts

Stirimango is most certainly an evolving tool. All migrations happen in serializable transactions, and database-level exceptions are propagated to the user. The py-postgresql package provides verbose exceptions, and it would be wasteful to shield this information from developers.

In the future I'd like to add the following features:

  • Colorful output. The status list should display applied migrations in green, unapplied migrations in yellow, and migrations that are only in the database in red.
  • Migration generator. It would be nice to have pre-built templates for migrations.
  • Check-pointing on migrations. Log each time the program is run, and which migrations were applied. It would be useful to have the backward command's default behavior be to undo the previous transaction, not all migrations.

Using Firmant to Publish a Wiki

In this post I'll show how to use Firmant to create a wiki in under 248 lines including four copies of the BSD license, a short README, and a sample wiki page.

I will try to cover all of my assumptions in this tutorial. If I leave any out, please don't hesitate to email 'me' at this domain for questions. You can also catch me as 'rescrv' on irc.freenode.net.

Overview of Firmant

Firmant is a static web framework written in Python. It is static, meaning no code is run to generate the response to a client's request; instead, all code is run at compile time (the time at which the site is "compiled" to a static website). It is a web framework as it facilitates the construction of websites/applications. It is written in Python because I ♥ Python (yes, Firmant supports unicode).

I've explained my rationale for creating such a project on the front page of the Firmant documentation.

Installing Firmant

As of this writing, Firmant 0.2.2 is available via PyPi. This is the latest version and was used in this tutorial. It is installable with easy_install and pip. I've only tested it using virtualenv, so feedback on other installation methods is appreciated (but not required).

The Thousand-Foot Picture

Firmant applications revolve around creating parsers, objects, and writers. A parser transforms some source (in all of my code the source a filesystem hierarchy) into a set of objects. Objects are discrete units of information. The only requirement for an object is that it has a permalink (this requirement may be removed in the future). For instance, I have post, feed, and tag objects for my blog. Writers take a combination of all parsed objects and create the resulting output (typically in the form of html files).

Bootstrapping the Tutorial

The code referenced in the tutorial is available in usable form from my public git server.

To follow the tutorial you will need to install Firmant and its dependencies. I used Fedora 12 (but also use Firmant from Fedora 13). The dependencies include:

Other dependencies may be necessary (e.g. if you wish to run the doctests or build the documentation).

Creating the new project

I create a new project and include the complete license text of the BSD license. Whenever creating a new repository to be exposed to the public, my first commit always expresses my desired license or copyright so that others know that they may use the code I publish. You can see this in commit debbde7a815b1a89861748ada7ff435e4270b594.

The next step I do whenever building a Firmant-based application is to add a suitable Makefile and empty settings file. For those following along, this happens in commit 963010723e0e3373948ce434be34a3546c84ae86.

The key thing to note is the way in which the firmant script interacts with the settings and the environment. In the Makefile, my rules have the structure:

env FIRMANT_OUTPUT_DIR=preview \
    FIRMANT_PERMALINK_ROOT=file://`readlink -f preview` \
    firmant settings

Environment variables of the form FIRMANT_x=string will be added to the settings object as x = string. I do this so that I may have one config for the site, and override the output directories and permalink roots. The output directory is a local filesystem directory in which all output files will reside. The permalink root is the base url where the files will be published. In the above example, they correspond to a preview directory and the URL to its absolute path on the filesystem.

I also take the opportunity to create a blank module for the application (seen in commit eb6a01c0467dfb4b9f63caa43a17f266b70d9029).

Creating Objects and Parsers

All objects must inherit from firmant.parsers.ParsedObject which provides a constructor to allow keywords that correspond to the object's slots. Object instances must provide an _attributes property that is a dictionary that will be used as the attributes for url mapping in Firmant. This causes objects to automatically have a permalink property set to the URL derived from the object's attributes (don't worry too much about this for now, I'll elaborate on URL routing later, for now, we just need to determine the permanent URL for our object).

I've decided to use the reStructuredText syntax for our wiki object. Firmant provides special support for this using firmant.parsers.RstParsedObject. The RstParsedObject accepts an additional _pub attribute which corresponds to the docutils publisher object. The user is expected to declare _pubparts as a list of two-tuples. The first value of each tuple is the object's attribute. The second value of each tuple is the part provided by the docutils HTML writer.

Bringing this all together gives us (found in cfbf2707165c78a108ff9dd029c597f34997c2cf):

class WikiObject(parsers.RstParsedObject):
    __slots__ = ['path']

    _pubparts = [('content', 'fragment')
                ,('title', 'title')
                ]

    def __repr__(self):
        return 'WikiObject(%s)' % getattr(self, 'path', None)

    @property
    def _attributes(self):
        return {'path': self.path}

The representation is just for debugging purposes and not necessary. Notice how we have declared the explicit attributes (the path of the wiki object, e.g., CreatingAFirmantWiki), and we have implicitly declared content and title as being derived from the written docutils code. Our wiki objects are unique to the path at which they reside.

The parser for creating a WikiObject is not much longer (still in the same commit):

class WikiParser(parsers.RstParser):
    type = 'wiki'
    paths = '.*\.rst'
    cls = WikiObject

    @decorators.in_environment('settings')
    def root(self, environment):
        settings = environment['settings']
        return os.path.join(settings.CONTENT_ROOT, settings.WIKI_SUBDIR)

    def rstparse(self, environment, objects, path, pieces):
        attrs = {}
        attrs['path'] = unicode(path[:-4])
        attrs['_pub'] = pieces['pub']
        objects[self.type].append(self.cls(**attrs))

Here I have used the special firmant.parsers.RstParser. Starting at the top of the class declaration:

  1. type is declared to be wiki. This should be a human-readable string and will be displayed to the user when parsing.

  2. paths is a regular expression that declares which objects will be parsed The path of every file under the directory returned by root (relative to root) will be tested against this regex, and only those that match will be parsed.

  3. cls is defined to be WikiObject. This is just a good practice as it makes it easy to copy a parser and change it to create objects of a different type.

  4. root is a function that returns the path to the root on the filesystem where all objects of this type reside. I've made this configurable using the settings object. Only objects under root that match the paths regular expression will be parsed.

  5. rstparse is a function specific to the RstParser. It takes an environment (e.g. where settings are defined), dictionary of objects (possibly empty), the path to the file from which this object was derived, and the pieces parsed from the object. For now we're only concerned with the docutils publisher object ("pub").

    The rstparse function is expected to append the parsed object to lists in the dictionary. It does this instead of simply returning the object as this will allow multiple objects to be created from a single parser in the future (e.g., LaTeX embedded in a post is parsed into an object when the post is parsed, and then this is written to an image file).

With the new object we must declare several settings:

# The wiki's source files reside in the current directory (assuming the make
# file is invoked from this directory).
CONTENT_ROOT = '.'

# This enables our custom wiki parser.
PARSERS = ['firmantwiki.WikiParser']

# The directory (under CONTENT_ROOT) used for storing wiki documents.
WIKI_SUBDIR = 'wiki'

# The URL mapping.
from firmant.routing import components as c
URLS = [c.TYPE('wiki') /c.PATH]

# Permalinks for our wiki objects must be to the html rendering.
PERMALINK_EXTENSIONS = {'wiki': 'html'
                       }

The only two settings that really need explanation are URLS, and PERMALINK_EXTENSIONS.

URLS
This is used for URL routing. This will be explained more in a later section.
PERMALINK_EXTENSIONS
A dictionary mapping types to the extension that is used for the permalink. For example, declaring "html" will ask the URLMapper (explained later) to map the permalink for this object to an HTML document. A value of None implies that the constructed URL will contain an extension. This will become more clear in the URL routing section.

Creating a Sample Wiki Page

I added the following wiki page as wiki/index.rst (commit 2c825b17eab148bfdf18cde7c805cf1b27adf3e8 for those with a score card):

Firmant
=======

Firmant is a framework for developing static web applications.

Much of today's web development focuses on developing dynamic applications
that regenerate the page for each view.  Firmant takes a different approach
that allows for publishing of static content that can be served by most http
servers.

Some of the benefits of this approach include:

 * Build locally, deploy anywhere.  Many notable server distributions
   (including CentOS 5, and Debian Lenny) still ship old (pre-2.6) versions
   of Python.  With Firmant, this is not an issue as static output may be
   published anywhere independent of the system where it was built.
 * Quicker page load times.  Search engines and viewers expect near-instant
   page load times and static content can meet these expectations.  Dynamic
   content can as well; however, it often requires more than simple hardware
   to do so.
 * Offline publishing capability.  Previewing changes to a website does not
   require Internet access, as the changes are all made locally.  Changes do
   not need to be pushed to a remote server.
 * Store content in revision control.  This is not strictly a feature granted
   by generating static pages.  Firmant is designed to make storing all
   content in a repository a trivial task -- something that web application
   frameworks that are powered by relational databases do not consider.

This wiki page will have a path of index, and a title of Firmant. The content will be an HTML version of the body of the page. If we were to run make right now we would see:

env FIRMANT_OUTPUT_DIR=preview \
    FIRMANT_PERMALINK_ROOT=file://`readlink -f preview` \
    firmant settings
INFO:firmant.application.Firmant:firmantwiki.WikiParser parsing 'index.rst'
xdg-open preview/index.html

It's clear that the parser is actually parsing the index wiki object, but our web browser does not show any output. To actually see the parsed page, we need to create a writer.

Creating a Writer

Firmant makes rendering HTML using Jinja2 as easy as pie. Writers must inherit from firmant.writers.Writer. For convenience, I've also created firmant.writers.j2.Jinja2Base which enables easy rendering of Jinja2 templates to the filesystem.

The entire writer code is:

class WikiWriter(j2.Jinja2Base, writers.Writer):
    extension = 'html'
    template = 'wiki.html'

    def render(self, environment, path, obj):
        context = dict()
        context['path'] = obj.path
        context['page'] = obj
        self.render_to_file(environment, path, self.template, context)

    def key(self, wiki):
        return {'type': u'wiki', 'path': wiki.path}

    def obj_list(self, environment, objects):
        return objects.get('wiki', [])

From top-to-bottom of the class declaration:

  1. extension this is the extension the writer will use for writing the wiki objects. This will make more sense when I introduce URL routing.
  2. template this is the name of the Jinja2 template that will be used when rendering the wiki objects.
  3. render populates the Jinja2 context and calls the render_to_file helper function. Typically this is as short as creating a dictionary and calling render_to_file.
  4. key is a function that produces a URL attribute dictionary from a single object. (Once again, more on this in the URL Routing section).
  5. obj_list is a function that returns a list of Python objects. In this case it is simply a list of wiki objects. Other writers I've written (e.g., for blogging) return lists of objects that have been grouped (e.g., by date for an archive view). The only requirement is that render and key are able to make sense of each item in the list returned.

Just as with parsers, the writers rely heavily upon the template method pattern to drive the whole process.

Some additional settings are needed as well:

# This enables our custom wiki writer.
WRITERS = ['firmantwiki.WikiWriter']

# The directory (under CONTENT_ROOT) used for storing wiki documents.
WIKI_SUBDIR = 'wiki'

# Load Jinja2 templates from the filesystem.
import jinja2
TEMPLATE_LOADER = jinja2.FileSystemLoader('templates')

Running make shows:

env FIRMANT_OUTPUT_DIR=preview \
    FIRMANT_PERMALINK_ROOT=file://`readlink -f preview` \
    firmant settings
INFO:firmant.application.Firmant:firmantwiki.WikiParser parsing 'index.rst'
INFO:firmant.application.Firmant:firmantwiki.WikiWriter declared 'file:///home/rescriva/projects/firmantwiki/preview/index/'
INFO:firmant.application.Firmant:firmantwiki.WikiWriter rendered 'preview/index/index.html'
xdg-open preview/index.html

Notice how the writer declares a URL for the index document while it writes it to preview/index.html. This logic is powered by the URL routing backend. I'll be elaborating more on this in the next section (as I have promised throughout this post).

URL Routing

The firmant.routing module provides a mapping between a dictionary of attributes and a URL. Those familiar with the lambda calculus or logic-based languages such as Prolog will recognize the behavior to be similar to unification.

Routing revolves around the concept of a path. A path has a set of attributes split between bound, and free attributes. Naturally the bound and free attributes for a path are disjoint while their union is the set of all attributes for the path. If a path matches a set of attributes, it is possible to construct a path from the attributes.

This lends itself to several possibilities:

Single Path Component
An attribute is converted to its string representation, and matches if and only if the attribute names are the same.
Bound Null Path Component
Similar to the SinglePathComponent, but the path is always constructed to be empty.
Static Path Component
The opposite of a bound null path component. This matches when the set of attributes is empty, and always is constructed to the empty string.
Compound Path Component
Several path components joined together. Each component will be constructed using the appropriate set of attributes, and the constructed strings will be joined with '/'. See the documentation for more details. The '/' operator is overloaded to join path components.

The firmant.routing.components package contains several pre-built path components.

If you haven't guessed already, the URLS setting contains a list of components (typically compound components). In our case we just have:

URLS = [c.TYPE('wiki') /c.PATH]

We can see that this declares a URL that has the attributes type=wiki and path. Internally, the firmant.routing.URLMapper object will turn a set of attributes that unifies with this (leaving no free attributes) into a path that is equal to the wiki object's path attribute. Our writer's key method returns just the right set of attributes to unify with these attributes. This is how the writer knows which URL to use for each written object. The URLMapper is able to return both a local filesystem path (relative to the output directory) and a full URL for any set of attributes (with the PERMALINK_ROOT as the base).

This URL mapping system allows for reconfiguration of the output paths and URLS for each set of objects a writer will write. For example, if I wish for the index.rst wiki document to be a special document that doesn't reside at '/index/', but rather at '/', it is as simple as creating a rule with higher priority that only matches this document:

URLS = [c.TYPE('wiki') /r.BoundNullPathComponent('path', 'index')
       ,c.TYPE('wiki') /c.PATH
       ]

The first URL entry will construct to the empty path '/'. Firmant will append the 'index.html' to the local filesystem path so that when an HTTP client requests '/', it will be served '/index.html'. The result is so-called "clean" URLs for all pages.

Creating A Blog

Firmant started as a blogging platform and slowly evolved to support much more. Similar steps to those taken here can be used to create a blog. The Makefile is the same as for a wiki. Firmant conveniently provides a basic configuration suitable for a blogging platform as firmant.settings.

If you're interested in building a blog, consider cloning either the Firmant blog or the CHASM blog. The common configuration requires a bare settings file, and a small template configuration file. I'll be posting more about creating a blog with Firmant in a future post.

Conclusion (AKA Where is this going from here)

Firmant is still a very young platform for development. It's only a year and a half old, but the current iteration (with static files) is less than six months old.

I'm hoping to add polish to both the code and documentation in future releases to make it more easily usable by those who are not inside my head (including myself). Users or developers interested in following Firmant can join the Firmant mailing lists to be alerted to new releases, or participate in the development of new releases.

As usual, you can find 'me' at this domain using SMTP, or 'rescrv' on irc.freenode.net #firmant.

CHASM vs. BitTorrent

I received an email as a result of my last blog post about CHASM suggesting we look into using BitTorrent to distribute files with CHASM, thus bypassing much of what we plan to implement.

While I will certainly be the first to admit that CHASM is heavily inspired by BitTorrent, I do not feel as though BitTorrent is the best solution for this problem.

I'm going to give an overview of how BitTorrent works, followed by the parts of it that would be problematic for mirroring, followed by the parts of it worth borrowing.

How BitTorrent Works

BitTorrent is based around the concept of a swarm of users. A third-party known as the tracker is responsible for receiving announce requests from users, and putting users in touch with each other.

Each torrent file has a unique identity known as an info hash generated from its contents. This hash depends upon the info dictionary of the torrent. When packaging multiple files for distribution, the torrent file is created by considering the concatenation of all files in the archive, and breaking the data into consistent-sized chunks (typically 256kb) which will all be hashed. Thus, when two users are exchanging data they can refer to pieces by the index in the list of pieces.

For a 650GB collection of files, there would exist approximately 166,400 pieces assuming each piece size was 4MB in size.

When peers announce themselves to the tracker, they announce using the info hash, which is derived from the info dictionary which in turn is derived from the contents to be distributed.

In this manner, peers sharing exactly the same files can talk to the tracker and find each other without the tracker having any clue as to what is contained within the torrent.

While this works great for static files, it is not for CHASM.

BitTorrent doesn't provide a good base for mirroring

BitTorrent was designed for immutable data. Collections such as Fedora are changing on a daily basis (or, ideally, more often than that). Additional tools would need to be built on top of BitTorrent to account for this. There are three methods I've seen suggested:

  1. The complete archive is packaged as a torrent and distributed on each update.
  2. Only the changed files are distributed on each update.
  3. A hybrid approach in which both methods are employed using some algorithm for determining which is more efficient.

Clearly the first option is not feasible as the size of the archive grows. Hashing each file again to match pieces after each update just is impractical, especially when each update only changes a few files. The SHA-1 hash is used within BitTorrent, so we cannot re-use the results of previous piece hashes to calculate new hashes after the point at which files change (unless we use an append-only method of changing the archive, and thus do not shift the location of any bytes in the pieces array).

The second option is infeasible unless all update torrents necessary to reconstruct the collection are well seeded. In cases where an update is not well-seeded (e.g. it is a week old), the mirror must fall back to traditional means to be back into a state where it may use the torrent-based updates to catch up with the master.

The third option is ruled out simply because it does not offer anything to correct for deficiencies in the first two options.

Where does that leave CHASM?

CHASM is borrowing from BitTorrent's design in that it is identifying transferable chunks by cryptographic hash, and offset in the list of hashes.

CHASM hashes full files instead of fracturing files into pieces. This has two benefits. First, a file's hash is determined solely by the contents of the file. There is no cascade of hash changes if a byte is inserted into the first file in the manifest. Second, traditional hard disks are much more efficient when seeking is minimized. John "warthog9" Hawley explains in more detail in Issues in Linux Mirroring: Or, BitTorrent Considered Harmful.

This allows for several versions of the manifest (the CHASM equivalent of a torrent) to be analyzed simultaneously, and thus the overhead of maintaining additional manifests scales linearly with the size of the delta.

We've taken everything a step further and are actually building cache-awareness into the peer-to-peer protocol. An organization with enough Tier-1 mirrors will be able to serve all of the updates pushed out of the filesystem cache without ever touching disk. While this is a small detail, it certainly will prolong the life of mirrors's disks as well as offer potential performance improvements.

Other approaches considered

We store all files of the collection in a pool wherein they are identified by their checksum. For a short while we considered simply syncing this pool over rsync, and avoiding the hassle of implementing our own protocol. This idea was tentatively rejected as it didn't offer the cache-awareness we were looking for, and required that rsync be used to perform the transfer. While using rsync is not inherently bad, it doesn't have the ability to pull from multiple upstreams without explicitly coding support for running parallel instances of rsync. If we were going to go so far as to write a protocol to negotiate what to transfer, we figured we might as well go the extra step and transfer the files too (file transfer is trivial compared to the handshaking).

Contacting the author

As usual, I welcome any and all feedback or criticisms (I like constructive criticism, but if you don't feel you can be constructive, I'll take what I can get).

IRC
You can find me on irc.freenode.net in the #chasmd channel.
Email
Email 'me' at this domain if you want to get in touch.

Bootstrapping Python Projects

I've spent much of my free time over the last year or two working on projects written in Python. In that time I've accumulated several practices I use when bootstrapping a new project. I wouldn't call them "best practices" as they haven't been vetted by the community, or really anyone other than myself. I'm hoping that in sharing these practices I will help others to set up a good framework for their projects as well as receive some feedback on my own projects' structure.

I'll be sharing how I use the standard library's doctest module, Georg Brandl's sphinx package, Logilab's PyLint program and Ned Batchelder's coverage module to keep the number of inconsistencies and defects in both my code and documentation low.

Writing Doctests

All tests within my code are done using the standard library's doctest module. I use doctests because they allow me to describe the expected behavior of my code in the code so that when I am reviewing my work I may easily see what behaviors I explicitly declare to be expected.

Introduction to Doctests

For those not familiar with docstrings or doctests: A docstring is a literal that is the first statement in a module, class, or function. A doctest is a string, embedded in the doctest, that mimics the form of the python interactive interpreter.

When the doctest is run, the values returned by statements are compared with the values specified by the doctest. If a mismatch occurs, the test fails. For instance, this test passes:

>>> None
>>> True is True
True
>>> False is True
False
>>> 'hello world'
'hello world'

while this test fails:

>>> 'hello world'
'not hello world'

We have stated that the output of the statement 'hello world' when run in the interactive interpreter must be 'not hello world', but this cannot be the case as the literal 'hello world' is itself.

Keep in mind that any output printed to stdout by the statement will be considered for the doctest. Another example:

>>> def say_hello(to_whom):
...     print 'Hello', to_whom
...     return 'I said hello to %s' % to_whom
...
>>> say_hello('Shirley')
Hello Shirley
'I said hello to Shirley'

Notice that the output to standard out is described just as it would appear if the function say_hello were called in a normal program. The return value is described using the same syntax used by repr.

More details of the doctest format can be found in the documentation.

Doctest Gotchas

Doctests can have some unexpected behavior with regard to scope. Consider the following:

def write_something(what, where=sys.stdout):
    '''
    >>> write_something('Hello World')
    Hello World
    '''
    where.write(what)

and compare it to this:

def write_something(what, where=None):
    '''
    >>> write_something('Hello World')
    Hello World
    '''
    where = where or sys.stdout
    where.write(what)

While it appears at first glance that these two functions are the same, there is a subtle error with the scope of sys.stdout. The first example will fail as sys.stdout is bound to the value of sys.stdout at the time the function is defined whereas the second example will fall back to the value of sys.stdout that is defined at runtime. This is a key thing to keep in mind as you write code to be used with doctests. While I wouldn't call the first example wrong I would say that the second example is certainly better (especially when you are developing for a library-like module).

Tests are no Substitute for Good Code

I've seen a lot of debate recently on the merits of unit testing. Some claim that testing is overrated. I've seen some that claim that test driven development is the holy grail of software. Others claim that testing is near useless when the individual writing the tests is the individual who wrote the original code.

I take a different stance: My tests exist to verify behavior that I document in my program. I can easily make changes to the program and see when things no longer work as documented.. It's comforting to finish refactoring code over the weekend and know that it still works as my users expect (or at least to know that they can find the new behavior by looking at the tests).

Running Tests (with coverage)

I've constructed a small script to run the doctests found in a given list of modules. It enables per-module setup and teardown functions as well as the declaration of extra globals that will appear in the test namespace:

#!/usr/bin/env python

import unittest
import doctest
import sys

from minimock import Mock
from pprint import pprint
from pysettings.modules import get_module

def main(modules):
    suite = unittest.TestSuite()

    for module in modules:
        mod = get_module(module)
        args = {}
        extraglobs = {'Mock': Mock
                     ,'pprint': pprint
                     }
        for arg, attr in [('globs', '_globs')
                         ,('extraglobs', '_extraglobs')
                         ,('setUp', '_setup')
                         ,('tearDown', '_teardown')
                         ]:
            if hasattr(mod, attr):
                args[arg] = getattr(mod, attr)
        extraglobs.update(args.get('extraglobs', dict()))
        args['extraglobs'] = extraglobs
        suite.addTest(doctest.DocTestSuite(mod, **args))

    results = unittest.TextTestRunner(verbosity=2).run(suite)

    if not results.wasSuccessful():
        sys.exit(1)

if __name__ == '__main__':
    main(['module'
         ,'module.foo'
         ,'module.bar'
         ,'module.baz.quux'
         ])

In each module, the _setup and _teardown functions perform setup and tear-down before and after each docstring's tests are run. It's important to note that the functions will not be run between each testblock within a docstring.

The _globs and _extraglobs dictionaries will be copied into the namespace in which the doctests execute before _setup is called. It is important to note that _globs will replace all globals, while _extraglobs will only override the keys it defines.

Writing Documentation

Writing quality, tested code is great, but expecting users of the code to dig into the code to determine how it works is generally not a path to happy users. This is where the excellent Sphinx documentation tool comes in. Sphinx uses the reStructuredText format (the same format I use for Firmant) for writing documentation.

Automatically Generating Documentation from Docstrings

As of version 0.6.6, Sphinx does not automatically construct a documentation tree that mimics the structure of a Python package. It appears this will be possible in version 1.0 of Sphinx. As a temporary measure I developed the technique shown in this section for having Sphinx generate documentation for all modules in a package.

Sphinx includes an autosummary extension. Using this extension, it is possible to read the contents of a rst file and generate documentation for all modules listed in the autosummary directives contained in the file. At first, I just had one document that listed all modules; however, this quickly became unclean as it turned a hierarchy of modules into a flat list of modules. While this is a matter of taste, it feels more intuitive to me to have module foo.bar.baz referred to from module foo.bar instead of module foo.

To accomplish this, I have an rst doc (I call it doc/modules.rst) that refers to every module for which I wish to generate documentation. An example looks like:

All Shipped Modules
===================

.. autosummary::
   :toctree: generated

   firmant.application
   firmant.chunks
   firmant.decorators
   firmant.du

Each of the modules listed (firmant.*) will be auto-generated and stored in the SPHINXSOURCEDIR/generated/module.rst. For firmant.application this translates to doc/generated/firmant.application.rst.

In order to take advantage of the autosummary generation it is necessary to add the following to the Sphinx conf.py:

unused_docs = ['modules']
autosummary_generate = ['modules']

This prevents Sphinx from generating a warning if modules is not linked from other documents.

I added a rule to my Makefile to perform autosummary generation using sphinx-autogen:

autogen:
    env PYTHONPATH=. $(SPHINXAUTOGEN) -o doc/generated doc/modules.rst

In the future, I'd like to have this file be auto-generated as well so that all tested code is automatically included in the documentation as well.

Tested Examples

Earlier I mentioned that I use the doctest module to test all of my code. Using doctests allows me to test every example in my documentation. Earlier I showed that the format of the doctests was just like a copy-and-pasted snippet from the interactive interpreter. If, instead, we use the following syntax, then we can benefit from some additional features present in Sphinx:

.. doctest::

   >>> print 'Hello world'
   Hello world

.. doctest::
   :hide:

   >>> print "I'm hidden!"
   I'm hidden!

If this snippet is present in a docstring, then the doctest module will run each of the examples; however, Sphinx will only display the first one in the generated html. For this reason I use this format for all of my doctests and put setup/teardown code into hidden tests. Details such as mock objects should not leak into the documentation, but are necessary for many tests to run.

Avoid Errors, Warnings, and Violations of Convention

I use the pylint command-line tool to check for potential problems within my code. Most of the points raised will be simply violations of convention; however, pylint will catch many of the common errors that lead to hard-to-debug code. For instance, having a default value for a function be [] (or other mutable values) instead of None is an error caught by pylint. The actual hazard of mutable default arguments is explained elsewhere.

I invoke pylint with the following rule in my Makefile:

pylint:
    pylint --rcfile=pylintrc firmant

This assumes the file pylintrc is present in the current working directory from which the Makefile will be invoked.

I use a potentially different pylintrc for each project as some projects require disabling warnings or enabling additional variable patterns

For example, I amend the good-names and additional-builtins settings to include the _ character so that I may use it for gettext support throughout my applications.

Take the time to read through the pylintrc and tweak it to reflect the conventions and preferences you wish to see in your application. With rare exception, I prefer to disable messages with an embedded # pylint: disable-msg=XNNNN rather than disabling the messages globally. This allows for fine-grained control over the messages emitted by pylint as it is rarely appropriate for a message to be universally disabled.

Publishing Your Package

Making your package easily accessible to users is the best way to ensure that people will take the time to consider it. Software which requires extra effort from the user will generally be disregarded. Give your project every opportunity to be adopted on a wider scale.

Creating a setup.py

Python setup files are currently in a critical transition point. For more information on this, see the discussion on Distutils2 vs. Pip. Until the plethora of options unify, I'm sticking with distutils as it works well and provides everything I need.

Here is an annotated version of my setup.py from Firmant's repository:

from distutils.core import setup


# Categories to allow others to find the package.
classifiers = [ 'Development Status :: 4 - Beta'
              , 'Intended Audience :: Developers'
              , 'License :: OSI Approved :: BSD License'
              , 'Operating System :: MacOS :: MacOS X'
              , 'Operating System :: POSIX :: Linux'
              , 'Operating System :: Unix'
              , 'Programming Language :: Python :: 2.6'
              , 'Topic :: Internet :: WWW/HTTP :: Site Management'
              ]

setup(name='Firmant',
      # More on version numbers below.
      version='0.2dev',
      # I include my IRC nickname (on irc.freenode.net) so others may
      # contact me.
      author='Robert Escriva (rescrv)',
      author_email='firmant@mail.robescriva.com',
      # Each directory including an `__init__.py` should be listed here.
      packages=['firmant'
               ,'firmant.parsers'
               ,'firmant.routing'
               ,'firmant.templates'
               ,'firmant.utils'
               ,'firmant.writers'
               ],
      # This pulls in my bundled templates and exposes them via
      # pkg_resources.  I'm not entirely sure on this part.
      package_dir={'firmant': 'firmant'},
      package_data={'firmant': ['templates/*.html',
                                'templates/*/*.html']},
      # Scripts to be installed into the PATH.
      scripts=['bin/firmant'],
      # The URL for the project.  Some people just list the PyPi page.
      url='http://firmant.org/',
      # The license for your project.  You do have a license, don't you?
      # This should match the trove classifier you use above.
      license='3-clause BSD',
      # A one-line description of your project.
      description='A framework for static web applications.',
      # More on this below.
      long_description=open('doc/README.rst').read(),
      # Include the classifiers from above.
      classifiers=classifiers,
      )

PEP 386 provides a good rundown of version numbers. After each new version number, I bump the number and append a dev suffix. When it comes time to release, I make a branch releases/X.Y. Each individual release is a tag releases/X.Y.Z. Before each tag, I update the setup.py and doc/conf.py to reflect the new version number. In the future I think it would be useful to merge setup.py and doc/conf.py to keep them consistent.

The packages listing is not recursive, so be sure to include each nested package. A in Python, a package is a directory containing an __init__.py and zero or more Python modules.

Turning a setup.py into a page on PyPi is as easy as:

$ python setup.py register
$ python setup.py sdist upload

Just don't forget to register your PyPi account, or change your version to be something other than *dev.

Consistent Project Description

I want the index page of my documentation, and my PyPi page to be consistent. The long_description line reads the contents of the doc/README.rst file. This same file is included in the index page of my documentation using .. include:: README.rst.

Overall, it seems to work well to eliminate any skew that could occur between the two descriptions of the project.

Concluding Thoughts

The contents of this post is the result of several late-night research sessions into Python best-practices. I am by no means a Python expert; I just observed what practices go on within the community and tried to best encapsulate them in a repeatable process that I could use for my many one-off projects.

There will most likely be slight errors, violations of Python convention, and downright incomprehensible statements in the above as I wrote it over the course of two weeks and do not have an editor to review it (and when I review it, I know what I meant, so I gloss-over its flaws).

You can email 'me' at this domain with your questions/comments/criticism on the techniques I described. You can also get me on irc.freenode.net #firmant.

Hello Fedora Planet

For months I've been telling myself I should push portions of my blog to Fedora Planet. Today I'm doing so as part of Fedora Summer Coding 2010. I have the privilege of being a project mentor in the program this summer helping Matt Mooney with the CHASM project. In this post I'll tell you a little more about the project's goals as well as introduce myself a little more.

Goals of CHASM

I came up with the idea for CHASM in the summer of 2009. At that time I was the sole systems administrator for the Rensselaer Center for Open Source at RPI. Due to some events in my personal life I was not able to check up on the mirror for several days to a week. In that time the script that I had borrowed from the community documentation for the distribution went haywire. rsync has a tendency to accumulate temporary files of the form .~tmp~. I would assume the user account creating the repo is different from the user running rsyncd as I did not have permission to mirror these files. As a result the check to see if rsync succeeds (exit code 0) would never evaulate to true. The overall result was that the mirror got stuck in a loop wherein it was continually trying to synchronize.

Looking back on it, I should have checked the script to make sure it was correct.

This experience led me to the following observations about mirroring large volumes of software:

  • Correctness The tools used should be more than a random collection of scripts that manipulate rsync or ftp. From my research it appears that Fedora is the best in this regard with Mirror Manager. Debian has the ftpsync scriptset (although it appears that Ubuntu has not taken advantage of this and simply has an rsync script).

  • Efficiency rsync is great for point to point transfers, but with many nodes arranged in a tree form, it does much more than is necessary. Each pair of nodes in the tree must re-establish what needs to be exchanged to complete the transfer.

    I spoke with Peter Poeml of the MirrorBrain project and he indicated that within the mirror infrastructure of SUSE, they have each node maintain a list of things that need to be pushed to its children. On the next sync, the accumulated list is passed to rsync.

  • Integrity A systems administrator should be able to verify the integrity of a mirror should hardware fail or a malicious user break in. rsync provides for this using the --checksum option, but with a big caveat: both ends of the connection hash all files considered for transfer.

    This is less of an issue when bandwidth and I/O are plentiful, but becomes an issue when mirrors are already near capacity without performing checksumming operations. As a result, Fedora recommends prohibiting the use of the --checksum option.

    A systems administrator should be able to verify the content they are providing is genuine. In the ideal case this should be possible without a network connection.

We (Ben, Joe, and myself) designed CHASM to meet each of these goals. It's designed to enable one machine to share files with many geographically distributed (from a network perspective) machines. It is optimized for when there are a large number of files in the collection, but only a small number of files change with each update.

A manifest is produced which identifies all metadata of the collection. A nice side effect is that all files are identified by a cryptographic hash. Two parties with the same manifest can easily communicate which files are available for transfer by relying upon this manifest. In experiments, a 500G mirror of Fedora (excluding certain architectures and rawhide) took ~100MB of disk space to hold the uncompressed manifest and ~64kB to describe availability of files to peers with the manifest. Compression reduces the manifest to ~30MB.

More information about CHASM can be found in its technical design document. I will be posting its URL on the CHASM Blog in the near future.

A Little About Me

I'm a 2010 graduate of Rensselaer Polytechnic Institute where I received a B.S. in Computer Science. At RPI much of my work in the open source community was sponsored by the Rensselaer Center for Open Source (RCOS). CHASM is the second project I worked on for RCOS. The first project was Firmant which is static web framework. It is easily usable as a blog out of the box. In fact, this blog and both the Firmant and CHASM blogs are powered by Firmant.

My interests focus primarily on improving the infrastructure for open source development. Both Firmant and CHASM evolved from my experiences with the open source program at RPI and are designed to fit a niche that I felt was not adequately filled. Some other projects I've been working on recently are also designed to meet needs I personally have, and I hope they'll help others as well.

Wrapping up

I'm happy to be offered the chance to work as a mentor in the Fedora Summer Coding program (do I get a shirt as well?). I hope to interact with the community more in the future as I've done a poor job of it so far.

Thank you for taking your time to read my post. Feedback from the community is appreciated via email (not published here, but a creative individual could find it) or on irc (rescrv on irc.freenode.net #rcos #chasmd #firmant).

Using Git

In September, 2009 and then again in February, 2010 I gave a lecture on Using Git. We video recorded it both times. The first ended up not turning out well as my computer could not transcode on the fly (from the webcam).

The second time around I was greatful for having Peter Hajas on hand to help record the video. The original source was 1080p. I've scaled it to 720p and posted it in vorbis/theora format.

Due to bandwidth limitations I'm only serving this via torrent. Without further ado, here is Using Git (torrent).

RCOS: Thriving in the Open Source World

I'm posting my slides from my presentation to RCOS last Friday. The slides focus on many of the basic points you need to take into account to start/join a successful open source project.

I like making my titles more extravagant than the presentations, so don't expect it to be too useful if you already have experience with starting open source projects.

I also encourage people to think contrary to my advice. I only offer one approach, and I do recognize the value of dissenting opinions. Feel free to discuss yours with me.

Here are the slides.

The Final Sprint at RPI

I'm enjoying my last few days at RPI and thought that it would be appropriate to blog about what I've been up to lately.

SCNARC Research

I've been working with the Social and Cognitive Networks Academic Research Center at RPI researching trust in social networks. I'm going to be continuing this research into the summer.

Additionally, I'm researching the stabilities of communities in social networks (mainly Twitter and Live Journal) in an attempt to predict long-term stability of a community using static snapshots of the network.

CHASM

I've not worked on CHASM nearly as much as I want to. Part of this is due to unfortunate hard disk failure in which I lost much of what we're researching.

This summer I should be devoting 15-20 hours per week to CHASM (realistically it will be more, but only 15-20 hours will be in front of a keyboard).

Firmant

My beloved blog engine is growing up. This site as well as several others run the latest version of Firmant without issue. Stay tuned for a 0.2.0 release in the next few weeks.

Commencement

Saturday I'm graduating Summa Cum Laude from RPI with my B.S. in Computer Science.

RPM Bot

I've been working on a project to automate the rebuilding of packages. This came around because I wanted to be able to automatically rebuild/backport RPM packages with custom fixes. For instance, Fedora 12 does not ship a version of rxvt-unicode that supports 256-colors. I wish to have the 256-color patch enabled. Until my request in the Fedora bugtracker is answered, I'd still like to be able to have 256-colors. RPM Bot will give me this.

Gitswitch

I'm a little fed-up with current get hosting plans. I'm going to be putting a couple hours a week into gitswitch this summer and hopefully finish it by the end of August.

There's probably many other things I've forgotten to mention above. As I remember them I'll blog about them.

ISTS 7 @ RIT

RPISEC sent two teams to compete in the ISTS competition held by Sparsa at RIT.

To quote the official press release:

RPISEC sent two teams to Rochester, NY this weekend to participate in the 7th Information Security Talent Search held by the Security Practices and Research Student Association (SPARSA), taking home first and second places.

Team 2
  • Alex Radocea '11
  • Joe Werther '10
  • Rob Escriva '10
  • Ben Boeckel '11
Team 3
  • Ryan Govostes '11
  • Adam Comella '11
  • Shawn Denbow '13
  • Andrew Zonenberg '12
  • Jay Smith '10

RPISEC's teams excelled at both offense and defense, keeping their services online despite the onslaught of malicious attacks as well as knocking other teams' servers into oblivion. The team members also found 5 previously unreported vulnerabilities in the latest release of a e-commerce application which is widely used on the Web.

The main event had teams defending a network of five servers running various operating systems, including Windows 2000, Ubuntu, and FreeBSD, and several services, such as DNS, HTTP, and FTP. Throughout the competition, team members were issued "injects," time-sensitive assignments such as reconfiguring software or adding new services.

In addition, the teams were also tasked with attacking the other teams, with the ultimate goal of knocking their services offline. A "red team" of security professionals simultaneously attacked all of the teams.

At the end of the competition, Team 2---the smallest team in the entire competition---took home first place, winning 5 Eee PC laptops and 5 books from information security publisher Syngress. Team 3 came in second place, and won 5 Western Digital media players.

We would like to thank Lockheed Martin Corporation for funding our participation, and SPARSA and the sponsors of ISTS for holding the competition and inviting us.

I'd like to give a big thanks to the organizers and their sponsors. They all made this event enjoyable.

Copyright © 2010 Robert Escriva ¦ Powered by Firmant