Robert Escriva

Things I wish I learned earlier

RCOS Dashboard Algorithm

In short, the algorithm boils down to:

s_i = The number of days since the last update to the source of project i.
b_i = The number of days since the last update to the blog of project i.

a_i = The age of project i
a_i = s_i s_i + b_i s_i + b_i b_i + s_i

I arrived at this algorithm by first establishing the expected relative age between nodes i, j, and then testing the result of the algorithm.

Problem expectations

The first expectation I established is that if s_i = s_j (resp. b_i = b_j), and b_i < b_j (resp. s_i < s_j), then a_i < a_j. I established this as a sanity check. Any good weighting algorithm should take this into account, and it should hold for all values of the source and blog ages.

The second expectation I established is if s_i < s_j and b_i < b_j, then a_i < a_j. This is really a subset of the first case, but is tested as well. This is also a sanity check. Any algorithm I would select should never violate this constraint either.

I then laid down some further expectations: The algorithm must slightly bias projects producing source code in the event of a tie, and the algorithm must favor projects who produce balanced updates. These two expectations are at odds with one another and if an algorithm can consider both, it will be a good candidate for project ranking.

The algorithm should favor source code updates slightly (but not too much).

This case is best described by an example. If s_i = x and b_i = x + 1, and s_j = x + 1 and b_j = x, then a_i < a_j. Open source software is about the source code. Updating a blog can be used to convey information; however, a project without source code is simply an idea.

On the other hand, a large discrepancy between s_i and b_i should not be encouraged. To test this constraint, we propose two test cases. The first case: If s_i = x and b_i > x + 14 and x < s_j <= x+ 7 and x < b_j <= x + 7, then a_j < a_i. This constraint dictates that an algorithm must favor projects that update both components regularly over projects who neglect one responsibility. The second case is symmetric to the first, except it considers projects where the source is delinquent with respect to the blog by ten days or more.

Testing the constraints

To test these constraints, I created a simple Python script to check dashboard weights and verify that the a given algorithm fit the above constraints. The first two were considered absolute, and any algorithm that did not pass the tests would be disqualified. The second two were considered on the basis of the percentage of test cases that passed.

For analyzing the algorithms, I considered an exhaustive number of cases that tested the constraint up to a given bound (in this case 30 days by default). For more details, check the source code.

Room for improvement

The algorithm I proposed met the above constraints 90% of the time or more. Additionally both sanity checks pass 100% of the time. That is not to say that there is no room for improvement. There most certainly is room for improvement as this was only the result of a couple hours of experimentation. For instance, right now it is too heavily biased against projects who have updated one component recently, but neglected the other for weeks.

I made a patch to the dashboard for the algorithm I outlined above just as a stop-gap measure until someone else gets the ball rolling on finding an even better algorithm.. Feel free to take the script above and improve upon it. Add more constraints. Develop a better algorithm. When you have something, write a patch and send it to Eric.

Best of luck!

Copyright © 2010 Robert Escriva ¦ Powered by Firmant