A few days ago, the 2009 GitHub contest ended. I’m currently going through all of the top submissions to verify the winners and look through some of the code used. I’ll post the official winners in a day or two, but in the meantime I’ve replaced the contest home page with a table of many of the entries that were submitted that have their source code online, along with the language and license used in each project. If you haven’t pushed your source code yet, please do so and let me know so I can add it to the final table and people can find it.
In the month that the contest was live, we processed over 8,000 submissions resulting in over 7,000 scores for 227 contestants. Several different languages were used to tackle the problem - C++, C, Perl, Python, Ruby, Objective-C, C#, Java, Clojure, Lisp and even Vala (Vala?).
Because of the way that I created the dataset, it was pretty neccesary to use some sort of blended model to get over 50% in the guesses, so most of the top entries did that. Several of the entries had a number of different algorithms run over the data and then had some sort of custom weighting on those choices to come up with the final 10. I’ll write up a full post on some of the algorithms used and different combination techniques employed.
Some of our contestants did some really great writeups of their own projects. Jeremy Barnes has a fantastic README with even more documentation there than the last time I posted about him.
Jason Brownlee has also done an amazing amount of work in writing up his methods as well as linking to a number of academic papers referenced, such as the Analysis of Recommender Systems’ Algorithms (2003), The Effects of Singular Value Decomposition on Collaborative Filtering (1998), Using Linear Algebra for Intelligent Information Retrieval (1995) and more. He implemented his solution in Objective-C, which is pretty cool too.
Some people got clever and took advantage of the fact that the results files were public. They wrote scripts that pulled down all the top results and blended them together. This actually worked out pretty well in most cases, but I felt I should disqualify them since they were breaking the licensing usage of some of the contestants who specifically stated that their results files were not public domain and could not be used by others. It is also not a practical or generally useful approach, even though it’s fun and clever. You can read a bit more about it at igvita.com, though it seems to me that the article blurs what for me is a very clear difference between taking everyones results from a public leaderboard and blending them vs implementing several algorithms internally and blending them. An example of this “result crowdsourcing” style can be seen in John Rowells entry.
Tom Alison also did a very nice writeup on his approach implemented in Python on conditional probability and blending. He also used Tokyo Cabinet for a backend data store for his calculations, which is interesting.
Other users decided to write up their findings and experiences on their own personal blogs.
Dan DeLeo submitted an entry with probably my favorite name, acts_as_bourbon, which is actually just pointer to a solution using his open source Decider Ruby Machine Learning library. Dan goes into great detail on his approach in an article on his blog.
I tried something a little different for contest registration and submission, having contestants add our web app URL as a post-receive hook to register. The actual contest web application that did the scoring and ran the leaderboard is now open sourced on GitHub, if anyone is interested on how to do something like that.
Thanks to Heroku (and Sinatra) it was amazingly easy. I made a small, simple Sinatra application that would take a post-receive POST, check the repo to see if the results.txt file was there and had changed, then would pull it down and compare it to the local answer key, recording the score.
This was my first real deployment on Heroku and it was amazing. I didn’t have to deal with Capistrano or database connection settings or configuration files or anything. If I made a change, just ‘git push heroku’ and the fix is live. If you’ve never tried Heroku for Rack based web app deployment, I would highly recommend it - least administration overhead on a successfully deployed web application I’ve ever seen.
Overall, I’m very happy with how this contest turned out. I am a bit dissapointed at my own design of the contest and preperation of the data that really encouraged overfitting. However, due to the contest, a lot of people found other developers that were interested in this space, a lot of developers used this as a chance to learn or try out new languages, and dozens of open source recommender implementations in multiple languages are now available on GitHub for everyone to use or study. And of course we can’t forget - at the end of the day, someone gets some Pappy. We’ll see who that is in a few days.