Editor’s note: Kaggle is a web-based platform for hosting data-mining competitions. As described on their website, “Kaggle facilitates better predictions by providing a platform for data mining, forecasting and bioinformatics competitions. The platform allows organizations to have their data scrutinized by the world’s best statisticians.” Several days ago, I was contacted by Kaggle’s founder, Anthony Goldbloom. We began an email exchange discussing the possible application of Kaggle-hosted data-mining competitions in areas of Astronomy. I pointed out that the data-mining challenges in Astronomy are not only algorithmic (how do I build a system that can distinguish a galaxy from a star?) but also technological. Current and future astronomical surveys will create resources too large to download, and thus require technologies and platforms such as the Virtual Observatory that facilitate the joining of heterogeneous distributed data-sets into a manageable subset, or that enable real-time analysis of a high-throughput data-stream capable of detecting unusual events and generating and distributing alert notifications for follow-up observations. Nevertheless, I do believe that the Kaggle approach is worth knowing about and will have applications for the science of Astronomy, particular in trying to push the state-of-the-art in the development of detection and classification where the training and test set data in of a manageable size. And with that in mind, I present below an informal Q&A with Mr. Goldbloom. We would love to hear comments as to the potential application of Kaggle in Astronomy.
Kaggle was inspired by a journalism internship I did at The Economist magazine in 2008. While there, I wrote an article about the use of data in modern organizations. I interviewed fascinating companies including a consultancy that was identifying swing voters for Barack Obama’s presidential campaign and the number-crunching outfit (called dunnhumby) that is said to have contributed to the transformation of Tesco from a down-market British supermarket to the world’s third largest retailer. I became excited about the power of data and eventually left my day-job as an econometrician to found Kaggle.
Tell me a little about your background. I see from your LinkedIn profile that you studied economics and that you used to work in the banking industry. How did you become interested in the scientific applications of data mining?
I am an econometrician by training. I have worked in the macroeconomic modeling areas of Australia’s treasury and Australia’s central bank.
Economic modeling is extremely difficult. First of all, the data is noisy and there isn’t very much of it. Secondly, people are not always rational (and their irrationalities don’t seem to cancel out in aggregate) making economic activity particularly difficult to forecast. What’s more, behavior tends to adapt to circumstances, meaning that historical behavior isn’t necessarily a good guide to future behavior. For example the global financial crisis prompted investors to choose safer assets, meaning that past investment decisions won’t have been relevant to predicting future decisions during the GFC.
In many scientific applications, the problems that impair attempts to model the economy don’t exist. There’s often lots of data (sometimes too much) and the subject matter often isn’t inherently unpredictable.
Another company that comes to mind is Innocentive where companies post challenges and seek solutions from the general public.
Innocentive hosts a broad range of challenges and so they can’t cater to data challenges as well as a specialized platform. It’s challenging to simulate real-life conditions in a data competition – and this is a service we can better provide as a focused platform. Moreover, as a focused platform we can better cater to our clients’ demands (such as data anonymization, model averaging…).
Those who compete on Kaggle, will be able to store their reputations on Kaggle. We are currently working on a league table, which Kaggle competitors can use to demonstrate their abilities to potential employers and consulting clients. Moreover, researchers can use Kaggle’s leaderboards and league table to demonstrate the veracity of new techniques before publication. Such a reputation store is more difficult on a broader platform because somebody who performs well on a chemistry challenge won’t necessarily perform well at a data challenge. Moreover, data challenges let us rank people on a leaderboard – so we have feedback on all entrants (not just the winner).
Tells us about some of the current or past competitions that have been hosted on Kaggle. How successful were they? Did they attract much interest? What kinds of rewards are groups offering?
We’ve hosted six competitions so far (one on our demo site that’s no longer accessible) and two that are currently live. These competitions have been very successful:
1. On our demo site we hosted a competition to predict the winner of Australian rules football games given 22 variables. It wasn’t a particularly rich dataset (I just assembled whatever I could find easily), but the best submission correctly picked 74 per cent of winners. I’m told that the betting markets get 65-67 per cent of games correct – so the winning model was amazingly accurate.
2. We hosted a competition to pick the winner of the Eurovision Song Contest. The Kaggle consensus (the average forecast) correctly selected seven of the top 10, while the betting markets only picked five.
3. We’re currently hosting a bioinformatics contest, which requires participants to pick markers in a series of HIV genetic sequences that correlate with a change in viral load (a measure of the severity of infection). Within a week and a half, the best submission had already outdone the best methods in the scientific literature. This result neatly illustrates the strength of data modeling competitions – whereas the scientific literature tends to evolve slowly (somebody writes a paper, somebody else tweaks that paper and so on), a competition inspires rapid innovation by introducing the problem to a wide audience.
The HIV competition has attracted entries from 97 teams so far. The other live competition (which has only been up for a few weeks) has attracted 42 teams so far.
What is the Kaggle business model?
Kaggle charges a consulting fee (for time spent preparing competitions) and a combination of listing fee and success fee. We don’t charge a listing/success fee on public research problems where the winning methodology will be open sourced.
It seems to me that, if nothing else, a competition like this might help to establish a baseline in classifier performance that can be the jumping off point for the investigation of advanced techniques. In other words, the scientist or computer programmer looking to develop a truly advanced data-mining solution could determine rather quickly what the minimal performance requirements of an advanced solution might be by having first determined, using Kaggle, what the collective intelligence of the wider public can come up with.
We actually think Kaggle can facilitate advanced solutions to data modeling problems. For the data mining competitions we’ve run so far, we find that top entries tend to bump up at an upper bound, which we believe to be the best that can be done given the inherent noise and richness of a dataset.
So let’s turn to the scientific subject that is the primary interest of our readers: Astronomy. How might Kaggle be used for addressing data-mining problems in Astronomy? One project that comes to mind is the Great08 Challenge. There they were looking to improve methods for inferring the structure of dark matter in the universe by analyzing images of gravitational lensing.
I’m not an astronomer so I can’t give a good answer to this question. However, I’m sure your readers will have ideas – would be great to hear any thoughts in the comments thread.
For further reading: