Interview | Data-Mining with (Collective) Intelligence

July 29, 2010

Editor’s note:   Kaggle is a web-based platform for hosting data-mining competitions.    As described on their website, “Kaggle facilitates better predictions by providing a platform for data mining, forecasting and bioinformatics competitions. The platform allows organizations to have their data scrutinized by the world’s best statisticians.” Several days ago, I was contacted by Kaggle’s founder, Anthony Goldbloom.    We began an email exchange discussing the possible application of Kaggle-hosted data-mining competitions in areas of Astronomy.    I pointed out that the data-mining challenges in Astronomy are not only algorithmic (how do I build a system that can distinguish a galaxy from a star?) but also technological.    Current and future astronomical surveys will create resources too large to download, and thus require technologies and platforms such as the Virtual Observatory that facilitate the joining of heterogeneous distributed data-sets into a manageable subset, or that enable real-time analysis of a high-throughput data-stream capable of detecting unusual events and generating and distributing alert notifications for follow-up observations.   Nevertheless, I do believe that the Kaggle approach is worth knowing about and will have applications for the science of Astronomy, particular in trying to push the state-of-the-art in the development of detection and classification where the training and test set data in of a manageable size.      And with that in mind, I present below an informal Q&A with Mr. Goldbloom.    We would love to hear comments as to the potential application of Kaggle in Astronomy.



So tell us a little about Kaggle.   What led you to start the company?

Kaggle was inspired by a journalism internship I did at The Economist magazine in 2008. While there, I wrote an article about the use of data in modern organizations. I interviewed fascinating companies including a consultancy that was identifying swing voters for Barack Obama’s presidential campaign and the number-crunching outfit (called dunnhumby) that is said to have contributed to the transformation of Tesco from a down-market British supermarket to the world’s third largest retailer. I became excited about the power of data and eventually left my day-job as an econometrician to found Kaggle.

Tell me a little about your background.    I see from your LinkedIn profile that you studied economics and that you used to work in the banking industry.   How did you become interested in the scientific applications of data mining?

I am an econometrician by training. I have worked in the macroeconomic modeling areas of Australia’s treasury and Australia’s central bank.

Economic modeling is extremely difficult. First of all, the data is noisy and there isn’t very much of it. Secondly, people are not always rational (and their irrationalities don’t seem to cancel out in aggregate) making economic activity particularly difficult to forecast. What’s more, behavior tends to adapt to circumstances, meaning that historical behavior isn’t necessarily a good guide to future behavior. For example the global financial crisis prompted investors to choose safer assets, meaning that past investment decisions won’t have been relevant to predicting future decisions during the GFC.

In many scientific applications, the problems that impair attempts to model the economy don’t exist. There’s often lots of data (sometimes too much) and the subject matter often isn’t inherently unpredictable.

Another company that comes to mind is Innocentive where companies post challenges and seek solutions from the general public.

Innocentive hosts a broad range of challenges and so they can’t cater to data challenges as well as a specialized platform. It’s challenging to simulate real-life conditions in a data competition – and this is a service we can better provide as a focused platform. Moreover, as a focused platform we can better cater to our clients’ demands (such as data anonymization, model averaging…).

Those who compete on Kaggle, will be able to store their reputations on Kaggle. We are currently working on a league table, which Kaggle competitors can use to demonstrate their abilities to potential employers and consulting clients. Moreover, researchers can use Kaggle’s leaderboards and league table to demonstrate the veracity of new techniques before publication. Such a reputation store is more difficult on a broader platform because somebody who performs well on a chemistry challenge won’t necessarily perform well at a data challenge. Moreover, data challenges let us rank people on a leaderboard – so we have feedback on all entrants (not just the winner).

Tells us about some of the current or past competitions that have been hosted on Kaggle.   How successful were they?  Did they attract much interest?   What kinds of rewards are groups offering?

We’ve hosted six competitions so far (one on our demo site that’s no longer accessible) and two that are currently live. These competitions have been very successful:

1. On our demo site we hosted a competition to predict the winner of Australian rules football games given 22 variables. It wasn’t a particularly rich dataset (I just assembled whatever I could find easily), but the best submission correctly picked 74 per cent of winners. I’m told that the betting markets get 65-67 per cent of games correct – so the winning model was amazingly accurate.

2. We hosted a competition to pick the winner of the Eurovision Song Contest. The Kaggle consensus (the average forecast) correctly selected seven of the top 10, while the betting markets only picked five.

3. We’re currently hosting a bioinformatics contest, which requires participants to pick markers in a series of HIV genetic sequences that correlate with a change in viral load (a measure of the severity of infection).  Within a week and a half, the best submission had already outdone the best methods in the scientific literature. This result neatly illustrates the strength of data modeling competitions – whereas the scientific literature tends to evolve slowly (somebody writes a paper, somebody else tweaks that paper and so on), a competition inspires rapid innovation by introducing the problem to a wide audience.

The HIV competition has attracted entries from 97 teams so far. The other live competition (which has only been up for a few weeks) has attracted 42 teams so far.

What is the Kaggle business model?
Kaggle charges a consulting fee (for time spent preparing competitions) and a combination of listing fee and success fee. We don’t charge a listing/success fee on public  research problems where the winning methodology will be open sourced.

It seems to me that, if nothing else, a competition like this might help to establish a baseline in classifier performance that can be the jumping off point for the investigation of advanced techniques.   In other words, the scientist or computer programmer looking to develop a truly advanced data-mining solution could determine rather quickly what the minimal performance requirements of an advanced solution might be by having first determined, using Kaggle, what the collective intelligence of the wider public can come up with.

We actually think Kaggle can facilitate advanced solutions to data modeling problems. For the data mining competitions we’ve run so far, we find that top entries tend to bump up at an upper bound, which we believe to be the best that can be done given the inherent noise and richness of a dataset.

So let’s turn to the scientific subject that is the primary interest of our readers:  Astronomy.   How might Kaggle be used for addressing data-mining problems in Astronomy?   One project that comes to mind is the Great08 Challenge.  There they were looking to improve methods for inferring  the structure of dark matter in the universe by analyzing images of gravitational lensing.

I’m not an astronomer so I can’t give a good answer to this question. However, I’m sure your readers will have ideas – would be great to hear any thoughts in the comments thread.

For further reading:

Unleashing Your Inner World Cup Geek | Wired

Your chance to beat the banks | The Independent

Crowdsourcing HIV Research | Slashdot


Perspectives | Learning from SETI: Overcoming the roadblocks to discovery

July 23, 2010

Imagine you’re a scientist looking to make a discovery – not merely an insight, a profound earth shattering once-in-a-lifetime kind of discovery; a discovery so significant, it will change the course of history, and man’s perceived place in the universe.    You believe it’s out there waiting to be revealed.   Logic alone tells you it must be so.   You start collecting data.   And you collect and analyze, collect, and analyze.  And you do this for fifty years, and still you find nothing!    Unbelievable!

What do you do?  Well, you have several options.  First, you can try to increase the amount of data you are collecting.    Perhaps your signal is very weak and merely hiding amidst the cosmic noise.    Secondly, you can change your data.   Maybe you’ve been collecting the wrong type of data.   Maybe you’ve been looking in the wrong places, or at the wrong time.   Perhaps you simply need to be a bit more clever about where and when and how you gather your raw observations.   Your third and final option is to try to look at your data with a fresh perspective – to change your analysis.    Maybe the signal is there all along, but you just aren’t sifting through it in the right way.   You’re looking for the wrong patterns.    Maybe the pattern your looking for is really quite alien.

By now, you’ve probably guessed that what I’m talking about, of course, is that most profound and potentially history-making career-risking data-mining effort of all time: The Search for Extraterrestrial Intelligence (SETI).    And which of the above strategies is the SETI Institute currently pursuing to address the fact that after all these years, it has yet to detect a signal from an alien intelligence?   Answer:  All of the Above!

Increasing SETI’s data receiving capacity.       SETI is pursuing a major technological upgrade to its receivers via the development of the Allen Telescope Array.      Amir Alexander offers a brief history of the SETI project in which he describes the Allen Array as “one of the best funded and most promising projects for the future of SETI.”    He goes on to write:

The Allen Array represents a true breakthrough for radio SETI. As a dedicated observatory, SETI researchers will be using it year-round to search for alien signals, as compared to the several weeks every year, which are allotted to Project Phoenix at Arecibo. In addition, since it is composed of hundreds of separate dishes, the array can be pointed at several points in the sky at the same time, and therefore listen to signals from several stars simultaneously. The latest technology will enable the Array to cover a frequency band 9 gigahertz wide, more than 3 times wider than project Phoenix, which scans the widest band of any of today’s searches. All of this represents a qualitative leap in the capacity of SETI searches, and increases the chances of detecting a “real” signal several-fold.

New sources of data.   Dr. Seth Shostak gave a talk earlier this year at Foothill College as part of the Silicon Valley Lecture Series entitled: “The Search for Intelligent Life Among the Stars: New Strategies.”   In his talk, he presents a wonderful range of novel and clever ideas aimed at trying to use the detection resources available in new and smarter ways.    One of my favorite ideas:   Theoretical modeling suggests that planets can in fact form in binary systems, and some such planets have already been discovered.   An intelligent alien race in such a system would likely try to colonize planets in the companion star system.   If the orbital plane of the binary system is in the line of sight of our own solar system, we will observe the star system as an eclipsing variable star.   Many such stars are known.   Now imagine this alien civilization communicating back and forth with its colony.   At times when we would observe the eclipse,  the communications beam will be focused right in our direction!   So why not point our receivers at eclipsing variables specifically when they are undergoing the eclipse!    As with all good strategies, this approach tells you “when to look and where.”

credit: Cosmos – The SAO Encyclopedia of Astronomy

New analytical methods.   Here, the SETI Institute as done something truly interesting.  Jill Tarter, director of the Center for SETI Research recently announced a new initiative by the SETI institute to enlist the help of researchers and programmers to see if the signal process and pattern detection algorithms can be improved.

We’d like to take the next step and invite all of the smart people in the world who don’t work for Berkeley or for the SETI Institute to use the new Allen Telescope. To look for signals that nobody’s been able to look for before because we haven’t had our own telescope; because we haven’t had the computing power…For people who don’t have black belts in digital signal processing, we want to take regions of the spectrum that are overloaded with signals and get those out and have them visualized in different ways against different basis vectors. We’d like to see if people can use their pattern recognition capabilities to look or maybe listen; to tease out patterns in the noise that we don’t know about (Source: O’reilly Radar).

So remember: the next time you’re stuck in your own efforts at scientific enlightenment and discovery, think about the challenge of SETI and its  strategy: more data, more data sources, and better analysis.


Research Synopsis | Finding cataclysmic variables with a click of your mouse

July 14, 2010

Contributed by Denis Denisenko, Space Research Institute of Russian Academy of Sciences, Moscow, Russia.

Astronomical data mining has three things in common with traditional mining: 1) you work in the dark; 2) it’s very hard work; and 3) you never know in advance what you will dig up in the end.  Sometimes you finish with a load of coal, but every now and then you find a real treasure! In other words, if you know how, where, and what to search for, you will find many hidden gems that everybody was passing by before!  This was the case with our work on discovering new cataclysmic variables from ROSAT X-ray and USNO-B astrophotometric catalogs (Denisenko and Sokolovsky, arXiv:1007.1798).

Cataclysmic Variables (CVs) are a special class of variable stars that continue to surprise astronomers.  Some CVs brighten by 100x or even 1000x during periods of outburst, while others can fade by 100x. These amazing objects can change their brightness by 7-8 magnitudes within a day and fade by 5m in a matter of 30 seconds!  Nothing else in the sky can vary in almost real time.  This is why CVs are perhaps the favorite objects among many fans of variable stars.  One would think they are easy to discover because of their huge amplitude of variability.  However, cataclysmic variables are relatively rare, constituting only about 1% all known variables.

No two cataclysmic variables are alike.  Each has its special quirks and surprises.    Despite decades of research, their behavior is not entirely understood. All CVs have one thing in common – they are all compact binary systems with a white dwarf (or sometimes two white dwarfs).  There the similarities end.  Some have quite heavy components and long orbital periods (up to several days).  Others have small red dwarf “satellites” with a mass and radius 1/6th that of the Sun.  These latter types can complete a revolution in 90 minutes at an orbital distance of perhaps 400 thousand kilometers.  Imagine the Moon making 16 revolutions per day!

Amazingly, many new cataclysmic variables can be discovered by mining publicly available catalogs and sky surveys that are freely accessible online. You just need to use your imagination and have a good eye for pattern recognition in order to notice some common features of these unusual objects.  To discover new CVs, we checked the vicinities of approximately 50 thousand X-ray sources in the Northern hemisphere. We identified 1,400 suspicious stars in the USNO catalog that were changing their brightness by 2 or more magnitudes between different epochs in Red or Blue light.  After some brain work we were left with just 200 candidates.  (Remember: the main tool of the astronomer is his brain, the second most important one is computer, and only afterward does one go to the telescope!)  Actually we didn’t have to use a telescope even with those 200 objects – the necessary images were already obtained by the Palomar Schmidt camera, by 2MASS infrared survey telescope, and in some cases by the NEAT asteroid hunter. Using images dating back to early 1950s, we were able to discover the variability of 10 objects, eight of them being new cataclysmic variables in our Milky Way and two probable active quasars with a large amplitude of variability.  Surely, we have detected only a fraction of variable objects among our candidates!  Now there is a special need for a detailed followup examination of these newly discovered objects.

Perhaps the most amazing thing in this whole story for me was perhaps the fact that we discovered an object with 5 mag amplitude of variability!  At the moment one can only guess what it is.  Maybe a gravitational lensing event, or a supernova explosion in the far galaxy, or maybe yet another previously unknown type of cataclysmic variable.  As I told before, you never know what you will discover today!  And this is, in my humble opinion, the main beauty and motivation of Astronomy.


Twitter, AstroTwitter, and SkyAlert

September 24, 2009

There has been a great deal of interest lately about astronomy and the new media. Internet technologies and modes of communication, including blogging, twitter, facebook, and even virtual worlds such as second life are becoming avenues for interactions between professional astronomers and space scientists and the larger public. (NASA’s martian rovers, Spirit and Opportunity, have over 16,000 followers on twitter!) This topic was a focus of a recent episode of Astronomy Cast, and there are even conferences dedicated to the topic such as .Astronomy | Networked Astronomy and the New Media where a number of the talks from last year’s conference are available for online viewing via streaming video. These efforts take on deeper significance as we enter the coming age of survey astronomy with open access to data archives and opportunities for citizen scientists of varied backgrounds to become part of the process of scientific discovery via direct collaborations. GalaxyZoo is certainly one of the most successful recent endeavors along these lines. It leverages survey data from the Sloan Digital Sky Survey by enlisting the help of thousands of individuals in an effort to classify galaxies to better understand galaxy evolution, discover new types of objects, and address deeper cosmological questions about the large-scale structure of the Universe.

New modes of communication are also in the works. AstroTwitter, is one such proposal under development of Dr. Stuart Lowe at the Jodrell Bank Centre for Astrophysics. As Lowe describes it, AstroTwitter aims “to make it easy for both professional and amateur telescopes to let the world know what they are observing in real-time.” By being a twitter-like service dedicated to Astronomy, AstroTwitter will overcome some of the inherent limitations of Twitter by providing specialized output formats (Webpages, XML, Google Sky overlays, and VOEvents, for example.)

With new synoptic survey telescopes due to come online in the coming years, the problem of disseminating information about significant events to the public, scientists, or other robotic instruments becomes a growing concern. SkyAlert aims to address this growing need. One of the key complaints about twitter is that there is no discrimination. So yes, I might be interested in the doings of a particular person (or robotic spacecraft!), but there is no other way to filter the content received. The goal of SkyAlert is to create a general subscription-based service that allows users to define constraints on the events of interest. An example from paper Skyalert: Real-time Astronomy for You and Your Robots: I want Catalina transient events where the Catalina measurement is at least 2 magnitudes brighter than that from the Sloan survey. Each event would have a wiki-based web-page with additional supporting data and allow users to add comments or other annotations.
skyalert

From Williams, et al. 2008. SkyAlert: Real-time Astronomy for You and Your Robots

Welcome to the New Media…get your telescopes and IPhones ready!


Follow

Get every new post delivered to your Inbox.