Papers | Galaxy Zoo morphology improves photometric redshifts in the Sloan Digital Sky Survey

May 5, 2011

Submitted by Dr. Michael J. Way, NASA Goddard Institute for Space Studies.

Given the recent release of the Galaxy Zoo Data Release 1 researchers can begin to explore the myriad ways that one can use the most accurate and numerous database of galaxy morphology ever compiled. To that end we have used galaxy photometry and redshift information from the Sloan Digital Sky Survey in combination with precise knowledge of galaxy morphology via the Galaxy Zoo project to calculate photometric redshifts using Gaussian process regression.

We are primarily interested in obtaining accurate photometric redshifts for a subset of SDSS galaxies called the luminous red galaxies. These galaxies are normally found in denser regions of the local universe. They are interesting because they tend to be accurate tracers of the large scale structure in the universe and have been used for measuring the Baryonic Acoustic Oscillation signal thus putting better constraints on present day cosmological models.

The Galaxy Zoo database is used to segregate the elliptical galaxies from the spirals (we focus on the former). Then we obtain a variety of derived primary and secondary isophotal shape estimates from the Sloan Digital Sky Survey imaging catalog (e.g. the amount of light within the 50% Petrosian radius). Using these shape estimates in combination with the five bandpass photometry of elliptical galaxies with redshifts from the SDSS we using a non-linear regression training set method (Gaussian process regression) to estimate their photometric redshifts.  The root mean square error for luminous red galaxies classified as ellipticals is as low as 0.0118 which is nearly a factor of 2 lower than typical estimates for galaxies in the SDSS (See Figure).

One can see in the lower left panel that estimates of the photometric redshift are lowest for the luminous red galaxies classified as ellipticals. The best results are obtained when using their 5-band photometry and a variety of isophotal shape estimates denoted as B. See the paper on arXiv for more details.

The next step will be to use classification techniques from the Machine Learning literature to classify all of the elliptical galaxies in the ~350 million object database of the SDSS. This has already been attempted by one group using the ~900,000 Galaxy Zoo morphologies and isophotal shape estimates as training samples. One would expect to be  able to classify approximately 50-100 million luminous red galaxies as ellipticals. These in turn can be used as the most accurate probes thus far in estimating Baryonic Acoustic Oscillations at unprecedented depth.


Papers | QSO Selection Algorithm Using Time Variability and Machine Learning: Selection of 1,620 QSO Candidates from MACHO LMC Database

May 1, 2011

This research synopsis was submitted by Dae-Won Kim, Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA. A preprint of this paper, which is to appear in the Astrophysical Journal, is available on the arXiv preprint server.

Modern astronomy is entering into the completely new era driven by immense amount of observational data. For instance, ongoing and future large-scale surveys such as Pan-STARRS and LSST will produce more than several terabytes of data per night. Wide-field data mapping of the sky will open a new paradigm of astronomy not only in both scientific and data-handling aspects. Especially, it will be practically impossible to manually examine all the data in order to discover scientifically meaningful information. In other words, innovative and novel algorithms that can automatically analyze the data with minimal human intervention, and that can deliver only the meaningful information to astronomers are becoming more and more important.

This paper introduced such an algorithm to select QSOs (Quasi-Stellar Objects) that typically show strong non-periodic or pseudo-periodic variability. In the absence of spectroscopic data, such an algorithm will be a very powerful tool to select QSOs. Especially, for the future large-scale surveys (Pan-STARRS and LSST), spectroscopic observation will be very expensive due to their wide field of views and limiting magnitudes.

We introduced 11 time series features that quantify different variability characteristics of light curves, which was confirmed to be practical to separate QSOs from other types of variable stars (e.g. Cepheids, RR Lyraes, eclipsing binaries, Be stars, micro-lensing, long-period variables, etc.) and non-varying stars. Figure 1 shows an example of the scatter plot of two time series features. As the figure shows, the two features are useful to separate each of the different types of variable stars. We then employed a supervised machine learning technique called `Support Vector Machine’ that can train a classification model in any hyper dimension. We claim that using hyper-plane cuts derived on the basis of the 11-D space (i.e. 11 time series feature space) is much more adequate to separate QSOs rather than using conventional 2-D hard cut.

We applied the algorithm to the MACHO database consisting of 40million light curves and found 1,620 QSO candidates. We then used the Harvard Odyssey cluster to analyze the whole dataset, which took about two days. Identified candidates were cross-matched with mid-IR catalogs and X-ray catalogs, and confirmed that the majority of their candidates are very strong QSO candidates.

Figure 1. Scatter plot of two time series features. Each axis is different time series feature. Different symbols and colors are different variable sources (gray dots: non-variables, black x’s: eclipsing binaries, magenta crosses: micro-lensing, yellow x’s: RR Lyraes, green x’s: Cepheids, cyan crosses: long-period variables, blue crosses: Be stars, red squares: QSOs). Most of the different variable types are grouped in the different regions.”


Interview | Data-Mining with (Collective) Intelligence

July 29, 2010

Editor’s note:   Kaggle is a web-based platform for hosting data-mining competitions.    As described on their website, “Kaggle facilitates better predictions by providing a platform for data mining, forecasting and bioinformatics competitions. The platform allows organizations to have their data scrutinized by the world’s best statisticians.” Several days ago, I was contacted by Kaggle’s founder, Anthony Goldbloom.    We began an email exchange discussing the possible application of Kaggle-hosted data-mining competitions in areas of Astronomy.    I pointed out that the data-mining challenges in Astronomy are not only algorithmic (how do I build a system that can distinguish a galaxy from a star?) but also technological.    Current and future astronomical surveys will create resources too large to download, and thus require technologies and platforms such as the Virtual Observatory that facilitate the joining of heterogeneous distributed data-sets into a manageable subset, or that enable real-time analysis of a high-throughput data-stream capable of detecting unusual events and generating and distributing alert notifications for follow-up observations.   Nevertheless, I do believe that the Kaggle approach is worth knowing about and will have applications for the science of Astronomy, particular in trying to push the state-of-the-art in the development of detection and classification where the training and test set data in of a manageable size.      And with that in mind, I present below an informal Q&A with Mr. Goldbloom.    We would love to hear comments as to the potential application of Kaggle in Astronomy.



So tell us a little about Kaggle.   What led you to start the company?

Kaggle was inspired by a journalism internship I did at The Economist magazine in 2008. While there, I wrote an article about the use of data in modern organizations. I interviewed fascinating companies including a consultancy that was identifying swing voters for Barack Obama’s presidential campaign and the number-crunching outfit (called dunnhumby) that is said to have contributed to the transformation of Tesco from a down-market British supermarket to the world’s third largest retailer. I became excited about the power of data and eventually left my day-job as an econometrician to found Kaggle.

Tell me a little about your background.    I see from your LinkedIn profile that you studied economics and that you used to work in the banking industry.   How did you become interested in the scientific applications of data mining?

I am an econometrician by training. I have worked in the macroeconomic modeling areas of Australia’s treasury and Australia’s central bank.

Economic modeling is extremely difficult. First of all, the data is noisy and there isn’t very much of it. Secondly, people are not always rational (and their irrationalities don’t seem to cancel out in aggregate) making economic activity particularly difficult to forecast. What’s more, behavior tends to adapt to circumstances, meaning that historical behavior isn’t necessarily a good guide to future behavior. For example the global financial crisis prompted investors to choose safer assets, meaning that past investment decisions won’t have been relevant to predicting future decisions during the GFC.

In many scientific applications, the problems that impair attempts to model the economy don’t exist. There’s often lots of data (sometimes too much) and the subject matter often isn’t inherently unpredictable.

Another company that comes to mind is Innocentive where companies post challenges and seek solutions from the general public.

Innocentive hosts a broad range of challenges and so they can’t cater to data challenges as well as a specialized platform. It’s challenging to simulate real-life conditions in a data competition – and this is a service we can better provide as a focused platform. Moreover, as a focused platform we can better cater to our clients’ demands (such as data anonymization, model averaging…).

Those who compete on Kaggle, will be able to store their reputations on Kaggle. We are currently working on a league table, which Kaggle competitors can use to demonstrate their abilities to potential employers and consulting clients. Moreover, researchers can use Kaggle’s leaderboards and league table to demonstrate the veracity of new techniques before publication. Such a reputation store is more difficult on a broader platform because somebody who performs well on a chemistry challenge won’t necessarily perform well at a data challenge. Moreover, data challenges let us rank people on a leaderboard – so we have feedback on all entrants (not just the winner).

Tells us about some of the current or past competitions that have been hosted on Kaggle.   How successful were they?  Did they attract much interest?   What kinds of rewards are groups offering?

We’ve hosted six competitions so far (one on our demo site that’s no longer accessible) and two that are currently live. These competitions have been very successful:

1. On our demo site we hosted a competition to predict the winner of Australian rules football games given 22 variables. It wasn’t a particularly rich dataset (I just assembled whatever I could find easily), but the best submission correctly picked 74 per cent of winners. I’m told that the betting markets get 65-67 per cent of games correct – so the winning model was amazingly accurate.

2. We hosted a competition to pick the winner of the Eurovision Song Contest. The Kaggle consensus (the average forecast) correctly selected seven of the top 10, while the betting markets only picked five.

3. We’re currently hosting a bioinformatics contest, which requires participants to pick markers in a series of HIV genetic sequences that correlate with a change in viral load (a measure of the severity of infection).  Within a week and a half, the best submission had already outdone the best methods in the scientific literature. This result neatly illustrates the strength of data modeling competitions – whereas the scientific literature tends to evolve slowly (somebody writes a paper, somebody else tweaks that paper and so on), a competition inspires rapid innovation by introducing the problem to a wide audience.

The HIV competition has attracted entries from 97 teams so far. The other live competition (which has only been up for a few weeks) has attracted 42 teams so far.

What is the Kaggle business model?
Kaggle charges a consulting fee (for time spent preparing competitions) and a combination of listing fee and success fee. We don’t charge a listing/success fee on public  research problems where the winning methodology will be open sourced.

It seems to me that, if nothing else, a competition like this might help to establish a baseline in classifier performance that can be the jumping off point for the investigation of advanced techniques.   In other words, the scientist or computer programmer looking to develop a truly advanced data-mining solution could determine rather quickly what the minimal performance requirements of an advanced solution might be by having first determined, using Kaggle, what the collective intelligence of the wider public can come up with.

We actually think Kaggle can facilitate advanced solutions to data modeling problems. For the data mining competitions we’ve run so far, we find that top entries tend to bump up at an upper bound, which we believe to be the best that can be done given the inherent noise and richness of a dataset.

So let’s turn to the scientific subject that is the primary interest of our readers:  Astronomy.   How might Kaggle be used for addressing data-mining problems in Astronomy?   One project that comes to mind is the Great08 Challenge.  There they were looking to improve methods for inferring  the structure of dark matter in the universe by analyzing images of gravitational lensing.

I’m not an astronomer so I can’t give a good answer to this question. However, I’m sure your readers will have ideas – would be great to hear any thoughts in the comments thread.

For further reading:

Unleashing Your Inner World Cup Geek | Wired

Your chance to beat the banks | The Independent

Crowdsourcing HIV Research | Slashdot


Perspectives | Learning from SETI: Overcoming the roadblocks to discovery

July 23, 2010

Imagine you’re a scientist looking to make a discovery – not merely an insight, a profound earth shattering once-in-a-lifetime kind of discovery; a discovery so significant, it will change the course of history, and man’s perceived place in the universe.    You believe it’s out there waiting to be revealed.   Logic alone tells you it must be so.   You start collecting data.   And you collect and analyze, collect, and analyze.  And you do this for fifty years, and still you find nothing!    Unbelievable!

What do you do?  Well, you have several options.  First, you can try to increase the amount of data you are collecting.    Perhaps your signal is very weak and merely hiding amidst the cosmic noise.    Secondly, you can change your data.   Maybe you’ve been collecting the wrong type of data.   Maybe you’ve been looking in the wrong places, or at the wrong time.   Perhaps you simply need to be a bit more clever about where and when and how you gather your raw observations.   Your third and final option is to try to look at your data with a fresh perspective – to change your analysis.    Maybe the signal is there all along, but you just aren’t sifting through it in the right way.   You’re looking for the wrong patterns.    Maybe the pattern your looking for is really quite alien.

By now, you’ve probably guessed that what I’m talking about, of course, is that most profound and potentially history-making career-risking data-mining effort of all time: The Search for Extraterrestrial Intelligence (SETI).    And which of the above strategies is the SETI Institute currently pursuing to address the fact that after all these years, it has yet to detect a signal from an alien intelligence?   Answer:  All of the Above!

Increasing SETI’s data receiving capacity.       SETI is pursuing a major technological upgrade to its receivers via the development of the Allen Telescope Array.      Amir Alexander offers a brief history of the SETI project in which he describes the Allen Array as “one of the best funded and most promising projects for the future of SETI.”    He goes on to write:

The Allen Array represents a true breakthrough for radio SETI. As a dedicated observatory, SETI researchers will be using it year-round to search for alien signals, as compared to the several weeks every year, which are allotted to Project Phoenix at Arecibo. In addition, since it is composed of hundreds of separate dishes, the array can be pointed at several points in the sky at the same time, and therefore listen to signals from several stars simultaneously. The latest technology will enable the Array to cover a frequency band 9 gigahertz wide, more than 3 times wider than project Phoenix, which scans the widest band of any of today’s searches. All of this represents a qualitative leap in the capacity of SETI searches, and increases the chances of detecting a “real” signal several-fold.

New sources of data.   Dr. Seth Shostak gave a talk earlier this year at Foothill College as part of the Silicon Valley Lecture Series entitled: “The Search for Intelligent Life Among the Stars: New Strategies.”   In his talk, he presents a wonderful range of novel and clever ideas aimed at trying to use the detection resources available in new and smarter ways.    One of my favorite ideas:   Theoretical modeling suggests that planets can in fact form in binary systems, and some such planets have already been discovered.   An intelligent alien race in such a system would likely try to colonize planets in the companion star system.   If the orbital plane of the binary system is in the line of sight of our own solar system, we will observe the star system as an eclipsing variable star.   Many such stars are known.   Now imagine this alien civilization communicating back and forth with its colony.   At times when we would observe the eclipse,  the communications beam will be focused right in our direction!   So why not point our receivers at eclipsing variables specifically when they are undergoing the eclipse!    As with all good strategies, this approach tells you “when to look and where.”

credit: Cosmos – The SAO Encyclopedia of Astronomy

New analytical methods.   Here, the SETI Institute as done something truly interesting.  Jill Tarter, director of the Center for SETI Research recently announced a new initiative by the SETI institute to enlist the help of researchers and programmers to see if the signal process and pattern detection algorithms can be improved.

We’d like to take the next step and invite all of the smart people in the world who don’t work for Berkeley or for the SETI Institute to use the new Allen Telescope. To look for signals that nobody’s been able to look for before because we haven’t had our own telescope; because we haven’t had the computing power…For people who don’t have black belts in digital signal processing, we want to take regions of the spectrum that are overloaded with signals and get those out and have them visualized in different ways against different basis vectors. We’d like to see if people can use their pattern recognition capabilities to look or maybe listen; to tease out patterns in the noise that we don’t know about (Source: O’reilly Radar).

So remember: the next time you’re stuck in your own efforts at scientific enlightenment and discovery, think about the challenge of SETI and its  strategy: more data, more data sources, and better analysis.


Follow

Get every new post delivered to your Inbox.