I admit it. I’m not on facebook. And I’ve been holding off tweeting as well. Sometimes I think the associations with all things facebook and twitter is like a virus. Do I really want to follow brand X on facebook? Really? But I’ve decided Twitter in particular can serve a very useful role in bringing attention to news, events, research papers, and web-sites dedicated to astronomical data-mining. I can’t tell you how many times I’ve come across a story that I wanted to share on this site but didn’t really have the time to write about it. So I’m going to continue to try to make AstroDataMining.net a useful resource for those interested in learning about this emerging new field. I think there is a significant opportunity and need for more fellow computer scientists to become involved in the field. But at the same time, I’ve started a twitter account to address those frequent occasions when some new development occurs in Astronomical Data Mining. SO… Follow Data-Mining for Astronomy on Twitter!
Perspectives | The growing importance of data curation
August 29, 2011Excerpt:
With the world awash in information, curating all the scientifically relevant bits and bytes is an important task, especially given digital data’s increasing importance as the raw materials for new scientific discoveries, an expert in information science at the University of Illinois says. Carole L. Palmer, a professor of library and information science, says that data curation — the active and ongoing management of data through their lifecycle of interest to science — is now understood to be an important part of supporting and advancing research….The Center for Informatics Research in Science and Scholarship at Illinois will receive about $2.9 million as a partner on the Data Conservancy project, a $20 million initiative led by Sayeed Choudhury at the Johns Hopkins University Sheridan Libraries. The five-year award, one of the first two in the National Science Foundation’s DataNet program, will fund developing infrastructure for the management of the ever-increasing amounts of digital research data.
Ref: University of Illinois at Urbana-Champaign. “Deluge of scientific data needs to be curated for long-term use.” ScienceDaily, 25 Feb. 2010. Web. 29 Aug. 2011.
Papers | Galaxy Zoo morphology improves photometric redshifts in the Sloan Digital Sky Survey
May 5, 2011Submitted by Dr. Michael J. Way, NASA Goddard Institute for Space Studies.
Given the recent release of the Galaxy Zoo Data Release 1 researchers can begin to explore the myriad ways that one can use the most accurate and numerous database of galaxy morphology ever compiled. To that end we have used galaxy photometry and redshift information from the Sloan Digital Sky Survey in combination with precise knowledge of galaxy morphology via the Galaxy Zoo project to calculate photometric redshifts using Gaussian process regression.
We are primarily interested in obtaining accurate photometric redshifts for a subset of SDSS galaxies called the luminous red galaxies. These galaxies are normally found in denser regions of the local universe. They are interesting because they tend to be accurate tracers of the large scale structure in the universe and have been used for measuring the Baryonic Acoustic Oscillation signal thus putting better constraints on present day cosmological models.
The Galaxy Zoo database is used to segregate the elliptical galaxies from the spirals (we focus on the former). Then we obtain a variety of derived primary and secondary isophotal shape estimates from the Sloan Digital Sky Survey imaging catalog (e.g. the amount of light within the 50% Petrosian radius). Using these shape estimates in combination with the five bandpass photometry of elliptical galaxies with redshifts from the SDSS we using a non-linear regression training set method (Gaussian process regression) to estimate their photometric redshifts. The root mean square error for luminous red galaxies classified as ellipticals is as low as 0.0118 which is nearly a factor of 2 lower than typical estimates for galaxies in the SDSS (See Figure).

One can see in the lower left panel that estimates of the photometric redshift are lowest for the luminous red galaxies classified as ellipticals. The best results are obtained when using their 5-band photometry and a variety of isophotal shape estimates denoted as B. See the paper on arXiv for more details.
The next step will be to use classification techniques from the Machine Learning literature to classify all of the elliptical galaxies in the ~350 million object database of the SDSS. This has already been attempted by one group using the ~900,000 Galaxy Zoo morphologies and isophotal shape estimates as training samples. One would expect to be able to classify approximately 50-100 million luminous red galaxies as ellipticals. These in turn can be used as the most accurate probes thus far in estimating Baryonic Acoustic Oscillations at unprecedented depth.
Surveys | RAVE: The Radial Velocity Experiment
May 5, 2011Editor’s note: The following was kindly submitted by Dr. Arnaud Siebert of the Centre de Donnés de Strasbourg (CDS) and the Observatoire Astronomique de Stasbourg. RAVE has just made public its third data release, as described in greater detail in a paper available on the arXiv, to appear in the Astrophysical Journal.

Heliocentric radial velocity of stars measured by RAVE projected on to the night sky. The smooth change in color (radial velocity) is due to the motion of the Sun around our galaxy.
The RAVE (RAdial Velocity Experiment) project is a multi-fiber spectroscopic survey of stars in the Milky Way using the 1.2-m UK Schmidt Telescope of the Anglo-Australian Observatory (AAO). The RAVE collaboration consists of researchers from over 20 institutions around the world and is coordinated by the Astrophysical Institute Potsdam.
As a southern hemisphere survey covering 20,000 square degrees of the sky, RAVE’s primary aim is to derive the radial velocity of stars from the observed spectra. Additional information is also derived such as effective temperature, surface gravity, metallicity, photometric parallax and elemental abundance data for the stars. The survey represents a giant leap forward in our understanding of our own Milky Way galaxy; with RAVE’s vast stellar kinematic database the structure, formation and evolution of our Galaxy can be studied.
Beginning in 2003, RAVE had obtained 465,000 observations of stars to the end of 2010. It is expected to run to the end of 2012. In April 2011 RAVE released its third catalog containing more than 80,000 radial velocity measurements and atmospheric parameters for nearly 40,000 stars. A full description of the project can be found on the RAVE project website.
Papers | QSO Selection Algorithm Using Time Variability and Machine Learning: Selection of 1,620 QSO Candidates from MACHO LMC Database
May 1, 2011This research synopsis was submitted by Dae-Won Kim, Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA. A preprint of this paper, which is to appear in the Astrophysical Journal, is available on the arXiv preprint server.
Modern astronomy is entering into the completely new era driven by immense amount of observational data. For instance, ongoing and future large-scale surveys such as Pan-STARRS and LSST will produce more than several terabytes of data per night. Wide-field data mapping of the sky will open a new paradigm of astronomy not only in both scientific and data-handling aspects. Especially, it will be practically impossible to manually examine all the data in order to discover scientifically meaningful information. In other words, innovative and novel algorithms that can automatically analyze the data with minimal human intervention, and that can deliver only the meaningful information to astronomers are becoming more and more important.
This paper introduced such an algorithm to select QSOs (Quasi-Stellar Objects) that typically show strong non-periodic or pseudo-periodic variability. In the absence of spectroscopic data, such an algorithm will be a very powerful tool to select QSOs. Especially, for the future large-scale surveys (Pan-STARRS and LSST), spectroscopic observation will be very expensive due to their wide field of views and limiting magnitudes.
We introduced 11 time series features that quantify different variability characteristics of light curves, which was confirmed to be practical to separate QSOs from other types of variable stars (e.g. Cepheids, RR Lyraes, eclipsing binaries, Be stars, micro-lensing, long-period variables, etc.) and non-varying stars. Figure 1 shows an example of the scatter plot of two time series features. As the figure shows, the two features are useful to separate each of the different types of variable stars. We then employed a supervised machine learning technique called `Support Vector Machine’ that can train a classification model in any hyper dimension. We claim that using hyper-plane cuts derived on the basis of the 11-D space (i.e. 11 time series feature space) is much more adequate to separate QSOs rather than using conventional 2-D hard cut.
We applied the algorithm to the MACHO database consisting of 40million light curves and found 1,620 QSO candidates. We then used the Harvard Odyssey cluster to analyze the whole dataset, which took about two days. Identified candidates were cross-matched with mid-IR catalogs and X-ray catalogs, and confirmed that the majority of their candidates are very strong QSO candidates.
Figure 1. Scatter plot of two time series features. Each axis is different time series feature. Different symbols and colors are different variable sources (gray dots: non-variables, black x’s: eclipsing binaries, magenta crosses: micro-lensing, yellow x’s: RR Lyraes, green x’s: Cepheids, cyan crosses: long-period variables, blue crosses: Be stars, red squares: QSOs). Most of the different variable types are grouped in the different regions.”
News | Kepler satellite and citizen planet hunters
April 15, 2011Excerpt: While computers are terrific at high-volume data-processing, nothing beats the human eye for pattern-recognition – which is why a project dreamed up by Yale University astronomer Debra Fischer, a veteran planet hunter and Kepler project scientist, has turned out to be so extraordinarily useful. Called Planethunters.org, it lets ordinary folks with no scientific training at all help find planets the Kepler software has missed. It works so well that in just a few short months of operation, the more than 22,000 visitors to the website have found nearly 50 potential planets, which are being sent on to Kepler headquarters at the NASA Ames Research Center in California for followup.
Source: Yahoo / Time
Papers | Synthetic Milky Way Galaxies
January 24, 2011Future surveys such as the LSST and GAIA will create object catalogs of staggering size. But beyond such qualitative statements, how do astronomers anticipate the scientific return on these projects? To some extent, you can extrapolate on past surveys taking into account anticpated advances in instrumentation. “Survey X imaged to magnitude M over P percent of the sky. Survey X’ will image to magnitude M’ over P’ percent of the sky, therefore….” But there are really a great number of variables to consider. In order to anticipate the productivity and capabilities of future surveys, one approach is to generate synthetic astrometric catalogs based upon models for the density and distribution of stars throughout the galaxy. As explained in a recent paper by Bland-Hawthorn, Johnston, and Binney (“Galaxia: A Code to Generate a Synthetic Survey of the Milky Way“), such synthetic catalogs are useful for:
a. Interpreting observational data
b. Testing theories on which the models are based, and
c. Testing the capabilities of different instruments and for defining strategies to reduce measurement errors (BJB, 2011).
Using a program that the authors developed, known as Galaxia, the authors implemented a complex model of the “stellar content” of the Galaxy as a function of position, velocity, age, metallicity, and mass. Different components of the Milky Way (thin/thick disc, stellar halo, galactic bulge) are modeled separately.
In order to consider how future surveys might perform, you also have to take into account factors such as extinction due to interstellar dust, which itself requires a 3D model for the distribution of dust in the galaxy. The figure below shows an impressive correlation with observations obtained between Hipparcos observations and those that would be anticipated based upon the models encoded within Galaxia.
Posted by John Rachlin 


