T. Jarrett, IPAC
(970722)
Most analysis work has focused on the Coma cluster, located near the galactic pole and thus in a very low stellar density region. Since the level-1 specification(s) for extended sources apply as well to higher stellar densities (e.g., those pertaining to areas of galactic latitude as low as 10 degrees), it is well nigh time for a close examination of GALWORKS output from a typical field near this limit. New 3-channel data acquired in May of 1997 contains several scans through a low galactic latitude field, ranging in glat from 7 to 10 degree, and longitude 42 to 48 degrees. Correspondingly, the stellar number density was very high, ranging from 6,900 srcs per sq. degree (brighter than K = 14) to 10,500 srcs per sq. degree.
We have chosen six scans from the night of 970521 for study. The scans are 111 to 116. The total coverage is about 6 sq. degrees. We don't expect to tally a large number of galaxies (since it is a random field, probably devoid of a nearby galaxy cluster), but we do expect to find many double and triple stars -- the primary contaminent to the extended source catalog for high stellar density regions. Thus, this exercise is intended to evalute the performance of the double and triple star parameter "killers", and apply whatever "tuning" is necessary to optimize these star-galaxy discriments.
For this data set, the relevant level-1 specification is the differential completeness. The spec requires that we achieve at least 80% completeness for fields with glat between 10 and 20 degrees. Since the field under study here has slightly higher stellar density (with glat between 7 and 10 degrees), if we achieve this level of completeness then we easily achieve the spec for high glat fields.
In order to build a large "library" of double and triple stars, we have lowered the scoring thresholds to pickup more non-extended objects for analysis. Thus, the sample of galaxy candidates will include both sets of false sources, those that would have normally failed standard (score) thesholds and those that have --some-- parametric values closely mimicing real galaxies.
The first column is the J band image, second column the H band image, third column the K band image and the fourth column is the DSS optical image. The dark blue elliptical contour represents the 20 mag per sq. arcsec isophotal area, and the light blue contour the "flux growth" elliptical area. Sources that had been "subtracted" from the object fields are circled in red with the size of the circle given by the subtraction radius. Sources circled with a green circle/ellipse represent sources that were previously processed and subsequently blanked from the object field (blanked pixels are then substituted with corresponding isophotal values given by the object of interest, thereby recovering pixel information).
The following images give a flavor of the wide variation in multiple star objects that, at least for some parameters, closely mimic the look of real galaxies.
Double Stars
Nasty Triple Stars
The following plots show the 2-D projection (in the flux plane) of the parametric scores, including the general star-galaxy discriminents, "mxdn" and "sh", as well as the double and triple star discriminets, "wsh", "vint", "r23" and "trip". The integrated flux is represented by the 20 mag per arcsec**2 isophotal mag.
For more information on the various star-galaxy discrimination parameters, see Star - Galaxy Discrimination Parameters .
Symbols:
Since double and triple stars are "extended" relative to isolated stars (with which the ridgefile is determined), their parametric value (i.e., score) is degenerate with galaxies. Other parametric measures, such as the FWHM or image moments, will also be inadequate discriminators of real galaxies from non-extended objects.
Additional parametric measures are currently in development. More on this stuff later.
It is possible that there is an optimum combination of the various scoring parameters that improves the current state star-galaxy separation. We will be testing various linear combinations of the scores that minimize stellar contamination. In addition, more sophisticated decision tree methods, e.g., oblique decision trees, will be implemented to see if any improvement can be obtained with the current set of parametric measures to separate stars from galaxies. A report on this work is forthcoming.
A New Visualization Method (**updated 970808**)
Recent discussion with Eric Feigelson (PSU), an astrostatistics expert, brought to light a new (but simple) way to visualize the n-dimensional parameter (score) space. The idea is to "connect the dots" for each score parameter per source, coloring the lines according to the class of object. The N-space cluster is then projected into the viewing plane: parameter vs score value. The plots below demonstrate this method.
Five gifs are given below, sorted according to integrated brightness: from the brightest to the faintest samples. Real galaxies are colored green, double stars red and triple stars blue. The N-space scores include: mxdn, sh, 2moment, wsh, msh, vint, r23, R(3sig) and darea/R. The latter two parameters respectively refer to the 3-sigma isophotal radius (semi-major axis) and the differential area between the 5-sigma and 3-sigma isophotes normalized to R(3sig).
It can be seen from the plots that faint galaxies and faint double/triple stars only "separate" with the triple-killer scores: vint and r23, whereas for brighter objects the additional double-killer scores: wsh and msh, also separate galaxies from non-extended sources. These results underscore the difficulty of star-galaxy separation at the sensitivity limit (or confusion limit, case being) of the survey.
We generate an oblique decision tree (ODT) using an algorithm developed by Murthy, Kasif&Salzberg (1994). The ODT is generated from the sources classified with this data set (sometimes referred to as the "training set"), consisting of real galaxies, stars, double stars and triple stars. There is something like 670 objects or so (only 33 of which are galaxies -- far to small for building DTs -- but sufficient for this exercise). Once the ODT is built, we then turn it loose on the training set itself. The results are tabulated below. In practice, we will use the training set to build an ODT, then apply the tree to a new data set to classify it. Since this separate data set does not yet exist, we can only classify the training set itself -- consequently we learn little (since the ODT is optimized to the traing set, it will naturally do well with classifying the training set).
note: the decision tree is generated based on the following parameters (& scores): isophotal mag, mxdn, sh, wsh, vint and r23. Each band is treated separately.
Notice that all of the galaxies are correctly classified, including the faint end. Furthermore, both J and K have very few false classifications (i.e., a non-galaxy classified as a galaxy). Since we are using an unpruned decision tree, these results are not surprising -- the ODT is well optimized to the training set, and in fact we are probably over fitting the data (see below). Ominously, using a "pruned" decision tree, the results are much worse, both in completeness and reliability. It remains to be seen whether "unpruned" ODTs perform better (compared to pruned odts) with an independent training set.
Unfortunately we do not know how well the ODT would do with a completely independent training set -- work in progress. Nevertheless, application of an ODT to the GALWORKS data set looks promising.
** UPDATED **
Using the ODT derived from the training set corresponding to the glat = 8 to 10 degrees fields, the COMA training set is classified and compared with the truth table. In practice this is not an optimum procedure since the COMA sources are located in a low source density regions, thus minimal confusion from doubles stars and the like. Nevertheless it is a constructive exercise.
As expected, the completeness suffers at the faint end for the Coma galaxies since the ODT is derived from a high source density field (thus the thresholds are higher to minimize contamination from double and triple stars). On the other hand, the reliability is very good, only one false source in the entire set.
It is interesting to note that the unpruned ODT performs much better with the Coma galaxies than the pruned ODT. We may presume that the pruned ODT is not well matched to the data (but I do not understand why this is the case).
Based on these results, building a training set from high glat fields should result in an ODT that is well matched to Coma and other fields outside of the galactic plane. Again, we conclude that ODTs look very promising toward building a reliable and complete catalog from the extended source database.
See Application of Oblique Decision Trees for additional information regarding ODTs and their application to 2MASS data.
It is not possible to estimate the completeness of the galaxy detections since this requires either (1) deep higher-res CCD image to act as a truth data set, or (2) many repeat scans of the same field. However, we can get some idea of what the completeness probably is by looking at the scores. For example, if we apply a threshold on the "r23" score to eliminate most of the contaminate objects (doubles and triples), say between 5 and 6 in score, 15 of the galaxies (except one or two at the faint end) remain, while only 6 or 7 non-galaxies remain. Our "completeness" is probably well above 80% for J(20e) < 15, with a reliability of about 70% or so. The reliability can be further refined by raising the threshold to about 8 in score, increasing the reliability to >80%, with some loss to the completeness -- perhaps approaching the level-1 specification for completeness. Any further refinement is sure to adversely affect the completeness.