Below are some emails between myself and John Fowler on philosophical issues regarding statistics and uncertainties. Last modified: July 8, 2009 F. Masci ---------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: WSDC Frame Coaddition Subsystem Design Peer Review Report Date: January 27, 2008 9:38:10 PM PST To: fmasci@ipac.caltech.edu Well, I agree and disagree. First of all, inverse-variance weighting is appropriate for Gaussian errors and no others that I know of. This follows from the Fundamental Theorem of Parameter Refinement (as I have come to call it), namely that if the probability densities p1 and p2 describing two independent measurements are Gaussian, then the rigorous density function for the combined measurements is p1*p2/Integral(p1*p2) and is Gaussian, with a mean given by the inverse-variance-weighted average and a variance given by the usual Gaussian formula. If p1 and p2 are Poisson density functions, the refined estimate given by the formula above is not the inverse-variance-weighted average (and the density function is neither Poisson nor Gaussian). So using Gaussian formulas on significantly non-Gaussian errors is indeed a gaffe. But if the Poisson distributions are in their Gaussian asymptotic limits, e.g., means greater than 100 photons or photo-electrons or whatever is counted, then approximating them as Gaussian (with variances equal to the means, since these are the variances) is an excellent approximation for the applications we encounter, hence inverse-variance-weighted averaging is a perfectly fine thing to do. In the Gaussian limit the bias is negligible (otherwise it wouldn't be the Gaussian limit); by bias, I mean due to Poisson skew, not the fact that the variance is equal to the mean; that's not a bias, it's just a fact. It is also true that algebraically it is probably simpler to use the fact that the variances are equal to the measured valued to simplify the formula, e.g., f1 f2 f1 f2 -- + -- -- + -- v1 v2 f1 f2 2 2*f1*f2 = ------- = ------- = -------- = ------- 1 1 1 1 f1 + f2 f1 + f2 -- + -- -- + -- ------- v1 v2 f1 f2 f1*f2 suppose f1 = 2 and f2 = 3; then = 12/5 = 2.4, whereas the straight average is 2.5. The higher flux is properly weighted less because it has a higher variance (which it really does have). However: 2 and 3 are hardly in the Gaussian limit! Consider 200 and 300 instead (still rather faint for IRAC photo-electrons, for which bright sources are in the hundreds of thousands). Then the ratio is still the same, = 240 instead of 250, but compared to 200, 300 is a 7.07-sigma discrepancy! These two numbers are statistically so different that it's probably completely unjustified to average them at all by any method. For a 3-sigma discrepancy, for example, 200 and 242.3, = 219.13, compared to a straight average of 221.15, less than 1% difference. All else equal, I'd believe that smaller average that has the measurement with the higher uncertainty weighted a bit less. But usually one also has read noise, flat-field error, etc., diluting the Poisson character a bit. I still think the Poisson contribution to the variance is plausible, but you do have add the other variances, which probably are not larger for the larger flux, so they tend not only to normal-ize the distribution but also reduce the ratio of the variances. So as long as the Gaussian approximation is used only when it's good, it's fine with me, and the variance being equal to the mean for the Poisson part of it doesn't weaken the approximation at all. Regards, John ------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Re: WSDC Frame Coaddition Subsystem Design Peer Review Report Date: January 28, 2008 11:41:11 AM PST To: fmasci@ipac.caltech.edu Suppose your independent measurements are described by the _same_ probability density function. E.g., repeated counts from a source with a perfect detector and we are in the Gaussian limit. Does it make sense to weight these measurements using the inverse of their values (variances). I would say no. I would agree that for estimating the average, knowing that the measurements come from a single (sufficiently) Gaussian population removes the need for any weighting. But in practice there are three problems: (1.) one generally does not know that a single population is involved, because while the noise variance may be from the same noise population (zero mean), the source could be significantly variable, creating a different population because of the different mean, and this changes the Poisson part of the noise, Gaussian limit or not, invalidating the same-population assumption; (2.) presumably one is going to compute the uncertainty of the average; one could divide the known noise variance by the number of measurements averaged, but this really doesn't save much CPU time compared to the usual Gaussian formula that accompanies inverse-variance weighting; (3.) for automated processing as it tends to be done for survey missions, the measurements are taken some time apart, and the noise generally cannot be assumed not to have changed (e.g., different pixels involved with different responsivities, hence different photo-electron counts for the same flux, hence dfifferent Poisson noise, etc.). Point 3 is typical of the difference between large-scale survey work and one-night observatory work on a particular object of interest; in the latter case, one typically observes several times (actually or in effect) with the same instrument positioned the same way for a relatively short period of time during which noise variations are often much less than orbit-to-orbit and day-to-day. So I'd expect a classical astronomical observer to tend toward special approximations, while mass production of survey data should be done with more general methods. I would only use variance = counts only when I have 1 to 3 repeated measurements to get a crude estimate of the uncertainty. A good astronomer would take many independent measurements and then perform their mean, uncertainty estimates on that ensemble. I will also assume that these measurements (of the same spot on the sky) come from the same population. That's what I called the "classical astronomical observer" situation. One thing that obscures this discussion is that the fact tends to be ignored that Poisson-noise-dominated measurements in their Gaussian limits simply cannot disagree with each other enough to matter significantly enough for the choice of variance to be very important. If the noise is all Poisson noise, and assuming roughly constant responsivities (so that the actual photo-electron counts are similar in the measurements), then unless one is willing to average values many sigma apart from each other, the sigmas are all pretty nearly equal anyway. For example, 1000 photo-electrons is not very bright in IRAC, but it is well into the Gaussian limit. The sigma is 31.6, so another measurement at 1100 is more than 3 sigma different from the first. One has to be concerned about averaging these, since the difference is enough to cause one to suspect variability. But if one insists on averaging them, then the fact that the Poisson distribution is positively skewed really does imply that errors tend to be larger on the positive side of the mean, so the larger value really is a little less guaranteed. The ratio of variances only de-weights the larger value by 10/11, not a shocking amount, and I claim in the right direction. So my bottom line is: don't use Gaussian formulas where they don't apply, and don't average numbers that are many sigma different; if one follows those rules, one can and should treat Poisson noise as Gaussian when the counts are large enough, and one should keep the variance = count relationship. I would be happy to discuss all this in detail, especially if you find any flaws in my argument. Regards, John ------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Poisson vs. Inverse-Variance Date: January 30, 2008 11:34:19 PM PST To: fmasci@ipac.caltech.edu Frank, I have reviewed my old notes on this subject, and my conclusions remain as they were regarding the preferability of using inverse-variance weighted averaging of fluxes when the Poisson noise is safely into its Gaussian limit. However: I do find that it is true that a "bias" persists when the pixels involved have identical responsivity and pure Poisson noise. I don't care about this bias because it is even smaller than I remembered for realistic cases (typically less than 0.1%, as seen below) and is completely washed out by the effects of responsivity differences. But I do concede that technically it does exist in this artificial case of pure Poisson noise (no read noise, etc.) and exactly equal responsivities, which is not what I recalled today in our discussion, so I want to correct that error. The fact that a Poisson distribution is skewed has a peculiar effect on the magnitudes of positive and negative fluctuations of equal probability (or almost equal; generally a given positive fluctuation has no corresponding negative fluctuation of exactly equal probability, but we can consider the negative fluctuation with the closest probability to a given positive fluctuation). Consider a Poisson distribution with a mean of 10,000. Then positive and negative 1-sigma fluctuations have approximately equal probabilities, so 9900 and 10100 are equally probable. Inverse-variance-weighted averaging yields 9999 instead of the true population mean of 10,000; this is a 0.01% bias. Closer to 3-sigma fluctuations: 9700 and 10302 are approximately equally likely (not 10300; this is due to the skew); note that these are about 6 sigma different from each other, or if one considers the sigma of the difference between two random draws, namely sqrt(2)*100, the difference is 4.2 sigma, i.e., probably not advisable for averaging (one would have to strongly suspect source variability). But averaging anyway, the inverse-variance result is 9991.94, which is off by 0.0806% (i.e., less than 0.1% for a highly discrepant pair of measurements). This is about as big as I can picture a bias ever getting for a photo-electron count that is realistic, and even this still has the artificial restrictions of equal responsivities and only Poisson noise. The Poisson skew makes equally probable large-magnitude fluctuations larger for positive fluctuations than for negative ones, whereas it makes small-magnitude fluctuations smaller for positive fluctuations than for negative ones. All this is wiped out by the fact that responsivities do vary by typically tens of percent. I'm attaching the paper I wrote almost four years ago that explores this. It considers (among other things) all combinations of +1-sigma and -1-sigma fluctuations on two pixels, one with responsivity of 0.75 and the other with 1.25. For all four combinations of fluctuations, the inverse-variance-weighted average was closer to the correct value than the unweighted average. On the other hand, you and Sean both mentioned taking more measurements and getting an idea of the parent population (i.e., not merely using an unweighted average, although I did understand Sean to be recommending that in preference to the inverse-variance-weighted average of those were the only two choices). I'd be curious to see how many measurements you would need in order to characterize the parent population to sufficient accuracy for removing those biases of < 0.1% above. I doubt that it is practical; I'm guessing on the order of a hundred measurements, maybe more. Regards, John ------------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Re: Poisson vs. Inverse-Variance Date: January 31, 2008 10:09:45 AM PST To: fmasci@ipac.caltech.edu PS. Today I looked at the question: how close must the two pixel responsivities be in order to preserve the tiny bias that exists for pure Poisson noise? The previous case I did for measurement values of 9700 and 10302 (equally probable plus and minus 3-sigma fluctuations drawn from a Poisson population with a mean of 10000) had a bias of 0.0806%. This bias would be removed if the lower measurement's responsivity were 0.97 relative to 1.0 for the larger, or if the larger's responsivity were 1.03 relative to 1.0 for the smaller. So in this (I think worst typical) case, a 3% responsivity difference would wipe out the bias (we may have gotten that much variation in MIPS24 just from the scan-mirror smudges; do you recall the magnitude of that effect?). Even if the responsivities varied in the wrong direction, this additional non-Poisson effect would have the same magnitude as the Poisson bias, diminishing the role of that bias. --> fmasci proves this is not right. See fmasci email below. This 3% is basically just the difference in sigma needed to cancel (or double) the bias. In principle, you could eliminate the bias by estimating the parent population to a similar accuracy. Estimating the parent population might be done by histogramming measurements and fitting a Poisson shape to them, but that would surely require a lot of measurements. So I just looked at how many measurements you would need to estimate the parent population variance from the sample variance. Sample variances are asymptotically distributed as chi-square with N-1 degrees of freedom, so a 1-sigma uncertainty in the sample variance as an estimate of the population variance is sqrt(2*(N-1)), and the relative uncertainty is therefore sqrt(2*(N-1))/(N-1). Solving for a value of 0.03, we get N = 2223. So you need over 2000 measurements to beat the bias with this method of estimating the parent population, and even then you are getting only a 1-sigma correction. Since the bias is not a random variable, you might want a 2 or 3 sigma correction certainty, i.e., 4 or 9 times as many samples. This doesn't sound like a practical way to beat the bias, but a much smaller sample size is recommended for a coarse check on whether the variance is approximately what one would expect, which I suspect is really what your intuition was going for. For example, to see if the variance is like a Poisson variance to an accuracy of 25%, which is probably plenty good enough for a sanity check, you'd need N = 33. Suppose you were constrained by having only N = 10 measurements: then the relative uncertainty in the population variance would be 47.1%, which might still be good enough for a sanity check. So my feeling is: assume the negligible bias is wiped out anyway by responsivity variations and other non-Poisson noise sources, and therefore use the inverse-variance weighting because it works so well in general for effects that are vastly more important than the negligible bias and is also part of the process of estimating the final uncertainty; do perform whatever sanity checks are available, including at the very least chi-square testing, which itself depends on prior estimates of uncertainties. Case closed! (I think.) But further discussion still potentially interesting... Regards, John ------------------------------------------------------------------ From: fmasci@ipac.caltech.edu Subject: Re: Poisson vs. Inverse-Variance Date: January 31, 2008 2:21:43 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu >PS. Today I looked at the question: how close must the two pixel responsivities be in order to preserve the tiny bias that exists for pure Poisson noise? The previous case I did for measurement values of 9700 and 10302 (equally probable plus and minus 3-sigma fluctuations drawn from a Poisson population with a mean of 10000) had a bias of 0.0806%. This bias would be removed if the lower measurement's responsivity were 0.97 relative to 1.0 for the larger, or if the larger's responsivity were 1.03 relative to 1.0 for the smaller. I don't follow this, but please correct me if I've gone astray. I agree that your bias will be 0.0806% for pure Poisson noise and assuming a constant pixel responsivity of 1 everywhere, i.e: = 2*N1*N2 / (N1 + N2) ~ 9991.94, but in processing, we rarely combine _raw_ electron counts. If we propagate this to after flat-fielding, then applying the same inverse-variance weighting (using new rescaled variances according to 1/f**2 where f=responsivity correction) will give: = N1*N2*(f1 + f2) / [(N1*f2**2) + (N2*f1**2)], where N1=9700, N2=10302 are the same measured raw electron counts from above. So, for the pure Poisson bias to persist after flat-fielding (i.e., after we have calibrated out responsivity variations and are ready for co-addition), we need precisely f1 = f2 = 1. I.e., the 3-sigma fluctuations were purely Poisson in nature. No other values of f1 and f2 will preserve the Poisson bias, except however for heuristic cases. Instead, if we found responsivities of f1=0.97 and f2=1.0302 for the pixels with observed counts of N1=9700 and N2=10302 respectively, i.e., the 3 sigma difference in counts is really due to a responsivity difference between the pixels, then = 10,000 and we have no bias at all. But, I know this doesn't happen in practice. In conclusion, it makes more sense to examine the consequences of the pure Poisson bias after artificial responsivity differences have been accounted for. This is because these are the pixel signals that go into co-addition, and where the pseudo-population estimation is done. >So in this (I think worst typical) case, a 3% responsivity difference would wipe out the bias (we may have gotten that much variation in MIPS24 just from the scan-mirror smudges; do you recall the magnitude of that effect?). I don't remember the exact figures, but there were indeed responsivity variations >10% over the MIPS24 array. Regards, Frank ------------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Poisson Bias Date: January 31, 2008 1:42:33 PM PST To: carey@ipac.caltech.edu, fmasci@ipac.caltech.edu Sean, Now that I have had a chance to do a few real calculations, I need to amend a false impression that I think I gave you about the tendency of the Poisson skew to produce fluctuations that work against the inverse-variance weighting bias. This skew tendency does exist for large fluctuations but it is generally not sufficient to reduce the bias significantly for cases typical of astronomical applications. To clarify: let A and B be negative and positive fluctuations respectively with equal probability (or nearest equal; exactly equally likely positive and negative fluctuations generally don't exist for Poisson distributions). I think fairness demands considering (approximately) equally likely fluctuations. Then for example, given a population mean of 10,000, if A is a 3-sigma fluctuation of -300, then the (closest to) equally probable B fluctuation is +302, not 300. This is far too small an effect to offset the bias. So technically that skew effect does occur (and we are discussing technicalities, after all!), but I was wrong to suggest that it has any significant effect on reducing the bias. Incidentally, I consider positive and negative fluctuations because clearly if we have two positive or two negative fluctuations of any significant magnitude, our answer is going to be wrong by more than the bias effect. On the other hand, the bias in that same case (averaging measurements of 9700 and 10302, with perfectly equal responsivity and pure Poisson noise) is just 0.0806%, i.e., less than 0.1%. And this is for two measurements that differ from each other by 6 sigma, or if we consider the distribution of differences between random draws, a factor of sqrt(2) reduces this to about 4.2 sigma. As Frank pointed out, anyone who averages values with this significant a difference has no right to expect accuracy at the 0.1% level. For smaller means, like 100, which Frank tells me may actually enter the picture for WISE channels 1 and 2 (I hadn't realized this could be an important range), the bias for plus and minus 1-sigma fluctuations is 1%, which is starting to approach the threshold of significance, but I believe is still completely wiped out in the real world by responsivity variations and other ~Gaussian noise sources. After all, a Poisson sigma of 10 electrons is unlikely to render insignificant the read noise, flat fielding error (we'll be doing well to get that below 1%), etc. I do agree, however, that for such small means, we need to look more closely at all the effects we can expect to encounter and possibly derive proper Poisson estimation methods. That is, in a rigorous sense, inverse-variance weighting is peculiar to the Gaussian distribution and no other distributions with which I'm familiar, even though a remarkably simple theorem does exist that shows that inverse-variance weighting is appropriate for unbiased noise when a minimum variance estimate is desired and the estimator can be written as a linear combination of the two measurement values alone (with more measurements handled by recursive pairwise computation). Lest you think "what estimator could not be written as a linear combination of the two measurement values alone", I refer you to the very common case of uniformly distributed errors (e.g., cross-scan position errors for IRAS point sources), for which the linear combination must be augmented with another term that depends on the widths of the uniform distributions. But I digress... One other digression I'm compelled to mention: that skew effect mentioned above operates in the wings of the distribution; near the center, the skew causes the opposite effect, i.e., A will have a larger magnitude than B if both have relatively small magnitudes. This adds to the bias, rather than working against it, but for small-magnitude fluctuations, the bias itself becomes smaller, hence even less significant. And again, this is for the artificial case of exactly equal responsivities and pure Poisson noise. I computed that the 0.0806% bias for the 9700 & 10302 case above would be washed out by a properly placed 3% responsivity difference (e.g., 0.97 responsivity for the 9700 measurement vs. 1.0 for the 10302 measurement). For means of 100 electrons, of course, the somewhat larger responsivity variation of 10% would be needed to wash out the 1% bias for the plus or minus 1-sigma fluctuation case. --> fmasci proves this is not right. See fmasci email above. A final digression: to use multiple measurements to estimate the parent population variance to an accuracy sufficient to eliminate the Poisson bias would require a burdensomely large number of measurements. Say the variance must be estimated with a 1-sigma accuracy of 2%, which seems reasonable for all cases, noting that for the case of a mean equal to 10,000 electrons, a 6% difference in the variances led to the 0.0806% bias. This then requires knowing the population sigma to 1%. For the mean of 100, the 1% bias due to plus or minus one sigma fluctuations could be reduced to less than 0.1% (say 0.01%) by a knowledge of the true variance of 1% accuracy, so the 2% variance goal is in the ballpark here too. The sample variance is (asymptotically) chi-square with N-1 degrees of freedom (N being the number of measurements, not the number of electrons in a measurement), hence has a variance of 2*(N-1), a sigma of sqrt(2*(N-1)), and an expectation value of N-1, leading to a relative 1-sigma uncertainty as an estimator of the population variance of sqrt(2*(N-1))/(N-1). Estimating the population sigma to 1% accuracy therefore implies that sqrt(2*(N-1))/(N-1) = 0.01, which gives N = 20001. Since this is impossible, one may fall back to unweighted averaging with the sample variance divided by N-1 as the uncertainty. But then one has jettisoned the prior uncertainties and cannot perform chi-square testing to see whether the resulting average is statistically plausible (although I suppose one could use the prior uncertainties for chi-square testing but not averaging; seems a bit ad hoc, however). Furthermore, one has eliminate a tiny bias in an artificial scenario at the cost of losing near-optimal handling of the real-world responsivity variations and non-pure-Poisson noises. This is why I favor keeping the prior uncertainty estimates in the game. Regards, John ------------------------------------------------------------------ From: fmasci@ipac.caltech.edu Subject: Poisson investigations.. Date: February 1, 2008 12:02:03 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Hi John, Yesterday you used an example of a Poisson distributed quantity with expectation value 10,000 and said that the +/- 3-sigma fluctuations: 9700 and 10302 are approximately equally likely. In fact for 1-sigma fluctuations, I found that the values 9900 and 10101 are equally likely, with ~34.25% equal probability on the left and right of the mean. Note that the interval 9900 --> 10100 corresponds to a confidence interval of ~68.268% as expected since we're way into the Gaussian limit. But at the 3-sigma level, you cannot obtain an equal probability on the left and right of the mean with your construction. The probability between 9700 --> 10000 is 50.1346% and that between 10000 --> 10300 is 49.5953%. You cannot make the +3sigma value larger to equal the -3sigma probability interval. Obvious right? Instead, to obtain a symmetric probability interval about the mean for 3-sigma Poisson fluctuations, you can do it with 9753 --> 10300. This gives ~49.59% equal probability on either side of 10,000 (the mean). Regards, Frank Hi John, I just realized that you were using the value of the Poisson density function at the +/- 3-sigma limits, and NOT the integrated values (i.e., confidence intervals). So, there's no error on your part. Just my incorrect interpretation. I'm used to quoting things in terms of confidence intervals. Frank ------------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Re: Poisson vs. Inverse-Variance Date: February 1, 2008 11:03:10 AM PST To: fmasci@ipac.caltech.edu PS. One effect we forgot to mention is that the Poisson bias, which is larger for fainter sources, acts in the opposite direction to another well known bias, the tendency to overestimate flux for faint sources near the detection threshold. This was observed and studied in the IRAS project, and I think there's a section on it in the Explanatory Supplement. It results from the fact that near the detection threshold, positive noise fluctuations increase the probability of detecting the source, biasing the sample toward brighter estimates. As I recall, near the threshold (say S/N of 4 or 5 with a threshold of 3), this was a substantially larger bias (in magnitude) than the Poisson bias. I think there was actually an average correction for this that was applied to the catalog. I'll check the Explanatory Supplement at the next opportunity. I'm curious to see how the magnitudes of the two biases compare for low S/N sources. I'm also curious about the read noise in W1 and W2; do you have any estimates? I think the study you proposed should include the effects of realistic non-Poisson noise and responsivity variation, and probably include at least passing mention of the flux-overestimation bias for faint sources. Regards, John ------------------------------------------------------------------ From: fmasci@ipac.caltech.edu Subject: Re: Poisson vs. Inverse-Variance Date: February 1, 2008 12:47:36 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu But 100 electrons is just the background. On top of a source, it will be Poisson dominated. I'm not following; what I understand at the moment is that read noise is 19 electrons, and we worry about getting a 100-electron source right, whose Poisson noise is 10 electrons; are you saying that there's also a 100 electron background, so the noise on a 100-electron (above background) source would have a variance of 100 + 100 + 19^2 Yes! The counts I had in that table refer to the mean (expected) zodiacal background level. That's why I had "background" in the column name of that table. A typical source will contribute >~ 100 electrons on top of this background. So, background + source >~ 200 electrons. This means Poisson sigma >~ 14 electrons at the location of that source with SNR ~ 7. Sorry! I didn't read-on - you already did the calculation. If you recall, that's why I mentioned during my talk that W1 and W2 are read-noise limited and Poisson biases are additionally washed out because of this. Sean didn't buy that. This is a nice thing since on the lower part of ramp, the "Gaussian" read noise will dominate even more. I haven't done the formal calculation. electrons, total variance = 561, sigma = 23.7? If so, that's 200 Poisson electrons (sigma = 14 electrons) and 19 1-sigma read-noise electrons; the noise is still less than half Poisson, which seriously eats into any Poisson bias. So a measurement consists of a random draw from a superposed population consisting of a Poisson with a mean of 200 and sigma of 14 and a Gaussian with a mean of zero and a sigma of 19. So two measurements, each with 1-sigma Poisson fluctuations on opposite sides of the mean, yields 186 and 214 from the source+background, plus we get some random read noise for which we could do a Monte Carlo, or just say, let's take 1-sigma fluctuations of the total population, hence 200 plus or minus 24, or 176 and 224. Assume responsivity = 1; then we would assign Poisson variances of 176 and 224 to these measurements, and the read-noise variance of 361 to each, getting corresponding variances of 537 and 585; the inverse-variance-weighted average is then 198.97, a 0.51% bias; the reduced uncertainty is 16.7 electrons 1-sigma; the bias of one electron is therefore 6% of the sigma. Seems a small price to pay. Later on, if this source is detected, background subtraction is going to pump up the uncertainty further. Regards, John -------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Poisson investigations.. Date: February 1, 2008 3:21:44 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Hi John, I've computed some electron counts (per pixel) for the Poisson component given our target SNRs, the expected background and read noise as tabulated in a previous email. This is to facilitate comparison with the uncorrelated Gaussian read noise in the ramp. You may want to do your own independently when time permits. Note: I have ignored correlations in the Poisson component between ramp samples for now. Over 8.8 sec = full ramp frame exposure: ----------------------------------------------------- W1, SNR = 5: sigma(Poiss) ~ 14 e/pix W1, SNR = 20: sigma(Poiss) ~ 27 e/pix W2, SNR = 5: sigma(Poiss) ~ 18.5 e/pix W2, SNR = 20: sigma(Poiss) ~ 30 e/pix After 2.2 sec = second sample in the ramp: ---------------------------------------------------------- W1, SNR = 5: sigma(Poiss) ~ 11 e/pix W1, SNR = 20: sigma(Poiss) ~ 25 e/pix W2, SNR = 5: sigma(Poiss) ~ 13 e/pix W2, SNR = 20: sigma(Poiss) ~ 26 e/pix Conclusion: These Poisson contributions will certainly be diluted by the 19 e/pix read noise. We need to ensure our sample up the ramp correlated Poisson error contribution takes this into account, and I think it does already. Also, we can safely say that even close to the background (~70/211 electrons for W1/W2) and with 19 electrons read noise, the fluctuations will be close to Gaussian at all samples in the ramp. Regards, Frank ---------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Poisson investigations.. Date: February 1, 2008 10:26:24 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Actually there's a flaw in my example below. With N = 16, an interval of approximately 16 - 1.2*sigma to 16 + Infinity where sigma = 4 would give qual probability mass about the mean, NOT 16 - 2*sigma to 16 + 2.3*sigma! So, it turns out that as we get closer and closer to the Poisson regime, the probability mass at the high end (right of the mean) becomes smaller, and this means we need a smaller | sigma | at the low end (left of the mean) to balance the probability mass. So, this new density function is defined on the interval: mean - a*sigma to mean + b*sigma with a < b. If we then renormalize such that the total probability in this new interval is 1, the only way to preserve the original mean is if we use the median as the new estimator. Re-estimating a new mean will surely lead to a bias. So the burning question is: will this bias be significant? Frank Okay, now you've got me confused on what the real issue is and whether it's worth pursuing. Suppose we are in the Poisson regime with an expected count of 16 electrons. We decide (inadvertently) to retain only samples within +/- 2 sigma of the mean, i.e., values 8 to 24. Everything outside this range is declared an outlier and thrown away. Due to skewness, there will be more values between 8 and 16 than between 17 and 24, i.e., the confidence interval is not symmetric. When it comes time to compute an inverse weighted average [or an optimal value via p1.p2/int(p1.p2)], there will be a bias in our answer because of our ignorance of not properly balancing out the probability mass when we selected the lower and upper thresholds for outlier rejection. Is this the bias that worries you? On the other hand, if we were clever to pick our lower and upper thresholds such that the confidence interval were symmetric, i.e., maybe 2 sigma on the low side and 2.3 sigma on the high side, this estimation bias _may_ be reduced. On the practical (WISE) side, I believe that skewness can be safely ignored down to the lowest expected electron count level of ~100 given the dilution from read-noise. Although I would feel more comfortable if this can be shown formally without hand-waving arguments. Frank About the positive bias: do you feel it is significant? I didn't find it based on my ~equal-probability fluctuation approach, but that choice of comparison was subjective. Your suggestion to consider equal confidence intervals led to it, and it indicates that my feeling that skewed-positive fluctuations might be a good reason to respect the Poisson bias was not without some foundation. However, I didn't have the strength of conviction to maintain this argument after the ~equal-probability fluctuation argument failed to support it, so if sleeping on it results in the decision that it is valid, you get the credit and should be the one to inform Sean that I punted prematurely. There is one more calculation we can do that might guide us: compute an average bias, where the average is based on weighting the bias over equal increments in confidence interval limits outward from the mean. For example, we compute the bias for plus and minus fluctuations of say 0.01 confidence, 0.02, 0.03, etc., until he hit the limit we have now found on the high side. One thing about that seems unfair (in our favor) is that it leaves some probability mass on the low side unaccounted for. I suppose we could average in equal intervals of available probability mass or something. I haven't seen this problem solved, so it's probably worth doing. Regards, John ---------------------------------------------------------------------- jwf@ipac.caltech.edu Subject: Evaluating the Poisson Bias Date: February 2, 2008 9:54:18 AM PST To: fmasci@ipac.caltech.edu Frank, After a night's sleep, I think I see the right method for evaluating the Poisson bias. You are right, we have to work in terms of central tiles, despite the lack of symmetry about the mean (or mode). Central tiles provide the best representation of the most frequent random draws for a given confidence interval. So define the M% central tile as that set of samples of the random variable extending from (100% - M%)/2 to (100% + M%)/2. Thus the central 30%-tile would extend from location where the cumulative distribution is 35% to where it is 65%, for example. The cumulative distribution of a Poisson is C(k) = Gamma(k+1,mu)/k!, where mu is the mean, and Gamma is the incomplete gamma function. Gamma is in Numerical Recipes, and in fact I'm already using it to evaluate chi-square cumulatives, wherein it also play a role. So I suggest we use a set of central tiles, probably 10%, 20%, 30%, ..., 80%, 90%, 95%, 99% (11 tiles in all). For a given tile, we locate the first and last closest-sample values that define the range, k1 and k2. Since the exact percentage values will fall between integer sample values in general, we define k1a and k1b as the two samples that bracket the lower tile limit, and k2a and k2b bracket the upper limit. We examine the set of central tiles for the Poisson populations with means 10, 100, 1000, 10000, and 100000. For a given population, we compute the inverse-variance-weighted average over all integer samples in the range from k1a to k2b (taking the variance for each sample to be the sample value, not the population value), except in addition to the inverse-variance weights we also weight by the Poisson probability of that sample (thus accounting for the Poisson distribution within the tile), and for the outer endpoints, an additional weight accounting for the interpolation to the exact tile limit. For example, say we're doing the 50% central tile, so we work with the samples between cumulative values 25% and 75%. The 25% point lies between k1a and k1b. Say C(k1a) = 0.2490 and C(k1b) = 0.2505; then since 0.25 lies 2/3 of the way from k1a to k1b, its interpolation weight is 1/3. For each central tile, we get an average estimate for the mean of that population, hence an absolute and relative error. We tabulate the relative error for all 5 populations and 11 central tiles. Then we see what we have in the way of Poisson bias. This table addresses your points 1 and 2: I was going to create a table showing the following columns: 1. raw expected electron count. 2. bias from inadvertent use of inverse variance weighting assuming pure Poisson statistics. I use the word inadvertent because for low electron counts, one must resort to using the optimal method: p1*p2/integral(p1*p2) - see item 3. How does this sound? If you agree that this is a good way to proceed, then I'll volunteer to generate this table; then we'll move on to your other points. I don't think we should rush too much, since we each have other tasks, and we are pretty sure that Poisson bias is not going to be a problem ,and this exercise is primarily a proof of that assertion and a deep familiarizing with Poisson peculiarities. Regards, John ---------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Evaluating the Poisson Bias Date: February 4, 2008 9:09:02 PM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Hi John, First, could you please check that those figures show up (instead of icons): http://web.ipac.caltech.edu/staff/fmasci/home/wise/poisson.html The reason why I limited myself to 2 and 3-sigma fluctuations is because these are the typical thresholds people use for outlier rejection and I wanted to simulated the average bias if all measurements within these intervals were retained. I ran a case for 1-sigma fluctuations (both symmetric and asymmetric probability mass about the mean) and results are below. As expected, the magnitude of the bias is smaller than in the 2 and 3 sigma cases. ======== Prob. Asymmetric ======== mu weighted avg. % bias #sigma_lo #sigma_hi --------------------------------------------------------------- 10 9.511758 -4.882422 1.000 1.000 30 29.586844 -1.377188 1.000 1.000 50 49.551012 -0.897977 1.000 1.000 100 99.558521 -0.441479 1.000 1.000 300 299.587162 -0.137613 1.000 1.000 500 499.590681 -0.081864 1.000 1.000 1000 999.586464 -0.041354 1.000 1.000 3000 2999.588606 -0.013713 1.000 1.000 5000 4999.610343 -0.007793 1.000 1.000 10000 9999.580760 -0.004192 1.000 1.000 15000 14999.567292 -0.002885 1.000 1.000 20000 19999.564723 -0.002176 1.000 1.000 30000 29999.603916 -0.001320 1.000 1.000 ======== Prob. Symmetric ======== mu weighted avg. % bias #sigma_lo #sigma_hi --------------------------------------------------------------- 10 10.014701 0.147008 0.632 1.000 30 30.030114 0.100381 0.730 1.000 50 49.960102 -0.079796 0.849 1.000 100 100.358780 0.358780 0.800 1.000 300 300.355388 0.118463 0.866 1.000 500 500.341250 0.068250 0.894 1.000 1000 1000.335913 0.033591 0.917 1.000 3000 3000.314736 0.010491 0.949 1.000 5000 5000.309934 0.006199 0.962 1.000 10000 10000.280121 0.002801 0.980 1.000 15000 15000.315346 0.002102 0.980 1.000 20000 20000.313221 0.001566 0.983 1.000 30000 30000.309296 0.001031 0.987 1.000 Regarding your earlier question about using symmetric tiles. I'm not sure I understand what you want to do. My definition of a symmetric confidence interval relates to balancing probability mass about the population mean, i.e., within an interval mu - a*sigma to mu + b*sigma where a < b. So you're right, the original population mean becomes the median for the new sample defined on this interval. Here's another interesting observation: a simple (unweighted) arithmetic mean of the measurements for the asymmetric confidence interval case (i.e., where the interval is mu +/- n*sigma) leads to a less biased average in magnitude than inverse variance weighted averaging. When confidence intervals are made symmetric however, the inverse-variance weighted averaging leads to a less biased result than a straight arithmetic mean. This is obvious if you think about it in terms of how probability mass is distributed about a skewed distribution. I have not documented this and I'm not sure if I should. So, the end result is that if one blindly assumes unbalanced probability mass (i.e., on an interval mu +/- n*sigma) then it's relatively safe to perform a simple arithmetic average of the measurements instead of an inverse Poisson-variance weighted average. Regards, Frank On Feb 4, 2008, at 4:09 PM, John Fowler wrote: PS. Can your analysis be conveniently done for 1-sigma fluctuations? If so, it would be nice, because 1-sigma is more like what you would expect in most cases with only 2 or 3 measurements. Regards, John ---------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Closed-Form Solution Date: February 5, 2008 11:43:59 AM PST To: fmasci@ipac.caltech.edu Frank, I think I have a solution to the Poisson Bias problem in closed form. The idea is to compute the average value of the bias over the entire distribution by Poisson-averaging every possible value (that has significant Poisson probability). So we're averaging n wieghted by p(n)/n, i.e., the Poisson probability times the inverse-variance weight. The average is: = Sum_n(p(n)*n/n) / Sum_n(p(n)/n) = Sum_n(p(n)) / Sum_n(p(n)/n) = 1 / Sum_n(p(n)/n) The argument of the sum in the denominator is exp(-M)*M^n/(n*n!), where M is the parent-population mean. There is one interesting catch, however; n = 0 is legitimate for a Poisson random variable, and for any finite parent-population mean M , p(0) > 0; but n = 0 diverges in the denominator above, so it must be excluded based on the argument that measurements of 0 never occur (or are not considered measurements). If one considers 0 a legitimate measurement, then the Poisson bias becomes -100%, because the zero value is infinitely weighted, wiping out all finite-weighted measurements that are nonzero. So we exclude n = 0, and therefore the normalization is technically off, so the "1" in the last line above has to be replaced with 1-p(0). It takes Maple a long time to evaluate the sums for M >= 10000, and it's still running at the moment. For the first three M values, I'm getting close to what I got with the equal-probability ~3-sigma fluctuation approach: Mean Estimate Bias 10 8.85 -11.52% 100 98.99 -1.01% 1000 999.00 -0.10% This is what happens with no data editing; your results apply better if some kind of outlier rejection is in effect. I think I'll look at restricting the sums to sigma and/or interval ranges to see what happens. Since the trend is clearly toward lower relative bias for larger M, the conclusion remains that the bias is negligible in th real world. Regards, John ---------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: small discovery.. Date: February 5, 2008 11:39:40 AM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Hi John, I realize I should be working on my WISE tasks but this stuff is too much fun. I noticed that my results for the weighted average for the asymmetric confidence interval case in Table 1 of my write-up (i.e., symmetric in sigma units) are biased low by an average count of ~1. I found that if I take the standard inverse (Poisson) variance weighting formula: avg = N / sum[1/x_i] and replace it with avg = N / sum[1/(x_i + 1)], then the bias practically disappears on average. This is because the addition of 1 in the denominator makes the weights smaller and lifts the overall weighted average by ~1. So, maybe astronomers with perfect detectors (no read-noise and no instrumental signatures to correct) and who like to combine raw detector counts should be using this new formula. More food for thought. Regards, Frank ---------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: small discovery.. Date: February 5, 2008 1:01:34 PM PST To: fmasci@ipac.caltech.edu I found that if I take the standard inverse (Poisson) variance weighting formula: avg = N / sum[1/x_i] and replace it with avg = N / sum[1/(x_i + 1)], then the bias practically disappears on average. This is because the addition of 1 in the denominator makes the weights smaller and lifts the overall weighted average by ~1. Something sounds familiar about that, but I may be thinking of something else. I'll try to reconstruct it, but if you get a chance to research corrections to Poisson estimation approximations, see if you find anything about this already being known. If it's not, it should be published, but if it is, it shouldn't! Meanwhile, I have a program going that computes the p1*p2/integral(p1*p2) refinement for Poisson measurements. It does reduce the bias of interest a bit, but only a bit; the origin of the bias is mostly in the assignment of measured value to variance. For example, for parent mean = 10, if I take measurements of 7 and 13, the normal Gaussian bias is -9.1% (estimate = 9.1), whereas the rigorous formula gives an estimate of 9.28593, bias = -7.14%. Interestingly, the rigorous distribution has a sigma of 2.1844, which is a bit more than the Gaussian formula, sqrt(1/(1/7 + 1/13)) = 2.1331. The skewness is 0.2287 (still positively skewed but less than the parent population, 0.316, and the measurement distributions, 0.378 and 0.277) and excess kurtosis of 0.0524 (parent 0.1, measurements 0.143 and 0.077). So it has moved toward being Gaussian, which is not promised in general (e.g., two uniforms refine to a uniform, without any move toward Gaussian form). So the rigorous formula makes things better but doesn't cancel out the bias caused by taking the measured value to be the variance. I'm attaching a quick plot of the distributions; black is mean = 7, red is mean = 13, and blue is the refined distribution. I'll toss in a table file with values. Regards, John ---------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: Amazingly Simple Pattern in Poisson Bias Date: February 9, 2008 10:52:54 PM PST To: fmasci@ipac.caltech.edu Frank, My relatively precise computation of the bias averaged over the entire distribution for mean = 100,000 is now approaching 4 days of Maple CPU time, so I ran a slightly less precise variation on another computer. What I called "precise" is actually averaging over the distribution from -10 sigma to +10 sigma, which we now know are not equal confidence intervals on each side of the mean, but I figure the density is so tiny outside of 10 sigma that it shouldn't matter. To test this, I tried using 4 sigma for the smaller means, which execute fairly rapidly, to see how different the results were from the 10-sigma computations. So far I have the following results, which seem to indicate a consistently more positive bias by ~0.005% for the 4-sigma case for all means: Mean Estimate Bias Bias(+/-4sig) 10 8.848 -11.521% -11.517% 100 98.990 -1.010% -1.005% 1000 998.999 -0.100% -0.094% 10000 9999.00 -0.010% -0.004% 100000 100005.285 +0.005% The last line's "100005.285" is from the 4-sigma computation (values above it are from the 10-sigma computations), since the 10-sigma case is what is still running; however, there's every reason to expect that when the 10-sigma case finishes for mean = 100,000, the answer is going to be 99999 with a bias of 0.001%, i.e., if the 4-sigma computation is about 0.005% too high for the last line, as it is for the others (0.004$ to 0.006%), then we'll get close to zero, or I suspect the right answer is -0.001%. The upshot of this is that the essentially exact computation is giving a relative bias of very close to -1/mean. This may be related to what you found: computing the average Gaussian-refined value for a population mean of N yields a result of very close to N-1, hence a negative bias with a magnitude of about 1/N. Then I saw that this could be derived as a very good approximation for population means that are not too small, and the equations were easier to do in WordPerfect than force into email text, so the result is attached as a PDF. Regards, John The +/- 10 sigma calculation in Maple finally ended for the mean = 100,000 case! The results are amazingly accurate: average Gaussian estimate = 99999.00000, bias = -0.00100000%, where those trailing zeros are significant. Maple computed the numerator to 478596 decimal digits and the denominator to 435071 digits in slightly more than 4 days. So the table is: Mean Estimate Bias Bias(+/-4sig) 10 8.848 -11.521% -11.517% 100 98.990 -1.010% -1.005% 1000 998.999 -0.100% -0.094% 10000 9999.000 -0.010% -0.004% 100000 99999.000 -0.001% +0.005% So it looks like your approximation takes the form of adding 1 electron to the Gaussian average to compensate for the bias, if someone wants to compensate a bias that is lost in the noise. Regards, John ---------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Estimating Poisson Means Date: February 11, 2008 10:21:24 AM PST To: jwf@ipac.caltech.edu Cc: fmasci@ipac.caltech.edu Hi John, Also, if you did a similar exercise using a Gaussian, it will be equivalent to minimizing the chi-square ( actually -2*Log[likelihood] ), and the optimum value of the population mean from which a sample of N values x_i with variances v_i were drawn turns out to be the classic inverse variance formula! However, a similar exercise of maximizing the likelihood involving Poisson distributed quantities (like below) leads to a simple _unweighted_ arithmetic mean as an unbiased estimator of the population mean. The reason why there are no inverse variance weights is because there is no explicit variance in the Poisson density function. Therefore, one must resort to the p1*p2*../Sum[p1*p2*..] method for optimally combining measurements with the a priori assumption that each was drawn from a parent distribution with it's own mean and hence implicitly: variance = mean. In essence, this method can also be tackled as a maximum likelihood problem. This may provide a way to solve it analytically (i.e., achieve a closed form solution). I'll give it a crack when time permits. This begs the question: when one measures a collection of counts (e.g., electrons) and they are far into the Gaussian regime, which estimator is the most optimal (in terms of minimum variance) and most unbiased: (i) a simple arithmetic mean, or (ii) an inverse variance weighting scheme with variance = N_i + 1 ? Also, you mentioned below about mixing Poisson and Gaussian noise. This is what I suggested as a future project in my web write-up: Other contributions to the measurement variance such as detector read-noise will dilute these biases. Read-noise is an independent stochastic component and likely to be Gaussian in nature. A nice exercise would be to first convolve a Gaussian representative of read-noise with a Poisson distribution and then simulate the bias. Regards, Frank ---------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: Estimating Poisson Means Date: February 11, 2008 11:26:41 AM PST To: fmasci@ipac.caltech.edu What about if one is worried about a 1 electron in 100000000000 electron bias? I.e., From a theoretical perspective. Would you _add one_ to all measurements and the weight with inverse variance_i = measurement_i, or, just perform an unweighted average of your measurements? I would say these estimates would be close. ---- Taking you to mean pure Poisson noise: I think I would take the sample (unweighted) mean, since that is the maximum-likelihood estimate of the Poisson population mean given samples drawn from a single Poisson population. HOWEVER: "maximum likelihood" is not generally the same as "minimum variance" if the distribution is skewed. So it isn't clear that the sample mean is also the minimum-variance estimator of the population mean, given that the problem is defined such that you know that the population is Poisson (you just don't know the population mean), so you know that the population has a skewed distribution and that its mean is not the maximum-likelihood value of a sample drawn from the population. However, they say (in Wikipedia) that the sample mean is also an unbiased estimator of the population mean. Since the mean of any distribution minimizes the variance, it seems to imply that the sample mean is also a minimum-variance estimator. But something seems wrong about the same estimator being both minimum-variance and maximum-likelihood for a skewed population. More sleeping is needed, but meanwhile, taking the sample mean as the thing to use, what is the uncertainty? The uncertainty variance would be the usual unbiased estimate of population variance divided by N (here N = #samples, not electrons) , i.e., Var_pop/N, but since our unweighted sample mean, say , is the unbiased maximum-likelihood (and possibly also minimum-variance) estimate of the population mean, then its uncertainty variance should be /N. On the other hand, the sample variance multiplied by N/(N-1) is the unbiased estimate of the population variance, which is also the population mean, so isn't N*Var_sample/(N-1) also an unbiased estimator of the population mean and variance? So can't we just average these two estimates and get an even better estimate? Regards, John ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: Estimating Poisson Means Date: February 11, 2008 11:58:46 AM PST To: fmasci@ipac.caltech.edu PS. Re: On the other hand, the sample variance multiplied by N/(N-1) is the unbiased estimate of the population variance, which is also the population mean, so isn't N*Var_sample/(N-1) also an unbiased estimator of the population mean and variance? So can't we just average these two estimates and get an even better estimate? When I wrote that, it seemed like a joke, but I couldn't immediately see the fallacy. Now, however, I think it's legitimate. Specifically, the usual problem is: here are some samples, estimate the population mean and variance. One does the usual things, sample mean and sample variance*N/(N-1). But then you are told: oh by the way, I forgot to mention: the population mean and variance happen to be equal. That is additional information and should permit a better estimate of both. The thing is, for samples drawn from non-Gaussian populations, the sample mean and sample variance are correlated in general, so one cannot simply average them, certainly not with inverse-variance weights, without accounting for the correlation, and I suspect that numerical studies will show that the uncertainty in the sample variance is usually large compared to that of the sample mean, so unweighted averaging seems wrong (and for the same reason, weighted averaging means that the additional estimate based on the sample variance will not contribute a lot, i.e., you won't get a sqrt(2) reduction in the uncertainty sigma). On the other hand, correlation usually has small impact on the average and a noticeable effect on the uncertainty of the average. More sleeping and Monte-Carlo-ing... Regards, John ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: Estimating Poisson Means (and Variances!) Date: February 18, 2008 2:44:40 PM PST To: fmasci@ipac.caltech.edu Frank, I finally got around to checking Cramér regarding correlation of sample means and variances. Bottom line: you were right, being Gaussian is sufficient but not necessary for the sample means and variances to be uncorrelated. The necessary condition is: the skewness must be zero (so I was also right about Poissons and skewness being a problem). I knew Cramér would come in handy for things like this. Page 348, equation 27.4.4 shows that the covariance of the sample mean and sample variance for any parent population is (N-1)*mu3/(N^2), where N = no. of samples, and mu3 is the 3rd central moment (i.e., the skewness times sigma^3). So if and only if the distribution is symmetric (or N =1!), the sample mean and variance are uncorrelated. This was in the context of averaging the sample mean and sample variance to get an estimate of the Poisson population mean (hence variance). In our case, the Poisson should be sufficiently Gaussian to ignore the skewness, hence the covariance, and inverse-variance weighting of the sample mean (unweighted, the prescription when one suspects that the population is Poisson, asymptotically Gaussian or not) and the sample variance should be a better estimate than the (unweighted) sample mean alone. The next question is whether the sample variance has such a high uncertainty that it contributes essentially nothing to the estimate. I'll check that and get back to you. There is also the question of taking the covariance into account; I'll check that too. One other thought: although the prescription for Poisson populations is unweighted averaging to get the sample mean, it isn't immediately obvious that this beats the inverse-variance-weighted sample mean with the "+1 electron" correction; I suspect that the latter is better only if the measurements are so different that one has no business averaging them in the first place, but I'll check some numbers. Hope you have been having a nice holiday weekend. Regards, John ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Update on Chi-square Minimization Paper Date: February 22, 2008 11:31:00 PM PST To: fmasci@ipac.caltech.edu Frank, While working on the problem of estimating the Poisson population mean by averaging the sample mean and sample-variance-based estimate of the population variance, I realized that equation 24 in the paper I gave you on chi-square minimization needs to be adjusted by scaling each summation by 1/2. I mentioned in the next paragraph that if the correlation terms go to zero, equation 24 reduces to equation 5 except for factors of 2 which cancel out in equation 25, BUT canceling out there doesn't preserve the desired relationship between the coefficient matrix and the error covariance matrix for the fit parameters (i.e., the latter being the inverse of the former). It seems clear that the factors of 1/2 are needed so that the equation reduces properly to the equations for uncorrelated errors. So I made that change, and I also added a passing remark at the top of page 2 about approximating non-Gaussian errors as Gaussian when the error distribution is skewed. I kept it brief, but I'm thinking of adding a section covering the whole Poisson-skew problem that we have been studying in order to elaborate the topic. As you pointed out, I don't want to steer anyone into biased solutions without their realizing it; thanks for that suggestion! An updated PDF is attached. Meanwhile, unless I have been continuing to generate mistakes, I think I have the problem mentioned in the first sentence above worked out to the point where I can say that this method does have a significant advantage over simply using the sample mean to estimate the population mean for Poisson distributions. For example, say the population mean is Mp = 10 and you have 5 measurements: assuming the sample variance is more or less typical, then your uncertainty in the sample mean will be about 1.77, whereas if you average the sample mean and unbiased estimate of the population variance, the uncertainty drops to 1.24 (these are both 1-sigma uncertainties). Another interesting case is Mp = 10000, for which we know that the average Gaussian approximation (inverse-variance-weighted average) has a bias of -0.01%: say you have 100 measurements (which seems like a lot): the sample mean has an uncertainty of 10.10, but the combined estimate has an uncertainty of 7.14, noticeably better. But both an order of magnitude larger than the average Gaussian bias of -0.01% (i.e., -1). And that's with 100 measurements! Of course, AWAIC near the pole may have 10000 measurements, but that's unusual. Just looking at the numbers anyway, with 10000 measurements, the sample uncertainty is actually down to 1 (actually, 1.0001), and the combined uncertainty is 0.707. So at the pole, you actually could be sensitive to the Gaussian bias! I want to do some Monte Carlo runs to verify my equations. If I confirm them, I'll send along a PDF write-up of the algorithm. Have a good weekend! Regards, John From: jwf@ipac.caltech.edu Subject: Back to Earth Date: February 23, 2008 3:11:21 PM PST To: fmasci@ipac.caltech.edu Frank, My early optimistic expectations for some gain in averaging the sample mean and sample-based estimate of the population variance to estimate the population mean were contaminated by not having fixed the very error I reported to you yesterday, that factor of 2 in equation 24. I found it and fixed it in the documentation as I was writing up the algorithm, but forgot that it was still present in my quick-check code, where it amplified the gain of the algorithm. It turns out that there is a tiny bit of gain, but not anything like what I mentioned in that email. For example, for a population mean of 10 and 5 samples, the uncertainty in the sample mean is 1.5811388, and the new algorithm's uncertainty is 1.5759346, i.e., less but hardly any less. This is more like what I honestly expected at first, based on the expectation that the uncertainty in the sample-based estimate of the population variance would be very large compared to that of the sample mean. For the case above, compare the sample mean uncertainty of 1.5811388 to that of the sample-based estimate of the population variance, 7.0710678. Combining measurements with uncertainties of about 1.6 and 7.1 doesn't yield much over the better measurement alone. Incidentally, if the correlation were ignored, the gain would be overestimated, since the errors are positively correlated. A simple inverse of summed inverse variances yields a sigma of 1.543033 for the case above, which almost looks like a worthwhile gain over 1.5811388, but it's an illusion, since the covariance produces a result of 1.5759346. But the exercise was worthwhile! Regards, John ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Asymptotic Gaussian Approx Date: February 26, 2008 1:29:01 PM PST To: fmasci@ipac.caltech.edu Frank, Another interesting exercise! It looks like the approximation is rather good, being worst for small population means (as expected) but hardly depending on the number of samples (not as I expected), although possibly showing a bit of an extra weakness for very small samples sizes. The lowest mean I computed was 3, for which the Monte Carlo sigma of the sample variance ran about 10% higher than the asymptotic-Gaussian expression (i.e., about 20% in the variance of the sample variance). For a population mean of 10, the sigma underestimation is down to about 5%. I ran a solid-Gaussian case (population mean = 1000) as a check. Here are some numbers; let me know if you have any questions, would like other cases run, etc. Incidentally, the Monte Carlo involved one million trials for each combination of mean and number of samples. Regards, John PS. I ran the cases for means of 5 and 10 first, which is why there are more sample sizes for those; once I saw that the dependence on sample size was weak, I used fewer for 3 and 1000. MonteCarlo GaussAsympt PopMean Nsamp Sig(Vs) Sig(Vs) MC/GA ------- ----- ---------- ----------- ------ 1000 2 707.370 707.107 1.0004 3 665.264 666.667 0.9979 4 612.951 612.372 1.0009 5 566.155 565.685 1.0008 10 424.069 424.264 0.9995 15 352.997 352.767 1.0007 10 2 7.462 7.071 1.055 3 6.930 6.667 1.039 4 6.363 6.124 1.039 5 5.917 5.657 1.046 6 5.506 5.270 1.045 7 5.179 4.949 1.046 8 4.884 4.677 1.044 9 4.651 4.444 1.047 10 4.438 4.243 1.046 15 3.694 3.528 1.047 5 2 3.865 3.536 1.093 3 3.574 3.333 1.072 4 3.283 3.062 1.072 5 3.043 2.828 1.076 6 2.836 2.635 1.076 7 2.666 2.474 1.077 8 2.521 2.339 1.078 9 2.394 2.222 1.077 10 2.288 2.121 1.079 15 1.905 1.764 1.080 3 2 2.356 2.121 1.111 3 2.162 2.000 1.081 4 1.988 1.837 1.082 5 1.847 1.697 1.088 10 1.395 1.058 1.096 15 1.162 1.058 1.098 More interesting results. There does indeed seem to be a pattern: the asymptotic-Gaussian approximation (sample variance approximately chi-square) is worst for 2 samples, gets significantly better for 3, almost as good for 4, and then rises slowly to equal its worst value (the one it had at 2 samples). I looked at means of 1 and 2, and added cases of 500 samples to the previous to be sure things had stabilized. The updated table is attached. Unless some idea emerges for why this dependence on the number of samples exists, I think this completes the study of Poisson random variables, about which I think you and I now know more than anyone since Poisson (and probably including him). Regards, John ------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Frank away from office.. Date: April 29, 2008 12:02:48 PM PDT To: jwf@ipac.caltech.edu Yes, for ~50-150 pixel stacks, MADTUT is overkill in that it's best suited to "heavy" hi-tails. I would classify the input distributions to tempcal as "light" hi-tailed. I understand RMDR is easy for you to implement because you already have the code. But, I'm a little against it in that it's difficult to generalize/tune the input parameters for all cases, and it's hard to know how much to trim without going overboard. I simulated RMDR about a month ago. I like to satisfy some metric when doing these kinds of calculations (e.g., the underlying population is ~Gaussian with median ~ mean ~ mode). If you can show it consistently (and robustly) recovers a measure of location close to the mode regardless of how asymmetric the input distributions are, then I'm all for it. I'm sure RMDR will work, but I'm a skeptic (as you know), and I would prefer a more parameter-free mode/centrality estimator (e.g., as I discussed with you recently). The difficulty is computing the associated uncertainty. I'm sure we can come up with a formula for the uncertainty from a simulation. I have more thoughts on this thread, and I'd like to discuss them with you when I get back. As you said, we can add more methods later (in addition to RMDR). Regards, Frank Okay, so you're ready to deliver! Almost; there's a small matter of the output sky-offset image and its uncertainty, but probably no one will notice that they are missing... Incidentally, I am leaning toward the 2MASS asymmetric trimming method (which we called RMDR) unless you have a preference. I can see the possibility for a straight median to be biased by the prevalence of high outliers. In 2MASS, it was possible to set the trim factors so that you got anything from asymmetric trimming with no symmetric trimming to purely symmetric trimming to pure median, etc. This will be totally modularized, so that computation should be very easy to change. I need to review and recall MADTUT, but I think it might be overkill for a single 50-150 pixel stack, let me know your thoughts (you can wait until you get back from vacation, of course). Have a good time! Regards, John ------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Frank away from office.. Date: April 29, 2008 12:41:31 PM PDT To: jwf@ipac.caltech.edu I gather that you agree that some accounting for the asymmetry is in order. I don't recall other specific algorithms that we recently discussed. E.g., the mode estimator based on finding the tile where the quantile interval is smallest (hence density maximized). Also, the recursive MINMAX method where the data is sorted in ascending order, the absolute pairwise differences computed, and then the data points giving maximum difference values are progressively removed until the sample variance changes by a small amount, or, which is consistent with that expected for a Gaussian. The resulting distribution will be "clean" of outliers by the "right" amount from which a median can be computed. I'm sure I mentioned these to you. If not, I can go over them when I get back. These are new methods. I suspect that some dynamic adjustment of RMDR would be possible if needed, but just percentage limits on how much can be tossed shouldn't be that bad, and I think I can derive a decent uncertainty to attach (we just didn't do that for 2MASS because we couldn't afford to store the result anyway). There's only one person in the world I would trust when it comes to this stuff: you! As you said, it can wait until you get back. Meanwhile I'll just plug in a simple straight median for now, since it requires very little coding or user-specified support. Sounds good for now. As you probably already know, a nice robust estimate for the uncertainty in the median for a distribution with outliers can be computed from the sample MAD: sigma ~ 1.4826 * sqrt[(Pi/2)*median{ abs(p_i - median) } ] / sqrt(N), where the 1.4826 ensures asymptotic normality. Regards, Frank ------------------------------------------------------------------------- Following regards: ChiSqDataMinusMean.pdf From: jwf@ipac.caltech.edu Subject: Re: Chi-Square Degrees of Freedom Date: August 7, 2008 10:45:18 AM PDT To: fmasci@ipac.caltech.edu Frank, As I said: I'll feel better after sleeping on them (so to speak). Having now done so, I see that I sent you some misinformation which I want to repair as quickly as possible. Regarding my statement: the usual association between the expression Sum_i(x_i-xmean)2/sigma_i2, i=1,N, xmean = sample mean with a chi-square random variable with N-1 degrees of freedom is valid only when the x_i are uncorrelated. This much is true; but what followed is a red herring and in fact simply incorrect: the case when the x_i are correlated, in which case a sample of N correlated random variables does not implicitly contain N degrees of freedom; it is something less, and how much less cannot be known without knowing the correlation exactly. The number of degrees of freedom is not affected by the presence of correlation in the data. Given that an error covariance matrix is symmetric and has a determinant > 0, it can always be diagonalized, which means that there is always a coordinate transformation that will render the data errors uncorrelated, and it will do so without reducing the number of data points, hence without reducing the number of terms in the chi-square summation, hence without changing the number of degrees of freedom. And since chi-square is invariant under coordinate rotations, the same number of degrees of freedom applies in all rotated systems, including the original one wherein the data errors were correlated. But it is not necessary to rotate coordinates to achieve the same end, since the more general expression provides the same result in the original coordinates: U_i = x_i - xmean W_ij = inverse(Omega_ij), where Omega_ij = error covariance matrix chi-square = Sum_ij(U_i*Wij*U_j) This second expression can be used in all coordinate systems; it simply wastes time multiplying by zero and adding zero terms if the error covariance matrix is already diagonal, but it gives the correct answer. If the error covariance matrix has non-zero off-diagonal elements, then clearly those change the value of the summation, but not the number of degrees of freedom. Thus taking the non-zero correlations into account changes the value of the summation from what the first expression above would give; however, that expression is still some function of the random variables, but it is not distributed as chi-square with any number of degrees of freedom in general. It is true that correlations among the data represent additional constraints on the data that would not exist in the same set of data values without correlation, but I was wrong to go from that fact to the conclusion that the number of degrees of freedom is affected. I think all the statements in the PDF paper are correct, however. But now I am not quite sure what my motivation was for deriving the expression in that paper [note added in proof: thanks to your question, I do rediscover the motivation below]. I know that it had something to do with accounting for the fact that the data and the sample mean are correlated, and this must be taken into account if we want a figure of merit that approximates a chi-square random variable as accurately as possible, but I don't know why I didn't just say that we lose one degree of freedom and end it there. This came up in the context of the sfprex 4-parameter fit, averaging over many bar-separation solutions for the four parameters: X and Y offsets, scale factor, and rotation; these form a 4-vector. For any one bar, this 4-vector can be computed, but since all four parameters depend on the same 8 numbers (the X and Y coordinates of two WISE sources and two 2MASS sources), the uncertainties in the four parameters are significantly correlated; the 4x4 error covariance matrix is straightforward and is also computed for each 4-vector. So when I average the 4-vectors over all accepted bar matches, I use chi-square minimization including the full error covariance matrices. After averaging, I compute chi-square of each "measured" 4-vector relative to the average 4-vector to see if its an outlier. It was to compute this chi-square that I derived that expression. So the problem was: compute a chi-square for a single "measured" 4-vector based on the differences between its components and those of the average 4-vector; include the fact the the errors in all components of both vectors are correlated. How many degrees of freedom does this chi-square have? If the true 4-vector were available, then this chi-square would obviously have 4 degrees of freedom; but we don't have the true 4-vector, we have only an estimate of it derived from the data; so does that reduce the number of degrees of freedom in this single-4-vector chi-square? Somehow, reducing the number of degrees of freedom from 4 to 3 seemed to overshoot the mark. If an infinite number of 4-vectors had been averaged, the uncertainties of the averaged components would have become vanishingly small, and the single-4-vector chi-square would appear to have just about exactly 4 degrees of freedom. But the question of whether the use of the estimate of the true vector in the chi-square should reduce the number of degrees of freedom of that chi-square by 1 should not depend on how many samples were averaged to get that estimate. So I abandoned the idea that my 4-vector chi-square should be treated as chi-square with 3 degrees of freedom; but 4 was also clearly wrong, because the measurement variance in the denominator of each term corresponds to the measured value in the numerator before another uncertain number is subtracted off. So what I tried to do was to get the variance in the denominator to be exactly right for the squared difference in the numerator. To do that, I defined z (in the paper) to be the function of the two random variables, observed value and sample mean, and derived the variance of that random variable. As long as the variance in the denominator is correct for the expression in the numerator, each term should be the square of a zero-mean unit-variance random variable, and since all input errors were already approximated as Gaussian, this should be the square of a zero-mean unit-variance Gaussian random variable, and N such terms should produce N degrees of freedom. And this is consistent with what the sfprex processing summary shows. So I think I do need this expression, that there is no way to get the answer I want by reducing the number of degrees of freedom. I certainly do need to know the correct number of degrees of freedom, because I want to threshold each 4-vector against something understood in order to declare bad measurements to be outliers. At this point, I don't see any other way to do it. But I apologize for wrongly reconstructing my motivation as something involving the number of degrees of freedom depending on correlations in the data. As shown above that is false, so please erase it from your information repository. Regards, John ------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: A quick check.. Date: 8 August 2008 2:39:41 PM To: jwf@ipac.caltech.edu I'm not sure I understand the below. A perfectly "flat truth" will be observed as a flat image with pixel-scale fluctuations attributed to photon (Poisson) noise and detector read noise. There'll also be responsivity variations, but I'm assuming these are all 1 (i.e.,pixels equally responsive). It's only when you have a point source standing above the constant background that you get correlated photon noise, and these correlations occur exclusively within the PRF's domain. Is this line of thinking wrong? Regards, Frank I suspect that the puzzling result that you mention is rooted in (or at least influenced by) the fact that your input "observation" had no PRF correlation in it, unlike a real observation. This allows the adjacent "observed" pixels to disagree with each other more than would be possible if they were PRF-correlated. ------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: A quick check.. Date: 8 August 2008 3:24:58 PM To: jwf@ipac.caltech.edu I was about to say "Touché! You're right! My intuition was wrongly affected by lack of playing with flat truth images." But now my (possibly mis-)conception returns and suggests the following argument: Suppose the optical PSF is a delta function; then the light from the sky falling on a given pixel has some photon noise that is completely decoupled from the photon noise of neighboring pixels (as you are suggesting); now make the optical PSF broader than a delta function; this same photon noise now spills over into neighboring pixels, and the photon noise of neighboring pixels now spills into this pixel. That seems to correlate the photon noise pixel-to-pixel. Not sure I bye this immediately, although I need to sleep on it. A "flat truth" remains in its "prestine" flat noiseless state even after having gone through the telescope optics, and up to 1 Planck time before before hitting the detector. The net effect of the optical PSF on a flat image made up of an infinite number of delta functions is "zero". When this truth hits the detector, it is subject to the photo-electron counting process and Poisson fluctuations ensue. But, with a flat image, my claim is that the PSF does not correlate the noise in neighboring pixels. This is because the superposition of lots of overlapping PRF tails randomizes the photo-electron counts across pixels. Again, I could be entirely wrong here. Regards, Frank ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: A quick check.. Date: 8 August 2008 5:14:23 PM To: fmasci@ipac.caltech.edu Not sure I bye this immediately, although I need to sleep on it. A "flat truth" remains in its "prestine" flat noiseless state even after having gone through the telescope optics, and up to 1 Planck time before before hitting the detector. The net effect of the optical PSF on a flat image made up of an infinite number of delta functions is "zero". Right; my next email (after the one you replied to here) agrees with this. BUT: there's more to the concept, although it doesn't affect the bottom line, as I'll explain below. But, with a flat image, my claim is that the PSF does not correlate the noise in neighboring pixels. This is because the superposition of lots of overlapping PRF tails randomizes the photo-electron counts across pixels. Again, I could be entirely wrong here. Right again for a flat truth image, I say, except that technically, even with a flat image, there is correlation, but it cancels out. Consider the situation with every photon labeled: then with a delta-function PSF, each pixel gets its due photons, complete with photon noise. Say the labels are the pixel coordinates that the photon belongs to and goes to if the PSF is a delta function. Then with a non-delta-function PSF, some photons get to the wrong pixel, flat sky or not. But because the sky is flat, losses equal gains for every pixel, and you can ignore the correlations. This is like when you asked whether a flat image would have to converge via multiple iterations; you were intuitively aware that the PRF was spreading photons around, even with a flat image. My response was that the image would be converged except for low frequency noise fluctuations immediately. In that context, I figured that losses would equal gains, so it wouldn't matter. But then I apparently forgot that same effect when it came time to consider noise. So again it's like the bull not getting angry because you waved a red flag; it gets angry because it doesn't like being mistaken for a cow. And the image isn't converged because there's no PRF correlation; it's converged because the correlation cancels out with a flat image. So if one asks how can there be correlation but that correlation doesn't matter, I think it's because the correlation is actually preserving the uncorrelated random photon noise that originates on the flat sky. The "random" photon noise on the pixels is actually pseudorandom; it's deterministically preserved random noise that remains effectively random after being shuffled by a PSF and canceling out because the spatial distribution of the original random rates is flat. So I think the bottom line is: your uncorrelated noise image is a fair "observed" image because of the flatness, and the difference between the RMS noise in the flux image and the mean CFV value is due to the PRF smoothing in the former while the latter doesn't forget the original irreducible input noise. I think it's a matter of the flux image being subject to the PRF smoothing, whereas the CFV never forgets the irreducible noise, even though it's ability to track it is also smoothed by the PRF. In other words, the spatial frequency of the CFV is PRF-smoothed, but its amplitude is not. We'll be much smarter on Monday! Have a good weekend. Regards, John ---------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: HiRes examples.. Date: September 16, 2008 1:26:03 AM PDT To: fmasci@ipac.caltech.edu Frank: I wouldn't call the smoothest solution the best solution when you have prior information about how the sky should look like (e.g., point sources are actually point-like). I'm not arguing that the smoothest solution is the best; for one thing the word "best" demands a context. Most people seem to think that a less smooth image (i.e., one with more power in higher spatial frequencies) is better, because it looks like what they are accustomed to seeing. But I have seen some of these same people start to get nervous when it is explained to them that this "better" image requires inserting features that have no foundation in the data and observational model. The idea of "prior knowledge" is a basic ingredient in the controversial quality of Bayesian estimation. Gene Kopan once referred to "prior knowledge" as really "prior opinion". Since the "knowledge" is not contained in the data and observational model, its provenance is debatable in many cases. I agree however, that "knowledge" of the sky having point sources on extended background without troughs and crests is pretty firmly established, so I go along with using prior knowledge to replace the smoothest image with one that has no ringing around point sources. The most effective application of such prior knowledge that I'm aware of is my own processing of M31 with IRAS data. The MCM algorithm makes it easy to inject your prior knowledge to the fullest possible extent: you can start the iteration with an exact image of what you think you ought to see; the reason why I call this the "MCM algorithm" rather than simply "MCM" is that it is no longer literally MCM, i.e., the initial non-flat image jettisons the idea of maximally correlated pixels. What I do take exception to is the idea that the natural ringing is an artifact. An artifact results from a bad model, bad data, or a bad algorithm. An artifact is a failure to fit the data in a proper scientific manner. That is not the case with ringing in MCM (or most other methods); the result is accurate and scientifically useful. I think it can be improved somewhat by the sort of thing you are doing with background removal and subsequent replacement, although this procedure is very definitely an application of prior knowledge, namely your belief that the extended emission is physically distinct from the point sources and so the two can be separately processed and then recombined. But that belief is not unassailable; it does constitute "tinkering" with the information. I happen to believe it is valid and desirable, especially because it can resolve point sources that would otherwise have intersecting ringing patterns. As to the Gibbs phenomenon: I think the paper you referenced is rather good overall. Their discussion in section 2 is excellent; note however that the words "Gibbs" and "discontinuity" do not occur anywhere in this section. In fact, each word occurs only once in the entire paper, neither with any elaboration. The phrase "Gibbs oscillations" occurs in the closing section, where it receives no discussion at all; it is a passing remark that I take to be very loose and not at all useful. I regard it as sloppy, in fact. It is similar to remarks about "smoothing" an image by "convolving" it with a matched filter to improve point-source detection. That phrase has been used so many times that it has a life of its own and cannot be stamped out at this point, even though every one with whom I've had a rigorous discussion of this topic has agreed that in fact cross-covariance, not convolution, is involved, and the result is not a "smoothed image" of the sky, it is an image wherein each pixel is a measure of the extent to which the observation of the sky resembles a point source at that location. The fact that matched filters have almost always been symmetrical in astronomical applications results in the numerical equivalence between convolution and cross-covariance that has allowed this bit of sloppiness to take root. Loose reference to Gibbs oscillations appears to me to be similar. For one thing, when real Gibbs oscillations occur, they prevent an exact fit. Since they do not stem from noise, one can consider a noiseless case. If MCM is given a noiseless observation of a point source on top of extended background, along with an exactly correct PRF, the image it produces, with ringing, can exactly fit the the data. The Gibbs phenomenon prevents the data at a discontinuity from ever being fit exactly by the Fourier expansion, no matter how many terms are kept. For another thing, the Gibbs oscillations accumulate at the discontinuity; as the Fourier summation begins, they start out distributed over part of the domain, and as more and more terms are kept, the oscillations squeeze into the small space where the discontinuity exists. The ringing in MCM starts out at the same place it ends up after convergence. That's two ways in which the MCM ringing exhibits behavior that is drastically different from Gibbs oscillations. Of course there is also the fact that nowhere in MCM is there a Fourier expansion or transformation; it was formulated specifically to be independent of convolution, since convolution was basically just irrelevant to IRAS, and the Fourier methods enter only because of convolution. To argue that MCM "effectively deconvolves" the image because it yields a similar answer in the limiting case of isoplanatic PRFs is really only hand-waving. I can solve the equation x^2 - 2 = 0 by Newton-Raphson or successive substitution, and they both give the same answer, but they travel drastically different paths algorithmically, and their numerical stability is entirely different. Also, the nature of an image array containing finite pixel values precludes any possibility of a discontinuity in the observed data; one might argue that a dead pixel is a discontinuity, but we both know that the ringing isn't caused by dead pixels. So there's no discontinuity to cause a Gibbs oscillation in the first place. But there's a reason for the ringing and you haven't convinced me, nor the world that it's not related to the Gibbs phenomenon. I hope the remarks above make some progress in that direction. I suspect that in the extant sloppy terminology, some people who use that phrase are really thinking about the sort of oscillations observed in a truncated Fourier expansion relative to the function that was intended to be fit. That doesn't require any discontinuity, and typically the fitting error looks like a ringing pattern. But that is not the Gibbs phenomenon, that is just a bad fit, and as pointed out above, with noiseless data and the right PRF, MCM can produce an arbitrarily accurate fit and still have ringing. I will admit that I have not proved that the ringing is the smoothest possible solution. No one ever asked me to prove that, and it has seemed pretty obvious over the years, because it can be proved that it is a solution (it fits the data to within the noise), and no one has ever fit the data with a smoother one. In fact, people have spent their time trying to fit the data with less smooth solutions (e.g., fractal pixon), no one has attempted to construct a smoother one, to my knowledge. I think it is very easy to prove that all other solutions so far submitted are less smooth; one merely computes the power spectrum. I have done that in a few cases, and just as you would expect, the tighter point source shape with no ringing does come with more power in the higher frequencies. Incidentally, I also showed that ringing is not caused by the image boundary. Your result in which you did observe a very low-level ringing at the boundary has to be due to some numerical roundoff problem, I think. Because I did a case in which there was no boundary. It was a one-dimensional domain formed into a ring, i.e., a single pixel row wrapped around onto itself, like an image that went all the way around the equator. At the time I thought maybe ringing was caused by the presence of a boundary, and I was pretty sure the point source I put in there on top of background would not ring, but it did! I think what you really mean is that the ringing solution is the smoothest possible at all frequencies, it is not the smoothest solution at the ringing frequency. There are smoother ones (at the ringing frequency) that also classify as solutions. I have to admit that I'm not sure what this means. I'm guessing that by "at the ringing frequency" you mean "at the place in the image where the ringing is happening" and that with ringing suppressed, that location is essentially just DC level, which is in fact smoother than the ringing. If that interpretation is right, I guess I'd agree that if the area surrounding a point source is fairly flat instead of having troughs and crests, then that part of the image is indeed smoother. But I take "smoothest image" to mean the entire image, hence the power spectrum of the entire image has to be taken into account, and the image with ringing has so far always had lower power in the higher spatial frequencies than any other image that fit the data, hence it is the smoothest image until someone shows me an image that fits the data with even less power in the higher frequencies. Regards, John ------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Latent Identification Date: July 7, 2009 12:38:47 AM PDT To: fmasci@ipac.caltech.edu I think the suggestion for testing the number of times p_i < p_i-1 would work, but thresholding against a factor times sigma seems less direct than another possibility I'd like to run past you. The null hypothesis is that there is no latent, hence just noise fluctuations in the run of "bad" behavior. So I think that we can label the condition p_i < p_i-1 as "heads" and otherwise we have "tails". The null hypothesis is that we have a fair coin, so in a run of K "bad" samples, we have N = K-1 coin flips. The probability of M heads out of N flips is given by the binomial distribution; the probability of at least M heads out of N flips is Q=1-P, P = cumulative binomial distribution. Suppose -tlat specifies a lower limit on Q, say 0.05. Then during initialization, tempcal allocates an integer array Mmin(NFrames-1), where NFrames is the longest possible run of "bad" behavior, so NFrames-1 is the maximum number of coin flips. Then the array is populated with Mmin(N) = minimum number of heads to tag a latent. For example, for N=10, the possibilities are: M p(M) P(M) 0 0.000976562500 0.000976562500 1 0.009765625000 0.010742187500 2 0.043945312500 0.054687500000 3 0.117187500000 0.171875000000 4 0.205078125000 0.376953125000 5 0.246093750000 0.623046875000 6 0.205078125000 0.828125000000 7 0.117187500000 0.945312500000 8 0.043945312500 0.989257812500 9 0.009765625000 0.999023437500 10 0.000976562500 1.000000000000 To compute Mmin(10) we start at 1.0000 and subtract p(M), M = 10, 9, 8... until P(M) becomes <=0.95, then take one step back. In this case we would get Mmin(10) = 8. In a similar way, we populate the entire Mmin array. Then while counting consecutive "bad" values, K, we also the count the number of "heads", M; when the run ends, we test whether M>= Mmin(N); if so, it's a latent. For a fair coin, the binomial density becomes just (N take M)/2^N, which can be computed using logarithms in very little CPU time. This method applies a constant probability threshold to all sizes of "bad" pixel runs. How does this sound? Regards, John ------------------------------------------------------------------------------ From: jwf@ipac.caltech.edu Subject: Re: Latent Identification Date: July 8, 2009 11:06:06 AM PDT To: fmasci@ipac.caltech.edu Frank: how would you modify this test if you required that the M out of N samples had to be consequentive (i.e., a straight run of M heads with no gaps)? Here's the answer for M>N/2, which is our case of interest and vastly simpler than the general case, which could have two or more disjoint consecutive runs of M heads in N tosses. Since M>N/2, the only possibilities are those where: the run starts on the first toss, leaving N-M tails at the end of the sequence; the run starts on the second toss, leaving N-M-1 tails at the end and 1 at the beginning; ...; the run ends at the end of the sequence, with N-M tails at the beginning. For example, N=10, M=8; the only sequences are: HHHHHHHHTT THHHHHHHHT TTHHHHHHHH Three possible sequences of exactly 8 heads, N-M+1=10-8+1=3. Since there are 2^10 possible sequences, the probability (with a fair coin) of getting exactly 8 consecutive heads is 3/1024. Considering that 9 consecutive heads also count as 8, the number of ways to get M or more consecutive heads in N coin flips, still requiring M>N/2, turns out to be (N-M+2)(N-M+1)/2. This results from summing N-M+1 from M to N. For example, here's the binomial probability table I sent before for N=10, including the probabilities of consecutive heads on lines where M>N/2. M p(M) P(M) p_consec(M) p_consec(M+) 0 0.000976562500 0.000976562500 1 0.009765625000 0.010742187500 2 0.043945312500 0.054687500000 3 0.117187500000 0.171875000000 4 0.205078125000 0.376953125000 5 0.246093750000 0.623046875000 6 0.205078125000 0.828125000000 0.004882812500 0.014648437500 7 0.117187500000 0.945312500000 0.003906250000 0.009765625000 8 0.043945312500 0.989257812500 0.002929687500 0.005859375000 9 0.009765625000 0.999023437500 0.001953125000 0.002929687500 10 0.000976562500 1.000000000000 0.000976562500 0.000976562500 Note that p_consec(10)=p_consec(10+)=p(10). All the formulas give the same answer for "all heads", so I think they're right (and the p_consec(M+) expression was obtained from Maple; I wouldn't trust myself to get it right). Another example, this one for N=20: M p(M) P(M) p_consec(M) p_consec(M+) 0 0.000000953674 0.000000953674 1 0.000019073486 0.000020027161 2 0.000181198120 0.000201225281 3 0.001087188721 0.001288414001 4 0.004620552063 0.005908966064 5 0.014785766602 0.020694732666 6 0.036964416504 0.057659149170 7 0.073928833008 0.131587982178 8 0.120134353638 0.251722335815 9 0.160179138184 0.411901473999 10 0.176197052002 0.588098526001 11 0.160179138184 0.748277664185 0.000009536743 0.000052452087 12 0.120134353638 0.868412017822 0.000008583069 0.000042915344 13 0.073928833008 0.942340850830 0.000007629395 0.000034332275 14 0.036964416504 0.979305267334 0.000006675720 0.000026702881 15 0.014785766602 0.994091033936 0.000005722046 0.000020027161 16 0.004620552063 0.998711585999 0.000004768372 0.000014305115 17 0.001087188721 0.999798774719 0.000003814697 0.000009536743 18 0.000181198120 0.999979972839 0.000002861023 0.000005722046 19 0.000019073486 0.999999046326 0.000001907349 0.000002861023 20 0.000000953674 1.000000000000 0.000000953674 0.000000953674 So we see that requiring the heads to be consecutive drastically lowers the probabilities, as expected. For Q=0.05 and N=20, we need only 14 heads if they don't need to be consecutive, and otherwise we'd never diagnose a latent unless latents typically cause all heads. Regards, John ----------------------------------------------------------------------------- From: fmasci@ipac.caltech.edu Subject: Re: Latent Identification Date: July 8, 2009 11:59:39 AM PDT To: jwf@ipac.caltech.edu Frank: Would you say that the binomial distribution (or the process that generates it) is the most fundamental in statistics, that is, from which all other distributions can be derived given appropriate limits, or at least in a chain (e.g., binomial -> Poisson -> Gaussian)? But then a uniform distribution seems fundamental too. I wonder what asymptotic limit would get you a uniform (discrete) distribution from a binomial process, if such exists. Maybe the binomial is not fundamental after all. Regards, Frank Frank: Now suppose the events were correlated (e.g., using an unfair coin)... Just kidding. John: In case you haven't used the binomial distribution in a while, I'll mention that without the requirement of being consecutive, a biased coin presents no problem, that dependence is built in. The 2^N in the denominator goes away, since the sequences are no longer equally probable, and the probability of a given sequence is no longer 1/2^N; it is replaced with (p^M)*(1-p)^(N-M), where the lower-case p here is the probability of heads on a single flip. This is fresh in my mind because the binomial distribution gets considerable discussion in the book I'm writing. I think modifying the consecutive-heads probability would just involve the same change for a given value of M, but for M+, you couldn't just sum N-M+1 from M to N anymore, you'd have to include the (p^M)*(1-p)^(N-M) in each term summed. It might actually be interesting to do that, if you would like to guess a bias value when a latent is present, say p=0.95.... Regards, John ----------------------------------------------------------------------------- From: jwf@ipac.caltech.edu Subject: Re: Latent Identification Date: July 8, 2009 12:25:50 PM PDT To: fmasci@ipac.caltech.edu Frank: Would you say that the binomial distribution (or the process that generates it) is the most fundamental in statistics, that is, from which all other distributions can be derived given appropriate limits, or at least in a chain (e.g., binomial -> Poisson -> Gaussian)? John: Pretty close, but not all, since as you mention, the Uniform seems distinct. But if you'll accept the case of two flips of a fair coin as containing the essence of the Uniform distribution, then the binomial does underlie that one too. It's just that (like Poisson) the binomial distribution applies to discrete random variables, and technically, the uniform and Gaussian are for continuous random variables, but in fact both are used for discrete random variables all the time, with the understanding that when we refer to the Gaussian "probability" of (e.g.) 317 heads in 539 flips, we are really referring to the integral of the Gaussian density from 316.5 to 317.5. So the fair-coin two-flip case can be described as a Uniform distribution, but we have to implicitly divide the domain into two equal regions. And if the coin is not fair, we just divide the domain accordingly. For N flips of a fair coin, the binomial distribution is consistent with a Uniform distribution equally divided into 2^N segments. It's just that many of those segments correspond to sequences containing the same number of heads (& hence tails). What the binomial distribution brings in is "Pascal's Rule" of "N take M" to account for the multiplicity of sequences containing the same number of heads. I use this in my book to introduce "entropy". Using "1" for heads and "0" for tails, the sequences 0101010101 and 1010101010 both contain 5 heads, so they are both "micro-states" of the macro-state "5 heads". This macro-state has the highest entropy because it has the most micro-states, 252 out of 1024. The "all heads" and "no heads" macro-states have the minimum entropy, because each has only one micro-state. So maybe the uniform is more fundamental, since we use it to understand the binomial distribution for p=0.5. Certainly it is uniquely the maximum-ignorance distribution, since the only information it carries is the range of allowed values, and for something like a wheel of fortune, the range is 0-360 degrees, so the information content is zero. At least a Gaussian tells you something about what is more likely, and also that what you are dealing with is probably the sum of lots of little random things. Regards, John -----------------------------------------------------------------------------