Robust Verification Statistics
Posted by Jeff Condon on June 1, 2009
As you know, Dr. Steig put up a post as a rebuttal to RyanO’s recent post on a new method of RegEM which basically fixed some problems we found with the original paper. His basic point is that Ryan may have used too many PC’s in the reconstruction, something I don’t agree with. The conclusion of the Antarctic paper is verified through r, CE and RE values comparing different forms of the reconstruction.
One of my own main complaints is that the verification statistics are almost completely insensitive to trend. The magnitude of the temperature anomaly is much higher than trend. Here’s a plot of the south pole.
The temperature fluctuations are +/-6C in anomaly while the trend is less than0.1C/Decade.
Values such as r, CE and RE have no way to determine that the trend is what we’re looking for. They simply wiggle match the individual peaks. So when Ryan adds these corrections to the satellite data.
It has no discernable effect on the comparison of reconstruction statistics.
First, from Ryan’s comment on the previous thread. Steig et al. verification statistics.

Figure 3
Then from Ryan’s reconstruction using the high PC reconsrtuction in comparison with corrected/calibrated satellite data.

Figure 4
Finally, the uncorrected satellite data.

Figure 5
No difference, the thing is RegEM doesn’t see any difference either as it iterates based on a truncated covariance (r – basically) matrix of the data.
So when Dr. Steig makes this point.
*While I was working on this post, someone called “Ryan O” posted a long discussion claiming that he gets better verification skill than in our paper using 13 PCs. This is curious, since it contradicts my finding that using so many PCs substantially degrades reconstruction skill. It appears that what has been done is first to adjust the satellite data so that it better matches the ground data, and then to do the reconstruction calculations. This doesn’t make any sense: it amounts to pre-optimizing the validation data (which are supposed to be independent), violating the very point of ‘verification’ altogether. This is not to say that some adjustment of the satellite data is unwarranted, but it can’t be done this way if one wants to use the data for verification. (And the verification results one gets this way certainly cannot be compared against the verification results based on untuned satellite data.)
I was a little surprised because I thought he would understand that short term covariance is dominating the verification. Either way, this is not a dead issue. This simply points out that the applied statistics are insufficient for verification of the quality of the result.




Ryan O said
Thanks for putting these up, Jeff. One correction: Fig. 5 was done using uncorrected data for both the reconstruction and the verification. I also did it by using corrected data for the reconstruction and comparing it to uncorrected satellite data, and vice versa. The statistics are virtually identical for all of those combinations.
This explicitly speaks to the point that you’ve been making for a while: unlike what is claimed by Wahl and Ammann, the commonly used statistics are insensitive to underlying trends if the signal-to-noise ratio is sufficiently low – including RE. Therefore there is no justification for claiming skill simply by examining RE. It is essentially the same statistic as the correlation coefficient and CE, and it behaves no differently with respect to the wiggles. In my opinion, if you fail one of the three, you fail them all. Even if you pass all three, I would argue that they are not sufficient.
One thing I’d like to post about later is the benefit of using a running Wilcoxon test (or t-test, if the residuals are uncorrelated and gaussian) for verification. This type of running means test penalizes missing on the low-frequency signal rather than the high-frequency wiggles. In fact, this is the test I initially used to convince myself if there really was a statistically significant difference between the reconstruction and actual temperatures.
It’s also the same test I used to determine if there was a statistically significant difference between the satellite data and ground data.
Fluffy Clouds (Tim L) said
Well I saw in the post that cloud cover was 80% ?
that means that the MOST sat. data could adjust would be 20% and 20% of .1C is
.1 x .2 ya .02 which is out side of precision numbers lol
he said he could have use pc that gave a .3C answer but did not! lol
Ya ok and the moon is made of cheese!
Layman Lurker said
Jeff, (and Ryan) I have copied and re-posted a comment by Ryan O from CA. Perhaps this is worthy of a more prominent post here. I found it to be an elequent description of how Ryan’s efforts have shored up a lot of the defficiencies in Steig’s (or perhaps Mann’s?) methods.
from Ryan O. via CA:
“One other thing that deserves mention about Steig’s use of RegEM.
.
Steig does a single RegEM step. He puts and incomplete PCs alongside the incomplete ground records and imputes the whole thing together.
.
To me, this does not make sense. The PCs are not temperatures; they are essentially coefficients for a map. What physical reason is there to allow this abstract quantity – which does not represent a temperature at a point – to affect the estimation of the ground temperatures? TTLS assumes errors in both Y and x and minimizes the combined error. However, when you impute everything together, an error in Y does not mean the same thing as an error in x even though they are scaled to unit variance. Y is a temperature. x is a coefficient that is later interpreted by the eigenvector. RegEM, however, can’t tell the difference – it assumes they are the same thing.
.
Additionally, the PCs are satellite derived, and in several posts across the blogosphere, multiple people have discovered properties of the AVHRR data prevent it from being a drop-in replacement for ground temperatures. In practice, imputing everything together this way leads to obvious artifacting even at low regpars (just look at the Antarctic tiles). This means that some type of calibration needs to be performed between the two. Steig’s method of mushing everything together at once violates the integrity of a calibration because it allows information from outside the 1982-2006 calibration period to affect the fit during the calibration period.
.
To address these issues, the reconstructions I did use a different method than Steig. Rather than mash everything together at once, I first impute the ground stations without the PCs. This eliminates the concern about the difference in meaning between ground station errors and PC errors, because the PCs are simply excluded.
.
I then take the fully populated ground matrix, place the PCs next to it, and run RegEM again. Because the ground stations are fully populated, the integrity of the calibration is better preserved. The estimation of reconstruction period values can no longer affect calibration period values because all of the calibration period values are fixed.
.
The last difference is that I use the full solution for the PCs – i.e., the best fit between the PCs and the ground stations – in place of any original values. Otherwise, you have the situation where the post-1982 portion is entirely satellite derived and the pre-1982 portion is entirely ground station derived, which presents a visually obvious artifact at the splice.
.
This (combined with his assumption that # of PCs and the regpar setting must be held the same) is why Steig gets worse validation statistics as he increases the number of retained PCs. His method allows all kinds of strange artifacting – as the tiles show.
.
If, however, you separate the ground station imputation and the calibration steps, the artifacting largely disappears.
.
A related question, Steve, is if Mann does a similar thing to Steig in his reconstructions. If Mann mashes everything together and does it all at once, that might explain the “64 flavors” that B&C noted. The mashing-method solution is much more unstable than the separated imputation/calibration solution.”