by Ryan O’Donnell
kgnd & Cross-Validation: PART II (TTLS vs. Ridge Regression and Table S3)
In his O’Donnellgate post, Steig makes the following observation:
It’s perhaps also worth pointing out that the *main* criticism I had of O’Donnell’s paper was never addressed. If you’re interested in this detail, it has to do with the choice of the parameter ‘k_gnd’, which I wrote about in my last post. In my very first review, I pointed out that as shown in their Table S3, using k_gnd = 7, “results in estimates of the missing data in West Antarctica stations that is further from climatology (which would result, for example, from an artificial negative trend) than using lower values of k_gnd.”
Mysteriously, this table is now absent in the final paper (which I was not given a chance to review).
This is not complicated folks. O’Donnell and gang, not liking my criticisms of the way they used TTLS, and in particular the fact that the truncation parameter they wanted to use, suddently started using IRIDGE. This has the advantage of having a build in verification function, which means you can’t see what the verification statistics are, which means that it is much easier to NOT SHOW THE BAD VERIFICATION STATISTICS I was criticizing them for. Maybe that is not why they used iridge. I don’t know WHY they used IRIDGE but I did not suggest it to them nor endorse it.
[at J N-G’s only:] P.P.P.S So if anyone wants to speculate that hiding table S3 is O’Donnell lying again, go for it, since speculation is all most people seem to be doing these days.
A few of these observations are fairly easy to deal with. Steig claims that the disappearance of Table S3 (which we will discuss in a moment) was “mysterious”, states that its disappearance was unknown to himself as a reviewer, and implies (and condones speculation) that this was done nefarious purpose.
The problem is . . . none of this is true. Steig did, in fact, see the revised Supporting Information with Table S3 removed. The removal was far from “mysterious”, as Steig himself acknowledges in his third review:
An unfortunate aspect to this new manuscript is that, being much shorter, it now provides less information on the details of the various tests that O’Donnell et al. have done. This is not the authors fault, but rather is a response to reviewers’ requests for a shorter supplementary section. The main thing is that the ‘iridge’ procedure is a bit of a black box, and yet this is now what is emphasized in the manuscript. That’s too bad because it is probably less useful as a ‘teaching’ manuscript than earlier versions. I would love to see O’Donnell et al. discuss in a bit more details (perhaps just a few sentences) how the iridget caclculations actually work, since this is not very well described in the original work of Schneider. This is just a suggestion to the authors, and I do not feel strongly that they should be held to it.
Apparently, Steig’s post-review change-of-heart includes forgetting that he saw the revised SI with the table removed, forgetting that he knew why it was removed, and forgetting that the only additional request he had with respect to the SI was that we add a few words on iRidge (which we declined to do for reasons stated at the end of this post here).
These issues are, of course, incidental to the primary point of concern: Did Steig’s criticisms based on Table S3 have any scientific validity? With the administrative details out of the way, let us examine this in some depth.
Table S3 can be found here. As noted in both the SI and the main text, it was never intended to represent valid cross-validation statistics for the reconstruction. Instead, it was the result of a screening test we used to limit the total number of combinations we had to test in order to determine the optimal parameters.
Regardless of how we intended to use the screening test, the results show one particularly interesting feature. At our optimal choice of kgnd = 7 (demonstrated in Part I), Byrd AWS station demonstrates a negative CE. A negative CE indicates that the simple mean of the data matches the data better than the infilled values – or, in other words, indicates that the infilled values are a poor representation of the actual values. Steig noted this, and – during the review process as well as following publication – has used this to try to claim that our West Antarctic results are no good.
The problem with Steig’s logic is that this test is a very poor indicator of the actual reconstruction skill for an individual station (and an almost equally poor way of finding the ideal truncation parameter). To see why, we must first understand how the screening test was conducted.
For the screening test, we co-opted a popular method for measuring reconstruction skill for the purpose of performing an approximate cross-validation test. This method is an early / late withholding test. How it works is that a certain number of stations are designated as verification targets and alternately have ½ of their data withheld:
The first step withhold half of the data for the verification targets (red in the left graphic), infill at various settings of kgnd, and compare the infilled values back to the original, withheld values in red. We then repeat this by withholding the other half of the data (red in the right graphic). Lastly, we extract the worst results from the early and late tests and compare at the various settings for kgnd.
Some problems with this type of test should be immediately apparent. First, only a portion of the predictors can be analyzed. If we were to withhold the long record length stations, for example, the test would necessarily yield ideal truncation values that are too small. This is because the maximum allowable truncation parameter is equal to the minimum number of predictors available for any time step. If one attempts to use a larger parameter, the regression coefficients are undefined. This limits us to withholding the short record length stations – which already have the most sampling error due to the short overlap with the predictors (and, as a result, are likely to give more uncertain cross validation results, especially when one considers the ability to match the low-frequency response).
The second major issue is that withholding such large quantities of data affects the order and shape of the eigenvectors. We discuss this at length in our response to Review A here. This means that eigenvectors 1 – 7 (say) when performing the early/late test are different than eigenvectors 1 – 7 (say) when performing the actual reconstruction. Because the eigenvectors are different, there is no reason to suspect that the peak performance would occur at the same number of retained eigenvectors.
In other words, this type of cross-validation testing is simply not very good at determining the ideal truncation parameter, either for the full set or for an individual station. So . . . one might ask, why on earth did we do this?
The reason for performing this test was to get a gross idea of what the ideal truncation parameter might be. Since we have no a priori knowledge of what it might be, simply testing all of the possible parameters (1 through 11) for kgnd was computationally prohibitive. So this test was performed to determine the neighborhood in which we would expect the ideal kgnd to occur when we performed our more extensive and effective cross validation tests.
To this point, however, all we have done is present plausible reasons why early/late cross-validation might be a poor indicator of the ideal truncation parameters. We have not demonstrated this to be true. It is now time to change that.
The test we will perform is simple. We will take known, complete instrumental temperature data, mask out values to duplicate the pattern of missingness in Antarctica, and then determine the actual idea truncation parameter by comparing the infilling error to the masked values. We will then perform the early/late cross validation test, and compare how well that test identifies the actual ideal truncation parameter. In addition, we will test the method used by S09 for determining the truncation parameter (Mann et al. 2007). As a bonus, we will also compare the performance of TTLS to ridge regression. For speed, we will use multiple ridge regression (mRidge) instead of iRidge, but this choice does not affect any of the following results.
The data sets we will use are the AVHRR data corresponding to the manned station locations for one test and long record GHCN stations for another. In order to obtain a true Monte Carlo estimate for the performance of these methods, we will use the phase-randomization approach from Christiansen et al. 2008 to obtain random realizations of the source data. This approach involves taking the Fourier transform of the data, randomizing the phases, and then performing an inverse Fourier transform. This preserves the covariance, noise, and autocorrelation structure of the data while yielding random temporal realizations. The test scripts and data used are available here, as are the raw cross-validation statistics used to produce the plots from Part I. We will perform 50 replicates for each data set, which take approximately 2.5 days apiece of computational time.
To summarize, we will test:
- How effective is the Mann et al. 2007 rule (used by S09) at finding the correct truncation parameter?
- How effective is early/late cross validation at finding the correct truncation parameter?
- Do the individual station results from early/late cross validation mean anything?
- Is ridge regression or TTLS more accurate for infilling instrumental temperatures?
*** MANN et al. 2007 ***
The truncation rule espoused by Mann et al. 2007 – which was the rule used by S09 – performed particularly poorly. For the AVHRR set, the rule correctly identified the ideal truncation parameter only 7 times out of 50 attempts, with an average miss of 2.08 and missed the truncation parameter by 4 or more 11 out of 50 times. The performance with the GHCN set was nearly identical, correctly identifying the right parameter 8 times, an average miss of 2.08, and 11 misses by 4 or more.
To put this in perspective, simply generating random numbers using a uniform distribution between the minimum possible truncation parameter (1) and the maximum observed (8) achieved a correct identification 5.2 times out of 50, with an average miss of 3.03, and an a miss of 4 or greater of 13.2 times out of 50 in 100 trials.
In other words, the Mann et al. 2007 rule is not much better than plucking random numbers out of the air.
*** EARLY/LATE CROSS VALIDATION ***
Early/late cross validation performed better than Mann et al. 2007. (Of course, this is almost a given since it is difficult to conceive of a truncation rule more inept than Mann et al. 2007.) For the AVHRR set, this method correctly identified the ideal truncation parameter 25 times out of 50, with an average miss of 0.64, and only 1 miss of 4 or greater. For the GHCN set, the performance was marginally worse. The method correctly identified the ideal truncation parameter 18 times, with an average miss of 0.76 and no misses of 4 or greater.
At this point, it is clear that neither Mann et al. 2007 nor early/late cross validation are particularly effective. We can expect to get the truncation parameter correct only 15% of the time using Mann (compared to 10.5% of the time using random numbers), and only 43% of the time using early/late cross validation. Given the large spread in reconstruction results based on changing the truncation parameter by only 1, this is not very encouraging.
Even more discouraging is how well the early/late cross validation testing captures the correct parameter for a given station. Remember that one of Steig’s criticisms is that in the early/late test, the Byrd value at kgnd = 7 was negative. Did that negative CE really mean anything? Let’s take a look.
Since we not only calculated the ideal parameter for the whole set but also the ideal parameter for each individual station, we can compare how accurate the early/late cross validation results capture the actual ideal parameter for each station. To do this, we simply extract the known ideal parameters from the full set and compare them to the estimated ideal parameters from the cross-validation testing on a station-by-station basis. If we plot those results, we get:
In case any of us were wondering, that’s not very good. The early/late cross validation test correctly identified the ideal truncation parameter for a given station only 751 times out of 3,500 chances – or a dismal 20.7% success rate. In fact, the true ideal truncation parameter differed from the cross validation estimate by 2 or more over 50% of the time.
In terms of CE, Steig’s case is even weaker. If one examines the CE value for the early/late cross validation test at the ideal parameter for each station, the early/late cross validation estimate of CE is negative over 36% of the time:
To illustrate just how bad the correlation is between the early/late cross validation CE estimate and the true CE at the ideal truncation parameter, we can look at the scatterplot of the values:
The early/late cross validation results tell us virtually nothing about how well the full data set will fit the missing data at the ideal truncation parameter. The idea that the early/late cross validation results – Table S3 – give any indication of full reconstruction performance for a given station is simply baseless. All early-late cross validation can do is give us an indication of what the full set ideal truncation parameter might be. It gives us a neighborhood for the full set – nothing more. It tells us next to nothing about individual stations.
This is why we ignored Steig’s criticisms of kgnd. His criticisms, quite simply, have no basis in fact.
*** RIDGE REGRESSION ***
The only remaining criticism we’ve yet to prove baseless is that ridge regression is known to cause problems. Of all of the criticisms Steig leveled in that post, this one is the easiest to prove false.
Along with performing a full-set ideal truncation parameter determination and an early-late cross validation test, we also infilled each of the random realizations using ridge regression and compared the results to the best possible TTLS results. How did ridge stack up? Let’s look:
Note that ridge regression beat TTLS EVERY SINGLE TIME. The average improvement in using ridge regression instead of TTLS was an 11.9% improvement in RMS error. In terms of capturing the actual linear trend of the data (measured in units of standard error of the actual trend), ridge demonstrated a 15.9% improvement over TTLS.
The charge that ridge regression is somehow less accurate than TTLS when infilling instrumental temperatures is completely without merit.
Given that ridge regression outperformed TTLS using the ideal truncation parameter every single time, one might wonder what would happen if – instead of using the near-random Mann et al. 2007 rule or the marginally better (but still poor) early/late cross validation tool – we simply picked the truncation parameter based on how well the TTLS solution matched the ridge regression solution. In other words, if we attempt to choose the truncation parameter based on the TTLS solution that matches the ridge regression solutions the best, how would we do?
The answer is . . . we would do remarkably well.
If we were to do this, we would have chosen the correct truncation parameter for the AVHRR data 34 times with an average miss of 0.46 and no cases of missing by 4 or more. For the GHCN data, the performance is even better. We would have chosen the correct truncation parameter 43 times with an average miss of 0.25 and no cases of missing by 4 or more. We can plot these results to give an idea of how much more effective this method is:
Nor does this give a fair indication of how much better a comparison with ridge regression is in terms of picking the right truncation parameter. This is because when the ridge regression comparison misses, it misses by a much smaller amount than early/late cross validation or Mann et al. 2007. Determining this is simple. We sum up the difference in RMS error between the truncation parameter chosen by a particular method and the true ideal truncation parameter, and divide by the number of times that method chose the wrong truncation parameter.
Average increase in RMS error per miss:
- Ridge regression: 19 misses, RMS error per miss of 0.011
- Early / late cross validation: 57 misses, RMS error per miss of 0.095
- Mann et al. 2007: 85 misses, RMS error per miss of 2.887 (0.655 with the removal of one large outlier)
In other words, if early/late cross validation chooses the wrong parameter, the resulting increase in RMS error is, on average, 9 times the additional error as compared to using a ridge regression comparison and choosing incorrectly. In fact, just 2 misses using early/late cross validation result in the same additional error as the sum of all 19 misses using ridge regression. And choosing it based on Mann et al. 2007 incurs, on average, 65 times the additional error if benefit of the doubt is given for one outlier, and 288 times the additional error if a strict average is used.
At this point, we may be wondering . . . at what value of kgnd does the ridge solution best match the TTLS solution for our reconstruction? Well . . .
If we use correlation coefficient, then we get kgnd = 7.
If we use RMS error, then we get kgnd = 7.
If we use CE, then we get kgnd = 7.
If we use RE, then we get kgnd = 7.
If we use our cross validation method from our first submission, then we get kgnd = 7.
If we use our published cross validation method, then we get kgnd = 7.
*** FINAL SCORECARD ***
We had 3 unanswered questions from the previous post:
- The reason for the lower West Antarctic trends in O10 is due to the use of iRidge. Nope, as the TTLS trends at the ideal parameter are even lower.
- Early/late cross-validation is a better way to determine the truncation parameter. Definitely not.
- The kgnd issues were not addressed by O10, who “mysteriously” removed a table of offending verification statistics. The removal of the table was neither mysterious nor unexpected, and the kgnd issues were thoroughly addressed in our responses.
With respect to Steig’s 11 criticisms of our paper in his post, only one of them has any potential merit, and that one relates to unpublished, presently unverifiable, results of uncertain accuracy at one location, which are also at variance with his reconstruction.
His final batting average: 0.091.
*** PARTING THOUGHTS ***
Regardless of whether Steig honestly attempted to improve our manuscript during the review, and regardless of whether he honestly believed his criticisms to be true, the act of simply claiming things without performing even the simplest check to determine if the claim could possibly be true cannot be condoned. Just as every author has the responsibility to verify results and calculations to the best of his or her ability, every reviewer should have the responsibility to verify claims prior to making them. This verification is even more critical when communicating things as fact to the public.
Regardless of how “gentle” Steig’s critique was, it was nothing more than a collection of false statements posing as fact.
If climate scientists wish to be taken seriously by those who have the time and ability to independently verify their results, they ought to be careful about claiming things to be true that are easily proven to be false.