kgnd & Cross-Validation – By Ryan ODonnell

PART I (West Antarctica)

IMPORTANT NOTE:  The code has been modified to extract station-by-station cross-validation results for TTLS.  If you are using the code, please download the new copy RO10 Code.txt and use that.  Data is here internal.cv.Rdata. [I have added a .doc extension to each link for wordpress compatibility,  just delete the .doc and the files are ready to use. – Jeff]

One of the issues brought up by Steig concerning our reconstructions concerns a parameter known as “kgnd”.  Steig’s argument is that this parameter was chosen improperly for our TTLS reconstructions in the SI (it is not a parameter for the main iRidge reconstructions), and, if a different value is chosen, our SI reconstructions match S09 better.  Specifically, Steig claims here:

The choice of kgnd that yields the best agreement with the iridge calculations (which, remember, is already known to create problems) happens to be kgnd = 7, and it just so happens that this yields the minimum trends. In fact, O’Donnell et al. show in a table in their Supplementary Material that the mean trend for West Antarctica for smaller values of kgnd is more than twice (~0.2 °C/decade) what it is for their ‘optimum’ estimate of kgnd = 7 (~0.07°C/decade).  {emphasis added}

Steig then goes on to claim that the closer concordance between the kgnd = 5 / 6 reconstructions and the S09 reconstruction means that we chose the wrong parameter value.  His justification is nothing more than the other values agree with his reconstruction better.  It is not based on any cross-validation results or other calculations (excepting, perhaps, the erroneous calculation of the Byrd station trend, which we evaluated here).  Steig – as per usual – doesn’t do calculations to evaluate his own claim.  He simply states things as true.  And – as per usual – these things turn out to be somewhat less than true.

Before we begin discussing why this claim is not accurate, let’s make sure we understand what “kgnd” is.  First, let’s talk about RegEM, which is where “kgnd” appears.  RegEM belongs to a class of algorithms called Expectation-Maximization algorithms.  In the case of RegEM, the “Reg” refers to “regularized”.  We will defer the discussion of regularization for now, as we first need to understand a bit about EM.

EM has been around for quite a while (since the 60’s or thereabouts) and was formalized by Dempster et al. in 1977.  The idea behind it is quite simple:  use the available information to take a best guess at the missing information.  Then, assuming that the previous best guess is better than having no information at all, use those results to make a better guess.  This process continues until each subsequent guess is nearly equal to the previous guess . . . in other words, until the guesses stop changing.  In the case of linear regressions, as long as the regression model is appropriate for the data and the proper correction is applied for the bias generated by the predicted variances, this general framework was mathematically proven to converge to the maximum likelihood estimate of the missing data.

Rather than use ordinary least squares for the regression function, RegEM utilizes regularization.  The reason for this is because many times – due to large gaps in coverage – the data matrix is ill-conditioned.  This means that there are too many unknowns and not enough overlaps in the data to provide a unique answer.  Such problems come up frequently in non-laboratory analyses like analyzing climate data.  Regularization provides a way to obtain a unique solution by adding some additional constraints.

The RegEM algorithm provides two types of regularization:  ridge regression and truncated total least squares (TTLS).  For this post, we will only talk about TTLS.  Ridge regression will be discussed in a follow-on post.

TTLS relies on a mathematical technique called the eigenvalue decomposition (and can alternatively be calculated using the singular value decomposition, or SVD).  It is easier to visualize using the SVD method, so that is where we will start.

The SVD is a way of restructuring a data set based on the amount of shared information between the variables.  Qualitatively, how it works is to find a spatial eigenvector and principal component (PC) pair that best represents the data.  It then finds the next best representation for the remainder . . . and so forth . . . until the data set has been completely decomposed into eigenvector-principal component pairs:

Each eigenvector (the colored map) contains a set of numbers.  Each principal component (the line plots) contains a time series of numbers.  If we multiply a PC by each number in its associated eigenvector, we obtain a set of estimates for all points and all times in the data.  We can then do this with the next eigenvector / PC pair . . . and so on . . . and when we add them all together after multiplying, we get back exactly the original data.

The reason this is useful as a regularization tool is that we may not have enough information to simultaneously solve for all of the original variables.  With the SVD, however, we can represent the data set by a smaller number of eigenvector/PC pairs.  To do this, we retain a certain number of eigenvector/PC pairs and discard the rest.  This reduces the number of variables we need to solve for, which allows a unique solution.  Hence the “truncated” part of TTLS – reducing the number of variables in the data set.

S09 use two truncation steps.  The first step – ksat – is to represent the entire satellite data set by 3 eigenvector/PC pairs (depicted above).  The second step – kgnd – is to mix the ground data and 3 satellite PCs in RegEM and truncate again.  The truncation choice is again 3.

So – in summary – S09 first reduce a 5,509 variable data set to 3 variables (ksat).  They then put these 3 variables (PCs) next to 42 ground stations in RegEM . . . where they then reduce the 45 variables again to 3 (kgnd), calculate regression coefficients, and estimate the missing data:

Unlike ordinary least squares, which uses the original data to find the regression coefficients, TTLS uses the truncated data to find the regression coefficients.  The reason for doing this is that the first few PCs have more “shared information” than the last PCs, which means the first few contain more common “signal” than the last ones.  In other words, we assume that the smaller PCs (in terms of eigenvalue magnitude) are mostly noise.  Noise is not useful for prediction.  To maximize the effectiveness of our prediction, then, we throw away the “noisy” paris and retain only kgnd number of eigenvector/PC pairs for determining the regression coefficients.

In the S09 implementation, the number of pairs retained is 3.  The criterion for retaining 3 comes from Mann et al. 2007, and it is a heuristic tool that has not been validated for general effectiveness.  Additionally, as we explain in our paper, the Antarctic data used by S09 violates some of the criteria established by Mann et al. 2007 for using the tool.

Another problem with the S09 approach is that RegEM has no idea where the stations are relative to the satellite eigenvectors (the pretty maps), as the maps are not provided to RegEM.  All RegEM can do is calculate the correlation coefficients between the rank-reduced (truncated) station data and PCs, and use that for prediction.  However, because some of the station data is short and because the satellite data is not a perfect match to the ground data, RegEM ends up calculating regression coefficients that are incompatible with the maps.  This results in the wrong stations being used to predict the principal components, with the necessary result that the reconstruction incorrectly locates the trends.

To reduce these problems, we decided that the missing station data had to be estimated independently of the satellite PCs.  So in our implementation, we first use RegEM to infill the ground stations.  Then, we use the maps – not the PCs – to generate the regression coefficients.  This ensures that the trend information is properly located geographically.  With these regression coefficients and the ground station data, we then calculate what the PCs should be:

These estimated PCs are then recombined with the maps to obtain the reconstruction.

Once we settled on this method to maintain the proper geographic location of the information, the question we then had to answer is:  what is the right value for kgnd?

Rather than use a thumbrule with unknown statistical properties (which may or may not work), we decided to find kgnd by finding the parameter that results in the most faithful representation of the original ground station data.  To do this, we first infill the ground station data at various settings of kgnd.  We then withhold one station at a time from the RLS reconstruction, perform the entire reconstruction, and then measure the error back to the withheld station.  We do this for every station in the ground set, and we also do this for the 24 stations we did not use as predictors.  The value of kgnd that best reproduces the withheld station data is selected for use.

The reason for withholding one station at a time is rather simple:  the more data we withhold, the less likely it is that the ideal truncation parameter for the withheld configuration is representative of the ideal truncation parameter for the full data set.  For the moment, we will take this as a given.  We will prove this to be true in a follow-on post (where we will also demonstrate that Steig’s proposed method from the reviews is a poor method of cross-validation and incorrectly estimates the ideal truncation parameter).

Taking the above as a given for the moment, let us examine cross-validation error by truncation parameter:

Note how the minimum cross-validation error occurs at kgnd = 7.  Also note the significant difference in the kgnd = 3 case, which was the parameter chosen by S09.

We can also look to see if Steig’s charge that kgnd = 7 may result in a better fit for the Peninsula stations but “overfits” in West Antarctica has any truth.  Specifically, Steig stated:

In both cases, the optimization is done on the basis of the entire data set; that is, the ‘best’ parameter depends on what works best on average both in data poor regions (e.g. West Antarctica) and data rich regions (e.g. East Antarctica and the Peninsula). The obvious risk here is that too high a truncation value will be used for West Antarctica. There is rather good evidence to be found in the Supplementary Material in O’Donnell that this is exactly what has happened. The choice of kgnd that yields the best agreement with the iridge calculations (which, remember, is already known to create problems) happens to be kgnd = 7, and it just so happens that this yields the minimum trends. In fact, O’Donnell et al. show in a table in their Supplementary Material that the mean trend for West Antarctica for smaller values of kgnd is more than twice (~0.2 °C/decade) what it is for their ‘optimum’ estimate of kgnd = 7 (~0.07°C/decade). Indeed, using any value lower than the one they choose to rely on largely erases any difference between their results and Steig et al., 2009. This simple fact — illustrated in the figure above (trends in °C/decade for 1957-2006) — has been notably absent in the commentaries that O’Donnell and coauthors have made about their paper.  {emphasis added}

There are some plausible concerns expressed by Steig.  After all, we do choose the truncation parameter based on the lowest RMS error for every withheld point, and it is possible that what is optimal for the Peninsula or East Antarctica happens to be suboptimal in West Antarctica.  However, one would expect that Steig might actually attempt to verify whether this is true prior to claiming it to be true in a public post.  If one expects such a thing, one would be wrong.  Again, Steig – as per usual – makes a series of claims without doing any calculations to justify them.  He instead continues in the same pattern demonstrated in his reviews:  make a claim via unsubstantiated arm-waving arguments with no supporting calculations or evidence, and leave us to sort through the mess.

Also as per usual, unlike Steig, we will actually do the calculations to see if his claim is true.  In order to determine this, we will extract only the verification statistics associated with the mainland West Antarctic stations:  Byrd, Doug, Elizabeth, Erin, Harry, Mount Siple, Russkaya, Siple, and Theresa.  If we plot the verification statistics associated with these stations by kgnd, we get:

So we can see that, not only is kgnd = 7 optimal for the continent as a whole, it turns out it is also optimal for West Antarctica – Steig’s suppositions notwithstanding.  This is why it is necessary to actually calculate results rather than imagine them.  The argument that our choice of kgnd results in a poorer representation of West Antarctica was plausible – but plausible is not the same thing as true.

At this point, we may be curious how the actual reconstructions stack up – i.e., our optimal TTLS reconstruction vs. S09 – instead of just the verification statistics.  After all, Steig has claimed that his reconstruction is a better representation of West Antarctica than ours.  Unlike his normal habit, he did actually do a calculation for this.  He calculated the Byrd trend . . . incorrectly . . . and then compared it to the S09 reconstruction and ours.  Take note of the fact that his criticism resolves to a single station among a total of 84 on-grid stations.  We will take a more comprehensive look, using the same set of stations (Byrd, Doug, Elizabeth, Erin, Harry, Mount Siple, Russkaya, Siple, and Theresa) as above.

As a bonus, we will also look at how well the reconstructions match the spliced Byrd station data.  No offset will be applied when we splice . . . we will simply splice them exactly as Steig did in his RC post.  As we noted here, this is not valid . . . but we will do it anyway to make a point.  Remember, Steig used the spliced Byrd station to try to imply that our reconstructions were “less accurate” than the S09 reconstruction:

The evident failure of O’Donnell et al. to correctly capture what is going on at Byrd (and presumably elsewhere in West Antarctica) is quite surprising, given that one of key differences in their methodology is to use the weather station data — not the satellite data as we did — as the verification target. That is, O’Donnell et al. use weather stations, withheld one at a time from the reconstruction for verification purposes to optimize their calibration. How then, can they be so far off for the location of the most important weather station?

We note, however, that the only statistic Steig examined was linear trend . . . not how well the reconstructions matched the actual data on an RMS basis (or using an alternate metric CE).  If we examine the RMS match (which also happens to produce the same result as using CE), a very different picture emerges:

Please note that the kgnd = 7 reconstruction fits the spliced Byrd station data better than the S09 reconstruction. How can this be?  The answer is quite simple.  The S09 reconstruction trend at Byrd station is a blend of Scott Base, McMurdo, and the Peninsula information.  It has almost no dependence on Byrd (see here).  It just so happens that this blend gives a linear trend that is almost equal to the [invalid] spliced Byrd trend.

In other words, the S09 trend match at Byrd is accidental.

Toward the end of his ODonnellgate post, Steig makes the following observation:

Sadly, attacking climate scientists by mis-quoting and mis-representing private correspondences or confidential materials appears now to be the primary modus operandi of climate change deniers. To those that still don’t get this — and who continue to believe that these people can be trusted to present their scientific results honestly, and who continue to speculate that their may be truth in the allegations made over the years against Mike Mann, Ben Santer, Phil Jones, Stephen Schneider, Andrew Weaver, Kevin Trenberth, Keith Briffa, Gavin Schmidt, Darrell Kaufmann, and many many others, just because they ‘read it on a blog somewhere’ — I’d be happy to share with you some of the more, err, ‘colorful’, emails I’ve gotten from O’Donnell and his coauthors.

This observation seems to imply, among other things, that we have misrepresented our own results and that the acute Reader would be better served by taking the word of a real climate scientist – such as Steig himself – over ours.  Well, let’s see how Steig’s post compares to reality.  Based on the above, one would expect that most of Steig’s criticisms of our paper would be true, right?
Steig’s Claims:

  1. Borehole measurements at WAIS are closer to S09 than O10. Probably true, but not easily verifiable (the scientists Steig quotes declined to confirm the 1 deg C number via email, saying they were concerned about use of the numbers prior to the results being in publishable form), and makes no statement on the methodological problems with S09.  Despite this, we’ll go ahead and score one for Steig.
  2. Borehole measurements at the Rutford Ice Stream (78oS, 83.1oW; grid cell 510), which yield warming since ~1930 of 0.17 +/- 0.07 deg C / decade match S09 better than O10.  Nope.  The S09 warming for this grid cell is 0.19 +/- 0.07.  Our warming for this grid cell is 0.18 +/- 0.07.  Both reconstructions match equally as well.
  3. The 50-year trend at Byrd is 0.38 deg C / decade.  Not true; furthermore, the infilled ground station trend at Byrd in S09 is only 0.13 deg C / decade (or closer to our result than their own).
  4. Since Steig et al. retained the satellite data, they didn’t need to worry about [Byrd AWS].  Nope.  The reason S09 doesn’t need to worry about Byrd AWS is that their reconstruction doesn’t care what the actual data at Byrd says.
  5. Mann 2007a, 2007b, and Smerdon & Kaplan 2007 says ridge underestimates trends compared to TTLS.  Nope, that’s not what they say.
  6. Kgnd = 5 or 6 gives a more accurate representation of the West Antarctic station data.  Nope (this post).
  7. S09 does a better job of representing the West Antarctic station data.  Nope (this post) and nope.
  8. O10 “overfit” in the satellite era and “underfit” in the pre-satellite era.  Nope – we do not ever (repeat: ever) fit the satellite data to the ground data for RLS.
  9. The reason for the lower West Antarctic trends in O10 is due to the use of iRidge.  (TBD)
  10. Early/late cross-validation is a better way to determine the truncation parameter.  (TBD)
  11. The kgnd issues were not addressed by O10, who “mysteriously” removed a table of offending verification statistics.  (TBD)

So of the 11 criticisms of our reconstructions, we’ve analyzed 8 of them.  In 7 cases, we find Steig’s criticisms to be provably without merit (and the other – borehole measurements – are not part of our paper to begin with).  We will look at the remaining ones in a subsequent post.
At the moment, Steig’s batting a paltry 0.125, with only 3 plate appearances remaining.

12 thoughts on “kgnd & Cross-Validation – By Ryan ODonnell

  1. Hehe. Of course, these are the TTLS recons – which are only in the SI. The main recon beats S09 significantly on all counts.

  2. More puzzling arm waving from ES at Nick’s blog:

    I would be very interested, since you’ve already got RO10s code running, to know what you get it you ‘optimize’ it — using their methods — for West Antarctica alone instead of all of Antarctica (which in my view is bound to mess things up in West Antarctica).

    I’m not sure how he can square a statement like this without conceding that S09 is in shambles.

  3. RyanO, thanks much for your continued efforts to tutor those of us who enjoy learning the nuances of these methods. I would guess that you have experience and/or training in teaching at some capacity. Expositions such as you present in this thread are what internet communications should be all about.

    What I need to do, and have not to date, is run methodically through your R code and learn from that experience. I have found that by applying methods from R at my own methodical pace, I begin to understand better what is being done and better still – why it is being done. Over the past several days I have been playing with ARFIMA in R and was surprised that some of the intuitive explanations I developed/guessed from those exercises were later confirmed by reading the literature.

    I look forward to future installments on your replies to Steig’s critiques of O(10) and while I like the presentations to concentrate on the methods I do have some curiosities about Steig’s approach to critiquing and method applications in S(09). The first looming question is why would Steig continue to conjecture without doing any calculations? Is he not aware that he is setting himself up for future embarrassment? Or does he simply not have the capacity – as in time and technical wherewithal – to do a proper analysis? I wonder (and do not expect an answer) whether S(09) was a team effort and the methods used were applied more or less as black boxes without sufficient understanding of how the boxes functioned? As a layperson in these matters I can certainly understand that not having a full comprehension of a method can lead to improper interpretations and particularly so with the more nuanced aspects.

    I also think that I have sufficient understanding of the analyses that the authors of O(10) did on S(09) and providing alternative methods to appreciate what they have accomplished. It was not simply a matter of recognizing a flawed method and setting it straight. It took a great deal of effort and understanding that I sometimes think those of us without the involvement and methods comprehensive do not fully appreciate. I say the Devil remains in the details.

  4. Ryan,
    Yes, there are posts on what TempLS is doing. Version 2, which is the extension to spatial models, is here. There is a more extensive explanation of the underlying LS model of V1 in this pdf, with an associated post here. There’s an overview of development and applications here, now a little out of date. And there’s the category index.

    I agree with the point about the difference of trends between ground and AVHRR. That’s why I varied the weighting ratio, which graduates from one extreme to the other. But even the ground stations give high values. It’s always possible that I have just made an error with the data or some such. The verification process should shake that out, if so.

    Just one query – you say that “RegEM has no idea where the stations are relative to the satellite eigenvectors”. But I thought the 4th number in the ground idx table was the satellite cell in which the ground station was located? I’ve been using it that way.

  5. Nick,

    RegEM locates information according to correlation only, so if you oversample the peninsula you get more opportunities for correlation of spurious info.

  6. Ryan,

    Once again, excellent post. I assume you will soon declare victory and move on. (Or are you waiting to see if the S(09) authors send a comment to JOC?).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s