## Steig et al. a Successful Reproduction

Posted by Jeff Id on May 1, 2009

Ryan O has done a brilliant job hunting down and now recreating the verification statistics in the Steig et al Antarctic reconstruction. This article details not only the verification statistics, it clearly lays out the limitations of the verification methods used.

Absolutely top notch.

Also, my thanks to Ryan for recognizing Dr. Steig’s cooperation.

======================================================

REPLICATING STEIG III: REGEM VS. PCA

I had mentioned in the Replicating I thread that after I finished with both the satellite portion and the ground portion, I intended to email Steig and ask about any discrepancies. Yesterday, I emailed him at about 2:45 pm with a description of what I found (including the relevant images). By 3:45 pm he had emailed me back with the explanation. After I read the explanation, the reason doing the verification statistics as described in the SI became abundantly clear.

While my actual calculations had been correct, my methodology was missing a step. Simply comparing the reconstruction to the AVHRR data (my method) does not evaluate the performance of RegEM. It only evaluates the performance of the 3 PCs.

In order to evaluate the performance of both the PCs and RegEM, Steig performed two additional reconstructions. For one, he withheld the 1982 – 1994.5 AVHRR data and performed the reconstruction using only the 1994.5 – 2006 data. For the second, he withheld the 1994.5 – 2006 data and performed the reconstruction using only the 1982 – 1994.5 data. The period where actual AVHRR data was used was the calibration period; the period where the AVHRR data was withheld was the verification period.

In other words, he deletes half of the AVHRR data, performs the reconstruction, and then compares the result to the deleted AVHRR data. Because this is a harder test to pass, some areas do actually end up with negative RE and CE values.

Within about 10 minutes of Steig’s reply, I was able to produce the following:

Fig. 1 -Replicated split calibration/verification experiments.

Compare to:

Fig. 2 -Original (Steig) split calibration/verification experiments.

Except for color scale issues, a perfect visual match.

(I should note that I’ve contacted Steig several times on various aspects of the paper and his replies have been immediate, cordial, and very helpful. This occasion was no different – he even offered up his verification data set. I have had no negative interactions with him at all – in fact, all of my interactions with Steig have been very positive.)

This is very satisfying, for a number of reasons:

1. We now have a script that we can use to benchmark future reconstructions against the Steig reconstruction.

2. This is confirmation that calculation of the TIR vs. ground data verification statistics was done accurately.

3. The R implementation (by Steve McIntyre) of the RegEM TTLS algorithm produces statistically identical results as Tapio Schneider’s MatLab version.

4. We now have two additional reconstructions which can be used to help quantify what features are due to PCA and what features are due to RegEM/temporally unstable covariance.

5. This one is a little more subtle, but we now potentially have a way to independently confirm that there are real offsets between the different satellites that produce false trends.

With that in mind, let’s take a look at some of the differences between the three reconstructions (main TIR, early calibration, and late calibration). The plots below show the continent-wide monthly averages, with the AVHRR data actually used highlighted in red. The 1957 – 2006 linear trend is shown in blue, with 95% confidence intervals.

Figs. 3 – 5 Top: Main TIR reconstruction. Middle: Early calibration reconstruction. Bottom: Late calibration reconstruction.

While all of them look similar, take note of the middle figure, which is the early calibration reconstruction. The 1957-2006 linear trend is only 0.052 degrees C per decade – significantly lower than the main reconstruction (0.118) or the late calibration reconstruction (0.113). This is very interesting.

Previously, at CA, I had investigated whether there were unaccounted for offsets between the satellites, as they are not calibrated to each other via overlaps. The results of that initial investigation are here:

http://www.climateaudit.org/?p=5678

What I found was a significant positive offset for NOAA-14 (~ 1995 – 2000). How does this relate to the above plots?

Remember that RegEM computes a covariance matrix by minimizing the residuals of the existing data points in order to predict missing the missing ones. Theoretically, if the true covariance between the ground data and the PCs is temporally stable, then RegEM should be able to accurately predict withheld AVHRR data. If one of the satellites has a noticeable offset compared to the others, then when that satellite’s data is withheld, the offset will show as a difference between the imputed and actual values. This will require further investigation.

I would also like to present two additional plots, which are the difference between the main reconstruction and the calibration reconstructions. I have not yet decided what I think these mean. There is a strange “warpage” of the difference plot for the late reconstruction/main reconstruction differences, while the early reconstruction/main reconstruction differences look more “normal”. I do not know if this is another way a satellite offset might manifest itself, and even if this is the case, I do not know if the warpage indicates there is a problem with the late data, early data, or both.

Fig. 6 – 7 Top: Early calibration minus main reconstruction. Bottom: Late calibration minus main reconstruction.

This oddity manifests itself in trend plots as well. This first panel is of the time period 1982 – 1994.5. Please note the difference in color scales between the AVHRR data and the reconstructions.

During this time period, the lower left plot (the early reconstruction) is entirely formed from the 3 PCs from just the 1982-1994.5 data. The lower middle plot is the main reconstruction, which is entirely formed from the 3 PCs using all of the AVHRR data. The lower right plot, on the other hand, is the output of RegEM using PCs generated from the late AVHRR data (1994.5 – 2006). The dichotomy cannot be due solely to RegEM, as we will see in other panels that RegEM does a reasonable job.

Fig. 8 Comparison of linear trends from 1982 – 1994.5.

The next panel compares late data trends. Note that the job RegEM does (lower left) when the late data is withheld is fairly accurate. This leads me to believe that the problem above is more than just RegEM.

Fig. 9 Comparison of linear trends from 1994.5 – 2006.

The next panel shows trends for the entire period of 1982 – 2006. Note that the early calibration reconstruction appears to do the most faithful job of replicating the AVHRR data. The main reconstruction shows warming concentrated in West Antarctica. The late calibration reconstruction does not match well at all.

I would also note anecdotally that most of the negative RE/CE values came from the late calibration reconstruction, confirming the visual comparison.

Fig. 10 Comparison of linear trends from 1982 – 2006.

The final panel shows a comparison of linear trends for the entire reconstruction period of 1957 – 2006.

Fig. 11 Comparison of linear trends from 1957 – 2006.

While these are by no means rock-solid conclusions, my first impressions after having gone through this exercise are:

1. RegEM is definitely sensitive to changes in covariance and input data. The imputed values for the PCs when data is withheld differ significantly. This would seem to be both a blessing and a curse. If sufficient original data exists, testing with RegEM to determine instrumental offsets/errors looks promising. On the other hand, seemingly insignificant changes to the input data may produce drastically different results.

2. The split calibration/verification experiments show that the covariance between the AVHRR data and the ground stations is not temporally stable. There is a significant change from the 1982-1994.5 timeframe to the 1994.5 – 2006 timeframe. This may be because there are unaccounted-for offsets, or the actual covariance really does change with time, or both. Based on the experiments, if the covariance changes, RegEM will “smear” trends spatially and temporally (Fig. 8 has the best example of this).

3. The choice of 3 PCs does not appear to properly resolve trends spatially. The 25 km gridcells in the Steig reconstruction give a false impression of spatial accuracy by making the plots appear more detailed than they really are. When directly compared to the AVHRR data, it does not appear if the reconstruction has a resolution better than roughly 500 km (calculated via the supremely accurate “eyeball” method).

4. The r2, RE, and CE statistics do not appear sufficiently sensitive to give any confidence in the ability to determine trends on the order of 0.1 degrees C per decade in this case. The trends between the AVHRR data and the reconstruction can differ by almost 1.0 degrees C per decade (over a 25-year regression) and have a sign change (such as the South Pole) without driving any of the three verification statistics negative. In the early and late period trends, vast regions of Antarctica have the incorrect sign for the trend when compared to the AVHRR data. In the whole-satellite period trends (1982 – 2006), the reconstruction warming is overall too high and inappropriately concentrated in West Antarctica when compared to the AVHRR data.

5. Unanswered question 1: If removing half the AVHRR data (12 years worth) results in a reconstruction with a trend less than one half of the original reconstruction, what does that say about the ability of the main reconstruction to accurately predict 25 years of missing data?

6. Unanswered question 2: Is the covariance between the ground stations and the AVHRR data temporally unstable, or is the apparent instability due to instrumental offsets?

Below are the functions that I used to perform the early/late/all AVHRR data reconstructions:

## cogito said

I’m by far no expert in statistics, so sorry for asking this question: Does this confirm or falsify Steig’s findings? What is the final conclusion: is the larger part of Antarctica warming or cooling? Thanks.

## Jeff Id said

It replicates the verification statistics and shows that the verification statistics weren’t adequate for verification. – How’s that for an answer 😉

Ryan would be better to answer, it doesn’t falsify to the extent of chuck the paper and go home but it does show the verification didn’t confirm the quality of the reconstruction.

As far as antarctica warming or cooling, I think the best result is the simple one.

https://noconsensus.wordpress.com/2009/04/27/maximum-triviality-reconstruction/

In general there is a cooling trend over the last 40 years, this contradicts the Steig et al result and I believe it is correct.

## Kenneth Fritsch said

Ryan O, good stuff that I need to digest — and remember.

Ryan O, can you compare the two reconstructions that you obtained using the late and early calibrations to determine if a statistically significant difference exists between the two versions.

I thought that the Steig et al paper was rather clear on how they did their calibration and verification statistics.

Are you going to test your cordial relations with Dr. Steig by pointing out to him the differences you found between using the late and early calibration? Based on the previous analyses suggesting possible satellite differences and the question of covariance stability over time I think you are owed an explanation or an attempt at one.

Could you confirm with Dr. Steig, the evidence that has been posted here indicating that the TIR reconstruction post 1982 is instrumental based?

Would you consider helping the current administration with talks with N Korea? Seriously, I think a working relationship with the primary Steig paper author is an ideal situation for getting answers to the questions posed from all the analyses performed.

## TCO said

Simply comparing the reconstruction to the AVHRR data (my method) does not evaluate the performance of RegEM. It only evaluates the performance of the [b]3 PCs[/b].

Hey Mike-Foxtrot: WHO HAS BEEN SOUNDING THIS CONCERN! Come to TCO papa, you shallow diver! All good subs are named after fish. Give me the last ’37 headed North. (I might be a poseur.)

## TCO said

Helloooooooooo!

## Ryan O said

Hi, TCO! Just so you know, I actually sort of have a life, which means I’m not bound to the computer all day long.

In response to your statement, I think you have a misunderstanding on how RegEM works. It does essentially the same thing as the PCA portion of the reconstruction. I tried teleconnecting this information to you, but it was apparently not a valid teleconnection, so I’ve decided to type it instead.

The first thing RegEM does is approximate the actual data by performing a singular value decomposition. It then throws away everything but the first regpar (or k) eigenvectors and tries to compute a covariance matrix that minimizes the residuals.

So in the case of Steig et al., they represent the actual data with 3 PCs and they represent the missing data with – guess what – 3 PCs. So if PCA smears trends, so will RegEM. The only reason it appears to do a better job than the PCs in this case is because the entering argument into RegEM

is already decomposed into 3 PCs.Playing with the number of PCs is easy – simply use more or less. Takes about 5 minutes in R. And you’re right – using only 3 PCs produces a poor representation of Antarctica. But to say that the PCA is the major problem with the paper is overlooking the fact that it is RegEM that infills back to 1957 and

we don’t knowhow accurate it is.I know you’re concerned about publication, but remember that if we don’t have a good handle on RegEM, we don’t have anything worth publishing because we cannot say

howwrong the representation is back to 1957.Right now, RegEM is a black box. That is why it is important to focus on RegEM. We need to understand what it is that RegEM does and how it does it, because it is RegEM – not PCA – that fills in the data back to 1957. The only purpose of the PCA portion is to produce a smaller representation of Antarctica to make the problem computable.

## Kenneth Fritsch said

Ryan O, in my mind there is much meat in your early and late calibration reconstructions, but if you were to publish I think the critical part would be to show how statistically significant the differences are between those two reconstructions. I note that the early calibration reconstruction has CIs showing no significant trend.

As I am sure you are aware temperature reconstruction papers put great stock in the showing of the statistical significance of their r, CE and RE statistics for calibration and verification. You may have demonstrated here with the two reconstruction sensitivity test a reason to put less stock in those statistics. I could never see how some of the low but highly significant r, CE and RE statistics found in temperature reconstructions could ever have much explanatory power.

I think I understand the essentials of RegEM and PCA processes, but what I have been very frustrated about is understanding how each were applied in the TIR reconstruction in the pre- and post-1982 periods. I suspect I have not been asking the proper questions. Anyway my primary concern here is whether the 1957-2006 reconstruction uses RegEM over the entire period for infilling data using the AVHRR results related to the 42 surface stations during the calibration period post 1982.

But I need to take my question a step or two further by stating that in my view a valid reconstruction would calibrate the RegEM algorithm (in the 1982-2006 period) and then remove all the AVHRR data for the 1982-2006 period and infill

allthe data from 1957-2006 by using RegEM algorithm from the 42 station relationships from the calibration period. Infilling only missing AVHRR results in the 1982-2006 period would be akin to having that part of the reconstruction based primarily on instrumental results.I also suspect that RegEM algorithms were initially used for infilling small percentages of missing data while(if properly performed in the Steig reconstruction) in Steig almost all of the total data is infilled. Do we have published examples of the extensive infilling that Steig et al. uses?

## Jeff Id said

“and then remove all the AVHRR data for the 1982-2006 period and infill all the data from 1957-2006 by using RegEM algorithm from the 42 station relationships from the calibration period.”

They didn’t do this step, it’s actually difficult because the whole thing is done in 3pc land. I tried to do it at this post (non RegEM).

https://noconsensus.wordpress.com/2009/04/09/the-antarctic-an-engineers-reconstruction/

I got too much information spreading so stations which should have applied to a small area were overweighted, but it’s all from one type of source (surface stations).

I’m having a hard time saying the RegEM half sat half thermometer reconstruction is ‘bad’ simply for mixing instruments. If both instruments were good, it might work. In this case it’s clear now that it didn’t work and is the result of both bad math and sensors. BTW, the only other infilling I know of is Mann08 which chopped off 60 years of briffa and RegEM’d in the rest of the data from a small number of proxies.

## Ryan O said

Kenneth:

I will answer your questions somewhat out of order. 🙂

Nope. The main TIR reconstruction is entirely PCA post-1982. It truly is simply the 3 PCs. No RegEM or anything fancy. You can exactly replicate the post-1982 portion by performing the SVD on the AVHRR data, extracting the first 3 PCs, and then multiplying the left/eigenvalues/right matrices through.

Pre-1982 is different. For pre-1982, the 3 PCs are represented as a time series by multiplying the eigenvalues and the right eigenvector. This produces a single series for each PC extending from 1982 to 2006. This series essentially describes how the eigenvalues change with time.

The PCs are then placed alongside the 42 surface stations. In the post-1982 period, there is overlap. In the pre-1982 period, the PCs are empty. RegEM then iteratively finds a “best fit” approximation that minimizes the residuals over the entire 1957-2006 period. Missing points are assigned a value corresponding to the appropriate point on this best fit surface. Non-missing points are left unchanged, as they set the boundary conditions for the imputation. If they were allowed to change, RegEM would likely converge to a trivial solution of a plane.

In this way, RegEM infills the pre-1982 portion of the PCs – along with all of the surface stations, too. So RegEM uses the mutual covariance of all of the stations and the PCs to predict how the PC eigenvalues will change with time from 1982 back to 1957.

The infilled PCs are then extracted and the left eigenvectors multiplied through. The left eigenvectors contain the spatial information. They distribute the eigenvalue at each point in time over the whole continent.

Note that RegEM left the post-1982 information intact. So when the left eigenvector is multiplied through, the post-1982 result is simply the original 3 PCs. Because the eigenvalues have been extended backward in time via RegEM, you now

alsohave a pre-1982 result.This is correct. Tapio Schneider’s paper on RegEM describes a test in which 3.3% of the data points were removed at random and then imputed. This is a far cry from how RegEM is being used in this case. I am unaware of

testsusing RegEM onknowndata sets with large volumes of data removed. I am sure that RegEM has beenusedfor similar purposes, but I do not believe it has ever been tested for accuracy in such a situation.However, the sparsity of data is not the only question, and it may not even be the biggest question.

RegEM is normally used to infill

actual data– not representations of the data. This is a subtle but important distinction. If RegEM is being used to infill actual data, each one of the input series contains the spatial information (by virtue of the series representing a specific point) and the temporal information. Each data point represents avalueat apointat a specifictime.If the spatial relationship between locations changes, RegEM will be able to adjust the best-fit approximation to compensate. This is not 100% compensation – as the representation of the data is limited to the number of eigenvectors specified by the

korregparsetting – but it can compensate. As long as the firstkeigenvectors explain a great deal of the variance in the data, this representation can be very good. The less variance explained, the poorer the representation.In the Steig paper, RegEM

cannot explicitly adjust the PCs for spatial relationship changes. The PCs contain no spatial information. All RegEM can do is describe how the PCweightingchanges with time. In other words, for each point in time, RegEM is actually imputinghow importanteach PC is.At t=x, the imputed value for PC 1 may be 140.4. At t=x+1, the imputed value for PC 1 may be 70.2. When the left eigenvector (which contains the weights by locations) is multiplied through, every point at t=x is multiplied by 140.4 and every point at t=x+1 is multiplied by 70.2. This is done for each PC and the results summed at each point for each time.

So unlike normal implementations of RegEM, in this case, the spatial information is recovered after RegEM. This is important because the left eigenvector

is based entirely on the spatial weights determined from the SVD of the AVHRR data. The left eigenvector does not containanyinformation from pre-1982 data.This means that, for RegEM to output values that correspond to reality,

the real covariance between points on the gridWith this method, there is no way to detect changes in covariance.mustremain constant in time.I deferred this question until the end because the answer is a necessary outcome of the previous explanation. Because no spatial information is input into RegEM for the PCs, RegEM

cannotbe used to directly infill all values from 1957-2006. To recover the spatial information, you must have a covariance matrix that describes the entire grid, and the only way to obtain this is by using the AVHRR data.Even if this was not the case, I am not sure that imputing everything would yield more accurate results. Imputation is really a sophisticated form of interpolation. Replacement of real data with “what would have been interpolated if real data didn’t exist at this point in time” would not necessarily yield more accurate results. You can certainly compare the “what would have been interpolated” data with the real data to assess the accuracy of the interpolation, but I presently cannot think of a reason whyreplacementwould be better.In cases where there can be calibration or other offsets due to splicing, yes, you have a valid concern. But this is not an analogous situation. The output of RegEM is quite necessarily scaled to

each individual input seriesand does not suffer from this problem.## Ryan O said

Sorry about the excessive italics at the end. Formatting error. 🙂

## Kenneth Fritsch said

Firstly, Ryan, thanks very much for the detailed explanation of the RegEM process pre- and post 1982, and, for that matter, an easily understood summary of the whole process. I was beginning to see myself as my grandfather (actually I am my grandfather these days) when I would talk real slow and loud to him in explanation and then have him ask the same question again.

I think my excerpt from your post gets to the crux of the matter of an instrumentally derived post 1982 part and a differently derived pre 1982 part of the overall reconstruction. The calibration and verification processes, that are the topic of your thread here, are the very test, based on smaller periods of time, which would derive from a reconstruction of interpolation in its entirety. You make a good point that splicing of “instrumental” and “reconstructed” part of the reconstruction is apparently not an issue with the Steig reconstruction based on the methods of the RegEM algorithm and its application.

My objection for not presenting a reconstruction version which was interpolated in its entirety goes back to my views of published temperature reconstructions, primarily from tree rings and densities, which contain the famous period of divergence at the end of the reconstruction. As Steve M has noted, cases exist where the divergence of the reconstruction is lopped off and in its place is the instrumental data. Whether that instrumental data can be properly spliced or not is really not the major issue but instead the fact that graphical presentation hides the divergence problem of the reconstruction.

The comment above takes me full circle to your analysis at this thread and the differences between the early and late calibrations reconstructions and why I think that difference could be important and that you should test the statistical significance of the difference.

I apologize for selfishly taking bandwidth from your thread, Ryan, as I think what you have shown in your analysis here is more critical to our collective understanding of the Steig reconstruction and its usefulness. I think that you, Jeff ID and Jeff C (the Jeffs) have exhausted my questions on the Steig methods with regards to pre- and post- 1982.

By the way, as an aside, I always get a kick out of Steve M’s references to the Jeffs. I have identical twin granddaughters that do not like being referred to as the “twins” as they are very much individuals. In fact, last summer while on a walk with them they announced that they were no longer going to be twins. We discussed the advantages of being twins and then I asked them what they thought about it after our talk. Answer: “we still do not want to be twins”.

## Ryan O said

Kenneth: The twins thing – haha. 🙂

Don’t apologize for taking bandwidth because your questions often force me to think a little harder about what I’ve done. Writing a response to you helps me clarify things in my own head.

I agree entirely that the differences between the early and late calibrations are very important and deserve more attention than they received in the paper.

Incidentally, I believe Steve’s R implementation of RegEM returns the full approximation matrix (which has only imputed values, not original values) as well as the matrix that has the original values. I need to look into this a bit. But if it does, then we do have a way of seeing how faithful the RegEM imputation is compared to the original data. I haven’t gotten that far yet, however.

(For that matter, TCO’s questions (when he’s not ranting about McI or Watts) do the same, as well.)