# Steig’s Verification Statistics – Part 2 (A little more majic)

A guest post from Ryan O.  The second part of Ryan’s Steig et al. verification statistics.  It’s amazing what you can find by replicating a paper.  The title above is my own.

————————–

REPLICATING STEIG’S VERIFICATION STATISTICS – PART TWO

Fig. 1-Geographical assignments for ground data comparison.

.
This is kind of a misnomer since I won’t actually be doing any replication, but as there was a Part One, there needed to be a Part Two.  Originally I had intended to replicate the r, RE, and CE statistics in Table S2 of the Supplemental Information, but doing so would require that I first replicate the restricted 15-predictor TIR reconstruction, an activity that I consider has little value at this point.

Instead, this post will examine how well the main TIR reconstruction matches the ground data.  Note that this is something not done anywhere in Steig’s paper.  To recap, the verification statistics that Steig presents are:

.
1. r2, RE, and CE for the main TIR reconstruction (though the displayed images were of the PCA reconstruction) where the AVHRR data was compared to the reconstructed values.
2. r, RE, and CE for the AWS reconstruction (Table S1).
3. r, RE, and CE for the restricted 15-predictor TIR reconstruction.

.

There is a noticeable absence of verification statistics for the PCA reconstruction (except by accident) and pre-satellite verification statistics for the main TIR reconstruction.  Steig does not tell us how well the ground data matches the reconstruction.

.

Why is this important?

.

Remember that the methodology for the main TIR reconstruction is the following:

.
1. Perform cloudmasking/removal of outliers for the raw AVHRR data.
2. Perform single value decomposition of the cloudmasked data and extract the first 3 principle components.
3. Place the first 3 principle components alongside the data from 42 surface stations and use RegEM to impute the 1957-1982 values for the 3 principle components.
4. Generate the 1957-2006 reconstruction from the 3 principle components, which are now populated back to 1957.

.

I placed step 3 in italics because it is a step that requires some thought.  Aside from problems with RegEM, a critical prerequisite for this step to work properly is for the AVHRR data and the ground data to be measuring the same quantity.  If they are not measuring the same quantity – i.e., not properly calibrated to each other – the output of this step might very well be gibberish.

.
A check one could use to determine if the output is gibberish is to calculate verification statistics of the ground data vs. the reconstruction.  By itself this will not be able to distinguish between calibration problems, imputation problems, and resolution problems (insufficient number of PCs), but it will at least provide some insight as to whether there are problems that should be concerning.  Since Steig did not include such an analysis with the paper, we shall take it upon ourselves to provide them.

.
The procedure for doing this is a bit different than Part One because the ground stations are incomplete during different timeframes.  To handle this, I take the first 50% of data that actually exists and designate that as the calibration period.  The last 50% of data that actually exists is then the verification period.  I do this for each station, individually, and then calculate r, RE, and CE.  I then reverse the order of the calibration and verification periods, recalculate, and store the minimum values obtained.  This makes the procedure exactly analogous to the method used in the SI for the AVHRR data.  Additionally, to ensure that there are sufficient degrees of freedom for the comparison to be meaningful, I exclude any station without at least 4 years of history.  I apply the above procedure to all AWS stations and all manned stations except Campbell, Macquarie, Grytviken, Orcadas, and Signy.  The reason for excluding them is because the nearest gridcell is 100+ km away, so there are no equivalent reconstruction values to which I can compare them.

.
The plots are color-coded with the geographical assignments of Fig. 1.

.
First up, correlation coefficient:

Fig. 2 – Correlation coefficient (ground station to reconstruction).

.
Take special note of the relatively high correlation coefficients for the red group, which is West Antarctica, and the low coefficients for the blue group, which is East Antarctica.  This may be unexpected, and it will become important.

Now coefficient of determination:

Fig. 3 – Coefficient of determination (ground station to reconstruction).

.
From the above 2 plots, we might be tempted to draw the following two conclusions:  1) the correlation coefficients and coefficients of determination are fairly low; and, 2) the reconstruction performs the best in West Antarctica and the Ross Ice Shelf.  This would seem to be counter-intuitive, since the plots of the eigenvectors and the AVHRR-to-reconstruction verification statistics would have us assume that the best performance would be in East Antarctica.  However, before we draw too many conclusions, let’s look at RE and CE:

Fig. 4 – RE (ground station to reconstruction).

.
Hm.   In this one, the Peninsula and West Antarctica that look good.  But isn’t the Peninsula supposed to be the least well-reconstructed region?  Maybe CE will tell us something:

Fig. 5 – CE (ground station to reconstruction).

.
Ah.  On the CE plot, the Peninsula stations look suitably poor.  However, the poor performance of East Antarctica seems as unexpected as the better performance of West Antarctica and the Ross Ice Shelf.

Now let’s step back and think about this for a moment.  After a review of all of these plots, we would conclude:

.

1. The reconstruction does the worst job in East Antarctica on all plots – r, r2, RE, and CE.
2. The reconstruction does the best job in West Antarctica on all plots, with the Ross Ice Shelf running a close second.
3. The overall performance of the reconstruction with respect to the ground data is poor, and in the Peninsula and East Antarctica, simply taking the mean of the station data explains more variance than fitting the reconstruction to the ground data.
4. The trends from the reconstruction should therefore be the most accurate in West Antarctica and the least accurate in East Antarctica.

.
So let’s see if our conclusion #4 is on the mark.  The minimum number of data points required for the following plot is 48 months.

Fig. 6 – Difference in slope for common points (reconstruction minus ground data).

.
Funny.  The most faithful replication of slope is in East Antarctica!

.

Here are the region means and standard deviations:

.

1. Peninsula:  -0.4282 (mean);  0.5059 (standard deviation) – 20 stations
2. West Antarctica:  -0.3576 (mean);  1.1581 (standard deviation) – 9 stations
3. Ross Ice Shelf:  0.1782 (mean);  0.5595 (standard deviation) – 26 stations
4. East Antarctica:  -0.0358 (mean);  0.5215 (standard deviation) – 40 stations

.
Why is this so?

First, the sample sizes for the mean/sd calculations are low.  In all cases, zero is well within the 95% confidence intervals.  So we cannot say that any of the means are different from zero at any reasonable significance levels – except for perhaps the Peninsula.  It could be that East Antarctica performs the best simply because we have the largest number of samples in that area.  Indeed, the distribution of peaks as we go from left to right almost looks random.

.
Second, the signal-to-noise ratio is very low.  In a 10-year sample, the signal we are looking for is on the order of 0.5 deg C or less, while the noise exceeds 10 deg C peak-to-peak.  Because of this, r, r2, RE, and CE penalize missing the noise to a much greater extent than they penalize missing the signal.  This leads to preferentially high scores for matching the high-frequency noise over the low-frequency signal.  The problem is made worse by the potential for spurious correlations between noise of physically unrelated stations because of the short record lengths and spotty coverage.

.
In short, I believe had the authors presented plots of the relevant verification statistics for the ground stations – which show negative REs for about half of them and negative CEs for about 3/5ths – they might have had a more difficult time with publishing.  The reconstructed values simply do not match the ground records.

.
With that being said, I personally do not think the situation is intractable.  I think that a reasonable reconstruction can be done using both the satellite and ground data.  Hopefully over the next few weeks I will be able to present a method that does a fairer job of representing actual Antarctic temperatures.  It will be based in a large part on the work that Jeff Id has already done.

.
Additionally, I intend to run some experiments using manufactured temperature fields to perform sensitivity studies with RegEM and this particular implementation of PCA.  There should be some fairly generic rules that result.  There are also a few different statistical measures I intend to test to determine a set of tests that provide more statistical power than r, r2, RE, and CE alone.

.
For the interested, here are the station numbers as used in the above plots:

Fig. 7 – Ground station numbers and associated names for the plots.

.

Lastly, the R script for calculating the above quantities:

ssq=function(x) {sum(x^2)}
get.stats.matrix2=function(orig, est) {
###  Set up the placeholder matrix for intermediate calculations
stats.matrix.early=matrix(ncol=ncol(orig), nrow=14)
stats.matrix.late=matrix(ncol=ncol(orig), nrow=14)
###  Loop through each column
for(i in 1:ncol(orig)) {
###  Extract station i
orig.all=orig[, i]
est.all=est[, i]
###  Fill in NAs for all points that are not shared
map=is.na(est.all); orig.all[map]=NA
map=is.na(orig.all); est.all[map]=NA
###  Find linear trends for all common points
tr.orig=lm(orig.all~I(1:(length(orig.all))/120))
tr.est=lm(est.all~I(1:(length(est.all))/120))
###  Remove NAs for r, RE, and CE calculations
orig.all=na.omit(as.vector(orig.all))
est.all=na.omit(as.vector(est.all))
###  Split vectors into calibration and verification periods
orig.c=orig.all[1:trunc(length(orig.all)/2)]
orig.v=orig.all[(1+trunc(length(orig.all)/2)):length(orig.all)]
est.c=est.all[1:trunc(length(orig.all)/2)]
est.v=est.all[(1+trunc(length(orig.all)/2)):length(orig.all)]
###  Set up temporary storage for calculation results
stats.matrix.early[, i]=rep(9999, 14)
stats.matrix.late[, i]=rep(9999, 14)
###  Ensure at least 24 months of data in each period before calculating r, RE, and CE
if(length(orig.c)>23 & length(orig.v)>23 & length(est.c)>23 & length(est.v)>23) {
###  Get stats for early calibration (first 50%)
stats.matrix.early[, i]=c(get.stats(orig.c, orig.v, est.c, est.v), tr.orig\$coef[2], tr.est\$coef[2])
###  Get stats for late calibration (last 50%)
stats.matrix.late[, i]=c(get.stats(orig.v, orig.c, est.v, est.c), tr.orig\$coef[2], tr.est\$coef[2])
}
}
###  Find the minimums between early and late
stats.matrix.min=stats.matrix.early
map=stats.matrix.early>stats.matrix.late
stats.matrix.min[map]=stats.matrix.late[map]
###  Replace 9999’s with NAs
map=stats.matrix.early==9999
stats.matrix.early[map]=NA
map=stats.matrix.late==9999
stats.matrix.late[map]=NA
map=stats.matrix.min==9999
stats.matrix.min[map]=NA
###  Return a list with minimums, early cal stats, and late cal stats
stats.list=list(stats.matrix.min, stats.matrix.early, stats.matrix.late)
}
get.stats=function(orig.cal, orig.ver, est.cal, est.ver) {
###  Get calibration/verification period means
orig.mean.c=mean(orig.cal)
orig.mean.v=mean(orig.ver)
est.mean.c=mean(est.cal)
est.mean.v=mean(est.ver)
###  Get residuals
cal.r=orig.cal-est.cal
ver.r=orig.ver-est.ver
###  Calculate Hurst parameter
hurst=HurstK(c(orig.cal, orig.ver))
###  Calculate average explained variance [Cook et al (1999)]
R2c=1-ssq(cal.r)/ssq(orig.cal-est.mean.c)
###  Calculate Pearson correlation for verification period
###  [Cook et al (1999)]
pearson=cor.test(orig.ver, est.ver, method=”pearson”)
r.ver=pearson\$est
r.ver.p=pearson\$p.value
###  Calculate Pearson correlation for all available data
###  [Cook et al (1999)]
pearson=cor.test(c(orig.ver, orig.cal), c(est.ver, est.cal), method=”pearson”)
r.all=pearson\$est
r.all.p=pearson\$p.value
###  Calculate Kendall’s Tau for the verification period
tau=cor.test(orig.ver, est.ver, method=”kendall”)
tau.ver=tau\$est
tau.ver.p=tau\$p.value
###  Calculate Kendall’s Tau for all available data
tau=cor.test(c(orig.ver, orig.cal), c(est.ver, est.cal), method=”kendall”)
tau.all=tau\$est
tau.all.p=tau\$p.value
###  Calculate RE [Cook et al (1999)]
RE=1-ssq(orig.ver-est.ver)/ssq(orig.ver-orig.mean.c)
###  Calculate CE [Cook et al (1999)]
CE=1-ssq(orig.ver-est.ver)/ssq(orig.ver-orig.mean.v)
###  Return vector
stats=c(hurst, R2c, r.ver, r.ver.p, r.all, r.all.p, tau.ver, tau.ver.p, tau.all, tau.all.p, RE, CE)
stats
}
Note:  Ground data was taken from the MET READER site on 4/4/2009.  Recon data taken from Steig’s site (ant_recon.txt)

## 40 thoughts on “Steig’s Verification Statistics – Part 2 (A little more majic)”

1. Matt Y. says:

Impressive stuff, Ryan. Your efforts are appreciated. I look forward to seeing your reconstruction. Herculean task you are taking on there.

2. Kenneth Fritsch says:

I second what Matt Y said in the above post.

In short, I believe had the authors presented plots of the relevant verification statistics for the ground stations – which show negative REs for about half of them and negative CEs for about 3/5ths – they might have had a more difficult time with publishing. The reconstructed values simply do not match the ground records.

Ot at least buried it in the SI and had that dramtic cover for Nature retained. I agree that overall (for the Antarctica regions) the statistics do not bode well for the validity of the reconstruction.

Hopefully over the next few weeks I will be able to present a method that does a fairer job of representing actual Antarctic temperatures.

Additionally, I intend to run some experiments using manufactured temperature fields to perform sensitivity studies with RegEM and this particular implementation of PCA.

Definitely something to look forward to. I like the approach you are taking here.

Also good review of the Steig paper and articulation of what you are attempting to do. I must admit that I sometimes let my guard down and slide over points in a well written analysis.

3. Layman Lurker says:

Great stuff Ryan. Other than Steve M on MBH, I have never seen anything like the work that you guys are putting in on this.

I just thought of another correlation…both you and Jeff are irish! Maybe correlation is causation. 8)

Some thoughts (feel free to consider, discard, cull out, laugh at, as you see fit):

I was thinking about Jeff’s re-trended scenario with constant -2C applied. If one re-trended or simply de-trended while maintaining the signal to noise ratio of the data within individual grids (ie: not constant trend) then the sensitivity of the verification stats to trend changes could be properly checked as well. I realize this may be easier said than done.

My humble offering of potential candidates of manufactured scenarios for sensitivity:
1. Re-locating the peninsula surface station cluster to another region where there is cooling – leaving only one station on the peninsula representative of the average.
2. As Jeff mused about, leave peninsula data alone and re-trend the greater continent.
3. Permutations / combinations including more PC’s.

4. Ryan O says:

Layman, all good things to try . . . and, TBH, #2 and #3 are, in my opinion, a necessity for any kind of reasonable sensitivity analysis.

#1 would only work if we explicitly included distance weighting for correlations, which RegEM does not do. Jeff or I could relocate the Peninsula stations to the moon and RegEM wouldn’t care. With that being said, if we did a distance weighting for correlations, we would have to do sensitivity analysis for the weighting factors, which is mathematically equivalent to your proposal.

5. Ryan O says:

Kenneth:

I must admit that I sometimes let my guard down and slide over points in a well written analysis.

I would implore you not to let your guard down. 🙂 I’ve often used your insights to get as far as I have. It benefits all of us to have you be as critical of our blogosphere ramblings as you are of the papers that inspire them.

6. TCO says:

I have a hard time following the explication. It seems obvous that PC resolution would casue issues with getting good nearest neigbor matches here. Also, that you a bit conflate the idea of matching AVHRR to station (measuring same thing) and PC TIR matching to AWS.

7. Ryan O says:

A check one could use to determine if the output is gibberish is to calculate verification statistics of the ground data vs. the reconstruction. By itself this will not be able to distinguish between calibration problems, imputation problems, and resolution problems (insufficient number of PCs), but it will at least provide some insight as to whether there are problems that should be concerning. Since Steig did not include such an analysis with the paper, we shall take it upon ourselves to provide them.

#6 I get points for saying this first.

Also, AVHRR to station is not measuring the same thing:

http://www.climateaudit.org/?p=5678

And lastly, AWS stations are fundamentally no different than manned stations. The temperature is still measured with a thermometer on a stick. Methinks you are confusing “AWS recon” with “AWS stations”.

8. Kenneth Fritsch says:

Ryan O, from your Part 1 you stated that:

The TIR reconstruction produces decent r2, RE, and CE statistics when compared to the AVHRR data in the 1982-2006 timeframe (the comparison to the ground data – including comparisons back to 1957 – will be Part Two). Though the authors used AR(1) noise instead of the more realistic AR(8) noise, this conclusion is unaltered.

I agree with the importance of your Part 2 analysis in that the 43 surface stations are apparently the centerpiece of the reconstruction. I think I am correct when I say that the 1982-2006 TIR reconstruction does not use the AVHRR data directly but uses the developed 3 PCs after “relating” the 43 surface stations to the AVHRR grids.

If I am correct up to this point then my question is: for the 1982-2006 period, are the 43 surface stations the driver for the AVHRR grids even though the grids could stand alone during this period? I would guess from the authors of Steig et al. statements about the uncertainties of the AVHRR measurements and particularly with reference to the cloud masking that they have more confidence in the 43 surface station measurements. I have the general inclination that the AVHRR measurements are used to spread the 43 surface station data around the Antarctica through correlations for the 1982-2006 period and the 1957-1981 period.

I also have the feeling that I could be missing something here and have this wrong. Please put me on the right path before I ask a question about the r, RE and CE values for the 1982-2006 period for AVHRR versus TIR reconstruction compared to those same values for the surface stations versus TIR reconstruction for the same period.

9. TCO says:

Yeah, I read your caveat. I just think that measuring a composite with 3 PCs versus station is different than measuring local satt versus station. Such that doing the former does not tell you “if they are measuring the same thing.”

10. Layman Lurker says:

#4

Ryan, in case I didn’t make myself clear, I was suggesting a “what if” scenario. “What if” there was only one peninsula station and a cluster of (psuedo) stations in tight proximity to an existing station with a cooling trend. The cluster would be closely correlated in both trend and noise like the peninsula – only cooling instead of warming. In essence a mirror image of the peninsula. I understand that RegEM would not know the location of the psuedo-stations, but if the hf and trend correlations mirrored the peninsula (only cooling) perhaps the method is sensitive and would smear this cooling out into the continent. Particularly pre 1982. The method should not be sensitive to this because the underlying temperatures would be assumed to be unchanged.

11. #8

The post 1982 reconstruction is entirely AVHRR data in 3 pc format. RegEM Matlab version doesn’t do any modification of the existing values.

I have the general inclination that the AVHRR measurements are used to spread the 43 surface station data around the Antarctica through correlations for the 1982-2006 period and the 1957-1981 period.

I also thought this would be the case initially, after all it makes sense. We need to stop thinking though and loose our minds to the force…. The post 82 data is uncorrected and noisy ‘trend’ satellite TIR info in 3PC format unaffected by surf temps. The older half is entirely surf data mushed together by high freq correlation.

12. TCO says:

“The older half is entirely surf data mushed together by high freq correlation.”

Of course it is. The question is how well it was mushed together. I actually SHARE a lot of the concerns of people here. I just don’t like the blanket statements, with a lack of chasing things down to ground…and the “tests” where an assumption already exists that patterns, teleconnections, high freq etc are insignificant. I mean maybe they are…but they are not automatically insignificant.

This is why I dislike 3 PCs more than I dislike RegEM

13. Kenneth Fritsch says:

The post 82 data is uncorrected and noisy ‘trend’ satellite TIR info in 3PC format unaffected by surf temps. The older half is entirely surf data mushed together by high freq correlation.

Jeff ID thanks for setting me straight – again. I think I have gotten the subtle point here wrong before. What Ryan O found in his Part 1 was r, RE, and CE statistics on the 3 PC format (unaffected by the surface temperatures) versus the AVHRR data. That comparison contrasted greatly (in Antarctica regional terms) from the comparison Ryan O made in Part 2 where it was the 3 PC format (unaffected by the surface stations) versus the surface stations.

Let me know if I have the comparisons correct. Jeff ID, I also want to attempt to ask some questions about the high/low frequency correlations as soon as I get my head around them better and perhaps look at how GISS handles this issue.

14. Kenneth Fritsch says:

Below, I have linked and excerpted some comments from GISS on using rural (or rural by GISS’s definition) as correlation and adjustment for proximate urban and suburban stations. My intent is to show that GISS is interested in using the correlation based on long term trends and not high frequency correlations. Missing data within the GISS limits of 9 missing per month are in-filled by extrapolation. Off topic to this thought is the comment from GISS where they admit to a problem with micro climates with non climate influences that can exist even in rural and non-urban stations. They wave its importance away by saying they have it covered, but the Watts team findings would seem to put some doubt into that waiver.

You may not agree with GISS methods but those methods would not apparently lead to the treatment that Steig et al use in correlating distant stations.

The basic GISS temperature analysis scheme was defined in the late 1970s by James Hansen when a method of estimating global temperature change was needed for comparison with one-dimensional global climate models. Prior temperature analyses, most notably those of Murray Mitchell, covered only 20-90°N latitudes. Our rationale was that the number of Southern Hemisphere stations was sufficient for a meaningful estimate of global temperature change, because temperature anomalies and trends are highly correlated over substantial geographical distances.

The analysis method was documented in Hansen and Lebedeff (1987), showing that the correlation of temperature change was reasonably strong for stations separated by up to 1200 km, especially at middle and high latitudes. They obtained quantitative estimates of the error in annual and 5-year mean temperature change by sampling at station locations a spatially complete data set of a long run of a global climate model, which was shown to have realistic spatial and temporal variability.

The GHCN/USHCN/SCAR data are modified in two steps to obtain station data from which our tables, graphs, and maps are constructed. In step 1, if there are multiple records at a given location, these are combined into one record; in step 2, the urban and peri-urban (i.e., other than rural) stations are adjusted so that their long-term trend matches that of the mean of neighboring rural stations. Urban stations without nearby rural stations are dropped.

The excerpt below is off topic for the subject of my post but relevant to what the Watts team has been attempting to determine.

We find evidence of local human effects (“urban warming”) even in suburban and small-town surface air temperature records, but the effect is modest in magnitude and conceivably could be an artifact of inhomogeneities in the station records. We suggest further studies, including more complete satellite night light analyses.

Click to access 2001_Hansen_etal.pdf

The urban adjustment of Hansen et al. [1999] consisted of a two-legged linear adjustment such that the linear trend of temperature before and after 1950 was the same as the mean trend of rural neighboring stations.

The USHCN analysis [Karl et al., 1990; Easterling et al., 1996a] contains another small adjustment in which missing data, mainly in the period 1900-1910, are filled in by interpolation. The effect is much less than the time of observation and station history adjustments, as illustrated. This adjustment is not included in the GISS analysis, which was designed to minimize the effect of data gaps.

By the way, the most recent version of USHCN (Version 2) uses break or change point analysis to make corrections for all non homogeneity effects in the temperature data including UHI effects.

15. Fluffy Clouds (Tim L) says:

Just throwing this out here.

if there is a Y axis crossover error it might be detected by adding 100 to the data.

example -1 + -1= -2 , 1+1=2 , -1 + 1 =0

we can see that there is a change of 2 for all three cases, but the last one records a “zero”

As always just ignore This crazy person.

good work ryan

16. Layman Lurker says:

#14 KF

Kenneth, thanks for bringing this to light. Maybe the Chosen One will yet come round.

17. Kenneth Fritsch says:

I think that this thread is as good as any to admit to some grasshopperish inclinations on my part and what I am attempting to do to curtail them.

My problem arises from recollections of the contents of the Steig et al. paper and my recollections of what the analyses here and at CA have shown about the processes used in that paper. I have no excuses for missing even the more subtle points of the Steig et al paper as I have read it a couple of times. I might have some excuse for not recalling all the revelations about the Steig paper analyses here and at CA because a summary of all of them does not exist in a single document like it does in the Steig paper. I do, however, think the separate summaries have been very good.

I plan to go back and carefully read the Steig paper again in order to determine which of these more subtle points on the processes used in their paper can be derived there and how many of them are only and exclusively revealed through the Air Vent and CA analyses. I then want to attempt to answer my questions through reviewing Air Vent and CA posts and finally by asking a few direct questions.

Having said all that I do want to ask a preliminary question here:

The TIR reconstruction for the period 1982-2006 does not use the surface station data, but the 1957-1981 time period must of necessity use it. We then essentially have a period where instead of using the processed AVHRR data it is first submitted to PCA and then put into 3 PCs with no surface stations inputs. The 1982-2006 TIR reconstruction then is essentially an exercise in getting the processed AVHRR data into the 3 PC format.

The period 1957-1981 TIR reconstruction then becomes one using the relationships of the 1982-2006 AVHRR processed data with the surface stations for the period 1982-2006 and processing those relationships in a different and separate 3 PC format for reconstruction of the 1957-1981 period using the surface station data for the 1957-1981 period.

This process, in my mind, would consist of an interpreted instrumental record (3 PC format) for 1982-2006 period attached to the end of the truly reconstructed 1957-2006 period. For that attachment to have continuity and meaning would require a rather exacting splicing of the two periods. How is that splicing facilitated in the Steig paper processes?

18. “For that attachment to have continuity and meaning would require a rather exacting splicing of the two periods. How is that splicing facilitated in the Steig paper processes?”

RegEM magic does the splicing. The surface data is placed in a matrix with the 3pc’s. RegEM does the rest. Nobody worried too much about continuity only appearance. There were several posts here early on which looked at the two periods separately, the match wasn’t very good.

19. TCO says:

Ken:

I don’t think you are wrong for having a hard time remembering and keeping track of all the critical analyses. They have been disjointed and evolving (even correcting itself). Which is fine…this is the McI defense that blogs are notebooks and scratch pads.

It’s just that actual scientists or even observors should not be considered required to monitor this stuff or to make judgements on Stieg et al. This is why I jump on JEff when he gets all breathy and wants to say how bad Steig is. Wrap it up first…then it will be like a knife through the heart. Don’t celebrate when the knife fight is far from over. Also why I criticize the silliness of McI giving himself pats on the back as some sort of super publisher when has one real paper (GRL 2005) or expects other people writing papers to Google Climate Audit and cite it for ideas.

20. #19, When you find a big problem which has obvious implications, isn’t that worth a blog post? It makes me think you’re missing the point. I would have been just as happy to report that I was able to replicate everything exactly and there was no oddness in the math but this math ain’t good and it ain’t gonna be later this summer or next year.

And that ain’t my fault, I wouldn’t have published it without checks. RegEM is a good idea on the surface but it requires more QC after the fact to prove it did what was expected. In this case, IT DID NOT.

What’s worse is that they didn’t verify their results against a baseline such as average surface station trends or a nice area weighted recon. Or what if they did some work to confirm the AVHRR post 1982 matched surface station trend post 1982 instead of presenting correlation (actually demonstrated here in tAV to be disassociated with trend) as the proof of quality. From my perspective, this is a horrible bit of stat mashing.

21. Kenneth Fritsch says:

I have reread the Steig et al. paper and most of SI and now have some questions that I will present one at a time.

RegEM uses the surface station measurements and its relationships with the grid based AVHRR processed data for calibration and verification in the 1982-2006 time period. Verification and calibration statistics reported in Steig uses a comparison of the RegEM (with the 3 PC format) derived reconstructions in the 1982-2006 period and the processed AVHRR data. Figure 1 in Steig reports r, CE, RE statistics for the processed AVHRR grid data versus the reconstructed grid results derived as noted above. Figure 2 in Steig reports r, CE and RE statistics for the Antarctica-wide monthly mean processed AVHRR data versus the monthly means derived the reconstructed grid results derived as noted above.

What I have seen as typical of temperature reconstructions is that the reconstruction process and reporting (normally in a time series graph) is carried through the instrumental period of calibration and verification. Mann et al. did that in their famous HS paper by showing the reconstruction to the point where the data ran out. Problematic for Mann (amongst many problems) was attaching the instrumental data on the end reconstruction (in the early 1980s as I recall) for perspective without actually bothering to formulaically splicing it on.

If the TIR reconstruction as presented in the Steig paper does not use the same reconstruction in the 1982-2006 period as that used in the 1957-1981 period, and instead uses the instrumental data put into 3 PC format, this, in my view, is against common practice and becomes problematic.

Careful reading of the Steig et al. paper and the SI has not allowed me to determine how the actual 1982-2006 reconstruction was done. That information has to come from those at Air Vent and CA who have gained sufficient analysis insights to make a reasonably certain call on it.

Are we quite certain that the 1982-2006 part of the reconstruction was different than that used for the 1957-1981 period and that it was independent of the surface stations?

22. Kenneth,

RegEM (matlab version) doesn’t replace known values in the matrix. The AVHRR reconstruction has the 1982 – 2007 portion of the matrix already covered so these values don’t change so the short answer is yes. SteveM’s algorithm apparently doesn’t mask the existing values out so they get re-calculated each time, something Ryan pointed out but I haven’t confirmed.

23. TCO says:

Steve fu…messed up?

24. Kenneth Fritsch says:
25. Kenneth Fritsch says:

Sorry something happened to my posted comments and I’ll try again.

RegEM (matlab version) doesn’t replace known values in the matrix. The AVHRR reconstruction has the 1982 – 2007 portion of the matrix already covered so these values don’t change so the short answer is yes.

Now all we need to know is how the known values got into the matrix. Were the values the processed AVHRR data unaffected by the surface station relationships or were they values from the reconstruction for the 1982-2006 period as required for the calibration and verification testing?

Do you know the answer directly or would it be determined by a test.

I recall that the TIR reconstruction for pre and post 1982 had different appearances (noise levels?).

26. #23,

Steve’s algorithm is close. Five decimal places or something.

27. Ryan O says:

#23 No, Steve’s version is more straightforward. It allows you to see what RegEM thinks the real data should be, so calculating verification statistics is far easier. Unlike the MatLab version, you can calculate verification statistics without withholding data – which could be a critical feature given that RegEM results can change unpredictably when the input data is changed.

28. Kenneth Fritsch says:

Jeff ID you say in the thread introduction:

This is done by calculating the principal components of the satellite data and using RegEM to allocate station trends according to the high frequency covariance (a subject which needs more time).

If I understand you correctly the process that you describe above is for your attempt to replicate the TIR 1957-1981 part of the reconstruction.

At post 21 Hu McCulloch said:

Great job, Jeff! Whether or not Steig09’s ant_recon.txt is meaningful or robust, at least you have shown that it can be replicated (closely enough for practical purposes) from cloudmaskedAVHRR.txt + 42 surface stations.

At post 30 Nic L said:

I am certain that Steig simply used RegEM on a combination of the data from the 42 surface stations and the 3 PCs for 1982-2006 that he derived from processing the satellite data.

I am again confused by Hu M’s post, while Nic L’s post leads me to believe that the 1982-2006 part of your attempt at replication was derived in 3 PC format by using the 42 station and processed AVHRR instrumental data.

Jeff ID have I assumed correctly here? If I have then the evidence says that Steig has essentially used instrumental data for the 1982-2006 part and reconstructed results for the 1957-1981 part. That would surely not be a valid temperature reconstruction as I know them.

In the Steig et al. (2009) SI the authors state: “We apply the same method as for the TIR-based reconstruction to the AWS-data, using RegEM with k=3 and using the Reader occupied weather station temperature data to produce a reconstruction of temperature at each AWS site for the period 1957-2006.” Was not the AWS reconstruction produced using the 3 PC format and relating the 42 surface stations data to the 63 AWS stations over the entire 1957-2006 period?

If I assume correctly from your evidence, Jeff ID, would not the sSteig SI comment be a bit misleading?

29. #29, Hu is simply recognizing that the post took the Comiso satellite data and processed it into 3PC’s Nic L is recognizing the fact that 3PC’s are used rather than the whole dataset. Prior to that we simply took 3pc’s from the output data and made the reconstructions.

The output data only consisted of linear combination’s of 3 PC’s.

“Steig has essentially used instrumental data for the 1982-2006 part and reconstructed results for the 1957-1981 part.”

The instrumental data post 1982 is satellite AVHRR so that highly processed and very noisy data was collected from orbit. The pre-1982 is ground station instruments with linear weighting according to correlation with 3 pc’s of sat data. The ugly part comes when the covariance matrix is calculated and truncated in RegEM retaining a fractional portion of the information. Theoretically it could work but there is a lack of verification, something that Ryan just nicely demonstrated in his latest post.

In the AWS reconstruction the matrices were small so reduction to 3PC’s wasn’t necessary. They just placed 42 surface stations next to 63 AWS surface stations and let RegEM rip.

30. Kenneth Fritsch says:

The instrumental data post 1982 is satellite AVHRR so that highly processed and very noisy data was collected from orbit. The pre-1982 is ground station instruments with linear weighting according to correlation with 3 pc’s of sat data.

Jeff ID, I would prefer a yes or no answer to my assumption that instrumental data was used without RegEM relating it to the 42 surface stations in the TIR reconstruction post 1982.

In the AWS reconstruction the matrices were small so reduction to 3PC’s wasn’t necessary. They just placed 42 surface stations next to 63 AWS surface stations and let RegEM rip.

Ok then Steig et al. used RegEM without PC reduction for AWS. The authors did a calibration and verification using RegEM for the 1980-2006 period and using that calibration for RegEM applied it to the entire 1957-2006 period – using RegEM. My point is that, if my assumptions for AWS are correct, the Steig reconstruction for AWS is in line with my experience with other temperature reconstructions. My assumptions for the TIR reconstruction would indicate that that reconstruction is flawed.

I’ll look at your code in the link above and see if I can determine what I am looking for.

31. 31 Sorry, Yes post 1982 was only instrumental. It was satellite TIR instrumental in 3PC format post 1982. The question was a bit complicated to me because the early part was also surface station instrumental processed through RegEM.

“My assumptions for the TIR reconstruction would indicate that that reconstruction is flawed.”

The authors assume RegEM ‘calibrates’ the surface station data appropriately to make it a good reconstruction. From everything we’ve looked at it does not.

32. Ryan O says:

Kenneth,

Let me give this a shot. It may be a bit longer of an explanation than you wanted, but I think it will help clear up what Steig did. Theoretically, the methodology used for the TIR reconstruction could be valid (with some caveats). While it may seem different on the surface, it is not dissimilar to other PCA-type reconstructions. It is more subtle, and, to be honest, rather clever in my opinion – because it is the methodology that allows them to legitimately not perform the type of calibration you are referring to.

First, the PCA portion:

The purpose of the PCA portion in this case is not really to perform the reconstruction, like it is in the hockey stick cases. Here, the only purpose of the PCA portion is to decrease the number of variables needed to represent the AVHRR data. They do this in order to make the problem computable for RegEM. Fancy terminology aside, it’s simply a data reduction step.

The RegEM portion:

They then take the 3 PCs and put them in a matrix alongside the station data. This is fed into RegEM, which scales and centers the data.

(Note: This is the reason they don’t have to do the calibration steps. RegEM itself does the scaling and centering. They can’t change the scaling and centering to be the wrong period, like in the hockey sticks. RegEM has sort of a built-in calibration.)

RegEM then looks at the matrix, fills in all of the missing data with zeros, and performs an SVD on the matrix. It retains k (or regpar) eigenvectors to represent the input matrix. It then tries to compute a covariance matrix that minimizes the residuals.

Think of the covariance matrix as a plane that attempts to fit through the rank-k data. Any points that are missing take on values corresponding to points on the surface of the plane.

Now RegEM has a completely filled-in dataset. But since it had formed the first covariance plane from a data set where all the missing values were zero – and now they are all non-zero – the eigenvectors will be different. So it goes through another iteration: SVD, calculate covariance, update the values. It continues to do this until the difference between iterations is less than a preset tolerance.

So RegEM doesn’t care whether you pre-calibrate the data – it does it as a matter of course. I tried recentering the data on arbitrary timeframes prior to input into RegEM and the output is always the same. I added constants, subtracted constants, multiplied by constants, deliberately offset the baseline for anomalies for some of the ground stations, used different offsets, all kinds o’stuff. None of it mattered. Regardless of the initial centering/scaling scheme, RegEM always output the same answer.

In other words, no interpretation or pre-calibration of the ground data to the satellite data is required. RegEM automatically does this on its own. It is fundamental to the algorithm.

When it comes to the AWS reconstruction, just think of the AWS reconstruction as a sparser version of the AVHRR data. There was no need to reduce it to PCs because it was already a small enough data set to be manageable for RegEM.

I hope that helps answer your questions. 🙂

33. Ryan,

Another great summary, de-centering the data was one of the first thing’s I tried. After running Matlab it didn’t make a difference.

I also completely agree that the whole thing is clever, it’s actually insidiously clever. The checks for it are the problem and they are not so clever, yet they are accepted by our alleged brightest. I’m worried that the exposition of the problems will not be understood by those who are in climate science. I doubt think they teach math in climatology the same way as in engineering or physics.

34. Ryan O says:

Evil geniuses they are. Muahahahaha.

Seriously, it is really clever. Light on the verification side, but damned clever. Even though I enjoy trying to rip it apart, I must admit that I think what they did is pretty cool.

35. #35 Agreed, let’s not rip it apart unless it’s deserved.

I say it that way because the casual reader won’t understand the serious defects and it will read differently to most. Your posts are clean and truthful so it’s important to let the math lay where it is.

IMO, The problem is not this paper but rather that RegEM will be used indiscriminately without proper verification on so many others. Mann08 is an example.

If I publish anything down the road, this is the most important point to me.

36. Ryan O says:

Yep, without a doubt.

37. Fluffy Clouds (Tim L) says:

3. Place the first 3 principle components alongside the data from 42 surface stations and use RegEM to impute the 1957-1982 values for the 3 principle components.

so this could force the early temps cooler?

IF the stations and sats. don’t match for the pca or tir? not properly calibrated to each other?

this is diabolically clever lol!!!

38. Kenneth Fritsch says:

Seriously, it is really clever. Light on the verification side, but damned clever. Even though I enjoy trying to rip it apart, I must admit that I think what they did is pretty cool.

It has been my impression since gaining a layman’s perspective of all these temperature reconstructions, and particularly the original Mann et al. HS, that it is that cleverness that got these papers wide recognition and in my opinion a pass on the strict adherence to statistical standards for the methodologies used. I think that when you combine that cleverness with the apparent effort that goes into the reconstructions you have an unbeatable combination. In view of the consensus amongst climate scientists one can see that a clever paper, like Steig et al. (2009), that gives evidence for the consensus is going to be difficult to not publish, acclaim in the popular media and get a pass from criticism by climate scientists in general.

The cleverer the paper appears, the more we should enjoy attempting to rip it apart – or as I prefer: doing sensitivity analyses. By the way, I think that Steig et al. is almost as iconic as Mann’s HS with that Nature cover picture of warming spreading to the West Antarctica. If I were marketing immediate AGW mitigation, I would have to say that Mann’s HS and Steig’s Nature cover have excellent emblematic content.

39. #39, “If I were marketing immediate AGW mitigation, I would have to say that Mann’s HS and Steig’s Nature cover have excellent emblematic content.”

Well now we can’t let that happen, can we? 😉