## Roman M – PCA Deconstruction – an unauthorized biography.

Posted by Jeff Id on February 7, 2009

Roman M. Did an interesting analysis over at CA today deconstructing the reconstructed temperatures in the Antarctic paper titled Warming of the Antarctic ice-sheet surface since the 1957 International Geophysical Year:

## Deconstructing the Steig AWS Reconstruction

The paper demonstrates extraordinary warming in the Antarctic. The entire exercise was brought about because of Dr. Eric Steig not releasing his code despite his unreasonable claims. Roman’s post back-calculated the pca’s use in the Automatic Weather Station version of the reconstructed temperature. What we need to understand is that weather stations in the antarctic are frequently buried and cease working.

Here is a graph of the available temperature series when they were actually reporting data. This is a plot of the peninsula stations but it is similar to the automatic stations in data availability.

The PCA analysis by Roman allowed him to determine the shape of the interpolated curves which were used to infill the gaps in the data to give continuous series for their final result.

Step 1.

We begin by using a principal component analysis (and good old R) on the truncated sequences. For our analysis, we will need three variables previously defined by Steve Mc.: Data, Info, and recon_aws. In the scripts, I also include some optional plots which are not run automatically. Simply remove the # sign to run them.

There are 63 eigenvalues in the PCA. The fourth largest one is virtually zero. This makes it very clear that the reconstructed values are a simple linear combination of only three sequences, presumably calculated by the RegEM machine. The sequences are not unique (which does not matter for our purposes).

PCA = Principal Components Analysis, a method for determining the primary trends which comprise a set of curves.

Roman looked at the reconstructed curves (part of the data we are allowed to see from Dr. Steig) starting in the area where there was no actual temperature data 1957 – 1979. This data was never measured and is actually imputed or interpolated from other information. This data isn’t real but may be related to real data in a reasonable fashon.

He found that there were only 3 actual curves representing Antarctica in the automatic weather station reconstruction. Although many were surprised we shouldn’t have been. The paper actually states that’s what should be expected. My suspicion is that Roman read this from the beginning.

Principal component analysis of the weather station data produces results similar to those of the satellite data analysis, yielding three separable principal components. We therefore used the RegEM algorithm with a cut-off parameter k=3. A disadvantage of excluding higher-order terms (k>3) is that this fails to fully capture the variance in the Antarctic Peninsula region.We accept this tradeoff because the Peninsula is already the best-observed region of the Antarctic.

Well his analysis revealed exactly what the paper said. —— SM:this gives coefficients for all 63 reconstructions in terms of the 3 PCs

Three curves multiplied by a constant and added together create ALL of the trends in 63 series.

Roman then continued the analysis since he had done the first part for areas where there was no data at all.

We will assume that exactly three “PCs” were used with the same combining coefficients as in the early period. The solution is then to find intervals in 1980 to 2006 time range where we have three (or more) sites which do not have any actual measurements during that particular interval.

Again he does something pretty sharp, realizing that his first analysis revealed that 3 trends make the total but he doesn’t have any of the artificial data trend outside of his year range 1957 – 1979 yet, he looks for at least 3 curves which don’t have data in the remaining series. This is basic algebra – 3 equations and 3 unknowns applied at a complex level for us mere humans.

His PCA analysis again reveals 3 series of data which when multiplied by their coefficient and added together match the artificial data perfectly. What’s more the same multipliers for each series were found. — The first analysis revealed a multiplier for this range 1957 – 1979 in all 63 series and the second analysis on a different range of artificial data revealed the same multiplier for 3 series. These 3 series gave him the shapes of the three curves from 1980 – 1995. Repeating the analysis again gave him the shape of the PCA’s for years 1996-2006.

He then verified all of the curves compared to Steig et. team for the artificial portion of the data to about a millionth of a degree, so let’s say it matched pretty close.

Well if you are like me you ask Jeff, show me the artificial data. I want to know what they look like.

Don’t be too perturbed by the uptrend. It is actually equally as likely to be a downtrend in the reconstructed temperatures and each set of real data is infilled by these curves according to this

GAP IN TREND = PCA 1 * C1 + PCA 2 * C2 + PCA 3 * C3

Steve M then figured out this interesting tidbit.

The meaning of this graph is a little difficult for those who aren’t mathematically inclined. This means that the PC2 curve separates the two halves of the antarctic. The red stuff gets a positive PC2 multiplier and added into the curve while the blue stuff get’s a negative pc2 multiplier. Only one trend separates out the halves of the Antarctic.

Conclusions for the automatic weather station reconstruction – only one of the group:

1. For those who don’t like math the new assumed data has a trend up or down depending on the multiplier.

2. The PC2 trend separates the two halves of the antarctic.

3. Three curves are used to define the entire trend of the antarctic.

4. Roman is smart.

Three series of assumed data to reconstruct the antarctic seems far too small in my opinion. The antarctic is huge, how can three trends cut it? Well, Igot a bit more skeptical of this paper this morning while reading through the SI. I’m trying to keep a level head about it. After all, I was told by a colleague today that if we don’t react now the ice cap will vaporize, the earths axis will tilt and Washington DC will flood. We wouldn’t want that, would we?

Maybe I’ll sleep in.

## Kondealer said

Jeff, can you help me out here?

According to Roman, over at CA, the 3 PCs are 1.334132e+01, 6.740089e+00,

and 5.184448e+00 or to all intents and purposes 13.3, 6.74 and 5.18- all positive variables, but no one of them considerably outweighing the other, hence C1, C2 and C3 become important in determining the overall trend. So what are the values of C1, C2 and C3?

Also looking at the map. if PC2 separates the 2 halves, then clearly the “blues” of East Antarctic outweigh the “red” of West Antarctic. This will give PC2 a negative absolute value. What do PC1 and PC3 represent and will they end up positive or negative? Again eyeballing the map, suggests the negatives outweigh the positivers so how come Steig’s analysis shows an overall warming trend?

## Jeff Id said

The best I can do is copy a section from the methods for you. PCA is a funny thing. Normally it wouldn’t result in 3 clean series but in this case it was a result of the method for creation of the reconstruction.

The first three principal components

are statistically separable and can be meaningfully related to important dynamical

features of high-latitude Southern Hemisphere atmospheric circulation, as

defined independently by extrapolar instrumental data. The first principal component

is significantly correlated with the SAM index (the first principal component

of sea-level-pressure or 500-hPa geopotential heights for 20 deg S–90deg S), and

the second principal component reflects the zonal wave-3 pattern, which contributes

to the Antarctic dipole pattern of sea-ice anomalies in the Ross Sea and

Weddell Sea sectors4,8. The first two principal components of TIR alone explain

>50% of the monthly and annual temperature variabilities4. Monthly anomalies

from microwave data (not affected by clouds) yield virtually identical results4.

Principal component analysis of the weather station data produces results

similar to those of the satellite data analysis, yielding three separable principal

components. We therefore used the RegEM algorithm with a cut-off parameter

k=3. A disadvantage of excluding higher-order terms (k>3) is that this fails to

fully capture the variance in the Antarctic Peninsula region.Weaccept this tradeoff

because the Peninsula is already the best-observed region of the Antarctic

## Kondealer said

Thanks Jeff. Looks like the usual convoluted “Mann-speak”.

Now I’m not a whiz with statistics (as you’ve probably guessed), but I know enough about principal component analysis to know that PC1>PC2>PC3. But Steig et al say there are only 3 significant PCs- which is backed up by Roman’s analysis at CA which showed that PC4 is minute compared with PC1 to PC3. But Steig then go on to say that “the first two principal components of TIR alone explain .50% of the monthly and annual temperature variabilities”.

Now are they really saying that PC1 + PC2 only account for 0.5% of the variance? (unlikely as this means that >99% of the variance is unexplained), or that PC1 + PC2 account for 50% of the variance?

If so why not use plain English?

## Jeff Id said

I’m sorry about that, when I copied the post several symbols got mixed up – WordPress does things like that. I corrected the comment.

## Stevo said

Kondealer,

I think those numbers are only the first value for the 3 PCs, their value for the starting date in 1957. Roman has only put in a few entries from the first line to allow you to check for bugs.

The anomaly series has been calculated at each of 63 points, each being a different linear sum of 3 series. It would be like 63 people standing in different parts of a room with three noise sources in the corners – what each hears is a sum of the three, but weighted differently by their different distances to each source. (I’m ignoring propagation delays.) PCA allows you to take the 63 audio recordings and extract the three noise sources (the principal components). You can then do linear regression trying to fit any of the 63 recordings as a sum of the three PCs, to get the set of coefficients for that recording.

The numbers quoted are from the three reconstructed noise sequences – the PCA1, PCA2, PCA3 values, not the C1, C2, C3 coefficients by which they are combined.

## Jeff Id said

Stevo,

I like the analogy.

## Tony Hansen said

Jeff, Is it known how much data is real and how much in-filled in this work?

## Jeff Id said

Yeah, it’s pretty sparse back in history. I should plot it.

## Kondealer said

Stevo, thanks for the analogy too. Looks like abot 60+ weather stations each making their own tune. To continue the analogy I get the distinct impression that more are playing the “blues” – some pretty loudly.

So how come the overall regression is anything but?

## Stevo said

Kondealer,

That’s the bit we don’t know. The weather station records were first processed by Steig using an algorithm called RegEM (Regularized Expectation Maximization – what mathematicians call Tikhonov regularization or sometimes ridge regression) to estimate the missing data. And then those reconstructions were analysed using PCA to extract the three ‘tunes’. It is stated that they found only three strong ones and the rest were too small to matter, but since they’ve been truncated by Steig I don’t think we can tell. What Steig did then is to reconstruct the 63 records using just the three PCs and only publish those as the final result. And what Roman has done is to reverse this last step to go from Steig’s 63 reconstructions to the 3 PCs that made them up, and the weightings.

The suspicion is that a few stations that happen to trend upwards might be being weighted more heavily than they should because of the lengths and times for which they happened to have been reporting data, or because of some other oddity. (For example, if a short-lived station shows a steep rise when all of the PCs show a shallow rise, the PCs will have to be multiplied by large coefficients to match it. This multiplies the rise in the rest of that station’s record too.) That was the sort of thing that happened with Mann’s earlier methods. But we don’t know. It may have been compensated for and such problem cases thrown out, but Steig probably wasn’t looking very hard for that sort of thing and they haven’t provided the intermediate data.

Since there’s virtually no data for that area before the satellites, it’s even possible that they’re right and that it rose initially before levelling off for the last thirty years. We can’t presuppose that it

didn’trise any more than they should presuppose that itdid. ‘No data’ means we don’t know.—

I haven’t got deeply into this one, I haven’t read Steig’s paper, and I’m only going by what I’ve read on ClimateAudit, and some intelligent (?) guesswork. That’s just my understanding of the situation, and my apologies if I’ve got anything wrong.

I suspect SteveM and friends are a bit too busy at the moment for providing patient explanations for the uninitiated, but I expect we’ll be getting some easier post-event analysis/summary after all the current fuss has died down.

## Kondealer said

Thanks Stevo- this is basically what I suspected (what weightings were attached to the various reconstructions) and what I asked over at RC (and was cut).

I’m afraid it is looking more like a smoke and mirrors job in simailar vein to the “Hockey Stick” where a few reconstructions, heavily weighted, determined the outcome of the final signal:-(