## How Many PC’s Does it Take to …

Posted by Jeff Id on February 17, 2009

I had an idea earlier today that RegEM on all the massive set of satellite data in the antarctic reconstruction paper may be actually equal to a RegEM of only 3 series. This is an oddity as revealed by Roman M’s brilliant analysis on climate audit where he discovered that the satellite reconstruction data is entirely represented by 3 pc’s for over 5000 grid cells. A massive amount of data for so small a number of series. Each gridcell is created by taking three trends times 3 x 5000 multipliers and added together to create all of the individual cells.

The way the paper seems to work is to use all the complexities of the 1982 -2006 satellite data with a wide variety of covariances in relation to the 42 surface measurements in a sophisticated bounded impution algorithm (RegEM) to reproduce data back to 1956. Well, it turns out that doesn’t seem to be the case.

I used the 3 back calculated PC’s that RomanM derrived which extend back to 1957 and deleted all values prior to1982. Three total temp series which were then placed next to the 42 surface stations in a matrix. RegEM was used to reproduce the 3 series.

The series on the left are the Steig PC’s, the ones on the right represent the recreated PC’s (the data prior to 1982 in the right side graphs were calculated using RegEM).

This is the SMALL difference between the plots.

Below is a plot of the number of temperature stations available in each year.

What this shows to me is that the reconstruction with all it’s complex covariances* actually comes down to the RegEM infilling of 3 PC’s based on the data from 30ish pre-1980 temperature stations. To be very clear, this is a reconstruction of the AWS data for an entire continent based on RegEM from — three curves!!*

What makes this significant is that it is assumed from reading the paper that the AWS data covariance is used to determine station weighting relative to individual ground positions. Instead there is little information here to separate the locatons of the ground stations as would be required for a proper RegEM reconstruction. Also, we now can assume that the AWS reconstruction was done on 3 pc’s rather than the entire set of data.

*———-*

I think that due to the number of reads and lack of comments this may be a bit too confusing the way it’s written so I added a bit more explanation.

First a PC or principal component is basically a curve. The curves in this case were calculated to be a best match to two reconstructions, satellite or AWS data for the antarctic. The 3 curves get a multiplier which can be positive or negative and are added together to create a reproduction of the entire field of satellite measurements. This process can do a good job of representing a field but limitations in the number of pc’s can create limitations in the detail level of the resulting field. Still, if done right the trend should be accurate.

From Jeff C the trend in the field from the 3 pc’s looks like this.

The above graph comes from the data presented by Dr. Steig at his website, Roman discovered this data was actually PC data and did not include the real satellite data. Jeff C simply plots all the points on a map of the antarctic. Jeff’s graph shows a positive trend at nearly every station in the antarctic, still you can see some of the detail level in the trends which let’s us know that the 3 PC’s can make a nice field. You can see the title of the graph goes back to 1957, yet the AWS data only exists from 1980’s onward.

In RegEM a temperature station would be expected to have a high correlation with the real satellite data at a the same point so RegEM would then assign a high weighting from the nearby surface station historic data to that individual point creating a reasonable spatial trend.

By using a low number of PC’s in the reconstruction rather than the actual satellite data the likelihood of an individual station receiving proper weighting is reduced and by my own guess it seems pretty minimal. That means that the stations in the peninsula would have their already exaggerated (from Jeff C’s post) weighting assigned to the entire reconstruction.

The first ten times I read the paper I assumed real data was used to properly match surface station data to the satellite data. This result clearly shows that the processed data from 3 pc’s was used to create the historic trends for all the satellite record. My result above has a very small error in the reconstruction which shows that the reconstructions of the pre-satellite data can be created with minimal concern for the weighting of individual stations.

This isn’t final proof or anything but it is strong evidence to me that the peninsula stations are likely exaggerating the trend of the entire reconstruction.

## Stephen McIntyre said

Jeff, I don’t have a problem per se with the number of curves being three. In fact, it’s not obvious to me that you want a lot of PCs. The more salient issue IMO is that I think that the RegEM algorithm is somewhat complicated, but not necessarily “sophisticated”. Buried beneath the algorithm is a form of regression relationship. Use of this method distracts from usual regression diagnostics, replacing forms of analysis that are well understood with ones that do not generate useful diagnostics. After these analyses, we know nothing about the coefficients, for example. To get at them, the algorithm needs to be rewritten a bit to get at them. I think that I’m getting close to accomplishing that.

## Jeff Id said

Steve,

We know now that the reconstruction of satellite data used PC’s instead of the actual data.

I wasn’t very detailed in my complaint about 3 PC’s for the continent. Certainly 3 pc’s can represent a trend with substantial detail for the surface area. It’s quite similar to some optics problems I work on. What happens with low curve numbers is real boundaries become fuzzy with less PC’s. The ‘real’ problem to me though is in the surface station weighting for impution of missing values in the reconstruction.

The implied positional information of each surface station is from my admittedly limited understanding, determined by the covariance with the grid location of the PC’s. Since we only have 3 pc’s weighted for various gridcells, it seems to me that it would be pretty hard to achieve proper association of the satellite trends to individual stations in RegEM. A station in the peninsula can therefore lock onto a trend on the other side of the continent which has too much weight in one PC or the other. All kinds of mixed up effects can occur completely without the knowledge of the scientist making the calculation.

If actual data had been used, perhaps a hypothetical temperature spike ocurring on the north east edge of the antarctic could help weight RegEM to the correct series. Without it, mush.

## Jeff C. said

I think this might be relevant as to whether a small number of stations and 3 PCs are adequate to describe the temperature trends over the entire continent. Here is a plot of the trends for the 5509 gridcells from Steig’s satellite reconstruction from 1957 to 2006.

Of the 5509 grid cells, 5508 show warming, one shows cooling (yes, there actually is one blue dot in there). If this really was a good representation of high-resolution satellite measurements, shouldn’t there be cooling over more that 0.02% of the gridcells? Even if the overall trend was toward warming, shouldn’t at least some small portions of a diverse continent show cooling?

## Peter D. Tillman said

For newcomers to PCA, there’s a tutorial at the CA wiki that may help:

http://climateaudit101.wikispot.org/Principal_Component_Analysis

Keep up the good work! And publish it!

Cheers — Pete Tillman