## PCA, SAMPLING ERROR, AND TELECONNECTIONS

Posted by Jeff Id on March 30, 2010

This is a guest post from Ryan O. Ryan was principle author of our submission on Antarctic temperature reconstructions which used PCA based methods for combination of satellite and surface data. This post has some very loose basis in that work. When I first learned of ‘teleconnections’ it was at Climate Audit while looking at some paleo papers. I thought it was a joke but quickly figured out that it was real and based on the potential for weather patterns in distant regions to be related/correlated. For instance, if you have stable prevailing winds over long distances and time. (i.e. E-W over the US). You might find some kind of connection with the weather in Chicago to the farther west- Iowa but less to to the south or north. This is about as simple as it can be stated, b/c some suggest negative teleconnections, polar ones, and all kinds of wild stuff.

PCA analysis in data with spatial correlation will bring out patterns similar to vibrations on a drum head. Because the patterns in the drum head aren’t perfectly symmetric from climate data, it’s easy for the human mind to attach spurious climate meaning to them. In this post Ryan used VERY general properties of satellite data to create an otherwise random synthetic data set and then created nearly identical PC patterns to the actual climate data as measured by UAH.

————

### Ryan O–

One of the favorite concepts popularized by climate science is that of the “teleconnection”, where large-scale dynamics are inferred from inspection of patterns resulting (primarily) from EOF or PCA analysis. These patterns are assumed to have physical meaning, either directly or indirectly. The concept is used in certain paleoclimate reconstructions (such as MBH 98) where proxies are assumed to “teleconnect” to the average earth temperature. It also appears in some climate-related EOF/PCA texts and in analyses of atmospheric pressure, precipitation, ocean current, and a myriad of other related topics. Given how pervasive “teleconnections” are used to study our planet, one might assume that they have a solid theoretical foundation.

As with other things in climate science, one would be wrong.

The issue of teleconnections and physical meaningfulness has come up here before – most notably with the work on the Steig Antarctic temperature reconstruction. The primary point of contention there was that Steig limited his analysis to the first 3 eigenvector pairs by using the “physical meaning” argument and a sampling error argument from North (1982). Steve McIntyre showed that very similar (and physically non-meaningful) patterns result from EOF analysis on an object shaped like Antarctica with exponential correlation functions and precisely zero physical dynamics.

So the question becomes . . . does a strong argument exist for the existence of *real* teleconnections as inferred from EOF/PCA analysis?

SYNTHETIC DATA

Testing for real teleconnections using real-world data is a futile endeavor. Due to sampling error, EOF/PCA analysis mixes information between eigenvectors. For global temperature data, it takes somewhere close to 10,000 individual values (or about 830 years of history) per grid cell to overcome sampling error for all but the first one or two eigenvectors. Since we only have about 30 years of truly global information (satellites), it is not possible to uncover the “true” eigenvectors no matter how the information is sliced, diced, and hashed. Land temperature indices such as GISS and CRU have longer histories, but since these infill uninstrumented grid cells, the eigenvectors are hopelessly contaminated by the assumptions used to perform the infill.

However, it is possible to generate synthetic data where the true eigenvectors are known. In fact, we can generate synthetic data where *no actual teleconnections exist* – points are related to each other solely by a correlation-with-distance function. We can then take samples from the data and test whether we observe “teleconnections” that are similar to what we observe from real-world data. If we do, then we have shown that it is possible for “teleconnections” to arise entirely from sampling error and/or be a result of assigning physical meaning to a quantity that has none.

I will defer a discussion about the method of generating the data to the end of this post. For the moment, let us assume that our synthetic data has the following characteristics:

- Correlation distributions similar to the real earth
- Variance distributions similar to the real earth
- Randomly generated temperature values using weakly autocorrelated noise (lag-1 value of ~0.2, which is similar to that observed in linearly detrended UAH/RSS series)
- Points related entirely by a correlation-with-distance function

We will then take a 372-month sample from our synthetic data (equivalent to the 31-year history of UAH/RSS) and attempt to find evidence of teleconnections via EOF/PCA analysis.

THE RESULTS

Fig. 1 shows the results of the following (in a covariance setting):

- Left column: The TRUE, perfectly known, eigenvectors
- Middle column: The eigenvectors determined from PCA on a 372-month sample
- Right column: The eigenvectors determined from the UAH data set

Now examine the middle (SAMPLE) column. Note how eigenvectors #1 and #2 are mixed in the sample. There are other clear examples of mixing, too: SAMPLE eigenvector #7 contains elements of TRUE eigenvectors #3 and #4. This mixing occurs despite fairly large separation in the eigenvalues (Fig. 2):

Eigenvector #1 is well-separated from #2 in the sample, and eigenvector #7 is also well-separated from #3 and #4. Despite this separation, clear mixing has occurred. This phenomenon of mixing even when the North (1982) criterion is met has been noted by other authors (such as Compagnucci and Richman, 2007), yet seems to be largely ignored. This causes authors to assume incorrectly that well-separated eigenvectors represent individual physical dynamics.

Now what about teleconnections?

The first one that we will look at is eigenvector #3 in the sample panel. Note how both the Arctic and Antarctic regions show a strong blue color. If we were climate scientists, we might call this evidence of a “teleconnection” and postulate a dynamical means of coupling the poles. We might even give it a name – like “Coupled Annual Mode” or something similar.

Unlike climate scientists, however, we have access to the *true* data, and we can see if the poles have any real physical coupling. So let’s look at the true covariance-with-distance plot for the south pole (Fig. 3):

No coupling; no teleconnection. No physical process to name.

Okay . . . so we have fake teleconnections that appear in the SAMPLE data. But what about the TRUE eigenvectors? Is it possible that even the TRUE eigenvectors show fake teleconnections?

Absolutely!

Examine #3, for example. #3 shows a strong inverse relationship between the north pole and latitude ~75. Does this strong inverse relationship actually exist? Let’s look at correlation-with-distance for the north pole:

No inverse relationship there.

How about the inverse relationship between Canada and Siberia in TRUE eigenvector #4? Again, not real. What about the inverse relationship between the Bering Strait and North Atlantic in #5? The inverse relationship between Antarctica and the Southern Ocean in #6? The bands of #7? Or the coupling of Canada and China in #8? Sorry . . . not a real teleconnection among them.

So if even the TRUE eigenvectors – which have zero sampling error – show teleconnection-like patterns that have *no physical meaning*, how do we interpret “teleconnections” in real-world data?

The simple answer is that we don’t.

Individually, the eigenvectors in our synthetic data are meaningless – except for perhaps eigenvector #1 in the presence of a global trend in the data. If there is a global trend, it will be carried by eigenvector #1 (along with sampling error). Other than that, it is only the *sum* of a large number of eigenvectors that has any meaning . . . because the sum of a large number of eigenvectors approximately returns the original data.

For those of you familiar with Fourier and wavelet analysis, there are some obvious similarities to EOF/PCA analysis. The individual frequencies/wavelets have no physical meaning. It is only the sum of many that reproduce the signal or feature of interest. As feature complexity increases, the number of modes necessary to reproduce it increases. This is why it should be no surprise that for our Antarctic analysis, we found that a very large number of eigenvectors (about 80) needed to be retained to maximize the verification statistics. For our synthetic set, with standardized data (i.e., correlation rather than covariance), the Fourier/wavelet analogy is particularly apt, as plotting the correlation spatial eigenvectors vs. latitude yields wavelet-like shapes (Fig. 5 and Fig. 6):

North (1984) also notes that decompositions can result in standing wave patterns. These standing waves are mathematical artifacts of the decomposition and result entirely from the conditions imposed during the decomposition; namely, that each successive mode explains the maximum possible variance in the original data and is mathematically orthogonal to all other modes. If we were to use different conditions, we would get different patterns.

Once we begin assigning physical meaning – and thus begin interpreting features as real “teleconnections” – we are treading on very shaky ground. While we could certainly insert real teleconnections into our TRUE data and see that these are reflected in one or more eigenvectors, we have seen that, completely in the *absence* of real teleconnections, we can also generate eigenvector patterns that look very, very similar to those found in real data sets – like UAH. Since we do not know the TRUE eigenvectors for any observational data set, we have no way to distinguish teleconnections that are real from those that are purely mathematical artifacts.

So . . . we should ask ourselves . . . what do we think would happen if we extrapolate these “teleconnections” back in time for 1,000 years or more?

Hm.

And at this point, those of you who are not interested in how the synthetic data was developed may stop reading!

GENERATING THE SYNTHETIC DATA

Given that:

SVD(X) = U D t(V), and,

SVD(C) = V D^2 t(V),

Where X is the original data, C is the covariance matrix, U is the temporal eigenvectors, and V is the spatial eigenvectors, we can generate perfectly known spatial eigenvectors by building a covariance matrix and taking the SVD.

To prevent inappropriate weighting of higher latitudes, rather than use the typical 2.5 x 2.5 degree grid, we use a grid with latitudinal divisions of 2.5 degrees and longitudinal divisions that change with latitude, such that all grid cell areas are approximately equal. This yields a 6596-point grid.

To build an earth-like covariance matrix, we examine correlation-by-distance of the UAH or RSS data in longitudinal and latitudinal slices. Examples look like (where blue indicates correlation between ocean points, green between ocean and land, and black between land):

We must then determine what type of function-with-distance best approximates this behavior.

Such a function must capture how correlation length changes with latitude, as well as the difference in correlation length when moving longitudinally vs. moving along latitude lines. For global data, neither exponential nor Gaussian functions accurately approximate this behavior. Instead, a combination of exponential, Gaussian, and floor must be used:

The equation that relates the exponential, Gaussian, and floor is the following:

r = c + (1 – c) * {e^(-d^2 / a^2) + (1 – b) * e^(-d / b)}

In this relationship, c represents the floor (or minimum value of correlation coefficient), d is the distance between points, a is the scaling parameter for Gaussian decay, and b is the scaling parameter for exponential decay. Note that no physical mechanism is proposed. We are simply fitting functions based on the observed behavior of correlation-with-distance on earth.

Because these functions, change with latitude, we must also define functions to describe a, b, and c with latitude:

From here, we define an *effective* distance between points, which is a coordinate system change that maps our complex relationship with distance to a sphere with a simpler, purely exponential relationship:

Effective delta-lat = ln(r_lat_max)

Effective delta-lon=ln(r_lon)

We then take our effective delta-lat and delta-lons, and, using the Vincenty formula, calculate the effective great-circle distance on the sphere (d_eff). We can then re-transform this result back to the real earth with the following:

r = e^(-d.eff)

This yields a 6596 x 6596 correlation matrix that approximates the real-world correlation-with-distance for the entire globe. We can transform this to a covariance matrix by applying smooths of the observed covariance-with-latitude to this matrix. We define this to be the TRUE covariance matrix of our synthetic planet. Note that since it was developed in the absence of temporal data, *there is no sampling error.*

We then uncover our TRUE eigenvectors by performing a eigendecomposition of the covariance matrix.

Generating the SAMPLE data set is now quite easy. Since:

SVD(X) = U D t(V),

And since we have determined D and V from our eigendecomposition of the covariance matrix, we can randomly generate U. In this case, I used the actual lag-1 through lag-5 autocorrelation coefficients from the UAH data to generate 6,596 individual “eigenvectors” with length 372 points. This defines matrix **U**. We can then obtain our synthetic temperature data via matrix multiplication:

U D t(V) = X

Voila! A synthetic data set with a perfectly known spatial structure and randomly generated temporal data. If we want, we can generate thousands of Us and perform all kinds of Monte Carlo analysis. We can also add signals and see if whatever regression method we choose can accurately extract them. We can truncate the data and measure our ability to reproduce the missing data.

Or, if we want, we can try all different sorts of Vs and do the same thing.

Hm.

What will we test first? Haha!

SCRIPT

Here are the functions used for generating the above. These are only the functions, not the commands. Anyone interested in actually running this, let me know. Some explanation is required.

http://docs.google.com/Doc?id=dcjxztvr_230cnhdx5d7&btr

## Mike Dubrasich said

Excellent discussion. Key points: PCA (principal component analysis) involves creation of “synthetic” data. The synthetic vectors are made of artificially weighted bits and pieces of real data. The weighted and fragmenting is done to force high correlations between the vectors. It is often said that correlation is not causation, but in the case of PCA, the correlations are not even real correlations.

PCA is used in a variety of disciplines. My familiarity with it comes from wildlife studies, where PCA is used to “find” associations between animal (or plant) species. The apparent associations (correlations) are forced by the statistical methodology. The vectors are not real world phenomena. The results are often ridiculous.

A hypothetical example, Habitat A is found to be 0.5 robins, 0.2 hawks, and 0.7 chickadees, and Habitat B is 0.3 robins, 0.6 hawks, and 0.4 chickadees.

The eigenvectors are significantly different according to the p-values. Therefore a scientific discovery has been made! Habitat A and Habitat B are different!

Except they aren’t. The whole analysis process is designed toward finding differences that have no correspondence to anything in the real world. The “results” from PCA analyses are completely unreal and without any inferential value at all. Science is based on logical inference. PCA is something else entirely.

## AMac said

From Wikipedia, the acronym “EOF”:

## Kenneth Fritsch said

Ryan O, great post and much appreciated.

I have been looking at differences between station temperature anomalies within 5 x 5 degree grids and what might be associated with these differences. I am finding some interesting relations but more analyses are required to say anything concrete about my findings to date. It is posts like yours that encourage the creative juices to flow and to look at things out of the box.

## Layman Lurker said

Great post Ryan. It brings me back to the discussions wading through these issues during the Steig deconstruction. I am looking forward to reading the paper when it comes out to see how the issues were worked through.

Your post begs the obvious question: What are the implications for the many climate reconstructions using these methods? In the case of Steig (limited data and only 3 PC’s for reconstruction) the problems are obvious.

## Layman Lurker said

When you relate your post here back to Steig, it really demonstrates how a cluster of points in a unique geographic area like the Antarctic peninsula can bring all kinds of havoc with spurious correlations into the reconstruction recipe used by Steig.

## Layman Lurker said

#1

The artifacts displayed by Ryan in his synthetic data might be accounted for by autocorrelated data within the geometry of a sphere – similar to the chladni patterns found by Steve McIntyre when looking at Steig et al. I don’t think it is fair to toss all PCA analysis out based on Ryan’s post.

## Peter Dunford said

Thanks Ryan. So PCA is really just another tool for picking the bit of noise that most looks like the signal you want to see, and calling it “evidence”. And getting funding.

If the true meaning of CPS is Cherry Pick and Scale, what is PCA? Pick your Correlation and Assert? Phony Concocted Argument? Poppycock Confounds Apprehension? Plucked Crap, [from] Arse?

## Jeff Id said

#7, Ryan I hope can answer, but PCA is not the problem. The problem is that it’s a little complex and it provides a lot of opportunities to make mistakes.

## Mark T said

I can guarantee PCA has plenty of applications in which the results are robust and the data generated are very real. However, in such applications, what is being searched for has distinct distributions with noise characteristics that are known a priori.

Keep in mind that one of the “easy” ways to being PCA is to implement singular value decomposition (SVD), which is used quite regularly to find orthogonal axes with which to represent data. In my field (signal processing, esp. communication theory), PCA-like algorithms are regularly used to isolate multiple spatially/temporally distinct signals from a single data stream. Your cell phone would not work very well without this ability.

Mark

## Mark T said

Uh, that would be “ways of

beginningPCA” in the 1st sentence of the 2nd paragraph.Mark

## Ryan O said

#1 & #7 PCA is an extremely useful tool. Not only is it used for signal processing as Mark explained, it is also extensively used for image compression, image analysis, data compression, and a whole host of other similar applications. Many things you take for granted would not be possible without this generic class of data decompositions, which includes SVD, PCA, ICA, and QR to name a few.

As Jeff and Mark mentioned, PCA is NOT the problem. The problem comes in when people begin assigning physical meaning to the eigenvectors. The distribution of points within an individual eigenvector are determined by the mathematical constraints used for the decomposition. Unless the physical processes are

known ahead of time, it is likely that the resulting eigenvectors will not INDIVIDUALLY reflect physical processes.This doesn’t mean that the eigenvectors are nonsense . . . it just means that you can’t assign physical meaning to

individualones. Subsets of eigenvectors may capture real physical processes (along with “noise”) . . . but finding these subsets requires more work than simply looking and saying, “Hey! That looks like the SAM index,”a laSteig & Co.However, doesn’t prevent you from using them for prediction in regression analysis. The important point is that you must use an objective set of criteria for determining which ones to retain. For our Antarctic work, the criterion was cross-validation testing, wherein you deliberately withhold data and calculate the ability of your eigenvector selection to predict the withheld data. If the prediction is poor, then you have the wrong subset. You iterate through the eigenvectors in this manner in order to choose the optimal set for prediction. In other words, eigenvectors that improve predictive ability are retained; those that do not are discarded. In this way, you attempt to separate those eigenvectors that may contain elements of real physical processes from those that largely do not. It is not perfect, and cannot be. Any finite data set contains a finite amount of information. No matter what you do, you cannot overcome sampling error – because you cannot extract information from the set that is not contained in the set. The smaller the set, the bigger the sampling error, and the greater the chance that your results will deviate from reality.

However, just as it is irresponsible to claim something that the data does not support, it is likewise irresponsible to throw out a legitimate tool because some people happened to use it improperly.

## Layman Lurker said

Mark, your cell phone example is a good case to differentiate from a climate reconstruction. In the case of the cell phone, the algorithms which were developed to recover the signal could be tested and verified against something which is known. I think Ryan has shown through his post that methods for reconstructing unknown climate signal should be developed in a similar manner with simulated data and signal.

## Mark T said

I realize the distinction between what I mentioned and what Ryan mentioned. A fact I’ve always known, and have occasionally mentioned on this site.

Ryan mentions ICA, something I am fond of. It is better at reconstructing original sources blindly – “blind” is nearly always impossible through variance alone, but ICA has the added benefit of an assumed independence between the source vectors which allows higher order estimation methods. Independence, of course, is impossible to prove with the proxy sources used in the climate world; they probably aren’t even uncorrelated as required by PCA.

Mark

## Peter Dunford said

#11

My apologies, I’ve only ever come across PCA in connection with work by the Hockey Team, usually being pulled apart on sites like this and Climate Audit, and so believed it to be another dubious statistical “trick”.

## Mark T said

Peter Dunford said

March 31, 2010 at 7:54 am

Actually, there is probably some legitimate work that could be done using PCA by the climate team, they just refuse to fix their errors and validate their assumptions. Stationarity comes to mind… as does linearity… and a host of other issues.

But still, it is possible to glean information from the proxies, you just have to be careful when you assign any level of statistical validity to the work, calculated or otherwise, when you’re guessing at a lot of the parameters. Certainly you don’t go on record making claims that your treemometers are more accurate 1000 years ago than actual thermometers are today… not without sounding like an idiot that flunked out of a PhD physics program, at least.

Mark

## Steve McIntyre said

I missed this post the first time round. This is really, really well done and interesting.