REGULARIZED LEAST SQUARES RECONSTRUCTION
Posted by Jeff Id on July 15, 2009
Another great guest post. Ryan figured out a method to do a stable multivariate regression without over fitting with similar results to TPCA. It’s comforting to see that even though the trend is slightly lower the results are very similar to the corrected EM reconstruction and the area weighted reconstruction. The math here is pretty sharp stuff.
One of the problems that Jeff Id has raised on numerous occasions is that none of the reconstruction methods we have examined and tried are truly reconstructions of surface air temperature. They are all in some fashion a mix between surface air temperature and skin temperature. Even the weighted eigenvector reconstructions only fully rotate the AVHRR PCs in the limit where the weight assigned to the AVHRR PCs approaches zero, and there are calculational limits on how low that weight can be set.
There are three different ways to resolve this problem that have been batted around. One is to calibrate the AVHRR data to the ground data. I had attempted to do this in the past without satisfactory results because the station data is so spatially incomplete. A second way is to supplement the weighted eigenvector reconstruction by assigning the stations an area of influence and multiplying them by an additional weighting factor. Jeff Id is currently working on this method. A third way is to not use any AVHRR temperature information at all. Instead of using the PCs (temporal EOFs), we will simply use the eigenvectors scaled by the eigenvalues (spatial EOFs).
To start, let us define our spatial EOFs as:
Lp is then a matrix containing columns with the eigenvector weights at the station locations for the first N satellite EOFs (U), scaled by the eigenvalues (sigma), and additionally scaled by the square root of the number of rows.
The problem we want to solve, then, is how to predict what the PCs should be in order to explain the maximum variance at the ground stations. This provides the maximal rotation of the PCs to the ground stations because it does not require the use of any satellite temperature information. We only use covariance information. Mathematically, this means finding a vector of temporal weights (a) for each time (t) given a vector of ground station temperature measurements (d):
The least-squares solution to this problem is given by the pseudoinverse of Lp times the station data vector:
I had tried this in the past and gotten reconstructions that were massively overfit. The reason for this behavior is that the eigenvectors were derived from the satellite data (so they will not be a perfect representation of station covariance) and that the station data itself contains noise. Dr. Beckers pointed me toward his 2006 paper where they used a regularized version of this to estimate the rms noise for the entire field when infilling. In our case, using the regularized version prevents the overfitting by limiting the fit based on the amount of assumed noise. The regularized equation is:
I is the identity matrix and mu squared is the mean square noise. Calculating the mean square noise is fairly simple:
Here, mu squared is simply the square of the point-wise difference between the calculated temperature using the first N satellite EOFs (x) and the ground station data (Greek letter xi). In other words, it is equivalent to the squared standard deviation of the residuals (or, the amount of station variance left unexplained by the satellite data).
Note: If you look at Dr. Beckers’ paper http://www.ocean-sci.net/2/183/2006/os-2-183-2006.pdf?FrameEngine=false , you will note that he calculates mu squared differently. I have sent him an email about this as I believe his listed equation is in error. The equation gives the mean square difference, which is not the same as the mean square error. For example, Dr. Beckers’ equation would yield an error of zero when comparing two sine waves 180 degrees out of phase, which is clearly not the intent. I believe this is a simple misprint.
Regardless, the correct form of the equation for our purposes (which are slightly different than Dr. Beckers’ – though I do not see why that would require use of the mean square difference) is the one I have listed above.
(Incidentally, the math behind our Iterative TSVD is identical to that used by Dr. Beckers’ DINEOF algorithm, so if you are interested in the mathematical basis for our infilling algorithm, his 2003 and 2006 papers provide an excellent reference.)
In order to perform a reconstruction with this method, we must first infill the station data. As we have in the past, we will use the Iterative TSVD algorithm with the mode set to covariance, station set GRID 1C (all on-grid stations with at least 96 months of data), and impute up to 5 retained EOFs. This infilled station set is identical to the station set we used for the previous covariance-based reconstruction.
We then take the SVD of the transpose of the AVHRR anomaly data (so that the left-hand eigenvector contains the spatial information and the right-hand eigenvector contains the temporal information). Now we can extract the eigenvector values at the ground station locations for each spatial EOF, arrange them in descending order in matrix U~, and multiply each column by its respective eigenvector. This defines Lp. From there, we compute the pseudoinverse and calculate matrix A, with columns corresponding to a given satellite EOF and rows equal to the vector a for that particular month. Note that each column of A is a principal component, but calculated using the fit of the ground station temperatures to the eigenvector rather than satellite data.
From there, the reconstructed temperatures are computed as before: U A, where U is the matrix containing the spatial eigenvectors and A is the matrix of PCs.
This method is attractive for a number of reasons. First, it is a straight-forward least squares regression of the station data against the spatial EOFs. It does not require any black-box imputation algorithm, and a least squares regression is a readily understood calculation. And most of all, for the purpose of reconstructing surface air temperatures, the solution contains no direct satellite temperature information.
So, without further ado, here is a 28-spatial EOF reconstruction, using ground station set GRID 1C.
The really encouraging part is that even though we used only ground station temperatures and satellite covariance information, our cross validation statistics to actual satellite temperatures are double that of Steig’s.
Just for the heck of it, I did a 288-spatial EOF reconstruction. Bigger is better. There are a total of 300 possible spatial EOFs from the decomposition of the AVHRR data. The last 12 have essentially zero eigenvalues ( ~ 10-12), so 288 is the maximum number of EOFs that can be used. Essentially, this uses every last bit of covariance information available from the satellite data. The subtitles are slightly incorrect. I retained zero PCs; I retained only spatial EOFs. Everywhere you see “PC”, substitute “spatial EOF”.
So much for cut-and-paste.
Results are below.
As an interesting side note, I used Dr. Beckers’ noise estimation procedure to compute the rms error between the satellite and ground information across the entire grid. The rms error at ground station locations specifically was about 2.5oC, or a little better than Comiso’s reported ~3oC for his cloud masking procedure in Comiso (2000). Across the whole grid, the average was about 2.8oC. The difference between the satellite information and the stations – and, by extension, the entire grid – is actually a bit more accurate than Comiso advertises his cloud masking to be.
In other words, pretty good.
However, it would do all of us well to remember that the signals we are looking for – ~0.1oC per decade – is still 30 times smaller than the standard deviation of the difference between cloud masked satellite data and ground data. So even though Comiso delivered on his ~3oC promise, that still results in a massive difference in trend.
Figured it was a good idea to add a little perspective to what ~3oC rms noise really means in terms of the confidence we should place in trends computed from the data.