A few notes from Jeff,
First, this is exciting work Ryan has done, The Air Vent has been lucky to have great guest posts lately. The work by Ryan and Nic has been excellent in improving my own understanding of the original Steig et al. reconstruction and methods for improvement.
Ryan has been working on a concept he had to improve two items of the original reconstruction. The first one I want to mention is the imputation of Steig et al. is actually backwards. The Steig reconstruction is a reconstruction of satellite surface skin temperature (AVHRR) rather than a surface temperature reconstruction as it’s billed. It needs to be recalibrated to match surface station trend to be considered a valid surface station reconstruction. As we know from previous work here, the trends of the Steig reconstruction are substantially different from surface station trends. The second point has to do with RegEM converging to a local or global minima.
Ryan employs a weighting method in a TSVD reconstruction which involves two separate steps.
The method Ryan used to correct for both of these problems involves pre-weighting surface stations in relation to satellite information . By applying a large equal multiplier to the surface stations in relation to sat PC’s, Ryan gives a strong weighting to the surface stations vs sat and negates the need for a post-reconstruction calibration.
The second weighting is more clever and important to climate science. While Ryan explains it pretty well below, sometimes two explanations can help communicate it and this mathematical step should be very important in EM processing using similar data fields. In EM where large qty of data is missing, spurious correlations can occur and a non-global minimum can be reached. In this case large portions of the data are missing. Imputing 1 PC at a time, Ryan weighted the individual surface stations by the pca eigenvector weighting of the original AVHRR data. This means areas with low information content for the pc are less likely to accidentally become heavily weighted as each iteration progresses. – Very important. Think of it as an improved start point for the iteration, or an improved station location in the imputation rather than RegEM figuring out where stations belong.
Ryan, please feel free to correct any details you see wrong with this description.
Of course Ryan couldn’t resist a little discussion with RC
. Read on, I think you’ll like it.
============================================================================
Guest post by Ryan O
Most of you are aware that Dr. Steig posted a response to our reconstructions over at RealClimate. The link is here: http://www.realclimate.org/index.php/archives/2009/06/on-overfitting/
There were two salient points in his post that we should look at. One point was that “someone called Ryan O” had obtained better verification statistics by first calibrating the satellite data to ground data. This point is easily addressed. To all RealClimate readers: everything that follows was done using the cloud masked AVHRR data provided by Dr. Steig as-is. No calibration. In fact, given the way our reconstructions are done, any such calibration would not affect the results. This, too, will be shown later.
The second point is that Dr. Steig claimed that the verification statistics were degraded as additional AVHRR PCs were included. This is certainly true, if you use the original tools (RegEM TTLS) and the original methodology (impute the whole mess at once). I think this point was lost at RealClimate. There are problems with the method and the math behind the method – so we changed the method to address these issues.
Here is a short summary of the issues:
1. TTLS assumes errors in both the predictors and predictands. However, an error in a PC (which is an abstract quantity) does not mean the same thing as an error in a temperature measurement at a specific location. Additionally, if the pre-1982 portion is calculated based on assuming errors in the ground stations and the PCs, then it is inappropriate to simply add the original, unmodified PCs onto the end. The post-1982 solution needs to be calculated the same way as the pre-1982 solution: assuming errors in both.
2. While Steig refers to the ground stations as the predictors and the satellite PCs as the predictands, in their method, this is not strictly true. Any existing values are the predictors and the missing values are the predictands. This means the ground stations and satellite PCs are both predictors and predictands. Not only that, but the satellite PCs affect each other’s imputation by interacting with each other and with the ground stations.
3. Because the solution is based on the truncated SVD of the correlation matrix, the pre-1982 portion of the AVHRR PCs is not truly an extrapolation of the PCs. It is a rotation of the PCs to the ground station data. This means that the original AVHRR PCs should not simply be tacked on to the end. The rotated PCs should be used from 1957 to 2006. The standard RegEM TTLS algorithm does not return the rotated (unspliced) solution (though Nic L’s modification does return that solution). This problem can be done as an extrapolation, but Steig’s method does not accomplish that.
4. The ground stations are used to predict PC values without regard to whether the PC explains any variance at the station location. This is not necessarily a problem – unless you subsequently recover gridded temperatures using the eigenvector. Because Steig uses the eigenvectors to recover gridded temperatures, then the eigenvector must be used to constrain the imputation. RegEM TTLS has no means of doing this.
5. An insufficient number of PCs are used. The claim that 3 PCs can represent land the size of Antarctica when the ERSST reconstructions required 15+ PCs to represent open ocean areas of equivalent size defies belief.
Some of these issues cannot be resolved when using RegEM TTLS. To that end, we started using a different imputation tool based on a truncated SVD approach (originally written for R by Steve McIntyre). The benefits of the truncated SVD approach are that it is faster, allows direct access to the unspliced solution, and is simpler to understand. That last benefit is important because we will discover that we need to modify the truncated SVD approach to address some of the methodological problems.
Read the rest of this entry »