Overconfidence Error in BEST
Posted by Jeff Id on October 30, 2011
UPDATE: The post started as a general review of BEST several days ago including some critiques by others. As I finished, an error in the CI calculation became apparent to me. If you are familiar with the work, jump to the section titled Jackknife for the description of that problem.
Finally, a technical post. As most here know, the Berkley Earth Surface Temperature (BEST) has released its early results. The media has worked tirelessly to misrepresent the results to the public. I even heard Chicago progressive radio refer to the authors of BEST as denialists who decided they were going to overturn the results of other surface temperature analyses and found that they couldn’t, proving again that global warming is going to doom us all and that skeptics are fools. Really, that’s what mainstream Chicagoan progressives are being told about this. The purpose of the project was actually to create an open and transparent global surface temperature that people can read and critique at will. The authors made it clear in their mission statement which begins with:
Our aim is to resolve current criticism of the former temperature analyses, and to prepare an open record that will allow rapid response to further criticism or suggestions.
A difficult proposition considering the quality and number of the temperature records involved. I have read all four papers including a multiple read of the methods paper to understand some of the more sophisticated points presented. I’ve also read critiques by several bloggers/sceintists some of which I agree with and others which I believe are mistaken, but in the end I don’t believe any of the critiques could possibly have any appreciable effect on the trend results without a surfacestations style analysis of the raw data. The only places I have concern are in the brush-over given to the UHI effect and the obviously over-tight confidence intervals. Even if the CI’s were widened to more correct levels, it wouldn’t change the result and the UHI effect isn’t going to reverse any trend so despite some statistical critique, I believe the result is very close to the actual global surface temperature average minus some unknown amount of warming by UHI. Still, I do believe that I have identified a specific error in the confidence interval calculation which must be corrected and is discussed below.
Now the model they give in the methods paper is designed to separate seasonal, latitude, altitude and measurement noise from the ‘climate’ signal. There is nothing I see wrong with that. The series DON’T appear to be smoothed (low pass filtered) before combination and some care was made to separate bad data from the chaff. They refer to the method as the scalpel method but it does look a bit like a saw to me. For instance, series are sorted for quality and deweighted by an automatic weighting scheme. This could have a small impact on the overall trend but it has the potential for a far greater effect on the uncertainty in the mean as currently calculated. They determined uncertainty by methods which use re-sampling and running the same algorithm which is normally a fantastically reliable way to avoid critique, in this case more discussion is required. The weighting method is very much ad-hoc, but again, I doubt it can change trend results substantially although the same cannot be said for the CI.
From the methods paper, they describe the process:
Rather than correcting data, we rely on a philosophically different approach. Our method
has two components: 1) Break time series into independent fragments at times when there is
evidence of abrupt discontinuities, and 2) Adjust the weights within the fitting equations to
account for differences in reliability.
I recommend the reader actually check the detail of the papers in the resources section linked above but the following paragraph is a bit alarming to the inner engineer. If everything were understandable the weighting would be ok, but by my reading, the weights seem tweaked to optimize stability of result. This then tweaks the stability of the re-sampling in the jackknife CI calculation and minimizes the confidence intervals.
Due to the limits on outliers from the previous section, the station weight has a range
between 1/13 and 2, effectively allowing a “perfect” station record to receive up to 26 times the
weight of a “terrible” record. This functional form was chosen for the station weight due to
several desirable qualities. The typical record is expected to have a weight near 1, with poor
records being more severely downweighted than good records are enhanced. Using a
relationship that limits the potential upweighting of good records was found to be necessary in
order to ensure efficient convergence and numerical stability. A number of alternative weighting
and functional forms with similar properties were also considered, but we found that the
construction of global temperature time series were not very sensitive to the details of how the
downweighting of inconsistent records was handled.
I do believe the last sentence of the paragraph though – the average isn’t much affected by de-weighting the outliers. I do hope that the details of this become more clear as the release of their results continues. Currently the BEST temperature averages look like this:
These plots are from the GHCN monthly data only. I am skeptical of the tight confidence intervals simply from my own work on the data to say the least. William Briggs whom is one of my favorite bloggers and should be linked on the right were I not so lazy, wrote this critique of BEST. There are numerous points I agree with but he critiqued the modeling of autocorrelation whereas my understanding of the monte-Carlo style jackknife method naturally incorporates. Again, I think he’s right that they came up with far too small of an uncertainty interval but the reasons are more straightforward.
This is what William wrote:
The authors use the model:
T(x,t) = θ(t) + C(x) + W(x,t)
where x is a vector of temperatures at spatial locations, t is time, &theta() is a trend function, C() is spatial climate (integrated to 0 over the Earth’s surface), and W() is the departure from C() (integrated to 0 over the surface or time)
The model takes into account spatial correlation (described next) but ignores correlation in time. The model accounts for height above sea level but no other geographic features. In its time aspect, it is a singly naive model. Correlation in time in real-life is important and non-ignorable, so we already know that the results from the BEST model will be too sure
Now I don’t know how carefully William read the methods paper at this time but since a resampling method was used to determine confidence intervals, the temporal autocorrelation is accounted for, no model required. Unfortunately, the whole scheme of the deweighted combination was designed to reduce CI’s so his conclusion is correct but his reasoning is not. What is going on is that the authors chose to upweight the data which matched the average best and downweighting the outliers, the cycle is repeated until weightings are determined for each station. The process is effectively narrowing the distribution based on best fit to the average. By resampling and running the same algorithm, they found the confidence intervals were very narrow. Their CI though, determines the ability of the algorithm to pick out a consistent mean value even with reduced data. Whether that consistency represents confidence in the known temperature is another matter entirely and it is my contention that it does not.
He also critiques the lack of Kriging model uncertainty analysis, this is a reasonable critique, but I doubt the function fit to the spatial autocorrelation of surface stations will make much/any difference. The real problem is in the application of the Jackknife method and the determination of the CI’s. Keenan also made some critiques of the method which I believe miss the mark as well.
Now this portion of my post will be fairly detailed and requires some understanding of how CI’s can be calculated by resampling. Where I have trouble with this is that there are multiple difficulties in understanding the deweighting of a spread of values. This is the description given in the BEST methods paper:
In their case, the weights are calculated 8 times with 1/8th of the data removed. Equation 36 creates an upweighted version of the residual differences between the full reconstruction and the reduced data reconstruction. The reduced data reconstructions contain tempreature stations which are re-weighted to produce the trends. Now, the authors claim that this variation in result represents the true uncertainty of the
total method mean temperature, but I disagree. What this represents is the ability of the model to chose (upweight/downweight) the same stations in the absence of a small fraction of the data. The resampling methods will necessarily generate a very small CI from this but the truth is that their algorithm is generating the same mean values within the CI’s as presented in the paper. Think about that. They always get the same result within that CI so they are getting the same result inside a very narrow band. So are these methods a true representation in our confidence in the mean temp?
The problem is that equation 36 generates independent datasets by upweighting residuals from different runs containing the same data. Each run though changes the weight of the root data by reweighting individual temperature series. The central values (series most like the mean) of the reduced data runs are effectively upweighted more when data is removed while outliers experience the opposite effect. The central value is therefore non-normally and non-linearly preferred, invalidating the assumptions of the subsampling methods. More simply this weighting of the preferred values means that you really don’t have 1/8 less data which is THE central assumption of the Jackknife methods. Because of the weighting algorithm, they have functionally removed less than 1/8th. This is likely the primary reason why subsampling produced even tighter CI’s than Jackknife, as mentioned in the paper.. This is a significant error in the methods paper which will require a rework of the CI calculation methods and a re-write of the CI portion of the methods paper.
If they re-ran the Jackknife without re-weighting the data, an improved (and wider) CI could be calculated but care would need to be taken as the removal of weighted data will result in strikingly non-normal distributions.
In all, I like the general idea of the transparency, the re-weighting is a little hinky and like William Briggs, I would prefer to see bad data excised as it is all data and therefore statistically cleaner, the UHI conclusions are not thoroughly vetted enough to make the conclusions about bias in the data that they do as it doesn’t require a skyscraper to screw up a temp station. However, excepting the errors in the CI calculations, the results aren’t far from what we would expect.