More BEST Confidence Interval Discussion
Posted by Jeff Id on November 1, 2011
Well, I’ve written to Richard Muller yesterday on this as well as Judith Curry, Richard has yet to acknowledge my email. We have seen that my previous explanations of the problems in the confidence intervals of the BEST temperature series were confusing for some pretty smart people. I’m hoping the authors can figure out what I mean but today I wanted to delve a little deeper. The methods paper is here.
On page 17 they are discussing the error calculation methods. It is a bunch of complex stuff which breaks down to weighting stations that have best correlation to the mean value higher. Limits were placed on how much a station can be deweighted and upweighted. The typical error of a point was assigned value ‘e’ as a constant.
The scale of the typical measurement error (𝑒 ≈ 0.55 C)
This is a screen grab of the station weighting section on pages 17 and 18:
Equation 31 limits the weights between 1/13 and 2.
My reasoning for looking deeper into this is because I have yet to receive any reply on the issues of weighting and their effect on the Jackknife calculation. I’ve become more convinced than ever that the problem is real and it will absolutely require a re-write of the CI portion of the paper. I was going to attempt to improve my explanation and started digging deeper into the equations presented.
The way I read the weighting section now, stations are estimated measured for variance from the weighted mean. Weights are re-calculated from this variance and the mean is recomputed from the weighted data. The process is repeated until some convergence threshold is met or the number of iterations is met. The claim by the paper is that the average station weight should be near 1 according to their definition but the actual result after iteration may be a bit different. There appears to be more room for deweighting stations by the equations than for overweighting.
Jackknife assumes elimination of x percent noise. In the case of this paper, it assumes 1/8 of the stations = 1/8 of the noise. The difference between the full reconstruction and this reduced data reconstruction is then calculated as 1/8. Now BEST eliminates the stations and then re-runs the entire weighting algorithm. This changes weights meaning we have now re-assigned noise percentages from each station creating a tremendous statistical problem, but if it did not reweight, could jackknife do the job effectively?
If most of the stations are below the centroidal midline of the weighting value, the jackknife method here would underestimate the confidence interval. If the every station were weighted at the midline of the weighting distribution – same weight for everything, it would work perfectly. The median weight of stations was targeted to be 1 by the calculation of the e value but the integration of 1/13 to 2 comes out to 1.04 so we have a seven percent bias in the range of accepted values. What it boils down to is that if a greater number of stations are weighted below 1.07 the algorithm will tend to have underestimated confidence intervals even if the data weren’t reweighted after information is removed.
I’m out of time now but think about this point for a bit. If we have a method which downweights stations which have high variance. Perhaps a region with stations by water for instance where stations may be directly affected by the stabilizing effect of water and others more distant. The variance (and influence) of inland stations would be deweighted intentionally and their effect on the confidence intervals would be minimized. There really aren’t physical reasons to deweight the high variance stations in this case but that is what the algorithm does. It may have been wiser for them to use correlation rather than variance.
Still, running any method through Jackknife the way BEST did, the result is very very dependent on the structure and distribution of the data. I have plotted the weighting function below. The plot shows how steeply weights are cut off by variance.
The paper claims that the scale of the ‘typical’ measurement error is 0.55 and shows a calculation for average measurement error, but what is important for Jackknife is not having the average variance (error) weighted to 1 but the median of the variance at 1. In the plot above 0.55 average variance error falls at 1. (One being a value equal to the average regional weight of the station subset being considered as weights are divided out. Some regions reported widespread high variance which resulted in large groups of low weight stations being combined. Jackknife just needs to know it has removed a true 1/8 of the data from the region to function properly.) If the median were centered on this regional 1, and you consider removing 1/8th of the data, you would be removing equal amounts of up and down weighted data and the chances that you actually are removing 1/8th of the error are probably better but still not proven accurate because the distribution of the weighting has to be taken into account. Still, since the data is re-weighted in each of the jackknife runs, all bets are off on the CI.
If they were to show that the weighting of temperature stations had a gaussian distribution centered on each regional average weight, Jackknife could provide a reasonable estimate, but to date this has not been demonstrated and no effort has been made to prove this out in the documentation.
Finally, they made a large monte-carlo style simulation which they claim verified the accuracy of their method.
We studied the relative reliability of the sampling and jackknife methods using over
10,000 Monte Carlo simulations. For each of these simulations, we created a toy temperature
model of the “Earth” consisting of 100 independent climate regions. We simulated data for each
region, using a distribution function that was chosen to mimic the distribution of the real data; so,
for example, some regions had many sites, but some had only 1 or 2. This model verified that
sparse regions caused problems for the sampling method. In these tests we found that the
jackknife method gave a consistently accurate measure of the true error (known since in the
Monte Carlo we knew the “truth”) while the sampling would consistently underestimate the true
There isn’t enough detail to understand what they are claiming here. My impression is that they discovered the algorithm is measuring its own stability in result just as it did with global data. Perhaps an SI will be released with the methodology of this section so that we can understand exactly what these simulations consisted of. I may have to simply read code for days.