the Air Vent

Because the world needs another opinion

More BEST Confidence Interval Discussion

Posted by Jeff Id on November 1, 2011

Well, I’ve written to Richard Muller yesterday on this as well as Judith Curry, Richard has yet to acknowledge my email.  We have seen that my previous explanations of the problems in the confidence intervals of the BEST temperature series were confusing for some pretty smart people.  I’m hoping the authors can figure out what I mean but today I wanted to delve a little deeper. The methods paper is here.

On page 17 they are discussing the error calculation methods.  It is a bunch of complex stuff which breaks down to weighting stations that have best correlation to the mean value higher.  Limits were placed on how much a station can be deweighted and upweighted.   The typical error of a point was assigned value ‘e’ as a constant.

The scale of the typical measurement error (𝑒 ≈ 0.55 C)

This is a screen grab of the station weighting section on pages 17 and 18:

Equation 31 limits the weights between 1/13 and 2.

My reasoning for looking deeper into this is because I have yet to receive any reply on the issues of weighting and their effect on the Jackknife calculation. I’ve become more convinced than ever that the problem is real and it will absolutely require a re-write of the CI portion of the paper. I was going to attempt to improve my explanation and started digging deeper into the equations presented.

The way I read the weighting section now, stations are estimated measured for variance from the weighted mean. Weights are re-calculated from this variance and the mean is recomputed from the weighted data. The process is repeated until some convergence threshold is met or the number of iterations is met. The claim by the paper is that the average station weight should be near 1 according to their definition but the actual result after iteration may be a bit different. There appears to be more room for deweighting stations by the equations than for overweighting.

Jackknife assumes elimination of x percent noise. In the case of this paper, it assumes 1/8 of the stations = 1/8 of the noise. The difference between the full reconstruction and this reduced data reconstruction is then calculated as 1/8. Now BEST eliminates the stations and then re-runs the entire weighting algorithm. This changes weights meaning we have now re-assigned noise percentages from each station creating a tremendous statistical problem, but if it did not reweight, could jackknife do the job effectively?

If most of the stations are below the centroidal midline of the weighting value, the jackknife method here would underestimate the confidence interval. If the every station were weighted at the midline of the weighting distribution – same weight for everything, it would work perfectly. The median weight of stations was targeted to be 1 by the calculation of the e value but the integration of 1/13 to 2 comes out to 1.04 so we have a seven percent bias in the range of accepted values. What it boils down to is that if a greater number of stations are weighted below 1.07 the algorithm will tend to have underestimated confidence intervals even if the data weren’t reweighted after information is removed.

I’m out of time now but think about this point for a bit. If we have a method which downweights stations which have high variance. Perhaps a region with stations by water for instance where stations may be directly affected by the stabilizing effect of water and others more distant. The variance (and influence) of inland stations would be deweighted intentionally and their effect on the confidence intervals would be minimized. There really aren’t physical reasons to deweight the high variance stations in this case but that is what the algorithm does. It may have been wiser for them to use correlation rather than variance.

Still, running any method through Jackknife the way BEST did, the result is very very dependent on the structure and distribution of the data. I have plotted the weighting function below.  The plot shows how steeply weights are cut off by variance.

The paper claims that the scale of the ‘typical’ measurement error is 0.55 and shows a calculation for average measurement error, but what is important for Jackknife is not having the average variance (error) weighted to 1 but the median of the variance at 1.  In the plot above 0.55 average variance error falls at 1.  (One being a value equal to the average regional weight of the station subset being considered as weights are divided out.  Some regions reported widespread high variance which resulted in large groups of low weight stations being combined.  Jackknife just needs to know it has removed a true 1/8 of the data from the region to function properly.) If the median were centered on this regional 1, and you consider removing 1/8th of the data, you would be removing equal amounts of up and down weighted data and the chances that you actually are removing 1/8th of the error are probably better but still not proven accurate because the distribution of the weighting has to be taken into account.    Still, since the data is re-weighted in each of the jackknife runs, all bets are off on the CI.

If they were to show that the weighting of temperature stations had a gaussian distribution centered on each regional average weight, Jackknife could provide a reasonable estimate, but to date this has not been demonstrated and no effort has been made to prove this out in the documentation.

Finally, they made a large monte-carlo style simulation which they claim verified the accuracy of their method.

We studied the relative reliability of the sampling and jackknife methods using over
10,000 Monte Carlo simulations. For each of these simulations, we created a toy temperature
model of the “Earth” consisting of 100 independent climate regions. We simulated data for each
region, using a distribution function that was chosen to mimic the distribution of the real data; so,
for example, some regions had many sites, but some had only 1 or 2. This model verified that
sparse regions caused problems for the sampling method. In these tests we found that the
jackknife method gave a consistently accurate measure of the true error (known since in the
Monte Carlo we knew the “truth”) while the sampling would consistently underestimate the true

There isn’t enough detail to understand what they are claiming here.   My impression is that they discovered the algorithm is measuring its own stability in result just as it did with global data.  Perhaps an SI will be released with the methodology of this section so that we can understand exactly what these simulations consisted of.  I may have to simply read code for days.

22 Responses to “More BEST Confidence Interval Discussion”

  1. Thanks, Jeff, for your efforts.

    I also wrote to Professors Richard Muller and Judith Curry and received no reply. I shortened by comment and reposted as comment #131498:

    And then fixed the links in comment #131511

    If the comment disappears again, I will ask them again to reply.

  2. M. Simon said

    Isn’t there a problem when the local average justifiably differs from the ensemble average (near the ocean, on top of a mountain)? I’m not a stats guy so don’t beat me up. Explain it.

  3. TGSG said

    A real question.

    We can’t have temps jumping all over the place, so we try to figure out what they REALLY are and after cutting,dicing,and slicing we can make the record be what we THINK it really is? Are these outliers NOT a part of the record and if they are why are they trying to get rid of them? I suppose it’s my lack of math that is confusing me.

  4. Jeff Id said


    A true outlier should be chucked because it is caused by false collection. The data is really messy. I’m not convinced that BEST is doing that but if they chuck a little extra to insure the crap is gone it won’t have much effect. Where they have problems is in the breaking of steps which Steve McIntyre pointed out. This is guaranteed to be biased toward elimination of down steps vs up because of the general uptrend.

    Outliers in trend are definitely part of the record. If the long term norm is urban bias as Anthony Watt’s surfacestation project demonstrated, the norm isn’t what you want to converge to. Someone pointed out at CA that the BEST result correlated to city stations far better than rural. I’m not surprised.

  5. lucia said

    Perhaps an SI will be released with the methodology of this section so that we can understand exactly what these simulations consisted of. I may have to simply read code for days.

    I’ll be curious too.

  6. Orson Olson said

    To those emailing Curry or Muller~

    It is worth knowing that both are at a conference in Santa Fe, MN, this week

  7. Brian H said

    Sounds like algebraic confirmation bias to me.

  8. #6. Thanks, Orson. I sent them a joint message after the observations and data that I asked to be addressed (#1) in the BEST reports were deleted.

    I doubt if the debate over global climate change can be concluded to the public’s satisfaction without considering the natural variability of Earth’s heat source – the Sun.

  9. Carrick said

    Jeff, I would consider a well executed Monte Carlo a decent verification of their methodology. I’m not sure what they’ve done is “well executed” though.

    I think the Monte Carlo needs to have similar error sources such as are present in the real data set (including UHI and other land usage effects, station moves and so forth).

    It needs to be testing the complete set of software, with similar numbers of stations, with similar “made up” histories and similar issues with incomplete geographical coverage, etc.

  10. TGSG said

    Thanks Jeff, that helped.

  11. Jeff Id said


    I agree that if the Monte Carlo were done correctly they would have it nailed. I have become completely convinced that they don’t have it right though so now I read the MC paragraph with a lot of skepticism. I mean think about the fact that they present the mean variance error as .55 and center the weighting on that. Nobody is considering that the regional centering and distribution of the weighting could have any impact on Jackknife. Each region lands on a different part of the wighting curve! Right now, I think the best method they could do to represent CI would be to remove 1/8 of the stations after weighting and not reweight the result. I really don’t know how to fix the mess but I wasn’t paid 600K to make it.

    The algorithm seems designed to reduce impact of high variance stations in favor of others which of course leads to artificially tight CI’s. They use the tight CI’s to justify the projection farther back in time. McIntyre is right to refer to it as a temperature reconstruction. While I have written that I believe it is a reasonable representation of temperature, I do believe it could be biased high by as much as several tenths per century and that kind of makes those CI’s look pretty useless.

  12. DeWitt Payne said


    I’m reminded of de-noising programs for removing clicks, pops, hiss and rumble from digitized vinyl records or cassette tapes. Run the data through the meat grinder too many times and you end up with distorted sound. The background noise is really low, but it still doesn’t sound very good.

    This article seems to be relevant as well from an analytical chemist’s point of view:

    It is tempting to remove extreme values automatically from a data set, because they can alter the calculated statistics, e.g., increase the estimate of variance (a measure of spread), or possibly introduce a bias in the calculated mean. There is one golden rule however: no value should be removed from a data set on statistical grounds alone. ‘Statistical grounds’ include outlier testing.

    [my emphasis]

  13. Jeff Id said


    I’m thinking that you are right in this case. Or if you remove it for trend, you need to leave it for CI. The data is really a mess though and if they collected the removed values in a histogram plot, it would be interesting to see how many crazy ones there were vs potentially realistic ones. Removing data at the 99.9 percent threshold means that in a 100% clean data group you would lose .1% of real data with real variance implications. I don’t think anyone has proven that temp distributions are true gaussian things so what looks like 99.9% might really be 99.1.

  14. Bebben said

    @ Lucia

    Sorry for the OT, but for reasons unknown, it turns out that myself and various other Norwegians independently of each other suddenly don’t get access to Lucia’s Blackboard. We get the message

    You don’t have permission to access /musings/ on this server.

    Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request.”

    So what’s up, did you block us Lucia… please don’t, we miss you…

    Bebben (and others), Bergen – Norway

  15. lucia said

    I’ll look through my server logs to figure out what the problem is. I’ve been having major numbers of ‘bots hitting the site and bringing it down. I may have blocked the innocent while I’m at it. If you can “guess” my email (starts with lucia. contains an @ uses my domain name….) I’ll look at your IP addresses in the return address and that might help me figure out what is happening with you specifically.

  16. DeWitt Payne said


    I had another thought. If you remove data on statistical grounds, don’t you use up more degrees of freedom than the number of data points you removed? I dunno. Just a thought.

  17. Paul Matthews said

    Re emailing Richard Muller, I did so yesterday and was pleasantly gobsmacked to get a reply in about an hour. He must be getting loads of mail. Maybe your query was harder to deal with than mine.

  18. Jeff Id said

    #17, I may be the bad guy. I’m used to that response from the team. I am certain that my critique is difficult for them but am expecting them to default to disagreement with my critique on the not explained monte carlo basis. We’ll see.

  19. John Vetterling said

    Something seems to be amiss in their scalpel technique. They say they look for abrupt jumps > 4-sigma. That indicates that the jumps should be on the order of 1/16K. If there is a record for each day that should equate to ~ 1 jump every 43 yrs. But they are reporting (if I read correctly) about 1 every 12 years.

    Am I missing something?

  20. ilyas said

    Reading a number your posts I genuinely discovered this specific 1 to typically be pretty creative. I’ve a internet log also and would wish to repost many snips of your content on my own blogging internet site. Should it be all right if I do this so lengthy as I reference your web site or create a 1 way link towards the article I took the snip from? Otherwise I recognize and would not do it with out your approval . I have book marked this post to twitter and facebook account intended for reference. Nevertheless thank you either way!

  21. SpipiseApposy said

    Help plz!

  22. Brian H said

    Ilyas’ post above is computer spam. The ‘signature’ is generic ungrammatical praise, and a request to cite. Intent is to garner links to a “homepage”.
    As for Spi**********’s post — ??
    Garbage. Pls delete also.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: