Overconfidence Error in BEST

UPDATE:  The post started as a general review of BEST several days ago including some critiques by others.   As I finished, an error in the CI calculation became apparent to me.  If you are familiar with the work, jump to the section titled Jackknife for the description of that problem.

—-

Finally, a technical post.  As most here know, the Berkley Earth Surface Temperature (BEST) has released its early results. The media has worked tirelessly to misrepresent the results to the public. I even heard Chicago progressive radio refer to the authors of BEST as denialists who decided they were going to overturn the results of other surface temperature analyses and found that they couldn’t, proving again that global warming is going to doom us all and that skeptics are fools. Really, that’s what mainstream Chicagoan progressives are being told about this. The purpose of the project was actually to create an open and transparent global surface temperature that people can read and critique at will. The authors made it clear in their mission statement which begins with:

Our aim is to resolve current criticism of the former temperature analyses, and to prepare an open record that will allow rapid response to further criticism or suggestions.

A difficult proposition considering the quality and number of the temperature records involved. I have read all four papers including a multiple read of the methods paper to understand some of the more sophisticated points presented. I’ve also read critiques by several bloggers/sceintists some of which I agree with and others which I believe are mistaken, but in the end I don’t believe any of the critiques could possibly have any appreciable effect on the trend results without a surfacestations style analysis of the raw data. The only places I have concern are in the brush-over given to the UHI effect and the obviously over-tight confidence intervals. Even if the CI’s were widened to more correct levels, it wouldn’t change the result and the UHI effect isn’t going to reverse any trend so despite some statistical critique, I believe the result is very close to the actual global surface temperature average minus some unknown amount of warming by UHI.  Still, I do believe that I have identified a specific error in the confidence interval calculation which must be corrected and is discussed below.

Now the model they give in the methods paper is designed to separate seasonal, latitude, altitude and measurement noise from the ‘climate’ signal. There is nothing I see wrong with that. The series DON’T appear to be smoothed (low pass filtered) before combination and some care was made to separate bad data from the chaff. They refer to the method as the scalpel method but it does look a bit like a saw to me. For instance, series are sorted for quality and deweighted by an automatic weighting scheme. This could have a small impact on the overall trend but it has the potential for a far greater effect on the uncertainty in the mean as currently calculated. They determined uncertainty by methods which use re-sampling and running the same algorithm which is normally a fantastically reliable way to avoid critique, in this case more discussion is required. The weighting method is very much ad-hoc, but again, I doubt it can change trend results substantially although the same cannot be said for the CI.

From the methods paper, they describe the process:

Rather than correcting data, we rely on a philosophically different approach. Our method
has two components: 1) Break time series into independent fragments at times when there is
evidence of abrupt discontinuities, and 2) Adjust the weights within the fitting equations to
account for differences in reliability.

I recommend the reader actually check the detail of the papers in the resources section linked above but the following paragraph is a bit alarming to the inner engineer. If everything were understandable the weighting would be ok, but by my reading, the weights seem tweaked to optimize stability of result. This then tweaks the stability of the re-sampling in the jackknife CI calculation and minimizes the confidence intervals.

Due to the limits on outliers from the previous section, the station weight has a range
between 1/13 and 2, effectively allowing a “perfect” station record to receive up to 26 times the
weight of a “terrible” record. This functional form was chosen for the station weight due to
several desirable qualities. The typical record is expected to have a weight near 1, with poor
records being more severely downweighted than good records are enhanced. Using a
relationship that limits the potential upweighting of good records was found to be necessary in
order to ensure efficient convergence and numerical stability. A number of alternative weighting
and functional forms with similar properties were also considered, but we found that the
construction of global temperature time series were not very sensitive to the details of how the
downweighting of inconsistent records was handled.

I do believe the last sentence of the paragraph though – the average isn’t much affected by de-weighting the outliers. I do hope that the details of this become more clear as the release of their results continues. Currently the BEST temperature averages look like this:

These plots are from the GHCN monthly data only. I am skeptical of the tight confidence intervals simply from my own work on the data to say the least. William Briggs whom is one of my favorite bloggers and should be linked on the right were I not so lazy, wrote this critique of BEST. There are numerous points I agree with but he critiqued the modeling of autocorrelation whereas my understanding of the monte-Carlo style jackknife method naturally incorporates. Again, I think he’s right that they came up with far too small of an uncertainty interval but the reasons are more straightforward.

This is what William wrote:

The authors use the model:

T(x,t) = θ(t) + C(x) + W(x,t)

where x is a vector of temperatures at spatial locations, t is time, &theta() is a trend function, C() is spatial climate (integrated to 0 over the Earth’s surface), and W() is the departure from C() (integrated to 0 over the surface or time)

The model takes into account spatial correlation (described next) but ignores correlation in time. The model accounts for height above sea level but no other geographic features. In its time aspect, it is a singly naive model. Correlation in time in real-life is important and non-ignorable, so we already know that the results from the BEST model will be too sure

Now I don’t know how carefully William read the methods paper at this time but since a resampling method was used to determine confidence intervals, the temporal autocorrelation is accounted for, no model required. Unfortunately, the whole scheme of the deweighted combination was designed to reduce CI’s so his conclusion is correct but his reasoning is not. What is going on is that the authors chose to upweight the data which matched the average best and downweighting the outliers, the cycle is repeated until weightings are determined for each station. The process is effectively narrowing the distribution based on best fit to the average. By resampling and running the same algorithm, they found the confidence intervals were very narrow. Their CI though, determines the ability of the algorithm to pick out a consistent mean value even with reduced data. Whether that consistency represents confidence in the known temperature is another matter entirely and it is my contention that it does not.

He also critiques the lack of Kriging model uncertainty analysis, this is a reasonable critique, but I doubt the function fit to the spatial autocorrelation of surface stations will make much/any difference. The real problem is in the application of the Jackknife method and the determination of the CI’s. Keenan also made some critiques of the method which I believe miss the mark as well.

Jackknife

Now this portion of my post will be fairly detailed and requires some understanding of how CI’s can be calculated by resampling. Where I have trouble with this is that there are multiple difficulties in understanding the deweighting of a spread of values. This is the description given in the BEST methods paper:

In their case, the weights are calculated 8 times with 1/8th of the data removed. Equation 36 creates an upweighted version of the residual differences between the full reconstruction and the reduced data reconstruction. The reduced data reconstructions contain tempreature stations which are re-weighted to produce the trends. Now, the authors claim that this variation in result represents the true uncertainty of the total method mean temperature, but I disagree. What this represents is the ability of the model to chose (upweight/downweight) the same stations in the absence of a small fraction of the data. The resampling methods will necessarily generate a very small CI from this but the truth is that their algorithm is generating the same mean values within the CI’s as presented in the paper. Think about that. They always get the same result within that CI so they are getting the same result inside a very narrow band. So are these methods a true representation in our confidence in the mean temp?

The problem is that equation 36 generates independent datasets by upweighting residuals from different runs containing the same data. Each run though changes the weight of the root data by reweighting individual temperature series. The central values (series most like the mean) of the reduced data runs are effectively upweighted more when data is removed while outliers experience the opposite effect. The central value is therefore non-normally and non-linearly preferred, invalidating the assumptions of the subsampling methods. More simply this weighting of the preferred values means that you really don’t have 1/8 less data which is THE central assumption of the Jackknife methods.  Because of the weighting algorithm, they have functionally removed less than 1/8th. This is likely the primary reason why subsampling produced even tighter CI’s than Jackknife, as mentioned in the paper.. This is a significant error in the methods paper which will require a rework of the CI calculation methods and a re-write of the CI portion of the methods paper.

If they re-ran the Jackknife without re-weighting the data, an improved (and wider) CI could be calculated but care would need to be taken as the removal of weighted data will result in strikingly non-normal distributions.

In all, I like the general idea of the transparency, the re-weighting is a little hinky and like William Briggs, I would prefer to see bad data excised as it is all data and therefore statistically cleaner, the UHI conclusions are not thoroughly vetted enough to make the conclusions about bias in the data that they do as it doesn’t require a skyscraper to screw up a temp station. However, excepting the errors in the CI calculations, the results aren’t far from what we would expect.

63 thoughts on “Overconfidence Error in BEST

  1. Here’s a simpler way to write it. The authors assume they removed 1/8th of the data but in reality they have removed less -say 1/20th, the resulting CI of the 1/20th will be more stable than you would expect for 1/8 and the CI is too tight.

  2. Another way to think of it is if you have 5 stations. Say one is weighted approximately 10 times greater than the other 4 which are equally weighted. This would be the same as 10 copies of 1 and 4 other copies of something else. If you eliminate one station which happens to be one of the 4 and the weights are recalculated so that one station is say 10 times weighted to 3 at similar weights to the original, the difference in the temperature reconstruction would be about 1/14th of change from error rather than 1/5th as would be assumed by jackknife.

    These values are arbitrary of course but if you expand the effect across thousands of stations, you can see that the subsampling and reweighting give artificially low CI’s.

  3. Jeff have you found the matlab code that does the jackknife? I’ve been traveling so haven’t had much of a chance to look.

    It’d be interesting to see whether their verbal description matches up with what the code actually does.

  4. Given BEST’s underwhelming contribution to the state-of-the-method-art, I’d say it is pretty likely that this is a social experiment. Muller’s set-up performance is signaling to establish credibility with the target audience.

    It is important to realize that what matters the most in credibility is the audience’s perceptions of the source. That is, the most important factor in persuasion is not whether the speaker really is an expert or trustworthy, but whether the audience thinks the source is an expert or trustworthy. If the audience thinks the source is an expert and/or trustworthy, it is more likely that the audience will be persuaded by that source.
    The Yale Approach

    Curry’s “apostasy” can be seen in the same light (Easterbrook’s early posts on Serendipity serve a similar signaling purpose: “I’m on your team, I’m one of the good guys”). Applying shallow advertising psychology to this issue won’t work. Fun to watch though.

  5. Thanks for the information, Jeff.

    We have all been richly blessed to have the opportunity to witnessed the workings of an old curse, “May you live interesting times!”

    How will it work out? Nobody knows.

    The AGW campaign has almost unlimited tax funds and threatens the most cherished values of self-government, as President Eisenhower warned in his farewell address on 17 Jan 1961:

    “The prospect of domination of the nation’s scholars by Federal employment, project allocations, and the power of money is every present – and is gravely to be regarded.”

    I’m betting that truth will prevail, but it won’t happen overnight.

    Oliver K. Manuel
    http://myprofile.cos.com/manuelo09

  6. Jeff ID:

    The code is difficult to read. Endless separate files with a large number of loops.

    That doesn’t add confidence to the view that their code is doing what they think it is doing.

    They’ve made a bit of a mess of things with their releases. I’m going to give this a rest until what they have online is stabilized a bit.

  7. Carrick,

    I’m hoping that the authors will pay some attention to my critique. Judith Curry has stated that she will point out the error to the authors so I expect they will read it. One point they make in their opening mission statement is ‘rapid response’ to criticism. I didn’t expect to find a big problem like this when I was first reading. Now we shall see how they handle it.

  8. When I quickly read through the BEST paper in question, my thoughts were if there is a problem with the methods it might well be in calculating the CIs and particularly with the jackknife procedure. Thanks for the post Jeff and I hope some concentrated effort can be made to either conclusively show the error or capability of the method used to calculate CIs. I would think that simulated data where the correct answer (CIs) is known could be used to test the jackknife method used – but I may be well getting ahead of myself with that remark.

    While I think skeptical criticism of all climate related papers is good and the timelier the better, but I think some time can be wasted in not making the legitimate criticisms and when caused by a poor understanding of the what the paper being criticized is attempting to impart. One advantage of the skeptics is that they do not have to appear to be infallible in order to satisfy the advocacy and PR positions.

  9. “Here’s the “reluctant testimony” style narrative they were aiming for: Skeptic Finds He Now Agrees Global Warming Is Real. The early press blitz didn’t quite get it right.”

    Perhaps we will see a similar remark in the future that says the BEST data changes the basis for reconstructions and climate models to such a degree that much uncertainty has arisen regarding the AGW consensus. An unnamed member of the consensus might be quoted as saying, “we thought we had everything calibrated until BEST came along and now we are confronted with explaining a ton of divergences”.

  10. Kenneth,

    The simulated data is a great idea. If I don’t receive constructive replies, that’s probably where I’ll go. The problem seems pretty obvious to me now though so I’m hoping it will be a quick oops and a redo.

  11. Jstults,

    it doesn’t give much confidence that Muller has been a Believer since the 1980’s when he split with the Sierra Club over whether to support Nuclear since he thought that was the only reasonable way to decrease Carbon production.This is getting the patina of another scam perpetrated by an activist!! Why is he allowing the media to perpetrate the fraud that he was a sceptic when he is on record as a Warmist?? Like Jones, Hansen and others, he apparently believed before there was temp data to support the belief!!

    http://blogs.dailymail.com/donsurber/archives/44855

  12. I homed in on fig 2 of the Rohde at al pre-release on processing methods, showing for global annual temperatures how correlation coefficients plot against the separation distance of stations. The following might be of interest in the context of your following quote, Jeff, “The model takes into account spatial correlation (described next) but ignores correlation in time. The model accounts for height above sea level but no other geographic features. In its time aspect, it is a singly naive model. Correlation in time in real-life is important and non-ignorable, so we already know that the results from the BEST model will be too sure.”

    I looked at one station with a 150 year temperature record with both Tmax and Tmin, data as raw as I could find. One station shift of tiny significance.
    So the spatial aspect was put aside in my little study, whose method looked at the 150 year data string and smaller parts of it, with lags. What is R when calculated against the day before or the day after? Or in strings lagged by a month or a year? Or 2 years, or 3 …20? In this time domain, the calculation of R depends on the sampling interval (annual averages differ from monthly from daily from smoothed – maybe also for detrended), Tmin has hugely higher coefficients than Tmax no matter what the sampling interval, R also depends on how long the data string is, R for Tmean depends on how it was calculated from Tmax and Tmin (thus has consequences for instrument type changes); and only once in many calculations of R did I find a value in excess of 0.85.
    If I can’t get better than 0.85 by milking the data for the best result at one location, where does this leave the cases in the paper, which rely on locations separated by distances up to 3000 km and by times taken for weather systems to move from one location to another over the globe?
    My conclusion from an initial look at but one station is that such correlation coefficient calculations have a large error band and that the better values, say those above 0.85, are mainly statistical outliers that could have little to no relation with the weather. Test. Take a few of the best pairs of Tmean stations from Rohde fig 2 and recalculate Tmax and Tmin separately.

  13. Dear Jeff, I tried to understand your criticism but I don’t seem to be getting it. The method looks kind of clever to me. May I ask: is your criticism addressed to the method as previously designed by Quienoiuille and Tukey, or are you saying this stuff was incorrectly used?

    Also, whether one removes 1/8 or something else is completely irrelevant, isn’t it? If they took 19/20 of the data and 20 reduced datasets, the method would still do qualitatively the same, wouldn’t it? Where did you get the “more accurate” number 1/20 and why do you think it matters?

    The weights “+8, -7” in formula 36 are chosen so that the sum is “+1”, so if you imagine that all the stations produce the same temperature, you get it back. On the other hand, the subtraction has to be there to make them independent. If you only took the average of the 7/8 of the stations, you would get 8 almost equal datasets which clearly wouldn’t be independent, so their small differences wouldn’t be representative of the error.

    All the best
    Lubos

  14. Lubos,

    Thanks for taking the time. I agree that the method is clever and originally I was quite happy with it. The assumption of the method though is that 1/8 of the error is removed each time and the delta is scaled up to +1 as you have correctly written. In this manner the variance in the residual difference represents one instance of the noise error. Where the assumption goes wrong is when the individual stations are weighted before averaging and the 1/8 of the station data is removed before weighting.

    For me the problem is more clear with extreme examples. If the algorithm were to select the weighting such that the best station on earth were always weighted at 1 and the rest at zero, how would the removal of 1/8 of the data affect the ability to select that 1 station. One of the 8 selections would eliminate that best station resulting in the choice of the most similar second best station, the other runs may still find only the one. The difference in the sigma from the one miss would amount to 1/8th of its error because the rest of the runs would show zero error.

    This isn’t what happens of course but if you have a group of 25 stations, if you have seen the temp data, many of them will be a mess. The top station(s) of the group will be most representative of the mean. Removal of the 1/8 of the stations will take some of the deweighted and will create less than the expected 1/8 change in noise variance as they have been deweighted in comparison to the best ones. They have unintentionally violated the 1/8 assumption of the Jackknife method and the fractional subsampling methods as well.

  15. Kuhnkat, Muller thinks the hockey stick is broken and has blogged to say as much, and he was genuinely dubious about the temperature records. He does not doubt the greenhouse effect or that human activity can enhance it – he’s a physicist. The conceit in the argument you’ve brought here is that only those who disbelieve x and y about AGW are legitimate ‘skeptics’. This is the antithesis of skepticism. It’s tribal.

    (Sorry for the digression, folks. Appreciate the top post and comments beneath)

  16. “Where the assumption goes wrong is when the individual stations are weighted before averaging and the 1/8 of the station data is removed before weighting.”

    The first step in your communication of your criticism to the BEST authors would then be to verify that they indeed did what you surmised they did above. That fact would be a required known if the authors did not quickly test the system under the correct and incorrect processes and someone else might want to do a with and without test of results.

    A further concern about BEST that I had was that there would appear to be some critical assumptions made in calculating the breakpoints in the station temperature series. I assume that their algorithm used neighboring station difference series (as in Menne and the USHCN data) to avoid down weighting legitimate climate regime changes. Even so the studies that Menne reported when his algorithm was applied to synthetic station data, prepared by an independent body, indicated that it missed some legitimate breakpoints and found some that were not legitimate. It is not at all clear to me whether the method can truly find a gradually evolving change in a station’s micro climate at this point in my own analyses where I am attempting to combine the CRN rating investigations with breakpoints in the “before” TOB series and the “after” Adjusted series. There are some things I see that do not make sense to me currently – but that could be due to my lack of knowledge at this point.

  17. Bruce,

    It is global land only.

    Kenneth,

    The image screengrab paragraph above eq 36 is fairly specific as to the method. You are right that the authors need to confirm it. Judith has recently written that they are looking into the more serious critiques so I hope that this has made the cut. I believe it is the most serious on record right now but Steve Mc has another one coming that is even more difficult to demonstrate.

  18. Dear Jeff,

    thanks for your answer. You didn’t answer explicitly but I amplicitly concluded that you object to the whole original method by Quenoille and Tukey.

    “If the algorithm were to select the weighting such that the best station on earth” – well, this is clearly not the point, is it? In fact, no one but you even assumes that something such as “the best station on Earth” matters. The point of the jackknife method is to find the error margin caused by presence of the holes and bad apples, whatever they are.

    I have a big problem with your very assumption that there is something such as “the best station”. What does it mean? If I understand well, it means a station whose temperature is most correlated with the global mean temperature. But what is “good” about it? This is due to coincidences and whether or not it’s true, you must still calculate the average the whole globe.

    Your assumption sounds kind of “warmist” to me; it apparently reflects the warmist ludicrous assumption that there is a single “good temperature” and all stations should obey it and if they don’t, it’s their “problem”. But this is complete nonsense. We are trying to reconstruct the global mean temperature which is just a contrived average of all points on the globe. The points on the globe are doing whatever they’re doing, each of them is doing something else, and something else than the average, but each of them may in principle be measured accurately and be “perfectly good”.

    The calculation of the error margin stands on the clever assumption that the holes and imperfections that already exist in the data will be changed – increased or decreased – if you introduce additional holes. The (root mean square average) amount of change you get by introducing new holes – e.g. in 1/8 of the data – is proportional to the error caused by the holes that are already there. However, if you just removed 1/8 of the data and compute the average, you would get a big underestimate – the proportionality could hold but to get the actual error caused by the existing holes, you would have to multiply the change of the average caused by the removal of 1/8 of the data by a large number which includes something like a positive power of 8, either 8 or sqrt(8) or whatever it is.

    So this is counteracted by “damaging” the data more when you remove a random 1/8 of the data. The data are damaged so that the removed stations are not just removed; they’re counted with the opposite sign. With the right coefficient which I guess are those indicated, the width of the distribution of the new “damaged average” will coincide with the width of the holes that are already there. To say the least, the dependence on the number “8” will be eliminated in working examples.

    I don’t recognize your criticism as being relevant for this method at all. You seem to be solving a different problem, the search for a “best station”, something that doesn’t make sense to me.

    Yours
    Lubos

  19. Jeff–
    On the simulated data– I think if something non-standard is being done, then the BEST team ought to be required to do with test on simulated data to show things are right. This isn’t something that ought to be incumbent on you.

  20. Lubos,

    Thanks again.

    “You didn’t answer explicitly but I amplicitly concluded that you object to the whole original method by Quenoille and Tukey.” –

    – No, their method is fine. Sorry for the confusion. I do like it quite a bit and think it is a clever method.

    “I have a big problem with your very assumption that there is something such as “the best station”. What does it mean?”

    You misunderstood, it isn’t me assuming which station is best, it is the Berkley method weighting the ‘best’ station according to their criteria which apparently compares each station to the mean iteratively re-weighting each one (good higher than bad) to a convergence point.

    The calculation of the error margin stands on the clever assumption that the holes and imperfections that already exist in the data will be changed – increased or decreased – if you introduce additional holes. The (root mean square average) amount of change you get by introducing new holes – e.g. in 1/8 of the data – is proportional to the error caused by the holes that are already there. However, if you just removed 1/8 of the data and compute the average, you would get a big underestimate – the proportionality could hold but to get the actual error caused by the existing holes, you would have to multiply the change of the average caused by the removal of 1/8 of the data by a large number which includes something like a positive power of 8, either 8 or sqrt(8) or whatever it is.

    This is what eq 36 does so we agree on this completely.

    So this is counteracted by “damaging” the data more when you remove a random 1/8 of the data.

    This is where we part ways – a little. They say they are removing 1/8th and they do remove 1/8th of the temperature stations but we cannot forget that the stations are preferentially weighted both before with 100% of the data and re-weighted again after the removal. The “best quality” stations (whatever that means) are the ones most represented so you have not removed 1/8th of the error variance as expected. In fact, you have removed far less – caused less than 1/8th the damage which is assumed by eq36. This is what results in an inaccurate estimation of the CI.

    Hopefully that resolves the confusion.

  21. Jeff said: The “best quality” stations (whatever that means) are the ones most represented so you have not removed 1/8th of the error variance as expected. In fact, you have removed far less – caused less than 1/8th the damage which is assumed by eq36.
    If I am following you correctly, maybe another way to put it would be that they have less effective degrees of freedom for error estimation because they are using some of them up to re-estimate the weights. Is that what you are arguing?

    I agree with Lucia; establishing the credibility of a novel method should fall on the person proposing the method.

  22. Dear Jeff, I am ready to agree that it’s plausible that there is a wrong numerical coefficient in front of the error margin computed in this way. But doesn’t your latest comment indicate that you might agree that if this possible error were fixed, the formula would be correct?

    If one divides the data to 8 parts, the global average is computed as the sum of these multiples:

    (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8).

    On the other hand, the crippled averages look like this:

    (0,1/7, 1/7, 1/7, 1/7, 1/7, 1/7, 1/7). Equation 36 takes 8 times the latter minus 7 times the former, so you get numbers like:

    (-7/8, 15/56, 15/56, 15/56, 15/56, 15/56, 15/56, 15/56).

    The removed eighth of the data is the single most influential station in the jackknived time series: negatively. However, the remaining 7/8 of the data in total contribute 15/8 if you imagine they’re the same thing. That’s more than the 1/8 separately.

    Now, some steps are missing in the calculation because I don’t really know what we’re calculating, how the “real error margin” may be quantified. To guess how their formula 37 is related to the error margin, I would have to apply it to a particular example where I “know” what the error margin should ultimately be. You have probably done this job, haven’t you?

    As far as I can see, BEST never talk about “best stations” – they take an agnostic attitude who is “good” – and I would agree with that attitude.

    Yours
    Lubos

  23. Lubos,

    In your examples, everything adds to 1. In the BEST algorithm, due to reweighting of stations in each data-reduced step, there is no such limitation. This is the violation of the basic assumptions I referred to in the headpost. Jackknife assumes everything adds to 1. My guess is that by visually looking at the tiny CI, their weighted results aren’t even close to 1.

    “But doesn’t your latest comment indicate that you might agree that if this possible error were fixed, the formula would be correct?”

    Sure. If the error were fixed, the answer would be ok. Trust me, I’m not some kind of whackjob denialist who thinks the problem cannot be solved. I just don’t know how to solve it with the current methodology because the weighting makes it messy and no I haven’t taken the step to demonstrate it with known data yet. I’m rather hoping that the authors will understand my critique and address it themselves. If they don’t, I’ll have something to blog on for a while.

  24. Jeff, I have not read the paper on how they have produced the dataset as I am sure it would be way above my head.
    However I have looked at the data file.
    Have you looked at the actual data?
    It appears to have a rather large number of errors in it for something that is supposed to be a “Quality” datatset.
    It has numerous apparent incorrect signs, ie minus signs when all surrounding data is plus and no minus signs where data is all minus.
    But the strangest aspect is the number of years where the Average for some of the Winter months is greater than that the summer months.
    I asked the question on JC’s forum and Bob Koss replied that it was probably something to do with “It has been detrended by removing the the seasonal signal”.
    Does de-trending produce that kind of result?
    As I replied to Bob it is not showing up every year just some of them.

    REPLY: I have not looked at the BEST data in detail. I beleive it is presented in RAW format which I prefer as it hasn’t been tweaked yet. The plus and minus signs could be the result of anomaly calculations designed to remove the seasonal variance, dunno. I wouldn’t refer to that process as ‘detrending’ though. There are a lot of sources for this data and my understanding is that it has been taken from the GHCN archive. There’s no magic in it, just a bit of noise. Well, a lot of noise but that is just semantics.

  25. Lubos

    I would have to apply it to a particular example where I “know” what the error margin should ultimately be. You have probably done this job, haven’t you?

    I think Jeff’s question is one that needs to be raised. It sounds plausible that their CI’s are wrong. My thought when I read this was: Did the authors run MonteCarlo to see if the method works on synthetic data. Synthetic data (if properly concocted) is precisely an example were one “knows” what the error margins should be. Ideally, the authors of BEST should explain why their method gives good confidence intervals– I suspect they will have to run Monte Carlo.

    I don’t think Jeff needs to do this to scratch his head, make a plausible case that the CI’s are wrong, and suggest the question. BEST is making a claim, and they should be the ones to answer. (I have no reason to think they won’t. )

  26. Gut instincts do suggest that there is something rather unacceptable about Jacknifing: imagine one influential / high leverage point in an otherwise bland regression; most single deletion jacknife results will suggest that we can be very confident in the answer ( and even when the influential point is deleted it will appear only as a single point out on the tail of the distribution of results).

    OK, I havent read how BEST works, but if there is an element of upweighting stations that agree with each other then the situation is surely even more complex , imagine that the the heart of this, in a region, is 4 out of 20 stations [let’s say these 4 had airconditioning units fitted at the same time:-) ] then it would take more than a 1/8 deletion to reliably break this up.

    To pick these sort of things up one would to delete a half each time, something more ‘Texas Chainsaw Massacre’ than 1/8 Jacknife.

    Would Jacknifing the proxys that went into the mannomatic have told anyone anything usefull?

  27. Keenan also made some critiques of the method which I believe miss the mark as well.

    Perhaps you could explain why you believe that my critiques miss the mark.

    Thus far, I have not seen valid criticisms of my critiques. Lucia and Tamino gave invalid ones; Judy Curry gave an invalid summary. Your work on the Methods paper is not relevant to my critiques, which were of the Decadal Variations paper and the UHI paper:

    http://www.informath.org/apprise/a5700.htm

  28. Chas,

    I don’t have a problem with the Jackknife algorithm. The method is actually necessary for the BEST algorithm to work. As is often the case with stats though, their CI represents an answer to a different question that what was intended. How accurately does this curve represent our knowledge of the mean temperature.

    Douglas,

    I had particular disagreement with your interpretation of what the model should/must incorporate. I did not read your critique in the context of the single decadal temperature paper but some of the comments on autocorrelation and smoothing were strong. I may do a post later on it if you like. This post was going to be my own impressions yet got a little sidetracked when I figured out what was going on in the Jackknife section.

  29. Looking at this from the software side, I think it is absolutely essential to test this algorithm as the authors implemented it for properties as bias, accuracy, stability with synthetic data. Generating the synthetic data with ‘proper’ statistical properties, but without injecting some form of a priori knowledge would be a challenge in itself.
    It would be also interesting to know with what dataset the authors tested their code for debugging.

  30. Doug-

    Thus far, I have not seen valid criticisms of my critiques. Lucia and Tamino gave invalid ones; Judy Curry gave an invalid summary. Your work on the Methods paper is not relevant to my critiques, which were of the Decadal Variations paper and the UHI paper:

    You sent me an email which included your own diagnosis of the merits of your arguments. I disagree with your assessment. I replied that comments were open, and you could post if you wish.

    I would be happy to respond to which ever of your arguments you consider strong enough to put in the public view. I haven’t see any comments by you at my blog, and I haven’t seen a ping. Have you posted your argument anywhere? If yes, I’d love to see a link. I think it is an inefficient use of my time to go back and forth privately. I would be more than happy to create a post quoting your argument and my own responses. Drop a note here or at my blog and I’ll go ahead and respond.

    I can’t speak for Judy and especially not Tamino, but I bet Tamino would be happy to read your response to whatever he posted. Have you considered engaging what he said in public? If you have, could you provide a link so we can all read your argument explaining why you think what he wrote is not valid.

  31. Doug,

    You are welcome to post here as well so you have all the outlets you want. The readers are mostly tired of my lack of time at this point. The moderation stinks though.

  32. Jeff, I need more time for my old mind to wrap itself around the problem as you describe it and why Lubos is not understanding it, but when you refer to station weighting, which as I recall BEST says ranges from 2 to 1/13 (and rather arbitrarily in my view), do you consider a weighting of 2 as if a station weighted by 2 is 2 stations and station as weighted by 1/13 is 1/13 of station? Please note that I am too old to worry about asking dumb questions.

  33. Kenneth,

    I’ve been around you too long to accept the ‘old’ comment. I’m sure that we could have a nice ‘old’ chess match sometime. Of course the questions aren’t stupid. Lubos has proven beyond a doubt that my explanation is poor. Hopefully, the authors are familiar enough with their own work to get my point.

    Here is how I think of it. Error in station data is random and being random is naturally orthogonal (uncorrelated). Error is comprised of multiple things, in this case error is defined as anything local to a temp station which disturbs the average value of temperature, climate, wind, humidity, sun, shade, massive air conditioning units etc.. If temp stations are combined by weighting:

    Tave= T1 * w1 + t2 * w2 …..Tn * Wn/ sum w

    Saying that one station has a weight of 2 and another has 1/13 is mathematically equivalent to 26 copies of one and 1 copy of the other. Elimination of the 1 means little to the average of the 26 others even though it looks like a 50% elimination of the data.

    Jackknife, as presented, assumes equal weights for all stations. If you eliminate 1/8 then you have to scale the noise by 8 times to equal 1 – eq 36. This works fine if your elimination is truly 1/8 of the noise. In this case it is not equal to 1/8 of noise because the summation of stations don’t equal 1 each by weight. That is why I’ve stuck my neck out on this. The authors have made a math error and will have to correct the CI of the paper.

  34. Jeff, you said to kenneth “in this case error is defined as anything local to a temp station which disturbs the average value of temperature, climate, wind, humidity, sun, shade, massive air conditioning units etc”.
    Aren’t “climate, wind, humidity” etc all natural and therefore should not be corrected for, excluded or modified as that temperature is what the local area is experiencing?
    This kind of weighting system lends itself to data manipulation to show just about anything you like, depending on Who and How the choices are made as to what is a good and bad station.

    REPLY: These are sources of variance in the data. They are not to be excluded but the effects mask any underlying warming trend. The degree of masking is what creates the confidence intervals. By specifically excluding (deweighting) some of this masking, the CI’s of BEST are artificially narrowed. It isn’t a matter of choosing a good or bad station, I don’t have a problem with the concept of weighting, it is a matter of quantifying how much variance is created by high frequency non-climate processes and determining how that affects the true knowledge of temperature.

  35. Jeff Id,

    Your saying that you have some criticism, but not substantiating it–this is not what I hope to hear from a reasonable person.

    Tamino does not allow me to post comments at his blog.

    Following is the text of the message that I sent to Lucia.

    =================================================

    Your recent blog post “BEST data: Trend looks statistically significant so far” includes the following statement.

    Since smoothing is discussed in Keenan’s letters to the economist,
    I will note that this data is computed by averaging over 12 months. So, it is “smoothed” relative to monthly data.

    If we have a series of n monthly values, and obtain from that a series of n/12 annual values, then we are not doing smoothing in the sense I intended; rather, we are doing aggregation. Aggregation is fine. Smoothing is problematic, and the problem is easy to understand.

    Suppose that our original series is a1, a2, a3, a4, …, an, and we are taking a 3-point moving average.  Denote the smoothed series by b1, b2, b3, b4, …, bn-2.  Then b1 = (a1 + a2 + a3)/3,
    and b2 = (a2 + a3 + a4)/3, etc.  Notice that both b1 and b2 depend upon a2 and a3.  Hence b1 and b2 are correlated with each other.  In other words, the smoothed series is more autocorrelated than the original series.  That is the problem that smoothing (via a moving average) introduces. 

    Suppose, on the other hand, that we simply aggregate the original series, to obtain c1, c2, c3, c4, …, cn/12.  Then c1 = (a1+ a2+ a3+ … + a12)/12  and 
    c2 = (a13+ a14+ a15+ … + a24)/12, etc.  Thus c1 and c2 do not depend upon the same elements of the original series.  Ergo, the aggregation does not introduce autocorrelation.

    Your post also says this.

    Doug’s claim that we might conclude the “IPCC assumption is insupportable” might be convincing to me if I bought for one second that it
    makes sense to use d=1 in ARIMA(3,1,0). I don’t. I think that arguments for statistical models with d=1 tend to violate the 1st law of thermodynamics.

    The physical plausibility of an ARIMA(p,1,q) could be questioned, at least on long time scales; on the other hand, ARIMA(p,1,q) might be a reasonable approximation, on time scales of interest here, to a more physically-plausible process. As an analogy, Earth is approximately spherical, but if someone is drawing a map of England, it is reasonable to assume flatness. Similarly, given the shortness of the time series, ARIMA(p,1,q) could be reasonable.

    In any case, the comparison of the ARIMA(p,1,q) model and the IPCC model strongly indicates that the IPCC model is failing to explain some substantial structural variation in the data–and that is the sole purpose of the comparison. Thus the comparison gives insight, regardless of physical plausibility.

    Your post further says the following.

    That is: using the preliminary BEST data going back to 1800, the IPCC AR(1) blows the model favored by Doug Keenan in his Wall Street Journal out of the water. The model that says “Statistically significant warming” wins.

    Update(May 25): If I use annual averaged best data, the model that wins reverses. As already promised below, I’ll be discussing other models.

    In the first case, you were using seasonal (monthly) data, which cannot be directly compared like that–as your Update effectively showed.

    Additionally, the ARIMA model was not “favored” by me; it was solely used for comparison with the IPCC model. Indeed, the WSJ piece ended by saying that more, and difficult, research was required.

    To summarize, your criticisms are invalid.

  36. Douglas,

    “Your saying that you have some criticism, but not substantiating it–this is not what I hope to hear from a reasonable person. ”

    You are far too pressed for time apparently. I can try to get back to it later and we can argue those finer points. In the meantime, what are your thoughts on the reweighted results being run through Jackknife? Isn’t that more interesting than whether the IPCC used the right ARIMA model?

    Tamino doesn’t let me post either.

  37. Jeff Id,

    I have not read the Methods paper. For me, the BEST project is of little importance; so I do not want to put much time into it. I only got involved because I was asked to by The Economist.

    About the Jackknife, and bootstrap, generally…. Real data almost always has structure. When selecting, or deselecting, some of the data, it is important to preserve the structure, before doing analysis. Preserving the structure, though, typically requires knowing the structure better than we do. Hence I tend to be somewhat wary of these techniques.

    That’s interesting about Tamino!

  38. “I believe the result is very close to the actual global surface temperature average minus some unknown amount of warming by UHI.”

    I find this statement astounding. The BEST result is the product of raw data which at most covers 15% of the surface of the planet in any meaningful way. Even if the data were pristine with zero measurement error and infinitely dense coverarge both spatial and temporal, it would still have enormous uncertainty as a proxy for global surface temperature. As presented its a WAG
    Not to mention it is hugely different from the satellite data set which has been running for some 33 years, covers nearly the entire planet, is vetted by two separate groups at least mildly antagonistic to each other, and has been extensively tested against balloon data.

    I have no inside track on the motives of the BEST group but their pr blitz and their spin would lead me to believe that they are full on board the CAGW bandwagon and are doing their BEST to keep the gravy train going.

    REPLY: It should say land surface. You are right about that but satellites measure a thick layer and again, we don’t know how much UHI contaminates the data. There are reasons to be skeptical of the series but we need to be careful not to be overly so.

  39. “I’ve been around you too long to accept the ‘old’ comment.”

    OK I’ll check you off my list of people who might fall for my “old” routine.

    Thanks for your explanation. It is in line with what I thought you were arguing. And thanks for keeping (so far) the thread mostly on track.

    By the way, chess skill is an excellent indicator of mental losses that come from aging and, on a more immediate time scale, consumption of alcohol.

  40. Doug–
    As you can see from the trackback, I posted my response.

    FWIW, I too am banned from Tamino’s. Lots of people are banned from Tamino’s. That doesn’t prevent anyone from responding to Tamino if they wish. You could post at your own site or request a guest post. Any number of bloggers would permit you to guest post. Bishop hill posted the letters Tamino criticized, I suspect that might be your first choice.

  41. Kenneth Fritsch:

    “By the way, chess skill is an excellent indicator of mental losses that come from aging and, on a more immediate time scale, consumption of alcohol.”

    Forgive me all, this is further O/T, but:

    I worked in a psychological research outfit in 1964. This place ran under Skinnerian assumptions of how the world worked. One of the studies was addressed to developing a series of tests which could detect and quantify incremental degradation of performance linked to dosages of whatever was of interest to the client. A very good test was to put a stereo headset on the subject, then read 4 digit numbers into one, then the other ear, separated by perhaps 15 seconds, then pause for 60 seconds and have the subject write them down. The intervals could be varied, the number of “numbers” varied, and the delay before the writing adjusted.

    With this scheme devised, several test stations were set up at one of the client’s facilities and hundreds of subjects (men between ages of 17 and 35) were subjected to these tests.

    The gross result was that the accuracy with which this could be done peaked at about 18 and it was downhill from there. But a significant number of the subjects proved noticeably less capable. Clearly, more study was required. There was a contract modification, and the investigators interviewed the subjects.

    The difference was alcohol. The quantity ingested by event, week, month, or year was not sensitive unless it was close to none. In less scientific terms. any alcohol more than trace blew this capacity away.

    Those of us who became familiar with this study hoped to hell that this particular ability would never have any useful application in our lives.

    Sorry, Kenneth, but now you can worry about this too.

  42. #55 J Ferg,

    That test of yours sounds remarkably like ‘In one ear and out the other’. That, alcohol comsumption and lack of attention, did you check the correlation against,’Being married.’?

  43. #57 Chuckles.

    I doubt if more than a few of the subjects were married given their line of work – heavy duty dispute resolution. Also, no one thought of it, IIRC. What was clever about the test was that remembering 4 or 6 four digit numbers read one at a time into one ear and then the other and alternating followed by a time-out is much harder than if you hear them with both ears simultaneously. I didn’t mention it, but there was a cash payout for correct sequences and it was enough that it was believed that they were getting everyone’s best shot.

  44. So Tamino has banned Lucia, Jeff and Donald; all statisticians; alternatively I can barely count my toes and I too have been banned; how egalitarian! Who said the alarmists were not democratic.

    Maybe we can start a ‘banned from tammy’s’ club?

  45. BEST promised to be transparent and provide a rapid response. Let’ see:
    1. They used 39,000+ stations. What is Number vs Year between 1950 and 2000?
    Is there the sudden decline after 1970, as seen for GHCN?
    2. Investigate the “demographics” of the stations used:
    A] Repeat plot #1 separately for cooling stations (~one-third) and warming stations
    B] Repeat plot #1 for Tropics , NH and SH
    C] Show number of airport stations vs Yr

Leave a reply to Nick Stokes Cancel reply