Time to Fix the thermometers

I updated this article since yesterday to include Chad’s blogland work on the subject.  Climate needs to address this issue as it is central to their conclusions.

——-

I was reading at Niche Modeling today and ran across another McKitrick, McIntyre paper on the trends vs measurements in models.  This paper also includes Chad Herman whom from the comments is probably Chad fr0m treesfortheforest who replicated and extended Santer’s paper in blogland some time ago.  This, however, is the first peer reviewed and published correction to the Santer08 paper which made the claim that models were supported by the data.  It’s published in a Atmospheric Science Letters a statistical journal and it demonstrates that model trends are well ahead of the data.

First, Steve’s previous reply to Santer was rejected by the journal it was sent to.  This is despite the fact that it used the same methods and data as the original, it just updated the data to 2009 and came to a different conclusion.  From the climategate emails, we know that criticism of models is not acceptable difficult in climate science.

The paper is not behind any paywalls and can be read here. I hate paywalls.

Anyway, models are often compared to surface level data rather than the satellite data.  The sat data measures a thickness of atmosphere without interference from ground clutter, so the comparison should actually be better.

But it’s not!  The best image of the paper is the comparison below- with Santer style confidence intervals.  This passed review in a statistics oriented journal.

Models are running 2 to 4 times higher in trend than the measured data over the same interval.

In our example on temperatures in the tropical troposphere, on data ending in 1999 we find the trend differences between models and observations are only marginally significant, partially confirming the view of Santer et al. (2008) against Douglass et al. (2007).

The methods presented nearly confirm the Santer et al. paper which although published in 2008, didn’t use any data after 2000.  Why is anyone’s guess, except for those of us who know that the quote from Trenberth in 2009.

The fact is that we can’t account for the lack of warming at the moment and it is a
travesty that we can’t. The CERES data published in the August BAMS 09 supplement on
2008 shows there should be even more warming: but the data are surely wrong. Our
observing system is inadequate.

Uh oh!

Anyway, this is why there are scientific skeptics out there.  We know the models are having problems, so do the other guys.  We know the climate environmentnalists won’t allow people to point them out in the IPCC or in the Journals, yet we know the problems exist.  How fun is that?!

Anyway, another fun quote from the conclusions.

The observed temperature trends themselves are statistically insignificant.

I find that quote a little surprising but that just means the methods they used estimated large confidence intervals — a plus for the climate modelers.

I do like to save the best for last.  It is a paper so my bold of course.

Over the 1979 to 2009 interval, in the LT layer, observed trends are jointly significant and three of four data sets have individually significant trends. In the MT layer two of four data sets have individually significant trends and the trends are jointly insignificant or marginal depending on the test used. Over the interval 1979 to 2009, model-projected temperature trends are two to four times larger than observed trends in both the lower and mid-troposphere and the differences are statistically significant at the 99% level.


The 99%  level when you use all the data, OUCH, that’s not good.

Anyway, check out the paper, and call your favorite climate modeler to let them know they’ve got the wrong answer. Either that or call the balloon and satellite guys and tell them to add some trend back in.

88 thoughts on “Time to Fix the thermometers

  1. Nice summary. I agree that the graphic is a doozie. Do you know who the third author is?
    It would also be interesting to see what the reviewer comments looked like. It might help to determine whether the “worm has turned.”

    OT a bit, but Steve McIntyre did a nice job at Keith Kloor’s Collide-a-scope. Alas Gavin decline to respond.

  2. There’s been a lot of back-and-forth on claiming significance for model-observation trend differences, and this paper doesn’t represent an advance. The differences vary a lot more than often-claimed analyses would indicate, and the reason is usually failure to account for all sources of variance.

    That’s the case here. The model trends are shown with small error levels, and that’s the reason for high significance. But if you look through Table 1, yes, the individual models show small s.e’s, and that reflects fairly small fluctuations within each calculation. But the differences between model trends are very large. There’s a lot of scatter that just doesn’t go into the significance test.

    So it may well be that their panel test is correct as far as it goes. But it isn’t a test of “models” vs observations. It’s a test of some observations vs a particular sample of models, ignoring between-model variation.

  3. Re: CoRev (Aug 7 17:22),
    No, not at all. It’s just saying that, if you had chosen a different set of models, you would have got a different result, and this is a source of variation which should go into a significance test like any other.

    However, on reading the Data section more carefully, I see that although they’ve plotted the models in aggregate, they put them in the panel regression separately, so this may give some accounting for between-model variation.

  4. Nick:
    Weren’t MMH somewhat constrained to replicate Santer’s treatment of the models upto the point of conducting the actual analysis?

  5. Nick, The issue of properly representing the variance is the point of the thread of research that has gone from before Douglass to Santer and now this. How to estimate the variance of observations when you have only one is one main question addressed.

    The individual models mostly have one sample. There are always ‘some’ observations and a sample of model runs anyway, what else is there? That’s why we have stats, to estimate population characteristics from sample characteristics.

    Your statement that between-model variation is ignored is wrong.

    They agree with Santer with data from 1979 to 1999 (and reject Douglass). This would show that Santer and M&M are not in disagreement about the sources of variation.

    All they did was update it with new data (which Santer deleted as per the CRU ‘trick’) using the same approaches as Santer and found the difference is significant.

  6. Is part of the possible cause of measurement variance seasonal differences? It looks like the model output is almost “locked down”.

  7. Re: Bernie (Aug 7 17:53),
    Well, they don’t seem to have done that. They show in Eq 5 how Santer made an AR1 correction, which seems to be generally accepted as necessary. They criticise Santer’s omission of a correlation between the slopes, but say nothing more about AR1, and as far as I can see, haven’t done it.

  8. Nick #2
    “…this paper doesn’t represent an advance.’

    In your opinion , which papers do represent an advance?
    A few examples would be educational(for me) and appreciated (by me)
    Cheers

  9. Re: David Stockwell (Aug 7 17:58),
    “All they did was update it”
    Some people are saying that, but I can’t see that MMH are. Their method looks quite different, and it’s not clear to me that it does the same things, particularly the AR1 correction.

    A caution that people might like to bear in mind – they also find that UAH and RSS have significantly different trend.

  10. Nick, “All they did was update it” replication of Santer occurred in the earlier arXiv papers. The latest paper brings an even more rigorous econometric method to the field, AFAIK. The results, however, are the same.

  11. #12 – Sonicfrog

    Monckton’s graphs were debunked by Lucia quite awhile ago so the RC post is nothing new. What I find interesting is they debunk the easy target which has been around for a while around but ignore Lucia’s much better work on this topic. I wonder what has happened to her paper which is supposed to be in the pipeline.

  12. Chad has duplicated Santer e al’s (08) methodology for comparison of models with satellite troposphere data, but extended the analysis to 2009. There are many statistically significant discrepancies between the models and the satellite data when the analysis is extended past the limited satellite data that Santer et al considered.

    Nick Stokes: Have you reviewed Chads work? I found it very solid. (see http://treesfortheforest.wordpress.com/2009/12/11/ar4-model-hypothesis-tests-results-now-with-tas/ and earlier posts on the same subject)

    Chad shows that most the models are demonstrably ‘wrong’ at >95% confidence when compared to the whole satellite record, using Santer et al’s methodology.

  13. Nick,

    I also find it astounding that 1) Santer et al limited their analysis to the range of data which (just barely) says the modes are not refuted with 95% confidence, and 2) neither Santer et al nor anybody else working in ‘climate science’ has applied Santer et al’s methodology to the whole of the satellite record.

    Bad. Shamefully bad.

  14. #15 Chad is the 3rd author I presume. #16 “Bad…” They simply do not care about validation. I have done a few papers now with “updated data show the claims of xyz are not warranted” and the response is one of:

    1. by the eyeball method the models show a “pretty good match”
    2. the observations will eventually match the models its just too soon to tell
    3. its not worth publishing (even though the original claim on a non-significant warrant was!)

    Santer’s methodology seems fine and useful to me, though M&M&C point out its not exactly right and extend it in a useful way. The only reason the field is not all over validation of the models is that they don’t want to report the answers.

  15. Re: David Stockwell (Aug 7 23:15),
    One good reason for being cautious about validating models with experimental data is that you may not be sure about whether the data is correctly measuring the same thing. And that’s reasonable here. Let me say again – they found that UAH satellite data was significantly different in trend to RSS. Does that mean RSS is invalidated? UAH?

  16. #18 – Nick

    The UAH/RSS differences are based on their splicing algorithm which affects the long term trends but not the year to year measurements. A paper by Randall concluded that the UAH algorithm is likely better. I suspect the RSS splicing algorithm was deliberately tweeked to get better agreement with the surface temp trends.

  17. Nick Stokes, in a different paper – submitted to IJC as a COMment on Santer et al – available on arxiv.org, we did an exact replication of Santer’s method, AR1 and all, on updated data and got results that contradicted Santer’s. This paper was rejected twice. Rest assured that Santer’s results do not hold up with updated data.

    In one of the Climategate emails, Santer coauthor Peter Thorne, who appears to have been one of our reviewers, snickered to Jones about the rejection of our submission.

  18. #18 “Let me say again – they found that UAH satellite data was significantly different in trend to RSS. Does that mean RSS is invalidated? UAH?”

    Nick, “As shown in Table 1 and Section 3.3, the model trends are about twice as large as [UAH and RSS] observations in the LT layer, and about four times as large in the MT layer.”

    The difference in trend between UAH and RSS is interesting, but not germane to the fitness of model wrt the range of observational datasets. They differ significantly from ALL of the observational datasets.

  19. SM;
    but, but, the ethics boards etc. said there was no malfeasance, just boys being boys! Surely you’re not implying they were evil, not just wrong?!?

    😀

  20. Re: TimG (Aug 8 00:06),
    OK, RSS and UAH use a splicing algorithm which causes their trends to differ significantly from each other – models use none. The trend difference is being cited here as the validity criterion. Why, on that basis, could the splicing algorithm not produce a significant difference between models and satellite data?

  21. Re: Steve Fitzpatrick (Aug 7 22:05),
    OK, I’ve re-reviewed Chad’s posts, which I guess are the basis for saying that Santer’s method was replicated. And it does seem to do that. The diagrams are very clear, and illustrate again my objection from above that the significance test includes weather noise, model-simulated weather noise, but not scatter of the model means (nor did Santer’s). In other words, the variability due to the choice of models. You can see that this is important, because most plots showing significance in fact show outliers which will dominate the result.

    You don’t have to take model scatter into account if you want to say that this particular model subsetset is significantly different from observations. But you do if you want to make that statement about models in general.

  22. Re: Tony Hansen (Aug 7 18:36),
    OK, I should modify that – the paper may represent an advance. I thought based on the Figs that it was comparing the means without autocorrelation lags (AR1), but I see that the AR1 parameters do appear in the model in Eq 11. But only in the diagonal terms – I can’t for the moment see the basis for that.

  23. #25:

    http://arxiv.org/abs/0908.2196

    S08 used a modified t-test (their d1*) to test the statistical significance of the difference between the model ensemble trend, denoted b, and the observed trend, denoted 0b , defined in their equation (12) … where s is the square of the ‘inter-model
    standard deviation of ensemble-mean trends’ …

    And Santer 08

    The hypothesis H2 tested in the second question involves the multi-model ensemble-mean trend.

  24. Just a quick comment on the difference between UAH and RSS since it was brought up. The two series had a step difference in them due to an NOAA 11 satellite correction. When this step is removed, the UAH RSS match is almost perfect.

    The first link is John Christy’s comments on the topic, he used radiosonde data to determine

    https://noconsensus.wordpress.com/2009/01/15/1853/

    When I used a GISS comparison before the step and after the step I got a similar result. This was before I saw his work.

    https://noconsensus.wordpress.com/2009/01/14/give-a-kid-a-toy/

    My result was noisier though in that the result was a bit sensitive to before step and after step data. I want to redo this someday from gridded data rather than bulk global data.

    The point is though that the two series are close enough to be considered identical excepting this one step. In recent years UAH has chosen to switch to a station keeping satellite while RSS stayed with NOAA that has a decaying orbit to be corrected for – so they are probably slightly divergent again.

  25. #25,
    Nick, did you actually read what Chad wrote (or for that matter, Santer et al 08)? The two hypotheses tested:

    H1: The trend in any given climate model realization is consistent with the observational trend.

    H2: The multi-model ensemble mean trend is consistent with the observational trend.

    Which covers both cases, including model variation. Where available, Chad used multiple realizations of each individual model; and if you look, you will see that if a single realization failed then most or all the other realizations of that model did as well.

    What motivated Santer et al in the first place was the statistical treatment in Douglas et al, which did not properly address the issue of variability between models (see the relevant UEA email exchanges for details about why an immediate refutation was needed). Santer et al (and Chad) did include this contribution.

    Really Nick, this is a case were Santer et al simply chose to not use us-to-date satellite data. Had they done so, their statistically robust methodology would have shown the model ensemble and many individual models were in fact NOT consistent with the satellite data at 95% confidence, and in spite of whatever statistical errors they had identified in Douglas et at, the primary conclusion of Douglas et al was correct: the models are in fact not consistent with measurements of tropospheric temperature.

    But that would not have been much of a refutation, would it?

    The plain fact is that the model ensemble (and most of the individual models) predict substantially higher tropospheric and surface warming that has actually been observed. In most fields this would be immediately identified, openly discussed, and then modelers would try to figure out a) why there is so much variation between models, and b) why most models are predicting more warming than has taken place. Not so in climate science, where it seems to me people are too worried about political fallout (see once again the UEA emails) and not enough worried about good science.

    Or as I already said: bad, shamefully bad.

  26. If the measured data does not agree with the model, obviously the measurments have problems and need fixing. or “adjustment”. The models cannot be wrong.

    Pay no attention to the snow on your lawn, it’s HOT outside!

  27. Paraphrasing, but close:
    Keynes: When the facts disagree with my opinions, I change my opinions. What do you do, sir?
    Feynman: When observations don’t match the theory, the theory is wrong. That’s the whole of science!

    Climate science is like a zombie with oozing holes, numerous embedded daggers to the heart and vital organs, and half its skull blown away. Yet it walks!

  28. re: Climate science is like a zombie with oozing holes, numerous embedded daggers to the heart and vital organs, and half its skull blown away. Yet it walks!

    Very funny! I keep telling my computer game playing son that there is no such thing as zombies. Perhaps I need to follow Keynes lead and adjust my opinion.

  29. Nick Stokes:

    In other words, the variability due to the choice of models

    There is no logical basis for including the variability of the choice of models here. Not all models are created equally, and worse ones deserve less weighting than better ones. It makes sense to look at the ensemble of outputs from a single model (with forcings and other inputs allowed to vary between runs but otherwise constrained by known experimental measurements) in determining how well a given model can explain the data.

    It makes no sense to try and lump in “variability” of the choice of models. Models live separately when tested against data, and they die separately. If none of the models explain the data (when accounting for uncertainties in e.g. the forcings as described above), none of the models are consistent, regardless of the individual spread of the models relative to the data.

    I’ll admit that my experience with climate modeling is very limited, but I have had reasonable exposure to modeling of data. And this is the only field I’ve seen where the models are bad enough that people insist on treating them as if they were a “truth centered” ensemble.

  30. I just finished reading the MM and Herman paper (MMH) and found it to be a advance on the Santer paper methodology, and, of course, accomplished the all important updating of the model/observed comparison of the temperature trends at the surface to the lower troposphere in the tropics – and the mid troposphere for good measure. For those of us who have followed this progression from the Douglass, to Santer and to this paper, I think most would say that it has been a learning experience.

    Douglass et al. got the ball rolling and provoked the Santer et al. response. That Douglass was flawed by way of using the observed data deterministically (without confidence intervals) and without adjustment for autocorrelation does not detract from the fact that in my view it was instrumental in the Santer and MMH papers following from it.

    Besides the update, MMH method deals with higher than AR1 autocorrelations and on first glance I believe uses the combined model and observed data to obtain a hypothesis test of model versus observed results. As I recall before the combined model ensemble results were compared to individual observed results, i.e. UAH or RSS or radiosondes. McKitrick has talked before about the use of panel regression in econometrics and from the paper’s description I now have a better idea of how that is accomplished, but certainly without being able to follow completely all the linear algebra it entails.

  31. Nick Stokes said:

    But the differences between model trends are very large. There’s a lot of scatter that just doesn’t go into the significance test.

    So it may well be that their panel test is correct as far as it goes. But it isn’t a test of “models” vs observations. It’s a test of some observations vs a particular sample of models, ignoring between-model variation.

    We used all the models available, a larger sample than even Santer used. Also, all the within- and between- variance is preserved in our methods, moreso than the “effective DOF” method. The panel method specifically parameterizes the covariances and lets every panel have its own error variance; the V-F method uses a non-parametric approach that converges asymptotically to an unconstrained estimate of the omega matrix.

    I see that although they’ve plotted the models in aggregate, they put them in the panel regression separately, so this may give some accounting for between-model variation.

    Change “may give some accounting” to “accounts”.

    They show in Eq 5 how Santer made an AR1 correction, which seems to be generally accepted as necessary. They criticise Santer’s omission of a correlation between the slopes, but say nothing more about AR1, and as far as I can see, haven’t done it…. I see that the AR1 parameters do appear in the model in Eq 11. But only in the diagonal terms – I can’t for the moment see the basis for that.

    Equations (13) and (14) deal with the treatment of panel-specific AR1 in the panel regression, and Section 2.3 presents the general non-parametric treatment of AR in the Vogelsang-Franses method, though you’ll have to read the VF paper for the full details of how that works. AR1 is in the off-diagonals of the A matrix (eq 13 not 11) because the diagonal represents lag-zero. Table 1 lists the significant AR coefficients for all the models and observational data sets out to AR6, for both LT and MT layers. On the basis of those calculations we conclude the AR1 specification is likely inadequate.

    The correlations between slopes are all included: they are the off-diagonal elements of the V matrix (equation 12).

    RSS emerges as an outlier compared to the UAH and balloon data. We’re agnostic as to who is more “right”. Taken together the satellites agree with the balloons, but not with the models.

    Steve’s and my comment on Santer is a different paper, where we work within the Santer methodology. I expect it too will make it into print, but progress on it got derailed by climategate.

  32. Re: Ross McKitrick (Aug 8 15:17),
    “AR1 is in the off-diagonals of the A matrix (eq 13 not 11) because the diagonal represents lag-zero.”
    Ross, my query here was not about the A matrix, which in my copy of the paper is defined in Eq (11) (or at least between the labels 10 and 11), but about the block off-diagonals in the Ω model of (11). If the stacked time series correlate with each other, and have lag autocorrelations within themselves, then surely there must be lag cross-correlations. The off-diagonal identity matrices imply otherwise.

    On sampling models, yes, you’ve modelled between-model covariance. But this just expresses the effect of noise from that particular set of models. Essentially you are just working out the linear algebra consequences of the e fluctuations of (9), which are internal to that set of data. It does not account for the possibility that you might have chosen a different set of models. And if the conclusion is about deviation of models in general from observation, then you have to do that.

    This is exacerbated by the fact that you did in fact choose a different (larger) set of model runs from Santer. Then you have to estimate the possibility that you, by chance, chose a set that were apt to make higher predictions. This might seem to be a small effect but when talking about 95% confidence, these things count.

    I’m reading the VF paper, but I’d like to understand the parametric method implementation first.

  33. #38, Nick,

    Before the discussion gets too far into the linear algebra used in the EXTENSION of Santer et al by MM&H, could you please answer one more basic question:

    Do you or do you not agree that duplicating Santer et al’s method, but including satellite data through 2009, shows clearly that the Santer et al conclusion no longer is true, and the models do in fact deviate from tropospheric temperature measurements at the 95% confidence level?

    You do not seem to have addressed this question anywhere in the thread.

  34. #38, Your a different guy Nick and I haven’t figured you out yet. Your knowledge of what is going on is deep, and you are quick to be critical of many things which contradict the standard, yet it’s obvious that Santer didn’t do crap compared to the MMH10 paper and is conradicted/debunked/denatured by treesfortheforest. What’s worse, by the Santer methods, models fail when extended to 2009. We don’t have any statistical methods which can pass the model tests compared to temperature data, yet you are after the inter- model covariance. Which isn’t just noise, it’s a function of using the same exact input data. And I would be remiss not to mention ‘the same responses’.

    Why so critical when this is as complete a work as any?

  35. Re: Steve Fitzpatrick (Aug 8 17:35),
    Steve, I’m afraid I’m easily distracted by linear algebra 😦
    A point I’ve been trying to make is this issue of subset selection as a source of variance which must be taken into account before making a statement about models in general. Santer didn’t do that either. It may be that they saw it as unnecessary, since they had shown that in their case even the dataset-specific sources of variation showed insignificant difference.

    But on your specific question what I see from the MM paper is:
    Table 1: All d values increase – UAH comparison becomes significant, RSS not.
    Table 2: (surface, Hadcrut with T2LT) – both become significant
    Table 3: Again a contrast – RSS low d levels (not significant), UAH quite significant
    Table 4 – no comparison.

    So mixed, with a big difference between UAH and RSS. I note that there are big changes to UAH with V 5.3, so it might be interesting to revisit.

  36. #34 Carrick said

    “I’ll admit that my experience with climate modeling is very limited, but I have had reasonable exposure to modeling of data. And this is the only field I’ve seen where the models are bad enough that people insist on treating them as if they were a “truth centered” ensemble.”

    My experience as well. It is a very odd assumption to make that somehow the ensemble of all models must produce data which ‘by definition’ (????) varies about ‘The Truth’. As Nick Stokes #38 points out:

    “But this just expresses the effect of noise from that particular set of models. Essentially you are just working out the linear algebra consequences of the e fluctuations of (9), which are internal to that set of data. It does not account for the possibility that you might have chosen a different set of models.”

    There speaks an older scientist.

    In other words, by simple inference, in particular it does not account for the possibility that the particular set of models concerned all suffer from one or more ‘consensual’ assumptions which are flawed i.e. suffers from the limitations of the group think applying at the time that particular set (of models) arose.

    What would be more productive would be to tease out the major assumption(s) and test them statistically one by one.

    Here we can see clearly the corrosive influence of post-modernist hubris on climate science.

    Not to mention all that watching of Star Trek Second Generation on TV listening over and over to that stupid phrase ‘Make it so’.

  37. Re: Jeff Id (Aug 8 18:05),
    Jeff, I’m less critical now of the MMH method, which I am warming to. But it’s more complicated numerically (big matrices), though simpler (to me) conceptually, and it’s important to get it right. Especially the AR1 issue – ignoring lags in cross-correlation has the potential to undermine it. On the other hand, it could just be that the omitting cross-correlation lags reinforces the autocorrelation lags; then the omission wouldn’t matter much.

  38. #38 Jeff,
    “What’s worse, by the Santer methods, models fail when extended to 2009.”

    I think the best your are ever going to get is a “maybe, maybe not” out of Nick.

    But to most people, I think it is pretty clear: most models predict much more warming than has been measured. I blink in disbelief at the scholarly quibbling about 90% vs 95% vs 99% significance, when the overall thrust is perfectly clear: a large number of the models are obviously way wrong. The tropical tropospheric ‘hot spot’ only exists in some (obviously wrong) climate model predictions, and not at all in reality. The predictions of future warming made by these models are so uncertain as to provide no meaningful constraints, and so are virtually useless.

    Why the heck aren’t the models being modified to eliminate what is wrong with them, scrapped, or just ignored? When is somebody in climate science going to stand up and say “models x, y, x, a, b, c… do not make accurate predictions and so have to be wrong”? Or better yet: “models x, y, and z are wrong because these models use the zzzzzz cloud feedback parameter, which is known to be incorrect.” Or even better: “the projected tropospheric hot spot was incorrect because the models that predicted this tropospheric warming incorrectly assumed zzzzzzz amplification”.

    Where is the normal process of scientific ‘creative destruction’ that leads to PROGRESS? Jeesh…

  39. Steve;
    After they had so much fun building their Tinkertoy buckets, they’re unwilling to acknowledge they don’t and never can and never could hope to hold water.

  40. Nick, sorry about the 13/11 confusion, I forgot they renumbered the eq’s in the proof. I’ll fix that in the pre-print. The off-diagonal elements allow for panel-specific sample variance, but not correlations across lags. Bear in mind that these are residual terms. This is the monthly noise. If we try to estimate all the lag cross-correlations among all the monthly noise terms we will have more parameters to estimate as there are observations. Ironically, in the 4th (and final) review round, one of the sticking points with a referee was their insistence that we should go the opposite route and constrain all the off-diagonal blocks in the omega matrix to equal zero. But there is no reason to do so, and the estimated off-diagonal elements are not zero, so the editor didn’t uphold this criticism.

    On sampling models, yes, you’ve modelled between-model covariance. But this just expresses the effect of noise from that particular set of models. Essentially you are just working out the linear algebra consequences of the e fluctuations of (9), which are internal to that set of data. It does not account for the possibility that you might have chosen a different set of models. And if the conclusion is about deviation of models in general from observation, then you have to do that.

    This “particular” set of models is the entire set. Your criticism is one that we heard in the first review round when we had, in fact, only used some models. So we got them all. This is the entire population of models used for IPCC analysis, and that group fails to match the observations. I don’t think anyone can criticise our findings based on the theoretical possibility that somebody somewhere might one day construct some other model that matches the data better. I could go into why that would fail as a line of argument but I suspect it would not be necessary for your benefit. If other models appear we can revisit the issue on that day, but in the meantime there is no other set of models, so you cannot say we should have accounted for “the possibility that you might have chosen a different set of models.” There is no such possibility.

    On the other hand, it is true to say that we did not examine all 2^23 linear combinations of models to see if any one or combination of them might match the data better than alternatives that make no use of the GHG sensitivity assumption. That project is underway, but you’ll have to wait for the final paper. Based on the initial results, do not expect any joy there either.

    This is exacerbated by the fact that you did in fact choose a different (larger) set of model runs from Santer. Then you have to estimate the possibility that you, by chance, chose a set that were apt to make higher predictions. This might seem to be a small effect but when talking about 95% confidence, these things count.

    Again, there’s no “chance” here. We used them all. If modelers are going to say, after the fact, that we shouldn’t use certain GCM’s because they give invalid results, then why were they included in the IPCC report? The IPCC takes the position that all models/scenarios are equally likely when they present GCM-based projections. Take it up with them.

  41. Steve;
    Don’t forget that in a pinch they have the “out” of saying the models are actually only “scenarios”, illustrations of what might happen given a particular set of weightings, algorithms, and data selections. As G&T put it, “video games”.

  42. I’m with #34 Carrick on this one. Each model has to realize a different approximation to the climate. Otherwise, why else have so many? Lumping them together makes no sense. Each one should be tested separately.

    I have a second issue as well, the global temperature anomaly. This is a totally unphysical quantity. All that really matters are the actual physical temperatures on the ground and in the atmosphere. A much stronger test of the models would be comparison of predictions of the global temperature map with the actual map as measured by satellites. I’ve never heard of this being done. I’d bet the anomaly sweeps a lot of problems under the rug.

  43. So MMH used ALL the models, used the Santer method but did not cherry pick [fancy that, there was a climate change in 1998] and found that the models are Edsels. Can we please finish now and direct the trillions being spent on AGW back into something useful like going to Mars and developing a FTL space-ship.

  44. #50 Paul,

    I agree it makes no sense; but the argument is that “the average is more robust than the individuals”, which I guess means that as the number of errors becomes larger they tend to on average be near zero because they are off-setting. I little nuts for sure, but that is how the IPCC wants to frame it.

    WRT global average versus warming pattern, some of this is for sure done already; the original Santer et al addressed the apparent absence of amplified tropospheric warming in the tropics. The IPCC AR4 showed typical profiles of expected temperature increase as a function of latitude and altitude, from ground level to the stratosphere…. showing the “signature” of global warming as a substantial tropical upper troposphere ‘hot spot’, which is obviously missing from the data. Richard Lindzen and others have commented on the absence of the hot spot for some time.

  45. Re: Ross McKitrick (Aug 8 21:09),
    This “particular” set of models is the entire set.
    No, it isn’t all models, and certainly not all runs – there are lots being done all the time. You’ve chosen all of a subset that was chosen by the IPCC. That doesn’t make it any less a subset of model runs. The fact that IPCC is “authoritative” doesn’t cancel the possibility of random variation.

  46. Nick;
    From the horses’ mouths:

    9: Climate Change Scenarios
    Clare Goodess

    Climate change scenarios provide the best-available means of exploring how human activities may change the composition of the atmosphere, how this may affect global climate, and how the resulting climate changes may impact upon the environment and human activities. They should not be viewed as predictions or forecasts of future climate, but as internally-consistent pictures of possible future climates, each dependent on a set of prior assumptions.

    General circulation models (GCMs) (see Information Sheet 8) are complex, gridded, three-dimensional computer-based models of the climate system (developed from numerical weather forecasting models). They are considered to provide the best basis for the construction of climate change scenarios.

  47. The above is, as you will see if you follow the link, from the CRU site.
    The 8) in the text is actually 8 ), as in “Information Sheet 8”

  48. I find this thread fascinating and illuminating. Nick Stokes as devil’s advocate has teased out further information from the authors and has led to greater understanding and greater confidence in the results of MMH2010. This is an example of real peer review in action. It has been carried out with great civility and none of the ad hominems and insults so prevalent at RC.

    This kind of review and questioning can never be carried out at RC, where insults and moderation stop any sensible ongoing discussion and understanding.

  49. I love this quote:

    “But with the addition of another decade of data the results change, such that the differences between models and observations now exceed the 99% critical value.”

    It’s not a close call.

  50. Nick, can we at least agree that the models and runs used in the IPCC report are not supported by the data, as regards the response of the tropical troposphere to increased GHG’s and other observed forcings over the 1979-2009 interval? Defending the IPCC suite of models/runs on the basis of hypothetical results that might be obtained from other models effectively makes the climatology embedded in the existing climate models non-testable and non-falsifiable, which I equate to non-scientific.

    And bear in mind that even if other models exist that might, hypothetically, match the data, be careful what you wish for: they might achieve the fit with parameterizations that imply very low sensitivity and feedback. If you are then tempted to dismiss such parameterizations as unrealistic, you are left arguing that the “realistic” models don’t match the data, while the models that match the data are “unrealistic”. If I were trying to defend such a philosophy in a room full of ordinary scientists, I think I would feel my forehead getting damp and my chair suddenly quite uncomfortable. Although, since this is what the CCSP-2006 did in their immortal words on page 11 (emph added), there would at least be precedent for trying to pull it off.

    In fact, the nature of this discrepancy is not fully captured in Fig. 4G as the models that show best agreement with the observations are those that have the lowest (and probably unrealistic) amounts of warming.

  51. Sigh.

    Too much civility!

    Time for another rant from me on convectively-forced Hadley Cell Circulation.

    I think Jeff’s the only blog where people get their panties in a twist over water vapor mixing ratios. I could be wrong.

  52. At 43, Ecoeng wrote: “we can see clearly the corrosive influence of post-modernist hubris on climate science”

    Ding, ding, ding. We have a winner. Gets my vote for apt description of the year.

  53. Stan;
    Yes, the “post-ideologues” are the most arrogantly ideological beasts of all. Beware those who loudly proclaim honesty! Honest men are too honest to do so.

  54. As I recall the initial responses to “fixing” the apparent observed versus model ratios of surface to lower troposphere temperature trends discrepancy was to use a range of model results and comparing it to observed results. I think Karl published some of these results and it was these results and methodology that Douglass et al. objected to and in turn did their comparisons which were better but not without problems of their own.

    The range comparison was obviously susceptible to model outliers, since the criteria for non difference was apparently that the range of models merely overlapped that of the observations. Also as I recall the model outliers (for ratio of trends) were not at all realistic with regards to surface trends in that they showed no significant trends at the surface.

    Overall, I think the progression has been very good from Karl to Douglass to Santer to MMH (both in methodology and adding to the time series) and for me a major learning experience and one that I might well have missed without the online discussions.

  55. This thread recalls discussions a year or so ago at “The Blackboard” about lucia’s trend analysis in re the falsification point of the model assembly.

    At that time we were informed by persons who were indignant over her impertinence that (a) the models cannot be judged in time periods of less than 30 years and (b) in any event, the error ranges are so large that (barring a return to the Precambian iceball) whatever it is that is meant by “the models” can never really be considered wrong when compared to actual data.

    I would expect that the model assembly mean would necessarily overpredict because while most models would be in the 1.5 to 3.0 deg/century range there is not much content on the down side (i.e. nobody is predicting dramatic cooling) to offset a few whack job disaster scenarios of 5-6 degrees or more. In other words a few high end outliers could bring the mean up from the true median consensus and probably make it less likely the mean will conform to the actual.

    However, the inclusion of the high end predictions not really supported by the consensus does permit the executive summary to say things about heating of “as much as [fill in scary bad number here] degrees” even as it detracts from the accuracy of what most people will invariably assume is the core prediction–the assembly mean.

  56. It seems to me that expanding the number of models and/or runs, becomes highly problematic to maintaining the notion that the models in some general way are reliable. It begins to sound like tuning rather than calibrating without the requisite physical rationale. Presumably all the models have been calibrated to some degree, i.e., they capture some key aspects of climate reality. MMH indicates that they do not appear to be validated. To undermine MMH’s analysis you need to justify why some other combination of models is to be preferred to the ensemble that is currently being used. MMH, at one level, is saying, “Sorry guys you have to go back to the drawing room – but no peeking.”
    Am I being too simplistic?

  57. #66 Bernie,

    The models are already highly tuned with the assumed level of aerosol effects. It is comical that so obvious a kludge passes muster; where I have worked this sort of obvious band-aid would bring endless wise-ass remarks and rounds of laughter. Apparently not so in the world of climate models.

    I think MMH (and earlier efforts) are mostly saying, “You guys have serious problems”. But the wide range of diagnosed sensitivities among the models should have been screaming that same message for 10 years! It is an Alice-in-Wonderland kind of field.

  58. Re: Ross McKitrick (Aug 9 09:43),
    “Nick, can we at least agree that the models and runs used in the IPCC report are not supported by the data, as regards the response of the tropical troposphere to increased GHG’s and other observed forcings over the 1979-2009 interval?”

    The models were cited in the AR4 Ch 8, as being participants in the MMD at PCMDI*. The runs were a particular set (20CEN) chosen by Santer et al as suitable for comparison. So yes, any results claimed relate to that particular set.

    That doesn’t mean that a proposition about the general class of models is untestable. The variability between models is obvious in any one of Chad’s plots, for example. That plot illustrates the issue. You’ve accounted for the central band spread of the observations, and the model noise indicated by the whiskers. The scatter of model runs themselves can be estimated, so you could use this sample to say something about model runs in general. It isn’t easy, because the population of possible model runs is hard to pin down. But that’s what you want to talk about.

    As to the rest of your proposition, I continue to note that, while this test shows significant differences between models and observations, it shows even more significant difference between the two main observations (RSS and UAH) which are supposed to be measuring the same thing.

    *AR4 Ch 8: “Perhaps the most important change from earlier efforts was the collection of a more comprehensive set of model output, hosted centrally at the Program for Climate Model Diagnosis and Intercomparison (PCMDI). This archive, referred to here as ‘The Multi-Model Data set (MMD) at PCMDI’…”

  59. Nick, the UAH RSS difference is almost 100 percent due to the same data being used. The only big difference is at 1992 and when resolved the data are the same. My guess is that the method is correct in having a large significance of difference between the two. You get perfect short term covariance which divides out in the equation and a step.

    From memory, I also found a significant difference using Santers method.

    On reading that, I took it as good evidence that the method is working well.

  60. Jeff: “The methods presented nearly confirm the Santer et al. paper which although published in 2008, didn’t use any data after 2000.  Why is anyone’s guess, except for those of us who know that the quote from Trenberth in 2009.”

    Can’t hurt to retread this and answer you. Stopping at 2000 is a cherry-picker’s delight. Perfectly transparent and pathetic IMO. That’s not why I’m writing though. I thought I’d offer the other Trenberth quote from the same UEA thread:

    “How come you do not agree with a statement that says we are no where close to knowing where energy is going or whether clouds are changing to make the planet brighter. We are not close to balancing the energy budget. The fact that we can not account for what is happening in the climate system makes any consideration of geoengineering quite hopeless as we will never be able to tell if it is successful or not! It is a travesty!”

    For you, Kevin, I suppose it is. Cheers!

  61. Following up on my comments above about the failure to allow for between model scaller, I’ve drawn a histogram here of the trends, as given in MMH Table 1, along with their mean and error range. The trands are far more scattered than the stated error ranges indicate.

  62. Jeff,

    I’ve got a question. In the CA thread in which you have some comments, another commenter using the handle VS wrote Climate Science with the trademark designation:

    Climate Science™

    I love it. First time I have seen it. Is the use of the trademark designation for hockey team/IPCC consensus “climate science” something that has been used before? As in “proprietors who “own” the intellectual property rights to the designation are not happy whenever the designated phrase is used in a manner which contravenes their business interests.”

    I’d love to see the usage become common. Has it been around before?

  63. Update: Of course, the error bars are for the mean, not the distribution. But the bars seem very tight. A simple se of the mean of the trends would be about 0.022. And that does not allow for the uncertainty of the trends themselves.

    Nick Stokes, I think adding your comment above might be enlightening. I think you and others may be prancing around the issue that a coauthor of Santer et al. held forth on and that is that using the standard error as Douglass et al. did in their paper was wrong in their opinion and instead the standard deviation (or maybe even range per Tom Karl) should be used. That coauthor was Gavin Schmidt who made a lot of noise about this issue at RC , i.e. SD versus SE. I was flabbergasted when I found that Santer et al. that he coauthored used – you guessed it SE and not SD.

    Also on the subject of statistical differences between observed temperature series, you will see these differences when you compare various and shorter time periods and over more regional areas. Remember that this comparison is for the zone 20S to 20N for 30 years. This is true for the various surface and troposphere temperature series trends. The various radio sonde series results show some very large variations. I think what was most striking in the Santer paper was the graphical display of the variations in not only model results but in the observed results as well. Without a reasonable and established method of a prior selecting the valid model and observed results, I would think we are stuck with using means of the available data.

  64. May I ask is his is an accurate summary of current discussion?

    a) The MMH paper concludes that the ensemble mean of the models do not predict the observed temperature readings.

    b) Stokes observes that this result is only a certain set of models and that the result may be the result of differences between the MMH technique and the Santer technique. The MMH conclusion is valid only for the set of models used and not or models in general. This may include future models or certain subsets of the models contained within the superset of teh set all models that was used by MMH

    c) McIntyre states that MMH in a different paper have replicated the Santer technique and the same result of lack of predictive capability is confirmed with it

    d) McKittrick states that the sample of models usd is the sample used by the IPCC for the reports

    e) Stokes observes that this is a superset of Santer’s set of models

    Wouldn’t there then be three contributions of MMH

    1) the creation of a novel technique for the matching of model predictions to observations

    and

    2) that the conclusions of Santer 08 in respect of this matching with current models are shown to be invalid if the temperature data is extended

    3) The novel technique may be used to test future candidate models and other model sets for their predictive power

  65. Re: TAG (Aug 10 13:16),
    Some comments:
    a) We’re talking about a subset of temp data – tropical lower and middle troposphere trends. And the greatest difference is in lapse rate trend, rather than temp itself.
    b) Pretty much, tho Santer also had the problem of particular set vs general.
    c) Yes
    d) Yes, although “used” could be made more precise.
    e) It’s a different range. Santer used the 20CEN runs, which finish in 2000. M&M’s first paper extrapolated these – this one seems to use A1B. I don’t know why Santer preferred 20CEN.
    1&3) Yes, the technique is attractive – I’m not yet convinced that it works properly here
    2) Yes, the paper says Santer’s results are invalid. That doesn’t mean models are invalid. Again the issue about different subsets, and whether either represents “models”.

  66. A second read of the MMH paper gives me more appreciation for the method of multivariate panel regression that was used. It answers some criticisms currently being made that the model results show statistically significant differences between models which I assume means that results are assumed to come from a normal distribution. The method used in MMH does not make or require that assumption as I read it.

    I think, Nick Stokes, that you have to better frame your criticisms in order to provoke some replies. The MMH paper clearly states that:

    3.1 Data

    We used the same archive of climate model simulations as Santer et al. (2008). The available group now
    includes 57 runs from 23 models. Each source provides data for both the lower troposphere (LT) and
    mid-troposphere (MT). Each model uses prescribed forcing inputs up to the end of the 20th century
    climate experiment (20C3M, see Santer et al. 2005). Projections forward use the A1B emission scenario.

  67. Re: Steve Fitzpatrick (Aug 7 22:22),
    Re: Steve Fitzpatrick (Aug 8 09:49),
    neither Santer et al nor anybody else working in ‘climate science’ has applied Santer et al’s methodology to the whole of the satellite record.</Re: Steve Fitzpatrick (Aug 7 22:22)
    Gavin has replied on that here. Two points.
    1. Santer et al did submit results up-to-date at the time (2007). The referees would not allow it, presumably because it involved the extrapolation that M&M did in their rejected paper.
    2. They did include the results to 2006 in their SI.

  68. 79,

    Your link to Gavin goes to James’ Empty Blog, not Gavin. But I did find the link to the Santer SI, and there is a very brief reference to data through 2006. Hard to say if it conflicts with M&M’s analysis using data through 2009.

    I find it odd “the referees” would not allow a (seemingly reasonable) analysis using up to date data to be published.

  69. Re: Steve Fitzpatrick (Aug 10 21:07), There was a comment by Gavin on that blog. They were writing in 2007, so 2006 was the last complete year. The test results are called SENS2.

    The issue isn’t up-to-date data but the runs available. Apparently there’s a feeling that runs should use known forcing. 20Cen did that; A1B uses a scenario (A1B) assumption. The IPCC, and PCMDI, seem to ask for runs that have a published paper attached. Getting 23 different codes that all have that synchronised is not easy, and requires the special effort that was made with end 20Cen (end 2000).

  70. If anyone is interested, there is an R library called plm which does panel regression. A fairly extensive discussion of the regression procedure, plm and and examples of its application is available here.

    It should be reasonably understandable for the folks who usually hang out here.

  71. After my second read of MMH 2010 it is apparent that the statistic used was not a trend of a difference series between the surface and troposphere temperature anomalies, but rather the LT and MT series. I thought that it was that measure that was important in the debate about the tropical surface and troposphere trends. I also thought that at CA (and in Santer et al. (2008)) it had been shown that using the differences series allowed one to more readily show significant differences between models and obsreved because the difference series has less variation. Could not a panel regression have been used with difference series?

Leave a reply to Carrick Cancel reply