Conversation with a climatologist

I’ve moved this older post to the top for a while.  The issues are complicated and interesting and still being discussed.  New posts are still here but will appear down below.

I’ve been wrong before, comment away.

Blogs which require passwords and codes drive me nuts.  Get wordpress people and stop moderating.  What a waste of my time. I’ve been discussing with James Amman MMH10 which is the single premier skeptic paper ever published.   Although even blogland hasn’t got that yet.  I’ve posted my comment to his blog several times to some un-interpretable error message.  I’ll post all the comments here, but the latest is at the bottom.

It started at Judith Curry’s blog and bled into James.

Paul said…
JeffID is still waiting for an answer! See his and John N_G comments on Curry’s latest meander.

http://judithcurry.com/2010/09/27/no-consensus-on-consensus/

Sorry I can’t link to a specific comment. You will just have to wade through the whole morass.

Paul Middents
28/9/10 8:44 AM
James Annan said…
Unfortunately, Jeff Id does not show sufficient competence to understand the answer.
28/9/10 4:20 PM

Matlab has always been my favorite class.  😀

Jeff said…
James,

This is my first reading of your post. Claims of my competence to follow your argument aside, I actually agree with you about the ensemble mean.

However, you are missing the little detail that so often gets lost in climate science. The magnitude of the various models trends was shown to be 2 – 4 times over observation. I believe Lucia covered this point in a comment where she said in paraphrase, saying something is statistically significant can be nearly meaningless. If you have a x +/- 0.001 and the trend is 0.002 different perhaps nobody cares. However when trend is TWO to FOUR times different, then you have something to discuss.

Now you can belittle me all you want, but if you want to pass that off as a non-issue, I may have to return the favor.

As an engineer, it would send me to question the data and the models. As I’ve already looked in depth at the data my opinion is that it is likely in the models, but it should be thoroughly examined.
30/9/10 1:44 AM

James Annan said…
Jeff,

Thanks for the comment. If you agree with me about the method of analysis, you should also agree that the observations fall within the range of model estimates (eg Pat Michaels et al analysis, on which I am a co-author). So I don’t know where your 2-4 times come from.

Just to be clear, I don’t think the models are perfect, and I would not be surprised if more concrete evidence of a mismatch were to mount up over time. However, it hasn’t happened yet – and when it does, it may be partly due to obs error too, of course.
30/9/10 4:51 PM


Jeff Id said…
James,

According to MMH authors, the model mean runs 2 – 4 times observation.

Now I’m sure that ‘some’ models have far lower trends, and some individual models have wide enough error bars to overlap temp trend, but when the mean runs so much higher than observation, it does call into question the underlying assumptions which are so similar in many of the models.

Your point about the accuracy of model mean for determining significance has merit, but only to the extent that the accuracy of the result is uncertain(yeah I said it). But when you average a hundred different images of simulated planes, and the blob looks like an ocean liner, you need to consider the possibility of a systematic bias.

I’m certain that in a less entrenched field where there was less value placed on the result, problems shown by MMH10 would be taken far more seriously.

—-

Here is a question on a different topic. Since models are modeling temp, and we’re comparing models to temp, should the temp-model residual be used to determine the error bars? Or should we just look at the uncertainty of each trend individually?

Also interesting, if not examining residuals, would you use the error bars of one trend and the actual trend of the other to determine significance? Or would you use the error bars of both?
30/9/10 8:49 PM

James Annan said…
IMO, in a less entrenched field, no-one would have taken MMH seriously in the first place. It would just have been ignored as obviously wrong.

The fact that the ensemble mean is some factor larger than the obs in itself means nothing about the ensemble’s reliability. In fact it merely tells us in this case that using these methods we cannot hope to give accurate predictions of trend over a short (say 10y) interval, because we know that natural variability is large relative to the forced response over this time period. (There is ongoing work to try to improve on this situation, but that is not relevant to this analysis.)

Consider a large ensemble of fair coins which are each tossed 6 times. The ensemble mean is 3H. If a single coin (representing “reality”) only gets 1H, then you could say that the ensemble mean is 3x greater than obs. However, this does not provide strong evidence that the real coin has a bias against H. The probability of 0H or 1H is over 10% even for a fair coin, and the distribution of the large ensemble will show this. Similarly, the probability of a very low temperature trend (over 10y say) is non-negligible, and the models show this. You can read it straight off the Michaels et al graphs.

If you wait for another 6 tosses and reality again ends up with only 1H versus the typical 3H from the ensemble, then you have much stronger evidence for a discrepancy…but until that happens there simply isn’t much to go on. This is basically the situation we are in now.
1/10/10 8:41 PM

Jeff Id said…
James,

I understand the bulk concept you are expressing, but there are ways to quantify the statistical validity of the claim that you are making.

For instance the standard deviation of a coin flipped 6 times is 1.22. Two sigma is 2.44 a mean of 3.5 heads minus 2.44 and you have or 1.06 heads at two sigma.

So it isn’t that big a deal to see 1 once in a while.

In MMH 10 they showed that the mean is well outside of the two sigma 95% level for the model coin tosses. Again, the magnitude of the difference is the key.

You shouldn’t ignore this quote:

“Over the interval 1979 to 2009, model-projected temperature trends are two to four times larger than observed trends in both the lower and mid-troposphere and the differences are statistically significant at the 99% level.”

This means that you already have your additional coin tosses. Despite the fact that an average of models may or may not be physically realistic, the fact that their average and error bars all run so much higher than observation, and are so statistically significant, should not be overlooked with a hand wave.
1/10/10 11:36 PM

James Annan said…
Jeff, the fundamental error of MMH is that confidence interval they worked out is not based on the standard deviation of the ensemble, but rather the standard error on the ensemble mean. Thus re-quoting their claim is irrelevant – it is simply an incorrect calculation.

Using their calculation, you could argue that many of the models fail to be predicted by the ensemble of models…surely you can see this is a crazy claim, as the models can hardly fail to predict themselves. A valid and correctly-formulated statistical test should reject 5% of models at the 5% level, etc – this is what the 5% significance threshold *means*.

Jeff Id said:

James,

Despite the rocky start, I’ve appreciated the discussion thus far. You almost had me convinced for a bit, to the point where I began writing to say so.  From that experience I can say that my mind is open enough but after some consideration, I am convinced with certainty that you are missing the point.  It comes down to the hypothesis you are working to test.

If you wish to test that observed trends are within model distribution, you would use the standard deviation of the trends of models and you would have the wide confidence intervals you expected.  Visually it is easy to say that you are correct, observed trends are within the edge of model distribution.  The narrow CI of MMH10 does not encompass all model trends (BTW, you don’t need to explain what CI means).

If you wish to test that observed trend is within our understanding of average trend over models, the CI you need is completely different.

The trend is the trend.  A perfectly instrumented/measured temperature series with a trend of 0.25 +/-  1 is still greater than a series of 0.20 +/- 1.  What that means is even though we haven’t nailed down the certianty of the trend long term and even though both are dramatically within the CI, the first is actually a higher trend. Of course statistically, neither trend is known to be separable due to the variance of the short term signal.  The assumption of Santer and MMH10 is that the trend from an average of models having different but allegedly realistic assumptions should resolve to something close to the observed data. After all models are mathematical representations of the climate. I also agree that model average is something non-physical because of differing assumptions.

So from the average model trend we find a value 3 times the measured.  The first question then becomes the uncertainty in that trend.  In paleoclimate, if you want to know the certainty of the average trend in the blade of the stick, you wouldn’t take the extremes of all the inputs to calculate uncertainty, you take the variance in the output and use some method i.e. Monte Carlo or a DOF estimate.

So with all that preamble you state:

Using their calculation, you could argue that many of the models fail to be predicted by the ensemble of models…surely you can see this is a crazy claim, as the models can hardly fail to predict themselves.  A valid and correctly-formulated statistical test should reject 5% of models at the 5% level, etc – this is what the 5% significance threshold *means*.

to which I reply softly:

By your method, more data never ever improves the certainty to which you know trend.

Please consider the bold for some time.

The post by Chad Herman which I reviewed that drives this point home is here:

http://treesfortheforest.wordpress.com/2009/12/11/ar4-model-hypothesis-tests-results-now-with-tas/

It was a bit sobering while I spent an hour in the process of figuring out why you were right.  The result of my reconsideration was again, the models are running high, or the data is running low or a little of both.

289 thoughts on “Conversation with a climatologist

  1. Jeff,

    Good luck trying to talk to James; he has a very bad case of climatologist-arrogance. He strikes me as someone who has little inclination to listen to anyone except James. His claims that MMH10 is wrong are nonsense, even if he is intellectually/emotionally unable to appreciate it. All the models are claimed by the IPCC to be representative of the real climate (and I use the words ‘real climate’ with some disdain). The ensemble, with their individual variabilities, if they are at all representative of physical reality, ought to include the actual measured trends. That they do not include the measured trends is a clear indication that the models are a very poor representation of reality.

    James can say people like you are not up to understanding the complexity (how obnoxious!), or put his fingers in his ears and say “nanananana” very loudly (which seems his favored mode of operation). That will not make the models any more correct. They are just way wrong. Someday (maybe) James can pull on his big-boy pants and get over this, but history suggests maybe not.

  2. as the models can hardly fail to predict themselves.

    Perfect model tests show exactly this — that the models fail to predict themselves.

  3. The William Briggs plea to never do stats on smoothed data is about the clearest exposition I’ve seen. It’s worth repeating. Source here is from WUWT AMO + PDO etc, September 30 2010, header.

    ……………………………………

    “Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.

    “If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.”
    ……………………………

    Can I please frame a question this idealised way? For this exercise, we are dealing with global temperature, with observations at all stations every 10 minutes of every day. There are statistical ways to measure the variability of the signal over 10 minute intervals. Likewise for hourly, daily, weekly, monthly, seasonally, annually, over decades and even over sunspot cycles. There are also ways to smooth each of these data strings, which might be a visual aid, but is suspect for mathematical work.

    For an extreme case, we can smooth data with so many (say) running mean points that it approaches or becomes a flat line. This is not very useful, usually.

    The expression “propogation of errors” has been ringing in my ears for some years, but I don’t quite know how to express it. When I was an analytical chemist, we would spend about 10% of our effort on running standard samples and on doing duplicates, triplicates, etc of routine samples. For us, the error to be reported was the worst result of our testing, which contained errors of precision (poor reproducibility) and of bias (e.g. our answer differed from the best estimate of many other labs).

    One can’t do this testing with a time series, so I am left wondering how one actually does estimate both bias and precision in these time series. Some people claim that the Law of Large Numbers takes care of all, but this is not the case. With these time series, one can make some allowances for known effects like latitude and season and day/night, but after these are taken out, surely the error to be reported is the worst case, because one has no guarantee that the worst case can be avoided in the future.

    So, how does one propogate the errors from from the range of 10 minute readings, through to decades?

    Until some kind person tells me, I cannot understand why there is even any discussion about ensemble projections and actual measurements. At the whim of the author, the bounds can be made so large that everything is significant, or so small that the ensemble mean does not overlap at all with actual and its errors as expressed.

  4. If any model ever achieves a perfect result it will have recreated the planet earth and we will have to call it God.

  5. Here is an interesting paper by James Annan and JC Hargreaves (“Understanding the CMIP3 multi-model ensemble”). The paper is particularly interesting where it explains “why the multi-model mean is always better than the ensemble members on average, and we also identify the properties of the distribution which control how likely it is to out-perform a single model.”.

    Some of you might be interested in this primer about Kalman filters by Maybeck. It is well written and easy to understand. An obvious problem with combining model results is that they are not independent of each other.

  6. Whoops, I forgot to paste the link to the Maybeck primer on Kalman filters. It is beautiful to see how one combines info from independent pdf’s about the same target of measurement to produce a resultant pdf that is more accurate and with a smaller standard deviation than any of the input pdf’s.

    Of course, the Kalman filter can’t properly be used with the the various climate models since they are not independent of each other.

    Click to access maybeck_ch1.pdf

  7. #8 ANthony, It is not the irresistible force and the immovable object, more like trying to walk through a sponge.

  8. He didn’t let your last comment in?
    Just like Tamino, he might let the first 1-2 comments in, ‘debunk’ all you have said with red herring. When you post your counter-argument he just doesn’t let them in.

  9. Oooh, how they must pray for heat. Bow to the sun, Jimmy me boy; it’s your only hope now.
    ===================

  10. I’m not sure I totally follow… so the observed temperature trend is significantly different than the ensemble mean, but he is saying the observations are still within the range of the ensemble so the ensemble has not been falsified (or that there isn’t enough data to falsify the ensemble.) My take away from that would be, the ensemble as a predictor has been falsified, but not all of the individual models are falsified. Would that be fair?

    Something has always bothered me about this whole line of reasoning… what is random between the models that would create a statistical distribution, the initial conditions? Or is the difference in feedback parameters, etc? If it’s the latter, the differences between the models would be systemic, not random, wouldn’t they? I get that no model will ever be perfect, but individual models should either be better or worse than others depending on how well they reflect reality. Why would you average them instead of taking the best? Unless none of them are very good.

  11. #14 J – He didn’t block my comment, the blog did for length of URL or something. The error was partially displayed and unintelligible.

    #16 Matt – sounds right to me. MMH just reveals that on average the models are biased 2 – 4 times over observation. It doesn’t disprove all models or observation but it displays the need for careful examination.

    I really have enjoyed the conversation, hopefully it will continue but the blog wouldn’t let the comment through so I put it here and left a link there.

  12. #16 Matt Y. speaks to a problem with this ensemble mean business that bothers me as well. Why is the mean of a group of theoretical models any more useful than a single model especially given the possibility that some of the ensemble members may be comprised of physically unlikely components?

    Briggs’ objection to combining products of multiple-processing (if I paraphrase it correctly) also seems applicable here.

    It makes sense to my meanest of understandings to work through the iconography of current Climate Science one step at a time, but can’t this whole thing be better dismissed with the objection that a combination of methodical guesses is not better than the best guess? That the idea of the ensemble mean is nonsense?

  13. (I was probably one of the not-supposed-to-comment people). But can I ask a dumb question.

    I’m quite familiar with averaging measurements, and I thought the idea was the measurements (data) contained some random error (variability from various known/unknown causes) and if we averaged we could in some sense cancel away that variability. Something like, we assume the real value is the mean and if take multiple samples and average we get an estimator for the mean along with some way to get at confidence intervals. Maybe this understanding is wrong?

    Anyway, what is the meaning of averaging model ensemble members. (And let me back up a little: does ensemble mean multiple runs of the same model with different input or is it runs of multiple models with the same input.) Either way, is there a basis for thinking that the model outputs contain random errors? Is there any basis for assuming any error distribution (normal or other). Averaging wrong numbers just gives another wrong number – right?

    Oh well, I now return you to the intended high level math discussion.

  14. The whole thing is complicated. Certainly model mean doesn’t have much meaning when the assumptions are different but in reality all models are predicting reality. All the mean can tell us is that there is a systematic bias in the models. Not whether some are unbiased predictors. It’s really pretty meaningless and cannot bring about the rejection of ‘all’ models but it can tell us that the models are running hot in general.

    I thought James point about the variance of the average trend not predicting the scatter of the models was a good one, but that’s not the lesson the paper teaches. What the paper is teaching is that the average model is biased well above the observed trend. When combined with the post linked in my last comment above and the fact that similar assumptions are used throughout the models, it is apparent that either some of the assumptions are wrong or that the observed data is wrong.

    Observed data error cannot be discounted either but we’ve looked a lot here at satellite temps. They are imperfect to say the least. So are radiosonde measurements but my bet is on the models running warm.

    If we assume models are the cause of running warmer than observation, the base assumptions of sensitivity to aerosols and CO2 need tweaking. That’s why climate science won’t be addressing this issue anytime soon.

    What stinks though, is the claim that the paper’s calculation is flawed because they want the wider CI caused by model scatter. The scatter doesn’t change our knowledge of average trend though, that knowledge is controlled only by the variance in the average trend.


    If you have 1208 proxies, and weight the hockey like ones to create a series. You don’t go back and look at the scatter of the trends of the individual input series to determine the confidence to which you know the trend of the output. Were we to calculate CI’s that way, we could cancel the whole field of paleoclimatology today.

  15. #21 Jeff Id:”Certainly model mean doesn’t have much meaning when the assumptions are different but in reality all models are predicting reality. All the mean can tell us is that there is a systematic bias in the models.”

    Maybe if I forget how the ensemble is comprised and look at it as another model, and that it trends differently from the observations, then i can see that there is a problem either with the ensemble or the observations.

    “..predicting reality.” suggests that we can compare the prediction and the reality and if the difference between the two has some consistency, then infer something about biases among the members of the ensemble.

    Please forgive my paraphrasing, but I think I’m finally getting it. I do think that detecting bias through use of an ensemble is not what Annan is doing. He seems to be using it more as you say as “a predictor of reality.”

    Or at least he hopes too.

  16. # 21,

    I think that is a more than fair summary Jeff. I would go further. The individual models versus measured trends show that at least some of the models are almost certainly wrong (>95% CI). At a minimum these obviously incorrect models ought to culled from the herd, since they are demonstrably incorrect. The remaining set of models could then at least be rationally argued to perhaps represent reality. That this culling process has not already happened is quite bizarre, and I suspect relates as much to the resistance to accepting a lower average model-projected warming trend as to anything else.

    An even more reasonable approach would be to note that since all the models rely on similar approaches (the same basic fluid flow equations), the differences between model projections must be related to parameterized factors (ocean evaporation rates, clouds, aerosols, black carbon, etc.) that differ between models. An examination of these differences between the models that are obviously wrong and the others might actually allow identification of areas where the models are weak.

  17. I think that is a more than fair summary Jeff. I would go further. The individual models versus measured trends show that at least some of the models are almost certainly wrong (>95% CI). At a minimum these obviously incorrect models ought to culled from the herd, since they are demonstrably incorrect. The remaining set of models could then at least be rationally argued to perhaps represent reality. That this culling process has not already happened is quite bizarre, and I suspect relates as much to the resistance to accepting a lower average model-projected warming trend as to anything else.

    I think what many miss in these arguments is that one needs some a prior reasoning for selecting a model. Curve fitting models with in-sample observed data is a dangerous pursuit. Your have not even made the attempt to withhold some of the data for “verification” – which is little better but still has much potential for data snooping. If you do not want to use an ensemble average than you must select models before testing and with a reasonable criteria.

    Another problem with the posterior selection process is selecting models for a certain region of the globe and ignoring how the models perform in other regions of the globe. What if I had a model that was dead on in the tropics but got the extra-tropics all wrong and perhaps with averaging got the global temperature reasonably correct. What if it got the tropical troposphere correct but failed with the tropical surface temperatures.Individual models need to get it all or nearly all correct, in my view, in order to even qualify for some out-of-sample testing.

    I believe it was Douglass who pointed to Santer, or perhaps it was Karl, using a model that got the tropical troposphere correct but was way off with tropical surface temperature. I saw an article coauthored by Gavin Schmidt attempting to explain the recent warming of the Antarctica Peninsula using a model that got the Peninsula “correct” but apparently did much poorer in other regions of the globe.

  18. The null hypothesis is that each of the models can predict the “true” trend in global temperatures up to a stationary error associated with difference in random factors such as initial conditions. Only under that hypothesis can one use any of the models to forecast the trend over the next century. Since they are all aiming to predict the same trend, we can use the ensemble of models to get a better fix on what that common trend is than we could get from the noisy signal produced by any one model in isolation. The more “noisy witnesses” we have on that common trend (assuming the noise across models is not correlated) the better the fix we can get on the trend. That is why the standard error of the estimated model trend goes down as you include more models in the ensemble. The MMH analysis shows that the common trend that the models are pointing to is statistically significantly different from the trend in the data. The models run too hot.

  19. I just don’t get the ensemble concept unless the output of any given model is itself a random variable. When we predict statistical phenomena using finite element analysis, fatigue life for example, the output the the simulation is still deterministic. If there is some unknown parameter, a residual stress from a forming operation for example, we might run multiple models and give a range of predictions. We certainly wouldn’t average them all together and claim that that is the most likely answer though. The parameter has a value, we just don’t know what it is. One of the models is closest to “right”, we just don’t know which one.

    Predicting the behavior of a chunk of metal can be challenging enough. One can only imagine the assumptions and simplifications that have to go into modeling something like the climate of the planet. At some point, the amount of uncertainty has to make the utility of the model predictions questionable. Given how much we still don’t understand about the climate, it seem to me like the modelers are overconfident in their results, at least publicly.

    One last comment, is it just me or are the climate modelers setting an awfully low bar for themselves. If my models were making predictions that seemed to disagree with physical test results after a few samples, and I said “Yeah, but you can’t reject my model predictions at a 99% confidence level, so we should trust this unproven model to make our business decisions until you can”, I’m guessing the response would be rather harsh. Instead of trusting the models until they are proven wrong, shouldn’t we be trusting observations until the models are proven useful?

  20. Using the range of model results versus observed results in attempting to show that the observed results are within the model results was how the climate scientists first attempted to explain away the differences in the ratio/difference between the tropical troposphere and surface temperatures.

    There is no a prior reasoning attached to the selection of models and model results and therefore one can readily throw in a seemingly bogus model result that expands the range to encompass the observed results. That potential problem makes hash out of the reasoning that some climate scientists claim for the use for the range or a simple standard deviation.

    Another bogus bit of reasoning used initially by Gavin Schmidt at RC in the Santer/Douglass debate and others who have since joined the chorus was that, if one used an ensemble mean with sufficient numbers of results, the SE gets vanishingly small and to the point that with infinite numbers of model results the SE becomes zero and therefore the observed results would always be significantly different. I believe it was Lucia who had the best reply to that one and that was that if the differences with large numbers of model results versus observed turn out to give a very significant difference but the difference turn out to be small no would care whether that small differences was significant or not.

    One final point is that, without an a prior selection process (outside of curve fitting to observed results) that selects for the better fitting to the observed results, the wide range of model results has to indicate that models have so much difference that in total the results are not meaningful or useful.

  21. #24,

    Fair enough. It is of course prudent to examine the behavior of the individual models beyond the global average temperatures, and at Chad H’s blog he does do some of that; there are clear differences in individual model performance for the tropics versus globally, for example. The models’ complete failure to show the kinds of longer term climate variation (especially pseudo-cyclical variations) is also pretty damning, since it suggests that there are substantial fundamental errors in the model in addition to the simpler stuff like selected parameter values.

    I agree that there is a lot more curve fitting going on than what the climate modelers admit to, and that culling curve-fit models based on simple performance measures does not mean that the remaining models are accurate representations of reality. On the other hand, those models which are clearly wrong based on simple measures, like global average temperature trends, for sure ought not be included in pooled estimates of future warming. If climate scientists and the IPCC insist on using a pooled estimate of future warming as a guide to policy makers, then at least they should be somewhat rational about which models go into that pooled estimate.

    The more relevant point is that the climate modelers always insist that the models are not curve-fits at all, and that they are 100% based on the best possible representation of physical processes. That there is a clear correlation between the size of assumed aerosol effects in each model and that model’s diagnosed climate sensitivity shows how dubious these claims are. But if we take the modelers at their word, then culling models which are obviously wrong when compared to data seems to me pretty much demanded.

  22. James Annan is certainly correct that the ensemble of models cannot be treated as in a normal statistical sense (e.g., compute mean and error of the mean for purposes of statistical testing). To show the absurdity of this, I could create 100 individual models, run them, average their outputs, and end up with a model mean that is 10x more “certain” than any of the individual models. All without the pesky need to compare any of them to data, or make any real improvement in the model! How convenient!

    The mistake is in deflating the variance of the mean, not computing the mean. You use 1/sqrt(N) when the individual samples are independent of each other. That obviously is completely not true here.

    Whether you use a collection of one group’s models, or one model each from group, the variation between model outputs is generally not well defined. When comparing against, the models that were more carefully constructed should in principle correspond more closely to the physical data than those don’t, and the difference between such a model and a more poorly written one is only measure of how much care is needed in constructing the model (if they aren’t very different, then somebody’s made a model that was needless detailed).

    What you can treat as a normal statistical fashion is the ensemble of model outputs from the same model, for example, when the inputs and forcings to the model are varied in some well defined fashion (e.g., treated as normally distributed values). This of course is just ordinary Monte Carlo’ing, and when testing model against data, you are required to include this type of uncertainty in your statistical test.

    When comparing models to data, you cannot (in this case) combine the temperature reconstructions together either. The reason is that the data sets are large the same underlying group of data (with few exceptions) and the only real differences have to do with spatial weightings, methods used for homogenizing data, algorithms (when present) for correcting for urban heating, and so forth. If the logic were correct that it was OK to combine reconstructions in this say, again, I could go to town and, generate 100 slightly different ways to combine the same data and voila! we suddenly have a reduction by 10 in our experimental uncertainty!

    What you can do is work out the confidence interval for a given temperature reconstruction, which would have to be carefully developed by considering the uncertainty introduced by the presence of regions with poor or no coverage [e.g, the interior of Africa, parts of Siberia, the Arctic sea, Antarctica, etc], or the effects of variation in the coverage over time, or urban heating and other land usage changes, etc. Unfortunately this is rarely done.

    Which is odd to me, because I’ve had it drummed to me from day one that a physical measurement (and yes, in the measurement theory sense, computing the global mean temperature from the network of existing temperature measurements is itself a type of “measurement”) without an accompanying uncertainty interval is virtually meaningless.

    This is done sometimes, it should be done at all times. By and large the individual reconstructions should and do agree with each other within their statistical bounds, otherwise the statistical bounds haven’t been computed properly or somebody has a bigger mistake in how they did their reconstruction (the underlying data sets are the same after all).

    In summary, when comparing model to data you should include the effects of serial correlation in the data set when making comparisons, as well as the variability introduced by atmospheric-ocean oscillations (which can’t be modeled directly, and need to be treated in a statistical sense) and you should compare an ensemble of outputs from single models to individual reconstructions.

    If after you’re performed this exercise and you get e.g., less than a 5% chance of agreement between data and model, then you can probably reject the model. If none of the models individually are consistent with the data, then all of the models are eliminated. That’s how it’s done.

    What you don’t do is average an ensemble of models and use the error of the mean of that model ensemble as the basis for testing consistency between model and data.

    What you could do is assume all of the models belong to the same ensemble (include each individual model output as part of a super-ensemble), and use the variance in that super-ensemble in your comparison of the mean of all of the individual model runs to measurement. This goes the other way than applying the error of the mean equation to the ensemble, because you’ve probably inflated the variance by doing combining the models rather than artificially deflating it, but it is another way of “getting more runs” to improve the ensemble mean, so there is probably some value to doing doing this.

    Without a lot of work, we can clearly see that the models are running “high” compared to the data (probably at the top edge of the 95% CL). That almost certainly indicates a problem with the models, it could be serious (e.g., too high a climate sensitivity to CO2 forcing) or benign (due to resolution issues, the models don’t properly handle ocean-atmospheric fluctuations so one could get large excursions over periods of 10-years between model and data).

    I don’t think James is saying you can’t combine independent samples to improvement your estimate, he’s saying these samples are in no-possible-way independent of each other in any statistical sense.

  23. Carrick,

    “James Annan is certainly correct that the ensemble of models cannot be treated as in a normal statistical sense (e.g., compute mean and error of the mean for purposes of statistical testing). To show the absurdity of this, I could create 100 individual models, run them, average their outputs, and end up with a model mean that is 10x more “certain” than any of the individual models.”

    I don’t agree with this. The improved CI would simply represent the average of your assumptions in your 100 models. The tightened CI would represent the confidence of the resulting trend, nothing more. The difference between the trend and measured data would represent how realistic your assumptions were on average.

    So in my opinion you certainly can run 100 models, and get a tighter CI, and reject or pass significance with respect to measured data. As Lucia pointed out though, if the amount of difference between the two trends is small, statistical significance won’t matter. In this case we have 2 to 4 times observation and 99% significance. Where significance is measure of the certainty of average model trend, nothing more.

  24. Again, maybe I’m just being dense here, but I don’t understand how confidence intervals even come into play. I assume that each of the models is deterministic (though it is possible I’m wrong on that.) Running different models is not replication. What you have is a parameter sensitivity study, not a statistical distribution.

  25. Matt, we are interested in the trend created by long term changes due to co2. When we measure the trend, all kinds of interactions create short term signals on the trend we’re looking for. This stuff gets treated as noise. So basically when you calculate trend, you need to judge how much the noise would affect your calculation. That is all the CI is good for.

  26. Jeff Id:

    I don’t agree with this. The improved CI would simply represent the average of your assumptions in your 100 models. The tightened CI would represent the confidence of the resulting trend, nothing more. The difference between the trend and measured data would represent how realistic your assumptions were on average.

    Start with the derivation of error of the mean and see whether you’ve violated any assumptions by using it to tighten the CI. One big zinger than comes up is the assumption of independence of the samples, totally and completely violated here.

    That’s it in the basic: Using the error of the mean here is wrong. It’s not even 5% right, it’s 100% wrong.

  27. Jeff, you said: “What the paper is teaching is that the average model is biased well above the observed trend. When combined with the post linked in my last comment above and the fact that similar assumptions are used throughout the models, it is apparent that either some of the assumptions are wrong or that the observed data is wrong.

    Observed data error cannot be discounted either but we’ve looked a lot here at satellite temps. They are imperfect to say the least. So are radiosonde measurements but my bet is on the models running warm.”

    The models are running warm compared with what? Presumably the observed data, right? Therefore, even if the data are off, don’t we have to look to the models as the problem in the variance from the observations, unless we have a situation in which the observed data used as input for the models was good, and then subsequently the observed data in later years became bad, while the models continued to demonstrate the “real” temperature, unlike the now bad observed data. Is that really the essence of the argument they are putting forward?

  28. #33

    But the calculations in MMH allow for contemporaneous correlation of the errors across the different models and inflates the estimated error variance as a result. Even allowing for that, the best estimate of the trend predicted by the models differs significantly from the data.o

  29. Carrick,

    I understand your point but we know each model is at least partially independent. The various tweaks and complexities produce different trends or else there wouldn’t be any variance loss on average. Claiming that they cannot be combined to determine if there is a bias, is therefore not correct. In addition, they are all supposed to represent reality.

    If you were using the model mean CI to represent the behavior of the models, you would be correct. MMH used model mean to compute an average trend and the significance of that average trend is the only thing required to differentiate from observed trend.

    It is reasonable to average the models together to see if they generally match observation, not much more can be determined from that. If the null fails, then the question becomes how high and is the average trend different from observed trends in a big way. It is far simpler than you are making it.

    Now if you want to argue that autocorrelation due to shared features between the models should create reduced DOF and an expanded CI, sure why not. But saying it is incorrect because the models are not completely independent is a bridge too far. Considering that the current CI isn’t even close to encompassing observations I’m not sure how useful that would be.

    Finally, if you were to use the trend to conclude that all models are bad then you have to use a widened CI to encompass the range of all models, and all you have to do is include a couple of low trend models in the group, and you’ll widen the sd and never fail the test. This is not what is being done.

  30. Eric,

    Observed temperatures are completely independent from models, except for the fact that the models are tested against observation and tweaked over time until they match. In this case observed temperatures were done by satellite and radiosonde measurements, surface temperatures were left out. There are plenty of sources of error in satellite temps, and they have been treated fairly badly by the mainstream community for their lower than ground trends. If both satellite and balloon measures have serious problems, we may need to correct the data to match the models rather than the other way around.

    I find the failure of measurements to be very unlikely.

  31. If both satellite and balloon measures have serious problems, we may need to correct the data to match the models rather than the other way around.

    Bite your tongue. The fact that the measurements might be off doesn’t mean the model results are correct, or even any better… Throw in all the caveats you want, but don’t fudge the data. There is already too much of that in climate science.

  32. By your method, more data never ever improves the certainty to which you know trend.

    It really is an excercise in logic isn’t it? The output of each model can be thought of as “climate estimate” plus error – with error being the difference between model and observation. If the model was “correct” then you would have a normal distribution of error centered about it’s mean and difference between the model mean and reality would be the normal type of error expected in any such excercise. To the extent that any given model is “not correct” There would be an offset between the model distribution of outputs (mean + error), and reality. If a model is close to being correct, then the reality offset would be insignificant, and the difference between model output and reality falls within the expected normal distribution.

    The population of model means could be thought of as the normal distribution of the best estimates of these reality “offset’s”. It is the ouputs of all possible combinations of inputs which are allowed to vary. Climate science assumes (hopes) that reality falls somewhere within this distribution. MMH are suggesting that observations fall outside the confidence intervals of this distribution.

  33. Jeff ID:

    I understand your point but we know each model is at least partially independent. The various tweaks and complexities produce different trends or else there wouldn’t be any variance loss on average. Claiming that they cannot be combined to determine if there is a bias, is therefore not correct. In addition, they are all supposed to represent reality.

    Jeff I never claimed you can’t combine the models, what I am saying is you can’t apply the error of the mean, computed from the assumption of independence.

  34. Peter Hartley:

    But the calculations in MMH allow for contemporaneous correlation of the errors across the different models and inflates the estimated error variance as a result. Even allowing for that, the best estimate of the trend predicted by the models differs significantly from the data.

    I don’t think they do it correctly though, because they can’t account for the overlap in the physics assumptions made in the various models. (Obviously the physics assumptions should largely overlap, the trouble is how do you model this statistically? I think the correct answer is “you can’t.”)

  35. Layman Lurker, I think it is more complex in this case. What we have is a combination of initial conditions (climate in 1850 for example) which itself isn’t known plus a sum of forcings that vary over time (some of which are not directly measured, e.g., sulfate emissions), plus we have natural climatic variations, e.g. atmospheric ocean oscillations such as the PDO and the ENSO that lead to large variations in climate over short periods plus weather.

    So what you end up with is a deterministic trend generated by e.g., CO2 plus other contributions that you have to characterize in some fashion. The easiest (and most reliable) thing to do is simply take a long-enough running average that all of this other garbage gets washed out, or if you can “reset” the Earth and run it again with slightly different initial and forcing conditions, then you could average all this other junk.

    If you want to do a short-term (e.g., 10 year) comparison with climate, what you basically want to do statistically is compare the trend in the model to the trend in the data, while adjusting the uncertainty for the variability in model and data temperature trend introduced by the short-period “junk” (e.g., atmospheric ocean oscillation, weather, hereafter “SPJ”). Part of a “good” statistical test might be to not only test the trend but to test the distribution associated with SPJ to see if the models are reproducing the distribution of SPJ seen in the data. [In practice this latter test will almost certainly fail, the climate models lack the resolution to reliably reproduce SPJ, so testing it is a needless complication].

    But here’s my main point: combining models doesn’t reduce the variability associated with SPJ that is superimposed on the temperature trend. It’s a feature of climate after all. All you are doing is removing it from your model estimates of global mean temperature.

    What Jeff and others are trying to argue is you use a variance reduced version of SPJ to compare the trend from the model against the trend from the data. Unfortunately that is just wrong. It would only be useful if SPJ were both deterministic and reliably predicted from the climate models.

  36. Carrick,

    I have no idea why removing the independent random ‘spj’ noise to reveal the less independent forcing is an incorrect method. The less independent yet still different forcings naturally have a very smooth trend, and the model reaction to them is exactly what we want.

    You can argue that the CI should be expanded, but it won’t expand by much, because if you remove the atmospheric noise, there ain’t much left.

    Your point that it is ‘wrong’ is far too altruistic. I want the base forcing, and I want to know if it is in the right magnitude for observations on average. MMH10 shows it is not.

  37. So what you end up with is a deterministic trend generated by e.g., CO2 plus other contributions that you have to characterize in some fashion. The easiest (and most reliable) thing to do is simply take a long-enough running average that all of this other garbage gets washed out, or if you can “reset” the Earth and run it again with slightly different initial and forcing conditions, then you could average all this other junk.

    That’s great if you already have a handle on all of the “other contributions.” There are an awful lot of unknowns baked into those two words… unless you are a climate modeler.

  38. if you ran exactly the same model a hundred times with slightly different start conditions and average, you would reveal a trend for that model to high fidelity. Oversampling. This is a completely valid process. You could then compare knowledge of that trend and its CI to weather noise. CI being how well you know that trend, nothing else. It states nothing about the variability of various model runs, it states the variability of the forcings.

    Yes, using this method, you could generate narrow CI’s – down to the level of the forcings. And the CI of the curve that remains indicates how well that curve is known. I mean that’s what it comes down to, how well the final curve is known.

    If you take the same model and give it a hundred different start points, you will get an answer almost exactly the same. If you do a hundred start points a thousand times, you get the same curve every time, within the CI.

    Now if you want to argue that the CI should expand some, perhaps you are right, but not by much.

    Anonn argued that the combination of models is non-physical, he’s right. I’m arguing, as is MMH10, that the non-physical combination give us a trend biased well above observation. Since many of the models have similar assumptions, this is proof that those and/or the data need to be examined again.

  39. #41

    Actually they adjust the standard error of the trend estimate using two different methods. One specifies a parametric form for the error covariance matrix. The other allows for arbitrary patterns of correlation across models and abritrary pattens of autocorrelation. With fewer assumptions, the latter gives one even larger error bands, but it does not matter. The estimated trend still differs significantly from the observed one. It seems to me that if you want to argue they have an incorrect model of the error structure you need to specify a plausible alternative one and re-do the estimation. Just saying their adjustments are wrong does not cut it.

  40. Matt Y:

    That’s great if you already have a handle on all of the “other contributions.” There are an awful lot of unknowns baked into those two words… unless you are a climate modeler.

    This is a bit of a circular argument it appears to me. I recognize there are issues with the “other contributions”, so does James Annan (a modeler). Jeff is arguing for combining the models in order to reduce their apparent uncertainty, James, a modeler, is saying you cannot.

    Who is arguing for a more precisely known result than is really there? The modeler or the critic?

  41. Peter, I understand what they do. The methods they are using were developed to study measurement, not model outputs. Also, don’t have to propose an alternative, if I think as I do that the problem is ill-posed.

  42. “Who is arguing for a more precisely known result than is really there? The modeler or the critic?”

    I’m saying the models forcings and resulting certainty of warming are something highly quantifiable. Quite a few other ‘modelers’ have agreed.

    Claims that we cannot average out the atmospheric noise to nail down the forcing due to your perceived dependence are incorrect.

  43. Carrick,

    How about this example. You have a ramp signal 0-1 for 100 years. You add red noise and make 10000 series. You average the series and calculate a CI for the resulting 0-1 signal. Can you tell me where I’m missing how that is not an acceptable CI and answer?

  44. What you could do is assume all of the models belong to the same ensemble (include each individual model output as part of a super-ensemble), and use the variance in that super-ensemble in your comparison of the mean of all of the individual model runs to measurement. This goes the other way than applying the error of the mean equation to the ensemble, because you’ve probably inflated the variance by doing combining the models rather than artificially deflating it, but it is another way of “getting more runs” to improve the ensemble mean, so there is probably some value to doing doing this.

    Carrick, assume a StDev =0.10 for the super ensemble and I have 10 models replicated 5 times each for a total of 50 results. I use that 50 sample data to estimate the StDev which was found to be 0.10. Now if I take an average of 50 samples for a comparison, I would calculate the Std Error of the mean of those 50 samples as StDev/50^(1/2).

    Once you make the assumption of the super ensemble, I do not see where your problem with independent samples comes into play. If I have a normal distribution (would need to check this) of model results over all the ensemble individual results and I randomly sample these results, the average I obtained would require I use the standard error of the mean.

    Let us say that I have used the 10 individual model means (of 5 replicates each) instead of the 50 model results. The mean is going to be the same, but now my StDev will be smaller and I will be dividing by the square root of a smaller number so the Standard Error of the mean should be essentially equal in both cases.

    Rnorm=rnorm(50,mean=0, sd=0.1)
    SEAll=sd(Rnorm)/(50^(1/2))
    M1_M10=rep(NA,10)
    for(i in 1:10){
    M1_M10[i]=mean(Rnorm[((i-1)*5+1):(i*5)])
    }
    SE10=sd(M1_M10)/(10^(1/2))
    SE10
    SEAll

  45. Jeff ID:

    How about this example. You have a ramp signal 0-1 for 100 years. You add red noise and make 10000 series. You average the series and calculate a CI for the resulting 0-1 signal. Can you tell me where I’m missing how that is not an acceptable CI and answer?

    Of course that is an acceptable procedure, but that’s not the version of the problem that we have with climate.

    Let’s suppose you have a detailed model of temperature as a function of forcings. Let’s drive it with white noise (don’t worry the response will be reddened) + a deterministic driving term.

    I give you one instance of that run. Just one. And to make things more interesting, I tell you that I may or may not have added a bias to that temperature (measurement error). I also don’t give you the form of the forcings, other than it is generally monotonic and slowly varying, and regarding the SPJ all you know is it’s unlikely you’ll be able to accurately reproduce it using your computer.

    That’s more akin to the problem we have here: You have to guess part of the forcings, and the only thing that helps you is some of them are known (so you can use variations in those to infer something about the unknown ones).

    Claims that we cannot average out the atmospheric noise to nail down the forcing due to your perceived dependence are incorrect.

    Nobody said you can’t average out the noise, in fact, that’s a good thing to do. The trouble is in using an artificially reduced variance to estimate the likelihood that current temperatures are not consistent with model trends.

  46. Carrick,

    Maybe we’re getting closer to an agreement. If you run your example model over and over with different white noise, wouldn’t you eventually find the response to the monotonic forcing? And wouldn’t your confidence in that response be just the noise in the resulting average?

  47. #48

    I don’t really understand what you are saying here, so these are perhaps dumb questions, but if you don’t think statistical methods can be used, how do you propose we check whether the models can make an accurate enough prediction of the trend that we can use their predictions out of sample as the basis for policy? Also, if we do not view the variations across models as statistical noise, what do we view it as instead? Finally, if we cannot use statistical methods (essentially some form of averaging) to derive an estimate of future trend from model outputs because they are not validly viewed as producing trend (signal) plus noise, what should we do instead to get our prediction for future trend?

  48. Further to my last post, I did not see your comment #52 before I wrote #54. I see now that you are ok with the averaging to reduce the effect of noise. But then I don’t see why you say the way MMH go about measuring the error of that trend estimate is an “Ill-posed” problem. I can see you might say they need a different hypothesis about the structure of the noise, but I thought you were saying #48 that it is not the exact method they used that you object to but the whole idea of modeling the problem as signal plus noise.

  49. Anonn argued that the combination of models is non-physical, he’s right.

    The flaw in this logic is that just because the mean of the population of model runs is not in and of istself an estimate of reality, does not mean that the ensemble cannot have statistical properties. It is the de facto solution space, and while we don’t necessarily expect the mean to converge on reality, we do have an expectation that reality lies within this space. The center of the space is not an estimate of reality, it is our expectation that reality lies within the space. As we move futher away from the center (mean) it is less and less likely that our expectation is realistic.

  50. I may be simple, but I take Carrick’s core point to be that you cannot validate models en masse by using ‘their average’. Models stand or fall utterly independently, since they have necessarily different and mutually exclusive algorithms and pararmeters built-in.

    And is not ‘culling the worst’ just a form of “curve fitting”? It would have to be followed by prospective follow-up and validation. Typically, this produces a new scatter-plot in which what previously looked good is now revealed to be a mere participant in a crowd-scene of drunkard’s walks.

  51. Jeff ID:

    Maybe we’re getting closer to an agreement. If you run your example model over and over with different white noise, wouldn’t you eventually find the response to the monotonic forcing? And wouldn’t your confidence in that response be just the noise in the resulting average?

    Jeff, I don’t think we’re that far off agreement, but I think what you do need to consider is

    1) The forcings and initial conditions aren’t known exactly,
    2) In addition there is “short period junk” (SPJ) that is either not modelable or is currently poorly modeled,
    3) In addition, we have modeling errors that vary between models,
    4) We just have one time series. which contains SPJ that may or may not be characterizable separate to changes in forcings over time.

    I can think of other issues, but I think these are enough to proceed:

    A) Uncertainty in forcings and initial conditions #1 introduces a spread in model responses, which is a separate issues from the SPJ contained already in the models.
    B) You have to average a number of model outputs to get a smooth trend from them (or you have to do e.g a 30-year average as I mentioned).
    C) You can increase the number of model runs by averaging over the outputs from multiple models.
    D) This helps reduce the effect of SPJ (why the multi-model mean looks smoother) but it does not affect the uncertainty associated with not know the model forcings and initial conditions adequately.

    My chief criticize in a nutshell with combining models is I believe not only are they reducing the SPJ, they are artificially deflating the uncertainty from the model forcings and initial conditions. If I am correct, this is a serious flaw in the analysis.

    (And to address Peter on this point, I don’t think the way they are combining the models is able to prevent this, but upon thinking on it, I believe I know a method for combining the models that addresses this.)

    In an ideal situation, you would never, ever combine the models. Instead, you would do a model-by-model comparison to the data (after doing the appropriate Monte Carlo’ing), and if all models were rejected, then that’s a problem for the modeling community. If the only purpose of combining the models is to reduce the SJP, there are other ways of doing that (more runs, faster computers, or longer time averages).

    I would also caution against statistical tests that are explicitly or implicitly testing whether the SJP from the models belongs to the same distribution as the SJP inferred from the single data series. I believe this is 100% guaranteed to fail.

    While it is an important question, it doesn’t address the real question we all want to know, which is “do the models as written give the correct environmental climate sensitivity for CO2”?

  52. #60, Carrick

    I agree, the only really rational approach is to compare each individual model with measurements and conclude if the individual models are consistent with reality, then reject those that are not.

    If you really want to combine the models, then your are faced with nutty choices. If you throw all the models into a pool, then the addition of horrible models, that are very far from measurements, makes the pool always “consistent with” the measured trend, no matter how far the models as a whole are from the measured trend. This is the present situation.

    If you do what Douglass, Santer, and now MMH10 have done (with their joint model hypothesis test), then you artificially reduce the variance, so that adding more models makes the pool always fail…. also not very satisfying.

    The only rational way I can see around this conundrum is to determine the mean model trend from the combined pool, then determine the individual uncertainties for each of the models that go into the trend, and use the average of the model uncertainties as an estimate of the overall uncertainty for the average of the models. Still very ugly, but better than nothing, and more defensible than other ‘combinations’ of the models.

    All of which is not terribly important. The models are just way off.

  53. Model pools are nonsense, unless you intend explicitly to mash them together into one metamodel — which is probably impossible due to logical inconsistencies which would lock the resulting mess up tight.

    IMHO.

  54. I can’t reword this again so people will understand, but there are a lot of us who believe that you can show models are warmer than observations from the ensemble mean. I think it’s rather obvious but am waiting to be enlightened. I really have been wrong enough times that I can listen to qualified opinion. Not that I like to be wrong or will easily concede it.

  55. I have heard it argued that a model rendition of climate gives a single rendition out of a possible infinite number of renditions that are possible given the chaotic nature of climate and that the observed climate could be one of them that was actually realized. Would we expect those renditions to be normally distributed and each rendition analogous to withdrawing a sample from a normal distribution or at least an approximately analogous to that?

    Do any of you recall that Steve M at CA attempted to show that the model results used in the Santer et al tropical troposphere paper had within model results that were much tighter than those between models – but unfortunately used the wrong data? He had indicated he would redo the analysis when he had time.

    My point here is that we should be talking more about how we suspect the models work or at least are expected to work instead of making statements about how to correctly apply statistics without referencing in some detail how the models are expected to work.

  56. I agree, the only really rational approach is to compare each individual model with measurements and conclude if the individual models are consistent with reality, then reject those that are not.

    It has been stated that using the standard error of the mean somehow artificially reduces the variation. Using Carrick’s logic I can run the same model many times and use the resulting average of the several runs to compare with observed result. In that comparison I am obliged to use the standard error of the mean. Is that somehow artificially deflating the error? I suspect with sufficient model runs I could narrow the standard error enough to reject the hypothosis of no difference between the model and observed results. But like Lucia has said if that is a tiny error who would care for all practical, purposes.

    I think that a major part of the problem with having to use ensemble means of many models is because no one steps forward and says here is my a prior selection process for models and using that/those model(s) here is the comparison with the observed climate. This situation truly strikes me as an indication that no one has a good method of evaluating a black box result or better that anyone has a comprehensive view of what goes into all the models.

  57. 64, Jeff
    “you can show models are warmer than observations from the ensemble mean”

    Only if you have a way to preserve the variance of the models when the ensemble mean is calculated.

    Say, for example, a single model has a 95% confidence interval of +/-0.3C from the multi-run mean of that model, and say that you have god-like knowledge, and so you know for certain that single model is a completely accurate representation of the climate. Now you average 20 very similar (almost perfect) models, some very slightly higher in trend than correct, and some very slightly lower in trend than correct, but with each individual model still showing the correct individual uncertainty of +/- 0.3C. When you pool these 20 models as you describe, you will find that the variation of the pooled trend of the individual models is quite low, even though you know already (with your god-like knowledge) that the true variability of a single instance of the climate is +/- 0.3C from the many-instance average trend. So you would have to conclude that the pooled estimate of variation is too small to properly represent the variability of the true climate, which your god-like knowledge says has a +/- 0.3C uncertainty. Increase the number of models to 10,000 and the calculated variance for the pool drops to ~1% of the correct variance. So the problem is how to correctly preserve the variance of individual models when the pooled average is formed.

    I think it would be best to average the model trends and the model variances separately.

    But all that being said, the models are in fact running way too warm.

  58. Re: Climate Models

    One other influence that people aren’t considering here is the numerical error associated with discretization of the differential equations that are (purportedly) being solved by the various computer codes. Because the differential equations are non-linear, you can have a perfect numerical solution (i.e. one that has no truncation error due to discretization) which is chaotic. But it is a given that all climate models have some amount of numerical error (which grows in a time marching scheme unless sufficiently small time steps and spatial grid element sizes are used), although this is hard to get a handle on this since many of these huge computer codes are not very well documented (there are exceptions). To what extent does this error affect the prediction of the true chaotic solutions?

    In addition, there are numerical stability issues that can cause problems, particularly when you have a large number of coupled equations (as you certainly do in climate models). You can develop some stability criteria for simple linear systems, but non-linear systems are never guaranteed.

    By the way, I saw a presentation online by a researcher from the UK where he showed a plot of the absolute global mean surface temperature as a function of time, predicted by several different computer codes. Interestingly, the absolute temperature levels were all over the place, although the anomalies were similar…so why is that???

  59. Steve,

    “Increase the number of models to 10,000 and the calculated variance for the pool drops to ~1% of the correct variance.”

    Then you have calculated the true response of the model. In the same fashion as the question I posed in 53.

    “So the problem is how to correctly preserve the variance of individual models when the pooled average is formed. ”

    Not correct, the problem is to convince the readers here that the multiple runs have simply revealed the underlying model characteristics of forcing everyone is most interested in. The short term variance has nothing to do with long term trend. From a single run, short term varaince determines the uncertainty. Over many runs, the short term variance cancels and we see the nature of the model.

    Hopefully Carrick will return soon and answer my last question.

  60. 69, Jeff

    “Over many runs, the short term variance cancels and we see the nature of the model.”

    Yes, that is right, but that is not really the question. The question is if the measured climate trend is or is not consistent with a particular model. That is, how can you tell if the measured trend is different, not from the average of many riuns of the model, but from a single run of that model? If a model is in fact correct, and the model predicts a certain level of variation from run to run, then any measured trend which lies within the expected range of variation for that model remains consistent with the model; you can’t say the model is wrong unless the measured trend lies outside the expected variation for a specific model. (Chad’s single model hypothesis test).

    Here is a simple case: Suppose I run a model 100 times covering the same 30 year period, and find that for 95% of those runs the model trend varies between +0.5C and +0.1C, with an average of +0.3C. Now I compare the 100 runs with actual temperature measurements over 30 years, and find that the actual trend over 30 years is 0.15C. What can I say about the model? Since a significant number of 100 model runs had trends equal to or lower than 0.15C, I can’t reject the model based on the measuremnts… the measured trend lies within the 95% confidence region of the model. If on the other hand, the measured trend is 0.05C over 30 years, then I can immediately say the model is almost certainly wrong, since 0.05C is lower than than the expected trend for all (or almost all) of the 100 runs.

    With an ensemble, averaging reduces the expected variation for the pool far below what the variation of a single model within the pool would be. The same is true if you average multiple runs of a single model and then calculate the variability of the average for that model. The variability of the average of N runs ought to fall as 1/(N^0.5) when compared to a single run. The “true nature” of the model (its average) is of course more clear as N increases, but that does not tell you if the actual measured trend of the Earth is or is not consistent with the model including its expected variation from one run to the next.

    Or to put is into a very few words: the variability of the average is not the same as the average of the variability.

  61. #7 Geoff Sherrington

    Agreed absolutely. It is this exact point (which I think you have cogently made before) that James Annan persistently deletes from his blog

  62. Jeff, yeahs to #53 both question, but not the right problem.

    The problem here is how you treat the uncertainty with the unknown forcings.

    10,000 runs smooths the effects short period junk, but it doesn’t help you with the range of uncertainty in the model response due to uncertainty in the forcings.

  63. How do you compute a standard deviation for the ensemble when the members of the ensemble are not independent of each other (they are heavily correlated algorithmically)? If you can’t compute a standard deviation, then you can’t compute a confidence interval. How do you do statistically meaningful comparisons of the ensemble average to anything else without a confidence interval for the ensemble average?

    I’m not sure that running the climate models huge numbers of times will be very beneficial. If you are changing the initial conditions for each run, then the results will of course vary a bunch because you effectively are modeling different climate systems.

    You shrink your standard deviation by increasing sample size. It is not clear how running the models beaucoup times with the same data increases sample size.

    Given the mess of the surface temperatures and the ongoing mystery about huge amounts of “missing energy” in the ocean heat content measurements since 2005, it seems like the first step is to improve the basic sensor/data collection system and validate its performance.

    Might be a good idea to review the paper by J. D. Annan and J. C. Hargreaves (link in comment 12) to see if their analysis of the multi-model ensemble makes sense.

  64. I think the meaning of “independence” with regards to the model means in the statistical sense is being confused here. The view from Carrick and Steve Fitzpatrick, as I see it, apparently judges that since the model means come from similar processes that the results cannot be considered “independent” for statistical purposes.

    That would not make sense when considering replicates that come for the same model as they are constructed very similarly and yet I do not believe anyone is saying I cannot average these results and use the standard error of the mean to compare with observed results.

    Further, think about the use of process control charts that use sample means from the same process and the standard error of the mean to statistically control the process. In the Carrick sense one would apparently consider the means dependent and not eligible for this type of statistical treatment. What is required is that the stochastic part of the result varies more or less randomly with a normal distribution.

    Finally if I have a set of model means that follows a normal distribution, I can see no reason at all why I cannot use an average of these means and the standard error of the means in a comparison with the observed results. The hypothesis would define that I am looking the average of the model results and that is necessitated because no one a prior is selecting the preferred model.

  65. …the multiple runs have simply revealed the underlying model characteristics of forcing everyone is most interested in.

    Jeff, this statement makes a lot of sense to me. IIRC, you are asking “what is the likelihood that the set of model forcing characteristics used to produce the ensemble contains the true forcing responses?” The relationship of the ensemble mean with observation should be a clue that we likely need to go back to the drawing board WRT model paramerization – either loosening the constraints on the “tunable” parameters or perhaps even more fundamental then that.

  66. Carrick,

    So you agree we can run the same model over and over and find greater confidence in the result. What do you think if we have two models, the second having different aerosol forcings. Can we then run both models over and over, average them together and use the average to determine the confidence in their model trends?

    After all, no matter how many times we run the two, we won’t get a different answer.

    And wouldn’t the CI for that answer be determined by the variance in the average of the two models?

    Yes the average of the two is non-physical at some level.

  67. Jeff, in a nut shell, you can run the model with as many different settings as you like, but the underlying uncertainty associated with the initial starting conditions and the forcings won’t get washed out by that. This is true whether you run one model many times or combine many models.

    I think what is happening with the way people have been combining the models is the variance introduced by the underlying uncertainty associated with the initial starting conditions and the forcings is getting artificially reduced.

    Think of the difference between systematic uncertainties and statistical uncertainties: Uncertainty in the initial starting conditions and the forcings is akin to systematic error here, you can’t combine a whole bunch of models and all of a sudden know the answer much better than you would have otherwise. It’s a different sort of beast than ordinary measurement error.

    The other problem is the assumption that the models have equal weight and are independent of each other. In practice, they are all based on the same physics, but some models are implemented better than the others: They have more funding, more personnel associated with the project, and greater computational resources.

    This is a case of “science isn’t a democracy”. You pick the answer the best model gives you, you don’t just average over the good, the bad and the ugly.

  68. #80 Carrick: “You pick the answer the best model gives you, you don’t just average over the good, the bad and the ugly.”

    Why isn’t above observation equivalent to “the ensemble mean is unlikely to be useful?”

    Or is it only useful in the sense of exposing bias common to the models?

    As an aside, are any models described in written form including the math, parameters, and so forth or are they creatures of code?

  69. Carrick,

    Of course, the paper is trying to determine whether models (including starting points) are representing reality.

    You wrote, “you can’t combine a whole bunch of models and all of a sudden know the answer much better”. The paper doesn’t ‘know’ any answers about climate better. It knows the models response better.

    So again, can you have two models as in my #78 and know their response better certainty from multiple runs?

    I hope you understand that by going through these basic steps, eventually we will understand each other.

  70. It’s almost basic enough to give me hope of understanding it myself. Or to have my view that the models don’t represent reality validated.
    ===============

  71. 80.Carrick said
    October 3, 2010 at 1:51 pm

    Jeff, in a nut shell, you can run the model with as many different settings as you like, but the underlying uncertainty associated with the initial starting conditions and the forcings won’t get washed out by that.

    100% correct.

    82.Jeff Id said
    October 3, 2010 at 2:04 pm

    The paper doesn’t ‘know’ any answers about climate better. It knows the models response better.

    I think Carrick’s point is that knowing the models’ mean response “better” does not provide any additional (better) information about the physical processes themselves, i.e., it does not give you any indication of the validity of their response. The ensemble mean of multiple models is no more valid/precise than any one model (or any other combination.)

    Mark

  72. Carrick #80

    “I think what is happening with the way people have been combining the models is the variance introduced by the underlying uncertainty associated with the initial starting conditions and the forcings is getting artificially reduced.”

    At the risk of being repetitive and getting us nowhere, I will try to re-state what I said in my first comment in this thread. I think the point is not that the underlying uncertainty per se is reduced but rather the uncertainty about the model trend is reduced. The null hypothesis is that all the models are trying to produce the same trend due to growth in CO2. Furthermore, the common trend they are trying to capture is the real trend in temperature reflected in the data. Under that hypothesis, the deviations from trend that appear in any one realization from any one model arise from differences in starting conditions, ocean cycles modeled in some cases but not others etc are all stationary (non-trending) noise. These noise terms can be correlated across models and runs and can be autocorrelated. However, the null hypothesis (that also is required to make the model projections useful out of sample) is that they all have the same CO2-induced trend. By adding more runs from more models we can get a better fix on that common trend. Uncertainty about the trend coefficient can decline because all the models are trying to estimate the same trend but with differing noise/signal ratio. How much will our uncertainty about the model trend decline as we add more models/runs? That depends on the cross and serial correlation patterns in the noise. It is not a simple matter of making a correction based on N. MMH model the noise and use the model of the noise to correct the standard error of their estimate on the common trend term. Doing that, they found the common trend the models are producing differed significantly from the measured one.

  73. Mark T, if that is the case then Carrick is missing the entire point of the thread, which I doubt.

    Nobody is trying to say models are perfect, we’re trying to ascertain whether MMH has demonstrated models are over trend by a statistically significant amount and whether they can even do the calcualtion this way. James Anonn has claimed you cannot do it this way, Carrick has also made the claim. I’m still confused as to why not. In my experience, I’ve never found anything I couldn’t figure out. Sometimes, I’m a little slow, but if it can be explained, I’m listening.

    I think Carrick may be missing the subtle and non-simple point that the paper is not representing ‘all’ models uncertainty, but the uncertainty in the average. If that is all the difference is, I think we can come to agreement on this. The uncertainty in the average is a far simpler problem and doesn’t require so much discussion, unless of course I’m missing something.

    I still strongly beleive this is the single strongest skeptic paper I’ve seen published. And it still doesn’t refute AGW, it only refutes the extreme models. But it does a heck of a job of it.

  74. Jeff #86,
    “we’re trying to ascertain whether MMH has demonstrated models are over trend by a statistically significant amount and whether they can even do the calcualtion this way. James Anonn has claimed you cannot do it this way, Carrick has also made the claim.”

    I agree with Carrick and James Annan (and believe me, my agreeing with James on anything is not common). On the other hand, the rejection of several of the individual models, as was done by Chad and by Steve Mc (prior to MMH10) by comparing the individual models with up to date data is perfectly defensible. If I wanted to argue that the combined IPCC pool is invalid, I would just point to the invalidity of several of the individual models in the pool that over-predict warming, and so prejudice the ‘average’ to be too warm.

    Of course the suggestion that the models are all valid because the measured temperature trend lies within the entire spread of all the models (including their individual uncertainties) is just nuts.

  75. I’ll throw this out for comment and critique:

    For a single model, a given value for each of the variables x1 and x2 is associated with one and only one value of y…. via beta(x1) and theta(x2). If there is uncertainty WRT to coefficients beta and theta then we allow beta and theta to vary (within constrtaints). When you allow for change in the value of the coefficients, then we have an ensemble. In an ensemble, for each value of x, we have a set of values for beta and theta, and a set of values for y. If we want to know if our ensemble set of beta and theta coefficients is realistic, we check the ensemble distribution of y’s against observation. If it falls outide confidence intervals then the answer to the question “Are the coefficients consistent with reality?” is “Likely not”.

    Note: this test has nothing at all to do with combining multiple models into a single enhanced “ensemble” model of reality.

  76. Perhaps my question “Are the coefficients consistent with reality?” is better phrased thusly:
    “Are the constraints I place on the coefficients consistent with reality?”

  77. Mark T, if that is the case then Carrick is missing the entire point of the thread, which I doubt.</blockquote
    I don't think he is, though I do think you're missing what he's getting at.

    James Anonn has claimed you cannot do it this way, Carrick has also made the claim.

    They are correct as far as I can tell.

    I’m still confused as to why not. In my experience, I’ve never found anything I couldn’t figure out. Sometimes, I’m a little slow, but if it can be explained, I’m listening.

    Because models aren’t independent (identically distributed) random variables you can perform statistical tests on and then make claims about what those statistics represent. Either you have a model that actually describes the system, or you don’t. If you have the right model, then averaging in other, incorrect, models won’t improve your result (you can’t be “more right than right,”) and if you have multiple incorrect models you can’t make any assumptions that their average approaches the true mean, i.e., the true physics. If, then, your models can’t be averaged together in hopes of approaching the “true physics of the system,” then discussing the statistics of their average is pointless. Averaging them together will indeed “smooth” the plots because there is some seemingly random clutter (hardly a surprise given randomized inputs,) but that smoothing is NOT the same as averaging to i.i.d. random variables and getting the 1/sqrt(2) SNR improvement.

    I think Carrick may be missing the subtle and non-simple point that the paper is not representing ‘all’ models uncertainty, but the uncertainty in the average.

    Carrick’s point(s) seems (to me) to be that the uncertainty in the average model response is a meaningless quantity and that averaging model responses will tend to give you a result that has just as much uncertainty as any individual response even if it does not seem to be as noisy.

    Of course, maybe I’m interpreting Carrick incorrectly, in which case… move along. 😉

    Mark

  78. J Fergunsen:

    Why isn’t above observation equivalent to “the ensemble mean is unlikely to be useful?”

    I guess my view is the ensemble mean does knock down what I am calling “short-period junk,” and it’s still expensive to perform repetitions of the model runs. That’s the main argument for it anyway.

  79. Some people commenting here want to throw out some of the models that produce a more extreme warming trend, saying they are the source of the problem for the ensemble. This is a reasonable reaction to the MMH results. It is not an argument that MMH cannot do the test the way they have done it. A reasonable conclusion from the rejection of the null is that the models in fact have different trends and some of the models with too high a trend should be discarded. MMH did not say what rejection of the null should lead to — the null is a composite hypothesis and more than one claim embedded in it could be wrong. I don’t think you can conclude, however, that they rejected the null only because the test was faulty without proposing a different test of the same null and showing why the alternative test is superior to the two tests they implemented.

  80. 1) Throw out the models that are individually in conflict with the measured trend.

    2) Average the trends of the remaining individual models to find the model mean trend.

    3) Average the uncertainties limits for each remaining model (around its own trend) to find the mean model uncertainty limits.

    The model mean trend +/- the mean model uncertainty limits is about as good as you are going to do combining models into a pool. My guess is that the measured trend would lie outside the calculated model mean +/- the model mean uncertainty.

  81. I agree with Carrick and James Annan (and believe me, my agreeing with James on anything is not common).

    When I read what Carrick is saying here and what I read from Annan in the paper titled, “Understanding the CMIP3 multi-model ensemble”, I would say they do not agree with one another.

    Annan uses a paradigm (his words) called statistically indistinguishable from the truth ensemble (which does have a distribution) and contrasts it with what he rejects and that being the truth centered paradigm (a distribution centered on the truth from independent samples). The truth centered perspective would be required to support the statistics used by Santer, Douglass and I believe MMH (2010). Annan in the paper noted above attempts to show why using the ensemble mean is almost always better than using any individual model result in a comparison with the observed result.

    I need to look at his perspective of models in other papers that were cited in the paper above, but it would appear that there is not any hard proofs for his statistically indistinguishable paradigm. I want to assure myself that this is not another climate scientists inventing a statistical method to support his conclusions.

    And by the way, Annan sees the differences between models as being caused primarily by differences in model parameterizations.

  82. Jeff ID:

    I think Carrick may be missing the subtle and non-simple point that the paper is not representing ‘all’ models uncertainty, but the uncertainty in the average.

    Actually you are and have been missing my point: They are computing the uncertainty of the average incorrectly, for reasons I’ve already explained.

    In almost any field that I know, except climate, nobody combines the good, bad and ugly models. Absolutely the only reason for doing it in climate is the cost of generating a single computer run, and as I’ve alluded to, to knock down the variability of short-period “weather/climate noise”.

    Your analogy of red-noise averaging is a good example of that. What that example misses is the uncertainty in the sensitivity introduced by uncertainty in the forcings/initial conditions should not be treated as if it were an independent, stochastic source of error.

  83. Kenneth, I’m not sure whether Annan or I would agree or not. In fact, I think we probably don’t agree 100%, but I wouldn’t guarantee 100% that this is because I’m completely correct, and he’s not, either.

    There are lots of reasons why models look different, but James is too much of a gentleman (at least to colleagues) to be blunt about why.

    I do computer modeling as part of my research, and one of the fights I have is with some other groups taking sloppy short-cuts, all in the name of speeding up the code, regardless of what it implies for the utilizability of their work. I certainly would object to averaging my model outputs against these groups results.

    Annan in the paper noted above attempts to show why using the ensemble mean is almost always better than using any individual model result in a comparison with the observed result

    I think I understand why that is in this case, it’s because of the “red noise”. If you average a bunch of outputs from one model, you’ll get a much more smoothly varying function. Since “goodness of fit” is defined as a sum of squares of residuals, a smoother function is going to have a better goodness of fit automatically than a more noisy one (that generally trends through the same mean). There’s no magic here.

  84. Peter Hartley:

    I don’t think you can conclude, however, that they rejected the null only because the test was faulty without proposing a different test of the same null and showing why the alternative test is superior to the two tests they implemented.

    I’m not absolutely certain that it’s my responsibility to fix problems with others papers (what happens for example if there is no fix? Certainly if somebody publishes a proof of how to geometrically trisect an angle—known to be impossible—it’s not my responsibility to provide a “corrected proof” of how to do it.)

    Nonetheless, I am thinking of how I might do this….

  85. I think I understand why that is in this case, it’s because of the “red noise”. If you average a bunch of outputs from one model, you’ll get a much more smoothly varying function. Since “goodness of fit” is defined as a sum of squares of residuals, a smoother function is going to have a better goodness of fit automatically than a more noisy one (that generally trends through the same mean). There’s no magic here.

    That is not the reasons given by Annan in the paper I cited and linked to above. In his conclusions he says the folloowing:

    By using some simple geometrically-inspired methods, we have explained why the multimodel mean has such good performance, firstly in having a lower root mean square error than
    the average of the individual models, and secondly in terms of how likely it is to outperform the best of the models. While the former result is a trivial algebraic result, the latter depends strongly on the relative widths, and effective dimensions, of the sampling distributions.

    Our analysis here further supports the notion of a statistically indistinguishable ensemble, as the results obtained with observational data are consistent in all respects with those generated by leave-one-out validation…

    …The EOF analysis, and calculation of the effective dimension of subsets of models, also shows that the sample size is too small to fully characterise the distribution from which it is drawn. Thus we might expect a larger set of models (constructed with alternative plausible physical parameterisations and numerical methods) to introduce some additional patterns of climate which are significantly distinct from those already obtained.

    I do not think that Annan’s arguments for the statistically indistinguishable from the truth ensemble and against the truth centered ensemble paradigm have been touched on in this thread other than by way of what I have presented. Until I find out more about the various ways of looking at the model results I find what Annan and some here have said as rather arbitrary at this point – but I am willing to learn.

    Annan claims that his paradigm for dealing with model (versus observed) results is used in other fields but fails to give any references in the paper by Annan and Hargreaves titled, “Reliability of the CMIP3 ensemble”. Here is what he says in the introduction to that paper:

    We consider paradigms for interpretation and analysis of the CMIP3 ensemble of climate model simulations. The dominant paradigm in climate sci ence, of an ensemble sampled from a distribution centred on the truth, is contrasted with the paradigm of a statistically indistinguishable ensemble, which has been more commonly adopted in other fields. This latter interpretation (which gives rise to a natural probabilistic interpretation of ensemble out put) leads to new insights about the evaluation of ensemble performance. Using the well-known rank histogram method of analysis, we find that the CMIP3
    ensemble generally provides a rather good sample under the statistically in distinguishable paradigm, although it appears marginally over-dispersive and exhibits some modest biases.

    These results contrast strongly with the incompatibility of the ensemble with the truth-centred paradigm. Thus, our analysis provides for the first time a sound theoretical foundation, with empirical support, for the probabilistic use of multi-model ensembles in climate research.

    Is anyone here famiar with the statistically indistinguishable concept and its applications in other fields?

  86. Can I ask another dumb question:

    Who made temperature the end-all be-all model output. I mean, the temperature increase, per se, isn’t dangerous. The scary stuff is floods, droughts, storms, sea level rise – right? So, why aren’t we evaluating models based on their ability to predict those things; the things that matter.

    So, let’s say one of these models is spot on in terms of global average temperaure, but is constantly predicting drought where there was flood, or vice versa, or mega-mega-hurricanes when none happened. Wouldn’t that be more important?

    BTW, Averaging different models just can’t be OK – it just can’t

  87. Anthony Watts said
    October 2, 2010 at 1:25 am

    I’m reminded of “the irresistible force meets the immovable object”

    Depends on the order of the aleph.

  88. All well and good (I assume the question) but is it close enough? If 15 models (or whatever) are close wouldn’t 10,000 models be closer?

    If every high school in America produced a model we would know the truth.

    (and that my friends is scarcasm)

  89. Let me cut to the chase with a cut down problem.

    I have a model that shows F=.9ma and another F=1.1ma. Exactly how well do I know from averaging the models that F=ma? +/- 10%? Plus the error bands of the models? Is “No fricken clue” in the answer list?

  90. Your analogy of red-noise averaging is a good example of that. What that example misses is the uncertainty in the sensitivity introduced by uncertainty in the forcings/initial conditions should not be treated as if it were an independent, stochastic source of error.

    I like.

  91. And by the way, Annan sees the differences between models as being caused primarily by differences in model parameterizations.

    Well duh. Now how do you deal with this? Run the most likely value? Pick the values according to some Monte Carlo method? Pick a few and tune the rest? Let “scientists” decide the “truth” and then average the results of various truths to determine the real truth?

    Averaging data is one thing. Averaging models is quite something else.

  92. #106;
    Exactly. It seems incestuous in climate science to fiddle with plugs (parameters) until you get some likely-looking “scenarios” and then claim they have some validity. Parameters are chosen, by definition.

    On what basis are GCM parameters chosen?

  93. Brian and Simon,

    I’ve seen not one coherent objection to averaging models. In fact all the discussion here has done has convince me that it is perfectly acceptable. The CI carrick objects to is not determining the accuracy of the models WRT reality, it is determining the accuracy to which we ‘know’ the average models.

    If you were to ask the question, are the models running high on average? Uncertainties in the driving parameters do not affect the CI of the answer. The parameters are what they are. Do the ones used make the models run high on average is the question. I think that’s where Carrick made his error and until someone can coherently explain why my question in #82 is wrong, the rest of this is just hand waving.

    If I am understanding Carricks statements, he’s mixing up the confidence interval of each model (uncertainty in how well models reflect reality) with the confidence interval to which we can measure the response (how well we know the response of the models). This seems to be similar in some aspects to James Anonn’s misunderstandings as well. James felt that the failure of the CI to encompass the other models, proved that you don’t have the right CI. This is clearly not correct, as adding more model data would never improve your knowledge of the average.

    What people are not understanding is exactly what I and MMH are explaining. The CI represents our knowledge of the response trend only, not the accuracy to which the models are performing. How well do we know the average trend.

    And in MMH10 they showed we know the trend is 2 to 4 times observation to a 99 percent certainty, I believe they are correct.

  94. Maybe averaging the models is like averaging the amount of sand brought to a site by 500 2 1/2 ton trucks made be a dozen manufacturers? Nothing wrong there. And it would tell you about loading bias in the group as a whole.

  95. Jeff:

    I’ve seen not one coherent objection to averaging models

    The issues I’ve raised are standard ones to the science of error analysis. I’m pretty sure I know how to raise a coherent argument, and other people seem to be groking it, so “it’s just you” my man.

    As to my answering you, I think it’s time that you demonstrate you’ve read and understand what I’ve said.

    1) Is science a democracy? (Are all models equally good? )
    2) If individual models have uncertainties associated with forcings, when you combine them, is the uncertainty interval associated with uncertainties in foricng the same or does it get smaller?

  96. Jeff ID:

    If you were to ask the question, are the models running high on average? Uncertainties in the driving parameters do not affect the CI of the answer

    This is one of your errors. Of course they affect the CI in that answer.

  97. J Ferguson:

    Maybe averaging the models is like averaging the amount of sand brought to a site by 500 2 1/2 ton trucks made be a dozen manufacturers? Nothing wrong there. And it would tell you about loading bias in the group as a whole.

    With sand, we have a bit of an advantage, because when you order a specific type of sand, there’s a classification scheme associated with it. We also have the advantage of “knowing the truth”, there is no significant issue with measuring the bias of weight scales.

    But if you didn’t have control of which truck brought the sand to a particular site, and you wanted to make sure you didn’t run out, what you’d do is compute the standard deviation of the group (don’t deflate it using 1/sqrt(number manufacturers), then choose the mean – 2 * standard deviations as the amount per load. If the only variability was the bias (but otherwise the same amount of sand was shipped), this would give you a 95% chance that you would have enough sand to finish the project without having to reorder.

  98. Carrick,

    “As to my answering you, I think it’s time that you demonstrate you’ve read and understand what I’ve said.”

    If you cannot answer 83 then you cannot make this claim. It is you who are wrong.

  99. I’ve seen other people saying yes to you, but there are plenty of people on my side of this Carrick. The one’s saying yes are simply making the same mistake you did.

    I’ve re-explained my position several times, and patiently waited for you to explain why it’s ok to run one model over and over to determine the model response but not two. Before it’s because the data was not ‘independent’ yet the data from the same model is certainly not independent. It seems reasonable that if you change only one forcing, you could average two models and get a average response to a very tight CI. What you can determine from that is another issue.

    Pretending I don’t know enough stats to follow an argument is not something that works with me. Either you can explain it or you cannot. It makes perfect sense to me to average models together to ascertain the median slope. It also makes perfect sense that the CI of that slope can be determined from its own variance. If you read 83, it seems to me that this next step would help me understand your alleged position better but a non-answer, makes the rest of this look like hand waving.

  100. Here’s how you compute the CI for this problem. Let x be the trend, σ be it’s uncertainty. We assume for simplicity gaussian errors.

    Then:

    Δx = x(model) – x(measured)
    σ(Δx) = sqrt[σ(model)^2 + σ(measured)^2]

    The null hypothesis here is “The model trend is the same as the measured trend.” The probability that this is so is given by

    p = erfc(|Δx|/sqrt(2) * σ(Δx) )

    If p < 0.05, we reject the null hypothesis.

    Here’s it written out using an equation editor.

  101. Jeff ID:

    The one’s saying yes are simply making the same mistake you did.

    Or alternatively, they are following the arguments and you aren’t.

    Pretending I don’t know enough stats to follow an argument is not something that works with me.

    Saying you’re not following the argument is different than “pretending that you don’t know enough”. You don’t need to get testy over this.

    I notice you continue not to address any issues that I’ve raised.

    Seems to me in fairness, we both get chances to get answers to our questions. What are your answers to the questions in #110?

  102. The answer to #83 is “no” for reasons I’ve already given (“systematic errors” don’t get reduced by the ensemble average). This is a result from measurement theory.

  103. Thanks for the reply, I think my understanding of your position is now clear.

    Science is not a democracy of course. All models are not equally good. Uncertainties in forcing stay the same.

    Response to the forcings, however, has reduced variance when run again and again. We are asking whether the models have a systematic bias over trends, not whether the uncertainty of forcing creates a wide enough spread to encompass reality. If you were to look as James Annon did to the fact that the CI doesn’t encompass all models, you can never reduce the spread by retesting models. As would be correct if that is what you were testing for.

    If you applied your methods to paleo temps, the proxies all have uncertainty in their own forcings so the analogy seems identical. You would take the slope of each individual proxy and calculate a variance in slopes and end up with a floor to ceiling CI every time. I say that the correct knowledge of slope can be calculated from the averaged data itself and will be far tighter than the CI calculated your way.

  104. I’ll have to admit that I follow the arguments of James Annan in his papers on the subject matter better than those presented by the posters here. First of all it is rather clear to me that the resolution of how to statistically handle the comparison of climate model results with observed results is not and may never be reduced to a single hard and fast methodology. That is perhaps why I see explanations of people’s views on proper handling of model results and comparison with the observed as hand waving.
    I would strongly suggest that we would be better discussing the strengths and weaknesses of those methods for analyzing model results put forward in the published papers. From Annan’s papers that I have read to date, I think I can describe his view of the statistically indistinguishable paradigm (SIP) and what he says is the current dominant theory of the truth centered paradigm (TCP).

    The TCP would hold that the truth (a Annan term that I think means the observed or perhaps a single rendition of a chaotic climate of which could represent the observed) is centered within the model results and I would assume be more likely to be closer to the center than away from the center.

    The SIP would hold that the model results all have the same likelihood of occurring and that the observed is simply a model result somewhere within the range of the model results. This view is what would justify some of the early papers on the tropical surface to troposphere differences using the range of the models and concluding that since the observed falls within the range that the models and the observed are in agreement. In other words, if the observed result is within the range of model results the model results the model and observed results are considered consistent.

    The treatment of the climate model to observed results used by Douglass, Santer and MMH would, in my view, use the TCP theory but with the caveat that the truth is the truth as the models see it and thus one can look for a difference in the model means and the observed means.

    Annan admits that the SIP approach is simply a first approximation and not meant to be a hard and fast theory. He claims the model result distribution fits better to SIP than to TCP and particularly when a model result is held out as a proxy for the observed result. Annan’s analysis was very vague to my untrained eyes and I did not see any hard and fast hypotheses testing of his results in his papers. I find something, on the face of it, of a circular reasoning attached to his approach where assumptions are made about the validity of model results going into the analysis.

    Annan claims that SIP is used by weather forecasters- and I need to follow up on that claim or better asked again if anyone here has a good link to some applications. I would assume that when I see an uncertainty band of the future location of a tropical storm that the center line is not the most likely trajectory for the travel of the storm but rather that the band is an equal opportunity band – given that Annan’s SIP applies here. Annan uses the example of the model results for the climate sensitivities and states that the TCP would give 90% CIs of 2.7 to 3.4 degrees C while the SIP would assign equal probabilities to some 20 line intervals and arrive at a 90% CIs of 2.1 to 4.4 degrees C and with all those intervals in the range having the same probability.

  105. Jeff ID, do you agree with the error analysis in #115?

    We’ve talked about the reduction in variance associated with averaging, and I think we agree why that is happening.

    If you want, the modeling error is given by (approximately):

    σ[model] = sqrt(σ[forcings]^2 +σ[weather]^2/N)

    where N = number of averages.

    Again, in equation form.

    (If we can agree on this equation and with the results in #115, then we are very close to an agreement.)

  106. Carrick and Kenneth,

    This quote is from Christy’s latest paper.

    “In this study we use the results from 21 IPCC AR4 models all of which portrayed the surface trend at or above +0.08 °C decade−1 (minimizing the problem of instability due to small denominators in the SR.). Some of the 21 models were represented by multiple runs which we then averaged together to represent a single simulation for that particular model. With 21 model values of SR, we will have a fairly large sample from which to calculate such variations created by both the structural differences among the models as well as their individual realizations of interannual variability. From our sample of 21 models (1979–1999) we determine the SR median and 95% C.I. as 1.38 ± 0.38. We shall refer to this error range as the ―spread‖ of the SRs as it encompasses essentially 95% of the results. We may then calculate the standard error of the mean and determine that the 95% C.I. for the central value of the 21 models sampled here as 1.38 ± 0.08, and refer to this as the error range which defines our ability to calculate the ―best estimate‖ of the central value of the models’ SR. Thus, the first error range or ―spread‖ is akin to the range of model SRs, and the second error range describes our knowledge of the ―best estimate‖ representing the confidence in determining the central value of a theoretical complete population of the model SRs.”

    and later:

    “With the exception of one SR case (RSS TLT) out of 18, none of the directly-measured observational datasets is consistent with the ―best estimate‖ of the IPCC AR4 [12] model-mean. Based on our assumptions of observational values, we conclude the AR4 model-mean or ―best estimate‖ of the SR (1.38 ± 0.08) is significantly different from the SRs determined by observations as described above. Note that the SRs from the thermal wind calculations are significantly larger than model values in all cases, which provides further evidence that TWE trends contain large errors.”

  107. Jeff ID:

    . We are asking whether the models have a systematic bias over trends, not whether the uncertainty of forcing creates a wide enough spread to encompass reality.

    Agreed, but part of testing requires including the uncertainty in the model temperature trend. See #115 and #120.

  108. Carrick 115 How do you defined sigma for the models (where you have N models, or N model runs) and for the observations (where you have only one series)?

    And Jeff how do you think sigma or the CI should be computed? Do you disagree with 115?
    The objection is not so much to averaging the models, but in assigning a CI.

    IMHO (a) a lot of this argument could have been avoided if MMH10 had explained what their error bars were. This was not Steve Mc’s finest hour.
    (b) The questions here are really pretty simple and it is frustrating that there has been so much confusion about it.

  109. Thanks for the link, Jeff. I will endeavor to look at it this evening.

    I think I’ve written out are the standard error analysis procedures. Are these any steps in #115 and #120 that you disagree with?

    If we can agree on the procedure for testing the null hypothesis, then we can apply that agreement to the various papers testing this null hypothesis.

  110. #125,

    As I understand it, you are calculating the uncertainty in the assumptions of the models or total model spread, so it looks right for that. I am saying that the uncertainty in understanding the median trend is all we need when demonstrating a systematic bias.

    I think Christy does a better job explaining it in the quotes of 121.

  111. PaulM:

    Carrick 115 How do you defined sigma for the models (where you have N models, or N model runs) and for the observations (where you have only one series)?

    The correct way to do it is perform a number of runs with the same forcings and slightly different initial conditions. This will give you sigma[weather].

    Now run the same initial condition and vary the forcings over the stated range of uncertainty, use this to estimate sigma[forcing].

    Combine them using the result from #120.

    The global mean temperature is a combination of thousands of different time-temperature series, and to include systematic effects such as that introduced by too-sparse of measurements or other effects like urban heating. For example, here’s NCDC’s product.

    I generally agree with the approach taken by NCDC in computing their errors. I know this is contentious with some of you guys, but I think the urban heating/land usage changes represents a potential 10% systematic error in the trend (and I also think it is not completely clear which sign this error has in the adjusted data).

  112. Addendum to #127, this is an oversimplification. There is obviously an interaction between weather and forcings, so you’d have to perform a more substantial analysis of variance to correctly separate these effects.

  113. Thanks Carrick but the point is that you are using a very different method of computing sigma for the models (which is fine) compared with that for the obs (which is dubious). But this is really a sideshow from the main issue.

  114. I keep reading this, and not fully getting what you are saying:

    As I understand it, you are calculating the uncertainty in the assumptions of the models or total model spread, so it looks right for that. I am saying that the uncertainty in understanding the median trend is all we need when demonstrating a systematic bias.

    In terms of the math, how would you modified and/or intepret my equations from #115 and #120.

    Words can mean a lot of things, equations are much less ambiguous.

  115. The paper Jeff links to is good for focussing the discussion. The models compute something, SR, doesnt really matter what it is. The model mean is 1.38 and the SD is 0.38, so the spread is 1.38 +- 0.38.
    Now the estimate of the mean of the models is 1.38 +- 0.08, the difference between 0.38 and 0.08 being a factor of sqrt(21), the umber of models.

    Now the multi-million dollar question is this.
    We want to test if the models are consistent with the observations.
    To test this, do we ask whether the observations lie in
    (a) 1.38 +- 0.38, or
    (b) 1.38 +- 0.08?

    Please give your answers to this question, (a) or (b).

  116. #131 Paul

    If you want to ask are models consistent with observation, “a” because the spread of the models is required to answer that question.

    If you ask is the model mean consistent with observation then b, because we only need to know the confidence of the mean. And to answer Carrick a simple trend CI with AR correction is how I would do it. From the model mean, we can determine there is a systematic bias while still having some models matching observation.

    What you can conclude from b is another story.

  117. It seems like if you believe in the statistically indistinguishable paradigm, then it is not enough to look at the mean and determine that there is a systematic bias. I’m guessing that the mean has differing importance if for example, the distribution is uniform vs say, Gaussian.

  118. Jeff ID:

    If you ask is the model mean consistent with observation then b, because we only need to know the confidence of the mean. And to answer Carrick a simple trend CI with AR correction is how I would do it

    This is the fundamental disagreement then.

    In my opinion, you can’t neglect modeling error when asking the question “is the sensitivity of the models consistent with the measured data”?

  119. PaulM, I think we aren’t quite in agreement, because in my opinion the model mean is being computed wrong here.

    I think #120 is the right way to do it.

  120. OTH, if you used #120 to compute b, I would agree with that.

    The point I’m making is that σ[forcings] is getting artificially reduced by computing a naive average over models.

  121. Let me rephrase that “σ[forcings] is getting artificially reduced by computing a naive error of the mean of all of the models.”

  122. Jeff ID:

    Carrick, I see 120 as the same as A, is that your interpretation?

    Not quite the same.

    What I’m saying is separate the uncertainty in the temperature trend associated with the forcings from the uncertainty associated with weather. The uncertainty from weather gets reduced by 1/sqrt(N). The uncertainty in the forcings does not.

    #120 is going to give you a different answer than a, because one source of modeling error has been reduced/eliminated.

    Really there are three chioces:

    a) use standard deviation of distribution of models
    b) use standard error assuming errors sigma[ensemble] = sigma[model]/sqrt(N)
    c) correct error of mean for contributions that are not uncorrelated (Eq. #120).

  123. Annan posts on the reliability of the observations in the ensemble based on the rank histogram of observations here and determines that models are at worst probably running ~0.05C cool.

  124. Carrick @ Post #15:

    I am not sure what you attempting to do here by combining standard deviations but look at the link above under the section Combining standard deviations.

    http://en.wikipedia.org/wiki/Standard_deviation#Combining_standard_deviations

    It would appear to me that you are using a formula for determining the standard deviation of the difference of means of two populations, where n1 and n2 both equal 1. In that case there is no standard deviation.

  125. Kenneth:

    It would appear to me that you are using a formula for determining the standard deviation of the difference of means of two populations, where n1 and n2 both equal 1. In that case there is no standard deviation.

    No, you missed some details. #120 refers in general to how to compute the uncertainty for a single model using multiple runs, or to multiple runs from similar models, where each run is added together in an unweighted fashion. That is “pool” all of the runs form comparable models, that becomes an “ensemble of model runs”.

    When we compute the mean response of the ensemble of model runs, we must separate the uncertainty in the temperature trend associated with forcing errors (which is common to all runs) from “weather noise” (which is unique to each run).

    You can “beat down” the contribution to the uncertainty in the temperature trend by averaging over multiple runs. The uncertainty from the forcings isn’t going to be improved by combining the models together.

  126. I believe that Carrick is comparing an individual model versus the observed and Jeff ID you want to use the mean and SE of several models (the ensemble) for a comparison with the observed. While you might agree on the single model comparison you would continue to disagree on using the inputs of several models – I think.

  127. RB, thanks for the link. I believe Annan is not correctly combining the models together either. Jeff is right that adding more runs will reduce the uncertainty in the trend associated with weather noise (what he’s calling “red noise.”)

    However, I think James is right that MMH have too narrow of a confidence interval.

  128. Kenneth, we could combine models as long as they make equivalent assumptions.

    It’s tougher to come up with a valid way of combining model outputs from a model that is known to be inferior with another more accurate model, though. This is a case of “science is not a democracy.”

  129. “The uncertainty from the forcings isn’t going to be improved by combining the models together.”

    That is completely true.

    What we can do though is take the uncertainties as chosen, calculate the mean of the models and determine that they are biased high using a CI based on the model mean. It tells us little about the ensemble as a whole other than the bias of the group. This is a different calculation from one which as Carrick points out, had different forcings been chosen, the ‘mean’ could be outside of the CI.

    That is the difference between what we are saying. IOW, I’m saying with the forcings chosen, the models are running high. Carrick is saying, but if you chose different forcings you should have an expanded CI.

    What they could have done and what they did are two different things. The MMH calculation is based on what they did, Carrick is caught up in what they could have done.

  130. Jeff ID I have not finished the Christy paper that you linked but I have judged previously that using a scaling ratio (SR) of the surface to atmospheric trend makes the comparison easier to detect significant differences between models and observed and it can zero out some of the residue of the chaotic nature of climate renditions.

  131. It’s tougher to come up with a valid way of combining model outputs from a model that is known to be inferior with another more accurate model, though. This is a case of “science is not a democracy.”

    I would only combine models that I have no reason to doubt a prior as I have indicated before. If you have a good selection method that can be used a prior for the 23 models under discussion here I would like to hear about it.

  132. 147 Carrick – sorry if this is dim, but why try to combine the output of an inferior model with one that is known to be more accurate?

  133. RB that link gives a good summary of the Annan veiw of SIP and TCP and his ranked data that is supposed to show that evidence supports his SIP view. Note he admits that the ranked test does not have statistically significant uniformity (rejects the null hypothesis that the ranks are uniform) but adds that the differences are small. I am not sure what that means or that I understand the statistic he used in the test.

  134. #150, Kenneth;
    Yes, there is a lot of dancing around the issue of qualification of the models to be sampled/used in the first place. Some a priori (note sp.), meaning rational and logically traceable, link to underlying physics would be nice, no? 😉

  135. Carrick,

    I think our whole disagreement revolves around the intent of the test. I’ve thought of yet another way to word it.

    The uncertainty of what the models could do from different input assumptions, and the uncertainty to what they did do are two different things.

    The MMH test is only for what the models actually did, and it shows that they are biased warm by 2 to 4 times to a 99% certainty. Now with different input assumptions representing what the models ‘could’ do, you get wider CI and the other methods you advocate become applicable.

    These model’s forcings were chosen and are therefore known to a perfect certainty. Discussion of other forcings or uncertainty in forcings is outside of the scope of the test. This test is more limited than you give it credit for, but that doesn’t make it without merit. It still indicates the typically chosen forcings are running very high over observed trend.

    You are correct that different forcings will push it outside of the CI I advocate, but that is a different test. These forcings have already been chosen and are perfectly known. James Annon has missed the point pretty badly also.

  136. “These model’s forcings were chosen and are therefore known to a perfect certainty.”

    MMH tests these assumptions, not the full range of possible assumptions.

  137. Jeff ID, I think this is the problem “These model’s forcings were chosen and are therefore known to a perfect certainty”.

    Ick. I thought they knew better.

    Sounds like we’re on the same page now, at least.

  138. Curious:

    147 Carrick – sorry if this is dim, but why try to combine the output of an inferior model with one that is known to be more accurate?

    You normally wouldn’t.

    The reasons it gets done here, as far as I can tell, are a combination of politics (can’t afford to exclude the Italians’ very bad model for example–this is a hypothetical example only) and the high cost of a single run of a climate model.

    Adding runs together does help reduce the variance associated with climate weather, as we’ve discussed, and the more you add, the smaller that gets.

    I think a much better approach would be to tier the models in terms of resolution and level of physical approximation. If you can’t show that the improvements are helping, there’s nothing wrong with the simpler model.

  139. Kenneth Fritsch:

    If you have a good selection method that can be used a prior for the 23 models under discussion here I would like to hear about it

    If it were me, I would only pick models that had a lot of runs, and test those models separately. If you could identify a group of them as “equivalent”, I’d be happy to combine them. Equivalent means using the same (or nearly the same assumptions), the same spatial discretization and so forth.

    I don’t know what is actually available in terms of variations in forcings by the way. It is likely possible to perform a sensitivity analysis of varying the forcings without having to restart the model a bunch of times. This is in my list of “to read about and understand”.

  140. I’m in agreement with the test as they’ve stated it.

    I’m confused why they would want to construct the test this way.

    It is an important point.

  141. I think this response from Ross M. over at Climate Audit a while back helps to clarify things a bit. I think the discussion between Paul’s choice a and choice b as to which test is preferable is appropriate, but Ross’s point about ambiguity of the null is what is needed to round out the perspective:

    Ross McKitrick Posted Aug 13, 2010 at 1:47 AM

    One of the benefits of panel regressions is that it forces you to spell your null hypothesis out clearly. In this case the null is: the models and the observations have the same trend over 1979-2009. People seem to be gasping at the audacity of assuming such a thing, but you have to in order to test model-obs equivalence.

    Under that assumption, using the Prais-Winsten panel method (which is very common and is coded into most major stats packages) the variances and covariances turn out to be as shown in our results, and the parameters for testing trend equivalence are as shown, and the associated t and F statistics turn out to be large relative to a distribution under the null. That is the basis of the panel inferences and conclusions in MMH.

    It appears to me that what our critics want to do is build into the null hypothesis some notion of model heterogeneity, which presupposes a lack of equivalence among models and, by implication, observations. But if the estimation is done based on that assumption, then the resulting estimates cannot be used to test the equivalence hypothesis. In other words, you can’t argue that models agree with the observed data, using a test estimated on the assumption that they do not. As best I understand it, that is what our critics are trying to do. If you propose a test based on a null hypothesis that models do not agree among themselves, and it yields low t and F scores, this does not mean the hypothesis of consistency between models and observations is not rejected. It is a contradictory test: if the null is not rejected, it cannot imply that the models agree with the observations, since model heterogeneity was part of the null when estimating the coefficients used to construct the test.

    In order to test whether modeled and observed trends agree, test statistics have to be constructed based on an estimation under the null of trend equivalence. Simple as that. Panel regressions and multivariate trend estimation methods are the current best methods for doing the job.

    Now if the modelers want to argue that “of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t; or words to that effect, then let’s get that on the table ASAP because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate.

  142. 157 – thanks. Point noted about the “political” aspects. Sorry if my comment came in as a bit of a drive by. I’ve enjoyed the discussion and refrained from commenting as I haven’t read the paper and it seemed as if the discussion was on the specific methodology employed therein.

    FWIW I view the idea of an “ensemble mean” as a very poor substitute for individual model audit and testing relative to observation. I looked at the link RB supplied re: ranked histograms and my view is that it seems a self referential test. My understanding is the bin range is determined by the upper and lower results from the ensemble runs and then subdivided into bins according to the number of members in the ensemble. The distribution is then “analysed” to see if the models have a “bias”. The relation to the actual measured variable of interest (lets say “temp.”) does not appear to be relevant; for example if the histogram shows a bulk of models reporting in the lower end of the range (I think) the view would be that the models have a “cool bias” and this view would be unaffected even if the actual measured variable were lower. Fully acknowledge I could have misunderstood.

  143. No, you missed some details. #120 refers in general to how to compute the uncertainty for a single model using multiple runs, or to multiple runs from similar models, where each run is added together in an unweighted fashion. That is “pool” all of the runs form comparable models, that becomes an “ensemble of model runs”.

    I would like to see an example of where you actually applied this approach – even if for the sake of security and proprietary concerns you have to use bogus numbers.

  144. #161;
    Oh, that’s very good. Yes, trying to have your “accurate representations” while fudging them, too. That about encapsulates the whole farrago!

  145. Layman Lurker @ Post#161:

    I truly think you will see more of the James Annan SIP paradigm used in future arguments against what MMH have shown using, in effect, a model truth centered approach. I also truly think that the SIP is a circular argument about the validity of the model runs.

    Santer et al have previously indicated that they favor the Annan approach but just to demonstrate that Douglass et al were incorrect using (according to them) an incorrect approach they used the truth center approach to supposedly refute what Douglass found. Then most frequently used model to observation approach before the Santer paper used the range of the model results in comparison with the observed to determine consistency. A simple way to refute the SIP approach is to show that some models get the observed wrong. Unfortunately in the Annan world that will not work because the observed result to them is simply the luck of the draw from the range that madels yield. Therefore to invalidate the models, or at least some of them, the observed result has to fall outside all the model runs used.

  146. Kenneth, I had never really gotten into that whole Santer Douglass thing. However now, after this discusion and your comment, I might do a little more reading.

  147. Curious #162, as I understand it, I believe the approach is as follows – models make predictions of a given variable such as temperature and the histogram is constructed by determining in which bin (corresponding to the interval between two model predictions) does a given observation fall. The idea behind proving uniformity of models appears to be that one of the conditions is that the histogram should be flat as you repeat the process over several observations thus demonstrating equal probability to each of the models’ outcomes. Therefore, it is not a histogram of model outcomes alone (excluding observations), as you seem to be stating.

  148. In other words, for each temperature observation on record, there is a spread of model values. The goal is to prove that (a) model spread covers the uncertainty in temperature predictions (b) models are equi-probable

    And Annan tries to show (a) by showing that the rank histogram is not U-shaped (i.e., no excess of observations towards the edges of the histogram) (b) they are somewhat equi-probable because they are flat (c) actually, models are conservative because they are dome-shaped reflecting that model spread is more than the actual spread in measured values.

  149. RB – not sure as I’ve only looked at the link you supplied and its follow on article which was introduced thus:

    “A standard test of reliability is that the rank histogram of the observations in the ensemble is uniform.”

    http://satreponline.org/vesa/verif/www/english/msg/ver_prob_forec/uos4b/uos4b_ko1.htm

    (as an aside the article starts “The rank histogram is not a verification method per se, but rather a diagnostic tool to evaluate the spread of an ensemble.” so I’m not sure if this really is a “standard test of reliability”.)

    If you look at the examples at the bottom of the page actual measured variables are not present – the categorisation applies only to the “distribution defined distribution”. Seems circular and self referential to me and to my mind a domed distribution indicates a narrower spread than the “rank histogram test” ideal of a flat distribution – but I haven’t read the paper, so I could be way off.

  150. Curious,
    From the link you provide:
    Rank histograms are prepared by determining which of the ranked bins the observation falls into for each case, and plotting a histogram of the total occurrences in each bin, for the full verification sample.

  151. BTW, for the purposes of the article, even the bottom example histograms would have to be constructed by comparing observations with model predictions.

  152. RB – thanks but I’m still not sure what role the observation plays in the preparation of the rank histogram. From the article I linked:

    “As you do the exercise, it is useful to keep in mind the assumption underlying the rank histogram, that the probability that the observation will fall in each bin is equal.”

    so they are saying the observation can be anywhere within the range of outcomes from the models. Hence I think my point stands that even if the actual observation is below a “cold bias” dome, the rank histogram analysis would be that the models have a cool bias. In their examples, IMO, the bars in the histograms are only showing model outcomes – it is only in the first pdf figure that the observation appears at a scale equivalent to a single run result.

  153. Curious, at each instance, models generate a range of temperature predictions and you would order them from lowest to highest and each of the bins represents the interval between neighboring predictions. Then you would mark the bin where the observation for that particular instance fell. Then you would move on to the next time instant and repeat the process (each model is used to construct the boundary of a potentially different bin in each time instant). You cannot generate a histogram without the observations.

  154. Well, there have been a lot of comments, a lot of detail, a lot of information exchanged back and forth…. and no snark.

    Has to be a skeptic blog.

  155. BTW, the X-axis is “bins”, not temperature. Y-axis is also “percentage frequency of occurrence”, not temperature. You cannot look to see whether observation is below a “cold bias” dome. You can use it to test your assumption of uniformity of model outcomes by looking at the shape of the histogram – whether it is flat, U-shaped, or dome-shaped – and judge whether model outcomes are equiprobable.

  156. I should add – however since bins are ordered according to increasing temperature, you can make a statement about whether there is a “cold bias” or a “warm bias” for the period considered.

  157. 174, 176 RB – Thanks. I realised X is bins and Y is relative frequency. I’ll have a look again later but at the moment I’m stuck on the description:

    “For each specific forecast, the bins are determined by ranking the ensemble member forecasts from lowest to highest. The interval between each pair of ranked values forms a bin. If there are N ensemble members, then there will be N+1 bins. The outer bins, lowest and highest – valued, are open-ended.”

    My understanding of the implications of this ties in with the initial comments that Jeff exchanged with James to introduce the thread. It also seems to be supported by comments Kenneth Fritsch made just above. However I fully realise science is not democracy (!) so I will revisit later/tomorrow! It would help me if you could clarify your use of “instance”, how exactly your marking of the bins that the observations fall into effects the construction of the rank histogram and whether a domed (ie peak upward) distribution indicates a broad or narrow range of outcomes. Thanks.

  158. 177 RB – I should clarify my 173: Where I say “below a cold bias dome” I mean “to the left of” (assuming higher temperature bins are to the right).

  159. Curious, let’s say that we have yearly measurements and 20 model predictions of temperature and “instance” refers to each year. Let’s say measurements and model predictions run from year 1955 to 1980. In 1955, let’s say model predictions of temperature ran from 45C to 50C. Then, bin 1 will represent temperatures less than 45C and bin 21 will represent temperatures greater than 50C. Then, you mark off the bin in which yearly measurement of temperature for that year fell. Let’s say that for 1956, models predicted temperatures from 55C to 60C. In this case, bin 1 will represent temperatures less than 55C and bin 21 will represent temperatures greater than 60C. As you repeat the process, you can construct a histogram that tells you whether (a) whether each of the bins are equi-probable as represented by a flat histogram (b) whether they have a bias i.e., if for example most of the observations are clustered within bins 15-20 then that means that the remaining 75% of the models are predicting cooler temperatures than that observed indicating a “cool bias” in the models.

  160. I suppose that also means that for 23 models, the probability that observations fall outside the model predictions is 2/24 – therefore a single observation outside model ensemble prediction would still not invalidate the models.

  161. If one of the models was absolutely spot-on perfect, but the average was statistically different from observation; should we continue funding research to improve the ensemble average?

  162. Carrick @ Post #115:

    I am really confused by the procedure/equations that you have listed in this post. If you have a standard deviation you must also have n1 and n2 for the samples used for estimating sd1 and sd2. It also implies that x1 and x2 have to be means of those samples. Under those conditions you must account for n1 and n2 in your equations. What am I missing here?

  163. I suppose that also means that for 23 models, the probability that observations fall outside the model predictions is 2/24 – therefore a single observation outside model ensemble prediction would still not invalidate the models.

    RB, I think you are right that 1 observation outside the range would not invalidate the “consistency” of model and observed results. Which, of course, brings to mind: what would invalidate the consistency – 3 observations outside the range of the models.

    Of course, Annan’s simplistic assumption that all model results are statistically indistinguishable would allow models one incremental step outside the former range to be added (it passes muster with the ranks histogram test) and now we have a new and wider range to compare to. None of this sounds right to me.

  164. Jeff ID, I finished reading the Christy paper to which you linked. It comprehensively goes over the strengths and weaknesses as the authors see them of the observed temperature data sets that it uses in the analysis and then makes the comparison using the scaling ratio SR for observed versus model results.

    Just for my own curiosity I want to use the eyeballed model results (for SR) and determine how well the results fit a normal distribution or perhaps some other distribution.

  165. Kenneth,

    Carrick asked what use is testing the model mean as written. It is very important to check if the assumed forcings are outside of the normal range. The fact that some are inside a statistically reasonable range means that the models will always pass Annan’s tests. Adopting this thinking allows the IPCC to go right on assuming most cases to have high trends with a few on the low end. Currently the median of the models is way way out of whack from observation, which sounds to me like someone has their thumb on the scale. Not surprising considering some of the crazy things coming out of the IPCC. It is actually important to test the models as presented than the possible range of models based on input uncertainty.

    Testing for input uncertainty is also important from my way of thinking, so people can see the floor to ceiling CI’s. It looks like the CI at the ceiling needs to be reduced considerably. I’m still trying to think of how to get James Annan to look at it from this perspective.

  166. Just a quick comment during lunch.

    JeffID the point is the IPCC. The argument is that with only one realization of the climate in reality, to expect that one can use a model ensembley without the possibility of expressing the full variance of the real phenomena is a test that is incorrect.

    Turn it on its head. If we accept that, then how could the IPCC conclude anything other than neither the real phenomena has been meazsured long enough, nor can definite statements be made.

    But they did make specific claims based on definite statements. In paticular, in the chapter on sttributing climate change, they specifically make a definite claim that modles could not reproduce the latter 20th century warming without CO2. They then used the Bayesian a priori to conclude the confidence that CO2 was responsible for most of the latter part of the 20th century.

    They can’t have it both ways. If Tebaldi and Knutti are correct that it could take up to 130 years to know if a 100 year prediction by a model was correct, then the IPCC has givem us nothing but reasonable speculation. In which case, their attributing chapter is oversated in confidence and in amount that can be attributed to CO2.

    I think Annan’s test if used, and then the questiobn was asked, how useful is a model or assembley that passes such a test, the answer is not very. In fact, the answer should be it does not support the a priori of the attribution chapter.

  167. Here is what I obtained for a Shapiro-Wilk normality test for the 21 model results used in Christy paper linked by Jeff ID. I do not know the power of the Shapiro test, but the p =0.87 is a long way from p<= 0.05 normally required to reject the hypothesis that the distribution is normal.

    Shapiro-Wilk normality test

    W = 0.9764, p-value = 0.8667

    I would like to do a chi test to test the probability that the distribution comes from a statistically indistinguishable distribution but the subtle of that concept immediately becomes apparent. What do you use for the expected model (and observed, for that matter) result. The observed result according to Annan's approach can come from the model result range, so that while we have an estimated observed result that result could have been different. I'll need to understand better how Annan handles this (if indeed he does) with the ranked histogram.

  168. Kenneth:

    I am really confused by the procedure/equations that you have listed in this post. If you have a standard deviation you must also have n1 and n2 for the samples used for estimating sd1 and sd2. It also implies that x1 and x2 have to be means of those samples. Under those conditions you must account for n1 and n2 in your equations. What am I missing here?

    I’m not sure. It’s like we’re not speaking the same language or something.

    First, “standard deviation” is different than “uncertainty”.

    It’s one type of uncertainty, that is generated from a series of repetitions of data, and it’s connection to a CI is based on a number of assumptions about the underlying distribution. Bias error associated with an erroneous choice of forcing is a whole different can of worms.

    I used σ to denote uncertainties (not just standard deviations). Perhaps that’s what is leading to part of the confusion.

    When you combine different types of errors, the best way to do this is via a sensitivity analysis, but more generally if that is not available you do so, using a sum over quadratures is a standard (approximate) fall-back.

    The error of the mean associated with weather noise will go as the standard deviation divided by the square root of the number of independent replications of a given climate (but with differing weather noise). That’s the origin of σ[weather]^2/N contribution to σ[model]:

    σ[model] = sqrt(σ[forcings]^2 +σ[weather]^2/N)

    You seem to be thinking about this from the perspective of how one rigorously combines two standard of deviations coming from different population samples.

    That’s a whole ‘nother can of worms.

  169. At some point I will write a longer paper on the challenge of framing hypothesis tests across multiple models and observations, based on the discussion surrounding MMH. Meanwhile there are a couple of points I want to mention regarding the above thread.

    1. We tested a historical interval, when models were forced with identical observed inputs (GHG, solar, ozone etc) and were trying to simulate an observed historical period. In this context,it is surely reasonable to assume, for the sake of testing, that the models encompass a unique common trend. Had we been looking at 30-year forecasts, you could argue that the modeling community presents something more like a menu of choices: if forcings and sensitivity are X1 then the trend will be Y1, if X2 then Y2, etc. But in this case we know X1, and we know Y1. We have observations on both the inputs and the target outputs. All we need to add is the assumption that the models aimed to represent the same climate system, and the framework of assumptions leads inevitably to our null hypothesis.

    2. Given the null hypothesis that the models attempt to represent one particular climate system, namely Earth’s, each model (and each run) yields, under the null, an observation on a single trend coefficient T. But suppose we want to construct a statistical test based on the view that some models aim to simulate an observed trend T, and other models aim to simulate an unobserved trend S. In this context our null would not apply and our test results would be irrelevant. Our critics would be correct. But isn’t this an odd thing to say about models? Does anyone go around saying that their GCM is constructed in such a way that when it is working correctly, it yields simulations of a planet we don’t live on? If anyone did, why would we care about their model, at least insofar as the study of Earth’s climate goes? We only care what climate models say to the extent that they are representations of _our_ climate system, in which case the MMH testing structure is valid.

    3. That said, there may be various ways of framing the null hypothesis, depending on what you want to find out. The most obvious ones to me are

    H:(The average model has the same trend as the average observational system), or

    H:(each model has the same trend as the average observational system), or

    H:(Each model has the same trend as each observational system.)

    These can be represented in the VF framework (see http://climateaudit.org/2010/08/13/ross-on-panel-regressions/#comment-240698.) But I can’t see how to frame the test of

    H:(model 1 matches the observations _OR_ model 2 matches the observations _OR_ model 3 matches the observations _OR_ … )

    I think some people would like to frame the test that way, but it’s a broken clock fallacy. It proves too little. You can always add more model runs until you get a match. Then what have you shown? Not that models as a group are consistent with observations, only that they are not all wrong, all the time. But that’s pretty meaningless because the one match may be a fluke. A broken clock is right twice a day, but it’s still broken.

    We had a lot of referees for our paper over the 4 review rounds, one of whom signed his review (B. Santer). We were asked to deal with a lot of issues, such as testing all the model runs rather than only a few, using ensemble means rather than individual runs, using the same time period for models and observations, dealing with higher-order AR and cross-panel correlations, adding dummies to account for the fact that some models didn’t include ozone depletion effects, etc. But across the range of reviewers, the usefulness of the basic null hypothesis was presupposed. That is not to say that there aren’t other model-testing approaches out there, but sometimes if you think through the alternative it may start to sound kind of trivial.

  170. I see on further research that doing a Talgrand diagram or Analysis Rank Histogram would require using all the gridded data used for the 20S to 20N Scaling Ratio (SR) in the Christy paper. I supposed I would have to calculate SR for each grid over the 31 year period of the study. Annan plays with gridded temperatures and not gridded SRs.

  171. I was disappointed that MMH do not use a scaling ratio or difference as a statistic in their paper.

  172. Carrick @ Post #190:

    I am a fairly fast learner so make this simple and show me a link to where what you show in Post #115 is used in a calculation.

  173. Ross, it would be fair to say then that the reviewers were not ambiguous on the null – at least at the end of the review process. Did you get a sense of whether they agreed with the usefullness of testing models against a unique climate regime?

  174. RB, just because it may not be possible to narrow down the climate sensitivity with certainty, does not preclude the MMH test. Indeed, it would seem (to me anyway) the excercise of highlighting the inconsistencies and the ambiguities is a central take away point.

  175. Ross, it sounds like you believe, by extension, that on a fundamental level, it should be ideally possible to generate a truly objective estimate of climate sensitivity while James Annan says that for all practical purposes it is not.

    Taking James Annan’s perception of climate to its conclusion would appear to allow for a wider and wider range of potential realizations of climate as the number of climate models grow. I guess we could label that prediction of climate by way of a crap shot where the luck of the draw would outweigh other climate factors.

  176. Kenneth,

    That is the lesson I took away from this whole thing. If you use the variance created by model assumptions they don’t provide a range, they provide a mush. Calling all models trends a range of uncertainty is crazy when the variability is so great. I think that’s what Ross means when he wrote the above. It is also along the lines of what Carrick is describing and we all have understood. The median trend certainty is a result of the choices, not a result of model quality.

    Wow, what a mess.

  177. RB: The problem with identifying a single value for sensitivity is that appears to depend on the relative strength of offsetting factors (CO2 sensitivity minus aerosol cooling, etc). It may be possible to test the different effects simultaneously, but there’s always the danger of omitting important effects, not counting lagged effects, etc.

    Kenneth:

    Taking James Annan’s perception of climate to its conclusion would appear to allow for a wider and wider range of potential realizations of climate as the number of climate models grow.

    That implies the process that generates models is nonstationary, in the sense of not having finite first and second moments. In statistical terms, if samples from a process do not converge on a finite mean, because the mean (and possibly the variance) is undefined, then the process is nonstationary. A simple example is the ratio of two N(0,1) variables. It’s called a Cauchy distribution, and has no finite mean or variance. It’s simple to generate in a spreadsheet, and the more observations you add, the larger the sample variance gets, and you don’t converge on a finite mean.

    But to argue that the climate model-generating process is nonstationary, even with respect to replications of historical episodes, that implies the models are not convergent on any one structure or process, in which case they cannot be viewed as representations of reality. In that case, adding more and more models would yield a wider and wider CI of results (rather than a narrower CI). But it would also mean that climate models are arbitrary and meaningless as regards the study of climate. Taking as a premise that models are representations of the climate implies that their output should be testable against observations.

  178. I actually didn’t interpret Annan to allow for a wider and wider range of climate with equal probability as you add more models – my understanding is that if model spread was higher than actual spread in observations, you should see a strongly dome-shaped rank histogram which he says is only mildly demonstrated in a model ensemble-observation comparison. My naive interpretation of the histogram is that the probability of error i.e. that observations fall outside models is 2/(N+1) for N models.

  179. The Cauchy distribution at least has tails no matter that they are fat and can get fatter with more inputs, but my read on the Annan SIP is that the distribution is flat all the way across.

    I should have added that the limit of the model extremes is determined by a Talagrand diagram and the use of a deconstructed chi test. The observed results can evidently fall outside the model range without invalidating consistency between the models and the observed. I am not at all sure how one handles the fact that the observed is a single rendition of the climate that could have been different or that the observed result has CIs or that different sources of measurement produce different observed results.

    Annan did the above analysis for models versus observed for temperature, rainfall and atmospheric pressure and reported that the diagram failed the null hypothesis of uniformity – but not by much. My question would be if it fails can the models that caused the failure by determined, removed and the same test made again minus the offending original member -or maybe it is the failure of the measurement of the observed result.

  180. Kenneth,

    The Cauchy distribution at least has tails no matter that they are fat and can get fatter with more inputs, but my read on the Annan SIP is that the distribution is flat all the way across
    That is, I believe a wrong interpretation. Annan is only looking at whether or not the current climate model ensemble is “reliable” in the sense of the goodness of the ensemble spread vs observations and whether you should analyze it based on an SIP model or a TCP model – and he concludes that based on the rank histogram, it is more appropriate to look at it on the basis of the SIP paradigm. Therefore, the model mean is not as significant physically as it would be for the TCP paradigm. More importantly, there is no reason to believe that adding more models would increase uncertainty and increase the spread of output values – if the ensemble is still good enough to fit the SIP paradigm, it just means that the bin height is lower and reduces the probability that observations lie outside the model spread. It doesn’t mean that as you add models, you will have a range of output values from daylight to infinity. On the contrary, he finds some evidence that perhaps models are a bit on the conservative side already. With respect to showing that a particular model is wrong, in the link I posted above he recognizes the problem and it looks like they don’t yet know how to do it.

  181. BTW, to clarify, in the rank histogram, adding models increases the number of bins in the X-axis but that does not necessarily correspond to an increase in the spread of the predicted output values. A poorly constructed ensemble would start to look more and more hump-shaped as you add more models, in which case, the SIP paradigm can no longer be used for analysis.

  182. Does anyone go around saying that their GCM is constructed in such a way that when it is working correctly, it yields simulations of a planet we don’t live on?

    Ross, I believe the question of the nature of uncertainty is important – since you too acknowledge the difficulty of precisely estimating the sensitivity – and the assumptions that go into your test of model uncertainty vs what Annan proposes. In the context of Annan’s model, your test does not appear to be relevant.

  183. Ross, doesn’t RB’s comment about uncertainty cuts both ways? The IPCC could not claim the confidence or attribute using Annan’s model without making a circular argument.

  184. 180 – RB, I’ve had another read and I think I see where you are coming from – are you saying that each “entry” in the histogram bins relates to one year’s output from the models and that the bins’ limits change from one year to the next? In your view is the histogram made up of an “integral” of the measured variable’s rank relative to the model ensemble’s outputs over time – ie in your example there will be 26 entries for 1955 to 1980 inc.? Sorry if I’m being slow to get this – if there is another reference please point me to it. I can see another interpretation to this but I want to check I’m understanding your view. Thanks

  185. Curious, yes, that is what I understood the rank histogram to be. In my example, there are twenty bins and the number of entries in each bin will add up to 26 in all, corresponding to the total number of observations. Bin 5, for example, does not refer to a specific temperature interval but only to the temperature interval that ranks fifth for each year.

  186. From Jolliffe’s slides on the web , the underlying philosophy behind using rank histograms to evaluate the reliability of an ensemble for purposes of considering the members statistically indistinguishable:
    If the ensemble members and the verifying observation all come from the same probability distribution (desirable), then the probability of the verifying observation falling into a particular bin is the same for all bins.

    Thus the rank histogram should be roughly ‘flat’ or uniform.

    So long as the histogram is flat, one cannot say that the models have a warm bias or a cool bias as MMH concludes by looking at the mean.

  187. RB:

    So long as the histogram is flat, one cannot say that the models have a warm bias or a cool bias as MMH concludes by looking at the mean.

    This conclusion depends on the sources of noise. If the histogram spread is mostly due to weather noise (and in this case it likely is), that wouldn’t be testing what you think you are testing.

  188. Carrick,
    It’s a good point and this publication , particularly Figure 6, gets into some of the limitations of this tool. I guess one question would be whether one would expect a more strongly U-shaped curve in the presence of noise.

  189. Maybe we should insert Annan’s comment to Jeff here again:

    Just to be clear, I don’t think the models are perfect, and I would not be surprised if more concrete evidence of a mismatch were to mount up over time. However, it hasn’t happened yet – and when it does, it may be partly due to obs error too, of course.

  190. So long as the histogram is flat, one cannot say that the models have a warm bias or a cool bias as MMH concludes by looking at the mean.

    Does not a flat histogram imply that all the model results (and the observed) have an equal probability of occurring. I believe that is what Annan said about the climate forcing calculation he does and that is what I am saying.

    I did not say that there was no limit to the range of model results that would be consistent with an observed result, just that that range will be and is large.

    RB you would though have to explain in more basic terms what really does limit the range of consistent model results and not keep referring to the shape of the histogram. If Annan is using a chi square test result (deconstructed or otherwise) he would not reject the null hypothesis of a uniform distribution in the histogram with a p > 0.05. That test is not a very powerful one for determining a goodness of fit for a distribution and particularly when the probability p has low values.

    The scaled ratios (SR) from 21 model results that Christy used in their recent paper fit a normal distribution with a p = 0.87 using Shapiro-Wilk test for normality. I do not know what Annan would have found with his historam tests on SR. His results depended on gridded climate measures. And his result failed the significance test for a uniform histogram.

    I think that we need to be very skeptical of a concept like the SIP until someone shows in more concise terms (than Annan did) what it really means and can show that the reasoning is not circular.

    Actually what the SIP concept shows is a great deal of inherent uncertainty in the model results with regards to predicting future or past climate. Who am I to argue about that result. It borrows the concept of statistically indistinguishable from weather forecasters yet most climate scientists appear loathe to consider climate and weather prediction to be anything similar.

  191. Just to be clear, I don’t think the models are perfect, and I would not be surprised if more concrete evidence of a mismatch were to mount up over time. However, it hasn’t happened yet – and when it does, it may be partly due to obs error too, of course.

    It is also important to remember also that Annan’s use of the SIP approach to comparing climate model and observed results is stated by Annan to be a (convenient) way of looking at the results without being able to prove the validity of the approach.

    Also his statement that it has not been shown that there is mismatch between the models and the observed results implies that all the models are correct, or, at least, that some model results are not more correct than others.

  192. RB you would though have to explain in more basic terms what really does limit the range of consistent model results and not keep referring to the shape of the histogram.
    Don’t know, like the professionals I suppose.

    The scaled ratios (SR) from 21 model results that Christy used in their recent paper fit a normal distribution with a p = 0.87 using Shapiro-Wilk test for normality.

    If the SIP is true, I don’t know whether that implies that the scaled ratios should not fit a normal distribution.

  193. RB, thanks for the candid replies and for the Jolliffe’s slides. The slides were very instructive on exactly how one constructs the rank histogram.

    On rereading Annan’s summary of a draft paper on doing the rank histogram for model results using gridded temperature , precipitation and atmospheric pressure, he claimed (I think) that the total chi square test did not reject the null hypothesis for uniformity, but notes some troubles with trend bias on temperature and a humped center on sea level pressure. He does not say but implies that those non-uniformities would fail the decomposed chi square test.

    At his blog in the link below he shows the 3 histograms and says:

    “…the rank histograms (of surface temperature, precipitation and sea level pressure from top to bottom) aren’t quite uniform, but they are pretty good. The non-uniformity is statistically significant (click on the pic for bigger, and the numbers are explained in the paper), but the magnitude of the errors in mean and bias are actually rather small.”

    http://julesandjames.blogspot.com/2010/01/reliability-of-ipcc-ar4-cmip3-ensemble.html

    Take a look at the 3 graphs and tell me if the temperature and sea level pressure histograms look uniform to you. The statement on his blog appears to be different than the one in the draft of the paper. How would I interpret that “the non-uniformity is statistically significant”? It sounds like the null hypothesis of uniformity was rejected.

  194. Kenneth, you are correct that the figures don’t look uniform. The body of the paper explains as follows:
    We should note, however, that these errors are in fact relatively small compared to the ensemble ranges themselves. The surface temperature histogram can be effectively flattened by both subtracting a mean bias of 0.5C, and adding random noise of magnitude 1C to the data to increase their overall spread. These figures only amount to 7% and 13% respectively the typical ensemble range of 7.7C at each gridpoint.

    Still, the spread looks a little too wide, and also adding more models is unlikely to increase the spread. Annan interprets it as more SIP-like than TCP-like.

  195. 209 – RB, thanks for the clarification and additional link. One thing I’m wondering about wrt the rank histogram is that it may not inform on the modelled vs. measured trends. For example lets say we have a model ensemble with a higher trend than the measured trend; could it be that over the course of time the measured variable would gradually traverse across the bin range from high to low. If the difference in trends is approximately constant it would make this traverse at an approximately constant rate and therefore at the end of the complete time period the rank histogram would be reasonably evenly populated and therefore “level”. However if snapshots had been taken and compared at (say) 5 years intervals this traversing bulge would have been visible?

  196. Curious,
    What you’ve described is exactly the issue. How do you describe the trend of a model ensemble? Having said that you can describe cases where each of the individual models produces an output that is trend-less (or flat) over the measuring period while the observation starts at the lowest model output value and traverses across each of the individual model outputs over the measuring period generating a rank histogram that meets the uniformity requirement. However, then they would probably not look to have been drawn from the same statistical distribution and we really need to look at the chi-squared test etc anyway and not determine on the basis of the shape of the histogram.

  197. Hamill01 is a well cited article from AMS which elucidates some of the pitfalls and limitations of the rank histogram method. For example, there is an illustration which demonstrates (one of a few different case situations presented) which can lead to “illusory” flatness of the histogram.

    Here is an interesitng quote where Hamill cites Gilmour and Smith (97) about a diagnostic for illusory flatness:

    These scenarios illustrate that reliability alone is not a good metric of forecast quality, and reliability apparently can be achieved even if samples from ensemble forecasts and the verification are not drawn from the same distribution. When, then, is reliability as diagnosed from rank histograms indicative of proper random sampling and when is it not? The results of Gilmour and Smith (1997) and Smith (1999) suggest that reliability may be illusory unless it is possible to find a model state that “shadows,” or follows, the evolution of the real atmosphere within an error tolerance consistent with magnitude of analysis uncertainty. If a shadowing trajectory can be found, it can be attributed to be sampled from the same distribution as the truth. Conversely, if no model state can be found with this property, then the ensemble is sampling some other probability distribution than the one the truth is drawn from, and hence any noted reliability from a rank histogram may be considered illusory.

  198. 220 – Thanks RB, good to have reached agreement! 🙂

    FWIW – I looked at the Jolliffe slides you linked and I was struck by the lack of diagnostic power of the statistical tests proposed. From the examples I was left with the impression that the numbers that the tests produce are at best subjective guides. I have some old knowledge of FEA modelling and I must say this present area seems to be falling over itself to avoid actually saying the models aren’t very good. IMO until that is on the table it will be hard to make progress. Usual caveats of not having up to date maths skills nor having read the detailed papers apply!

  199. 221 – Layman, that looks an interesting paper. Can anyone tell me where else the approach of ensemble modelling is used?

  200. Intuitively (correct me if I am wrong), I would think that the monotonic anthropogenic increase of CO2 would be expressed as a low frequency trend in the models, with very little to do with HF variability. IOW, if models do a good job reproducing the higher frequency wiggles of the observations, but observations do not contain the low frequency trend, is this not set up to play out exactly in the manner described by Curious in #219?

  201. Carrick @ Post #115 and for completeness:

    I initially jumped to the wrong conclusion about what you were doing with your equations in this post, but, of course, the variance relates to that of the trend line slope.

    What I would ask now is why you could not simply take a difference series between the modeled and observed time series and hypothesize the null of the trend slope of the difference series is zero. The variance of the difference slope would have to be adjusted for auto correlation, but I think the correlation of a difference series often requires less adjustment than working with the two individual ones.

  202. LL thanks for the link to the Hamill paper. It would appear on a precursory look that it addresses some of the issues I was thinking about – as a layperson and certainly not as a statistician.

    Curious @ Post #122: Those are my feelings exactly – but feelings they are and not hard counter facts or theory.

  203. …as a layperson…

    Hmmm…. In this age of gender nuetral political correctness I may have to change my moniker. 🙂

  204. 224 – Layman, I’ve wondered that for a while. Aren’t the models over the “climate relevant time period” simply going to be driven to their fundamental algorithms? I don’t see how they can’t be, and if they aren’t then my feeling is that we are into non convergent solution territory as Ross mentioned up thread. FWIW in FEA if your “solution” doesn’t converge you do not have a solution, though I realise Ross was making a statistical argument. Similarly in FEA if your “solution” does not agree with empirical measurment and test it’s not really worth very much.

    226 – Kenneth, I’m not sure this is an area where hard counter facts or theory actually offer anything. My stats are very poor but things like correlation coefficients etc etc. strike me as subjective indicators which need to be backed up by experience and good experimental design. If those are there and the experiment has been set up from the outset to have a meaningful pass/fail criteria then I can be persuaded. But the current situation seems to be there is a group of programmes producing different results to each other, and reality, and attempts are being made to use after the fact analysis to suggest all is well. I’m afraid this seems back to front to me and hence my question about where else “ensemble modelling” is used – I’ll do some more googling.

  205. #225 Kenneth Fritsch & #228 Curious

    I would also be curious as to the sensitivity of the RH to the removal of anthropogenic CO2 from the ensemble suite. It seems to me that if such a test did not return a definitive “cool bias” shift in the RH then it would really limit the inferential power of SIP.

  206. RB, the assertion which figure 2.5 was based on was stated as this:

    It is likely that there has been significant anthropogenic warming over the past 50 years averaged over each continent (except Antarctica) (Figure 2.5).

    That statement was qualified with this:

    Difficulties remain in simulating and attributing observed temperature changes at smaller scales. On these scales, natural climate variability is relatively larger, making it harder to distinguish changes expected due to external forcings.

    What is the modelled AGW trend for the MMH period and how does this compare with yearly and decadal variability of both the models and observation?

    Since MMH was a 20 year span, I would have to say I’m unconvinced by your suggestion in #230 (I don’t have the link but didn’t someone say that Amman’s RH indicates a slight model warming bias by eyeball?). I’m open to argument though.

  207. Hamill uses a N(0,1) set, i.e. a normal distribution with a mean = 0 and a standard deviation = 1, to show some of the problems with using the rank histogram as a diagnostic tool. Obviously he indicates that the distribution whence the histogram is constructed can come from any distribution including normal.

    The flat histogram can show that all the model results in an ensemble can have the same probability of representing the truth. I think I may have previously made a misleading observation that would have confused the underlying distribution with the histogram diagnostic.

    Anyway, I am a little confused about Annan’s statement that I think ruled out a normal distribution of model results since a normal distribution, in my mind, would be truth centered, or, at least, truth centered as the models see it. Is there a statistical/diagnostic tool that Annan uses that can determine the underlying model distribution or rule out a given distribution?

  208. Kenneth,
    In the link I posted in #141 to Annan’s blog post he outlines his reasoning for why the TCP paradigm is implausible and therefore, in his own words to a question from me “To be honest, I don’t really think the models are truly statistically indistinguishable from the truth, I’m really just trying to promote it as the appropriate standard that the ensemble should be tested against (in contrast to the bizarre and useless truth-centred idea). [I’m surprised and encouraged as to how well the tests have worked out so far, but that doesn’t actually make it *true*.]”

    I don’t think one should try to read anything more into the normal vs uniform distribution than that. Suffice it to say that on the face of it, the model mean-based approach to analyzing current models looks weak.

    (LL: Annan’s histogram shows models to have a slight cool bias, not warm bias, I posted based on his blog post somewhere above)

  209. Kenneth, I’m not sure this is an area where hard counter facts or theory actually offer anything.

    Curious, my point here is that when I see a climate scientist, who would appear to also be a policy advocate, make claims, I have a feeling that their claims must be furthered scrutinized and by those with expertise far beyond mine. When I first am introduced to the subject and claim I can have a feeling about the author, but I can only attempt to understand clearly what they are claiming and look for others with counter arguments and then weigh all the evidence. Usually a few well-conceived sensitivity test can do the trick.

    In Annan’s case I will borrow a comment that RB provided in a comment above:

    “To be honest, I don’t really think the models are truly statistically indistinguishable from the truth, I’m really just trying to promote it as the appropriate standard that the ensemble should be tested against (in contrast to the bizarre and useless truth-centred idea.”

    I find the adjectives used to describe the truth centered idea a bit beyond what I would expect from an otherwise disinterested scientist.

    I think after reading the Hamill paper linked above, I have my head around what a model ensemble that is statistically indistinguisable means. Now perhaps I do not understand what a truth centered one means. I will go back and read again what Annan says about his evidence (as opposed to hand waves) for SPI and against TCP. I have to also think harder why Annan favors the SPI and then writes a paper about why the mean of the ensemble results is almost always better than a single individual one.

    Obviously Annan could not say that the SPI is always working when it depends on individual models and modelers always being able to obtain model results that are always part of the same distribution.

  210. In the link I posted in #141 to Annan’s blog post he outlines his reasoning for why the TCP paradigm is implausible…

    RB, I am going to excerpt from the link to Annan’s blog what I think are the salient points he makes for SIP versus TCP and then comment on my own.

    From Annan’s blog we have his following comments linked here:

    http://julesandjames.blogspot.com/2010/01/reliability-of-ipcc-ar4-cmip3-ensemble.html

    The basic paradigm under which much of the ensemble analysis work in recent years has operated is based on the following superficially appealing logic: (1) all model builders are trying to simulate reality, (2) a priori, we don’t know if their errors are positive or negative (with respect to any observables), (3) if we assume that the modellers are “independent”, then the models should be scattered around in space with the truth lying at the ensemble mean. Like so: (My comment here to note that without the picture reproduced I can describe a circle with concentric lines with the red star at center and dots representing model results scattered about on the circle.) where the truth is the red star and the models are the green dots.

    However, this paradigm is completely implausible for a number of reasons. First, since we don’t know the truth (in the widest sense) we have no possible way of generating models that scatter evenly about it. Second, this paradigm leads to absurd conclusions like a 90% “very likely” confidence interval for climate sensitivity of 2.7C – 3.4C, based on the sensitivities reported by the AR4 models (this comes from a simple combinatorial argument based on the number of models you expect to be higher and lower than the truth, if they lie independently and equiprobably on either side). Third, it implies that all we would need to do to get essentially perfect predictions is to build enough models and take the average, without any new theoretical insights or observations regarding the climate system.

    Lastly, it is robustly refuted by simple analyses of the ensemble itself, as observations (of anything) are routinely found to lie some way from the ensemble mean. As has been demonstrated in several papers including the multi-author review paper mentioned above.

    Annan sees the truth, which I assume is the observed result or one of the renditions of a potential observed result, lying at the center of the model results. I would call that an egocentric centered paradigm. Those persons who use a model mean and variation in an attempt to compare the model results with the observed are assuming something different and that is that the model results are centered on a mean value that might well be biased some distance from the truth or the observed. Why else do the test? I am not sure what Annan means when he says we do not know the truth. Is he saying the models do not know the truth without the observed result or that the observed result is just one rendition of the truth, i.e. there could be many possible realizations of the observed result.

    After using his uncertainty of the truth for model results he then appears to “know” that the 90% CI range for the climate sensitivity estimated from a TCP of the model results are too tight – and without further explanation.

    He finally seems to be saying (and I need to check on these references next) that the model results lie “someway from the ensemble mean” (which I suppose is a vague way of saying that model results might have a less than expected peaked distribution – but I am not at all certain) and that his referenced papers bear out that observation. Not sure that it means the expected distribution for a TCP is rejected by chi square or other statistical tests.

    Interesting though that Annan claims that there is a climate science consensus favoring the TCP approach and that that approach would fail some “robust” tests indicating it to be invalid and yet it continues to be used by the consensus. Maybe the consensus and Annan are both wrong?

    On to reading the references.

  211. One more comment on Annan’s problem with an ensemble mean and that being that if sufficient number of model runs were used the and averaged the truth would be found using his interpretation of TCP. I would re-phrase that to what I said above about a potential bias in the models with the model results scattered about a biased mean. It depends greatly here on whether the models keep getting better and whether the mean would then become less biased to the observed, but if they did not than one could approximate that we could reduce the CIs for estimating the models bias to the observed.

    In the end, though one knows that TCP and SIP are simply convenient ways of looking at model results and both depend on non-random individual modeler choices or even political choices of which models to include in an ensemble.

  212. 237 – Kenneth, yes I agree the terminology is a bit odd – is this used anywhere else? Also, I’d still welcome input on other areas where ensemble modelling is used along with any comments on its value.

  213. To go back to an analogy used during the debate about Douglas and Santer and whether SE or SD was the appropriate statistic for comparison: Presume that we live in the HHG2G’s universe and we have access to the Magrathean planetary construction facility so that we can build any number of Earths. So we build 1,000 Earths and initialize them all at 1850 conditions and start the clock. Why would anyone think that the mean of the climate of all those synthetic Earths would be identical to our Earth’s current climate?

    Our current climate is just one realization out of many of a probably chaotic system and we really have no idea about the spread of the distribution. Weather noise may give some idea, but we really don’t have a long enough time series to define the low frequency part of the noise spectrum. It may be very red indeed, leading to large and rapid dispersion of results. There is no reason at all to believe that our particular realization is going to be anywhere near the center of the distribution.

    I think that may be Annan’s point about truth centered or statistically indistinguishable paradigms.

  214. I read Annan’s excerpt as negating and rejecting the value of a model mean, since it could be almost anywhere compared to the truth. He says that it is “implausible” to assume that the model results will be scattered evenly around the actual value.

  215. 241 – DeWitt Payne – I hesitate to say this but isn’t your second para. effectively saying that the models are useless?

  216. Re: curious (Oct 8 21:20),

    Not necessarily. But they are definitely not fit for the purpose of estimating the climate 90 years in the future, especially given the large uncertainty in the forcings over the next 90 years.

  217. One of the concepts that I struggle with wrt SIP and rank histograms goes back to Ross’s comment wrt to null ambiguity and possible circularity of logic. If this concept is to have any inferential power at all, it seems to me that it must rest on solid priors. That is to say that for the variable coefficient beta of X1, the prior for the ensemble uncertainty range for beta must be a solid scientific belief of a specific value for beta with an uncertainty range which is clearly centered on this value. If the priors for beta are not definitive and encompass an uncertainty range centered over multiple values of beta then how can the claim be made that the ensemble range comes from a single representation of reality?

  218. What DeWitt Payne says at Post #241 is, of course, what is implied by some climate scientists who work with climate models but what they never quite say in my view. If there are many potential realizations of climate (with or without GHG effects included) and we do not know the distribution whence they come, then surely climate predictions for the future (and past) are a crap shoot.

    We could only hope to put limits on the range of possible outcomes (and all with equal probability of occurring) but never being able to determine whether the limits are correct.

    I am not at all certain what the view described for alternative climate outcomes by D. Payne above means for comparing model results with the single observed result that could have had several different “correct” values. The first question would be: is do we know for sure that the models can emulate the chaotic nature of the climate and if they do how much variation do we expect to see with the same model results replicated several times with the same initial conditions? How does that variation compare to the variations of the mean results of several different models? Is what Annan says about SIP compatible with the view described above by D. Payne.

    Can we refute the TCP from the model results distributions and the observed result? Annan was coy about the references in his blog and I have not as yet been able find the papers that supposedly show this – and in robust form.

    Annan also comments that the denialists will not be happy about what he shows supporting SIP, because it makes the model results, in total, correct. I do not know about the denialists, but this skeptic has already stated that who am I to question a modeler of the climate who is willing to admit that, although all the model results could be correct, the range of possible results covers at least all of the range of the models and with all of the range having the same probability of occurring.

  219. I think another aspect should be considered. Dr. Koutsoyiannis http://climateaudit.org/2008/08/30/?p=3361 on model predictability. Comparing this and Gravel and Browning’s work can any other conclusion other than the IPCC could not claim what they did?

    With posted by Dr. Browning on http://climateaudit.org/2007/05/02/exponential-growth-in-physical-systems-2/ thread

    “”draft of the unpublished manuscript by Sylvie Gravel,
    G. L. Browning F. Caracena and H. O. Kreiss entitled

    The Relative Contributions of Data Sources and Forcing Components to the Large-Scale Forecast Accuracy of an Operational Model

    The manuscript contents were presented at CIRA (CSU) and later at a Canadian conference by Sylvie.””

    Just what does Annan mean “in total, correct”?

    Has Annan refuted Browning and Koutsoyiannis? If models are known to be incorrect for short time scales and for present gridsize, and now for 30 years, at what point do we get into long term persistent phenomena, and conclude the null of climate change is valid?

  220. Just what does Annan mean “in total, correct”?

    John Pittman, that was my terminology and what I mean is that if I have wide range of model results and they all have the same probability of being correct then in total of all the results Annan can say the models are right. The problem is that that wide range places a lot of uncertainty on the model results, again, in total.

    I excerpt below some comments from your linked Koutsoyiannis paper which indicates agreement with the DeWitt Payne view above and yet seems to be comparing the model and observed results in another light.

    Specifically, in a HK climate, the uncertainty at a climatic (30-year) scale proves to be only slightly lower than that at the annual one (Koutsoyiannis et al., 2007), in contrast to the classical approach, which yields significant reduction as we proceed from the annual to the climatic scale and justifies different perception of climatic and finer scale views of processes.

    Furthermore, Koutsoyiannis (2006b) has demonstrated, using a toy model with fully known simplified deterministic dynamics capable of producing a HK climate, that even slight perturbations in initial conditions produce very high departures, not only at a fine time scale but also (and mainly) at the climatic time scale. Such a result is in line with Collins (2002), who used a GCM (HadCM3) and, assuming this to be a “perfect” model, concluded that the climate predictability is likely to be severely limited by chaotic error growth…

    ..Our falsification/validation framework merely involves spatial interpolation of the GCM output fields to infer their values at the points of interest…

    ..Moreover, at all scales, we provided comparisons between the observed and modelled average and standard deviation (the former is obviously the same at all scales). In addition, in the annual time series, we calculated and compared the first-order auto-correlation coefficient and the Hurst coefficient, whereas, at the climatic scale, we also compared three fluctuation indices.

  221. There’ a post up at Pielke, Sr.’s blog on models with a long quote from Kevin Trenberth about climate models. Here’s one sentence that was highlighted:

    None of the models used by IPCC are initialized to the observed state and none of the climate states in the models correspond even remotely to the current observed climate.

    Which is, of course, why you only see global average anomalies reported.

  222. 250 – Hi DeWitt – thanks for flagging that up. First time I’d seen it – I found it pretty shocking but not surprising, if that doesn’t sound contradictory!

    I googled your selected quote and hit on this Nature blog source:

    http://blogs.nature.com/climatefeedback/2007/06/predictions_of_climate.html

    Interesting comments and I agree with poster hunter that Eduardo Zorita hits the nail on the head. Amazing to me this was three years ago – what has changed since?

  223. The material presented in both the links to Kevin Trenberth and Judith Curry appear to me to be a bit nuanced and do not really provide much definitive information about climate models.

    I found the link below to an article that again is not very definitive, but has a bibliography of other references that might be ( I have not read them yet) more definitive and useful to this discussion.

    http://rsta.royalsocietypublishing.org/content/365/1857/1957.full.pdf+html

    I was looking for articles that address directly the issues raised in this thread. I did find a link from the link above to a paper coauthored by James Annan where Annan states:

    Model projections of anthropogenically forced climate change are sensitive to many aspects of model construction, in particular the values of parameters which are not well determined by direct observations or theoretical arguments but which directly impact on the future trajectory of model simulations. Although short-term numerical weather prediction is essentially an initial-value problem, the behaviour of a coupled atmosphere–ocean model in response to a forcing scenario on climatological (multidecadal and upward) timescales is determined much more by the details of its parameterizations rather than the initial state. Therefore, much attention has recently been given to the parameter estimation problem in this context (e.g. Forest et al. 2002; Gregory et al. 2002; Knutti et al.

    Annan is saying, in climate models the major issue is the details of parameterizations and not initial conditions. He says weather prediction is sensitive to initial conditions

  224. Kenneth,
    You can find more about the initialization versus boundary conditions issues here . As I understood it (any errors are my interpretation), the effect of injecting a CO2 pulse is in the 10-30 year timeframe sensitive to initial conditions while the trend over a century timescale due to the injected pulse is more related to a boundary condition problem.

  225. #253 Kenneth Fritsch

    If climate models are no sensitive to initial conditions why does trenbeth say this:

    “Of course one can initialize a climate model, but a biased model will immediately drift back to the model climate and the predicted trends will then be wrong.”

  226. Brian H:

    Parameters are plugs. The very word means not measured. Data-free input, adjust to suit.

    I wouldn’t go that far. I’d call them “dials”. Obviously there isn’t any first principle way to determine CO2 or sulfate emissions forcing. These have to be measured, and to the degree they can’t be measured, their value has to be constrained by model.

    When one does geophysical prospecting, for example, the entire problem is in tweaking the parameters of the model, in this case the 3-d density model of the subsurface you are trying to measure.

    When applied correctly, it does work very well.

  227. I should have said “the 3-d density model of the subsurface you are trying to determine” to be unambiguous.

    Also are inverse solutions measurements? I would call them that, others might argue.

  228. Another example is that of modern semiconductor process models (BSIM) which have long since moved away from parameters with a purely physical interpretation, but they are constrained by measurements.

  229. #253 Kenneth Fritsch

    “The material presented in both the links to Kevin Trenberth and Judith Curry appear to me to be a bit nuanced and do not really provide much definitive information about climate models. ”

    I was very disappointed in the Judith Curry thread on this for precisely the reason you note above

    Essentially, much ado about nothing. I believe this was deliberate on the part of those few contributors who are knowledgeable about detailed climate modeling

  230. Parameterization has a specific meaning in model construction. Because of computational restrictions, it’s not possible to use line-by-line or even band models for radiative transfer calculations in a GCM. Therefore RT is reduced to a simplified empirical expression or parameterized. But just as the entire model cannot possibly be identical to the real world, a parameterization is necessarily less accurate than a method that is more complex and more fundamental with fewer adjustable parameters.

  231. Re: curious (Oct 11 18:09),

    The current projection method works to the extent it does because it utilizes differences from one time to another and the main model bias and systematic errors are thereby subtracted out. This assumes linearity. It works for global forced variations, but it can not work for many aspects of climate, especially those related to the water cycle.

    from the Nature article of Predictions of climate
    Posted by Oliver Morton on behalf of Kevin E. Trenberth.

    I think the problem lies here. On the one hand there is this claim that the method works to the extent that it does, for global forced variations. But this is what has been attempted to be tested several times now. Upon completion of the test, we have claims that the models do not exhibit these characteristics. Which is it? This goes to the heart of the issue, and Pielke Sr’s link.

    Without the acceptance of at least this “method”, then the attribution in WGI 4AR IPCC is a circular argument, unsupported speculation with problems in the proxy reconstructions.

  232. Trenbreth talks about scenarios and the climate models and that the IPCC cannot make climate predictions because it cannot or does not predict scenarios. Somehow that misdirection bothers me when it is obvious to all that a scenario (for the emissions of future GHGs) is required for a prediction by the climate models and that furthermore that is what the models are supposedly all about, i.e. allowing us to attempt to predict climate given a GHG emission rate into the future. Does Trenbreth think the IPCC is being misleading or does he think the readers ot their reports are misled?

  233. DeWitt Payne, obviously the limitation of computational power to do a line by line calculation of the GHG radiation transfers that leads to a parameterized shortcut is just one of many parameters that are used in the GCMs. The uncertainities from using parameters are not always the same as the limitaions in this example either.

    DeWitt you have, as I recall, noted that simple models with energy balances for the entire globe make use of line by line radiation transfers. These are the simple models that, I assume, lead us to the well established physics of GHGs and allows us to predict a global temperature increase with increasing GHGs without the effects of feedback. Obviously these simple models can tell us little about temperatures at scales less than global.

    My question is: How complex does a climate model need be to start showing the effects of feedback even if it is limited to global scale?

  234. Kenneth, I wish preparing a report were this easy – I’ll predict that the 10C is less likely than 1.5C thanks to all of the aerosols coming out of India and China.

  235. RB at Post #254:

    Your link with statements implying the initial conditions influence climate out to 30-40 years would appear to be diametrically oppossed to Annan’s reference to weather versus climate time frames for the initialization effects.

  236. RB, you got me. I neglected to ask for some scenarios from you, but obviously you have some in mind. By the way, I think future levels of aerosols are played down in most of those scenarios commonly proposed.

  237. Kenneth, I didn’t read the paper, but your quote, cited again below, seems to be consistent with Meehl et al., in particular with Figure 3 of the linked paper showing a diminishing uncertainty over longer timeframes:
    Although short-term numerical weather prediction is essentially an initial-value problem, the behaviour of a coupled atmosphere–ocean model in response to a forcing scenario on climatological (multidecadal and upward) timescales is determined much more by the details of its parameterizations rather than the initial state.

    You can argue about what Annan means by multidecadal, but you can hardly call it diametrically opposed.

  238. RB, on further thought I will have to agree that Annan’s comment is not diametrically opposed. I was misled by his reference to weather and its dependence on initial conditions. For clarity, for me anyway, why did not he simply state that: “The behaviour of a coupled atmosphere–ocean model in response to a forcing scenario on climatological (multidecadal and upward) timescales is determined much more by the details of its parameterizations rather than the initial state.”

    Annan could have gone on the explain that initial conditions can have a significant influence beyond the time frame of weather and up into the time range where parameterization becomes the major influence – if indeed that is what he intended.

    But thanks, RB, for assisting me through these less than definitive statements I frequently see in these workshop reviews on climate modeling. I continue to be uninformed as to anything definitive in estimating the uncertainty of the climate models.

    I am thinking that perhaps it would be an easier task to makes these uncertainty estimations on the simpler models that would be relegated to global temperature changes and then use the GCMs to tell us something about how this would be distributed about the globe.

  239. Re: Kenneth Fritsch (Oct 12 19:13),

    My question is: How complex does a climate model need be to start showing the effects of feedback even if it is limited to global scale?

    *sound of hands waving*

    You can include a feedback parameter in a simple model, but the choice of the magnitude of the parameter is arbitrary. If you want the feedback to arise from the model rather than be specified initially, then I think you need a fully coupled air-ocean general circulation model. Even then, it’s not clear, to me at least, whether the magnitude of the feedback arises from the details of the necessary parameterizations or from the basic physics. In the end, feedback is all about water vapor and suspended water droplets (clouds). We know models do a poor job of modelling clouds and suspect strongly that water vapor isn’t done all that well either. The wide range of rate of increase of precipitation/evaporation with temperature between models (a negative feedback as an increase in latent heat transfer lowers the expected temperature increase) is evidence that all is not well on that front.

  240. There is sooooooo much we do not know and have not observed – the interactions and reactions and relationships of proportions of water vapor and clouds and ocean currents and feedbacks and feedback mechanisms, etc. etc. – we are children drawing childish pictures of a giant head with stick arms and legs, and we toddle about making animated versions of our silly picture while wondering aloud about the precise measurements of the sticks and how they are attached…

  241. Good news!

    Stoat Sacked!!

    Lawrence Solomon: Global Warming Propagandist Slapped Down

    Financial Post, 15 October 2010

    William Connolley, arguably the world’s most influential global warming advocate after Al Gore, has lost his bully pulpit. Connolley did not wield his influence by the quality of his research or the force of his argument but through his administrative position at Wikipedia, the most popular reference source on the planet.

    Through his position, Connolley for years kept dissenting views on global warming out of Wikipedia, allowing only those that promoted the view that global warming represented a threat to mankind. As a result, Wikipedia became a leading source of global warming propaganda, with Connolley its chief propagandist.

    His career as a global warming propagandist has now been stopped, following a unanimous verdict that came down today through an arbitration proceeding conducted by Wikipedia. In the decision, a slap-down for the once-powerful Connolley by his peers, he has been barred from participating in any article, discussion or forum dealing with global warming. In addition, because he rewrote biographies of scientists and others he disagreed with, to either belittle their accomplishments or make them appear to be frauds, Wikipedia barred him — again unanimously — from editing biographies of those in the climate change field.

    I have written several columns for the National Post on Connolley’s role as a propagandist. Two of them appear here and here.

    Financial Post, 15 October 2010

  242. Re: Don Keiller (Oct 15 12:35),

    One, two! One, two! And through and through
    The vorpal blade went snicker-snack!
    He left it dead, and with its head
    He went galumphing back.

    “And, has thou slain the Jabberwock?
    Come to my arms, my beamish boy!
    O frabjous day! Callooh! Callay!’
    He chortled in his joy.

    `Twas brillig, and the slithy toves
    Did gyre and gimble in the wabe;
    All mimsy were the borogoves,
    And the mome raths outgrabe.

  243. OK Jeff ID, you can let the fat lady sing on this thread. I have done as much research as I am going to for the time being and all I was able to come up with in regards to the determination of uncertainty in model and model ensemble results was not very definitive or satisfying. I am not certain whether that results from my lack of understanding of what I read or truly what I read was not conclusive.

    It would appear that most estimations of uncertainty in climate model results are based on a Bayesian approach with an expert and uninformative prior and that is combined, in classical Bayesian methods, with maximum likelihood functions to produce a posterior probability density function. The circular reasoning that we think we see is apparently from an explicit or implicit use of a prior. The scientists doing Bayesian analysis prefer their methods because they make their assumptions upfront, transparent and with the flexibility to change them and do sensitivity testing.

    Uncertainty estimations have been carried with GCMs and less complex models where the statistics can be applied more rigorously. Some use weighted model results and some do not. There is no doubt that most of the uncertainty estimations that I read were based on truth centered approaches – as James Annan likes to refer to them. The statistically indistinguishable approach that Annan favors, being rather new with Annan, has much less analyses done to date.

    All this research was a side trip from what we have been discussing with regards to the difference in warming trends between the tropical surface and troposphere temperatures. I really think we got afar from how to handle a ratio of trends for given region of the globes to one of looking at results of from an ensemble of climate model results.

    I propose, as a layperson in these matters, that Christy in his recent paper by using a ratio (he could have used a difference) of surface and troposphere temperature trends (not temperatures) has the initial condition and parameterization variations averaged out in each individual model that was used. I am assuming here that the results used were always differences from individual model runs and not differences of averages from several runs of an individual model. What he is then looking at is a rather unique property of the model that should be relatively constant within an individual model. I would find an analysis of the individual model variance versus that of the ensemble of interest here. This something that Steve M from CA was starting to look at in a different form.

    I have already looked at the distribution of the 21 model results (ratios of the surface and troposphere temperature trends) and found it fit a normal one with a high probability using the Shapiro-Wilk test.

  244. Keneth,

    Neat that you looked at the distribution of the models statistically and found a normal distribution. It gives some credit to the Carrick envisioning of uncertainty in input rather than good old fashion guessing.

    I’ve had several days to consider this issue quietly again, I think that Annon is missing the point that the simple test presented is not whether model distribution can encompass observation but whether the typical models are running above trend. We see that they are from these methods, however there is enough spread that we cannot call everything wrong. Someone pointed out above that individual models may need to be compared and rejected one at a time and I believe that is the definitive approach but if the models as written represent the full range of the IPCC and the average having normal distribution runs above observation by 2-4 times, the whole top range trend projections need to be reconsidered IMO. The atmosphere isn’t as sensitive as these models represent.

  245. Re: Kenneth Fritsch (Oct 15 21:29), Thanks Kenneth. The normal distribution gives me more confidence in my following reasoning.

    I think climate modeling is at a crossroads. The 4AR report has an understandable methodology for resolving the circular argument of climate sensitivity and assigning attribution. The Bayesian a priori resolved the issue, but has a weakness. It bounds the answer, but can be biased. The criticisms of the proxy reconstructions which were used by the IPCC as the method to break the circularity have been shown, at this point, to be valid. Further the IPCC defined its measure of usefulness, and now that has been shown to be biased by a factor of 2 to 4. At this point using the same a priori, the conclusion would be that CO2 could be attributed as a cause of warming, but it would mean a doubling of CO2 is about 0.38C to 2C, using the greatest spread of the bias. The conclusion is supported by the increased likelihood that MWP is as warm as today, and the actual response of the system to CO2 is smaller than the models predict. The weakness in this, is that for the MWP to be warmer, climate sensitivity should be higher not lower. And we know where this leads. It leads to the assumption of the positive water vapor feedback.

    Of course, I am not surprised. I was taught in entry level engineering that one uses a heat and mass balance, not weight and temperature. Perhaps, they will start paying attention to the critics who pointed this out.

    Nah, those critics are labeled as not being very scientific. 😉

  246. re: 282.John F. Pittman said
    October 16, 2010 at 10:28 am

    The weakness in this, is that for the MWP to be warmer, climate sensitivity should be higher not lower.

    Time for another essay John? Are you suggesting that there might be a reluctance to move to the higher sensitivity paradigm (even though paleo + models likely tells them they should) because it excludes the lower sensitivity (0.15C) scenario? This is the “have your cake and eat it to” scenario.

  247. Re: John F. Pittman (Oct 16 10:28),

    The weakness in this, is that for the MWP to be warmer, climate sensitivity should be higher not lower. And we know where this leads. It leads to the assumption of the positive water vapor feedback.

    Not necessarily. If the climate shows long term persistence (a high Hurst coefficient), then it can wander quite far from the long term trend at any given time. Analyses of temperature records by Koutsoyiannis show Hurst coefficients on the order of 0.95. A warm MWP does not have to mean a high sensitivity. The same goes for glacial/interglacial transitions. Another possibility is that there’s a forcing that’s not being taken into account.

  248. Re: DeWitt Payne (Oct 16 12:11), Yes, I agree. It is a CA thread. Though, I can understand those that look at the phenomena and say “Don’t know means Don’t know” for such phenomena. But, then they have to account for a forcing that they can’t account by their own claims.

    No, LL. They don’t want to go to higher sensitivity with negative water vapor, because the result will be fat-tailed below 1.2C for doubling. Which means for practical purposes, we can probably ignore it. I am considering the essay, but will need to research some items.

  249. Jeff ID, given James Annan’s belief in the statistically indistinguishable paradigm way of looking at model results, I can readily see where he would use the range of the model results for a comparison with an observed result. He is saying that the observed result does not (necessarily) lay at the center of the model results and that the model results are not centered either, but rather that it is within the range and further that all the results including the observed are part of the same distribution. He then goes on to use the rank histogram in an attempt to show that all the results are from the same distribution. The fit (he showed) was not good in my estimation and others have shown that data originating from several distributions can give a good rank histogram fit. What Annan admits in his approach is that, although all the model results could be correct, we then have a very wide distribution of possible results for the observed.

    The problem with all of these exercises in model ensemble statistics is that, at the GCM level of complexity, the expense and time of the runs evidently precludes larger sample sizes and the smaller sample sizes make for less certain estimations of the distribution(s) whence they came.

    My point in my last post was that on researching the estimation of uncertainty in model results and in comparing them to observed results is that nothing appeared to be clear cut. My further point was that the looking at ratios of trend outputs from models (surface versus troposphere as was performed in the recent Christy paper)) using individual runs could well difference out some of the effects that would make looking at raw results much more difficult. I would like to look further at the differences versus the raw results for these model runs and to compare variances of replicate runs within a model and that between the models.

    It remains for me a great leap in faith to believe that model runs should come from any statistical distribution, although assigning a reasonable distribution might make it easier to do specific analyses when comparing results with the observed ones. And I think, Jeff, that might be where you are coming.

  250. Kenneth,

    The points that plots of trend described by Annan and Carrick have made convincingly true, is that the models are all over the place. They have such a huge spread that it is impossible to say Models are completely outside of reality. However, knowing that, the model mean isn’t pointless. I’m convinced by this effort that the models are biased high. When you are two to four times observed trend on average, many of the models have problems.

    As far as distribution, I wouldn’t have been surprised to see a very one-sided non-gaussian result, but I’m cynical about modeled climate science.

  251. Does anybody know if the model outputs that MMH10 studied took this into account?

    The figures from the IPCC report show the models doing a good job over the 20thC. But what’s not made clear is that each model has had its bias subtracted out before this was plotted, so you’re looking at anomalies relative the the model’s own climatology. In fact, there is an enormous spread of the models against reality.

Leave a reply to Kenneth Fritsch Cancel reply