the Air Vent

Because the world needs another opinion

OLMC 10, What does it mean…

Posted by Jeff Id on December 10, 2010

[edit]

Nic and Ryan both emailed their views to Andy Revkin and gave permission to post their emails here.  Rather than post in their entirety, I’ll find the good parts so you don’t miss your hockey or football games.  Nic’s letter first, my bold throughout.

Please note that if you quote any comparisons of our and Steig et al’s 2009 Nature paper (S09) regional temperature trends, to be comparable the S09 figures should be those quoted in our paper, which have been recomputed using the same natural geographical boundaries used in our study.  The regional boundaries used in the original S09 study were slightly different, so the regional temperature trends stated therein are, without recomputation, not comparable to those per our study.  (NB Eric Steig may prefer his own definition of the boundary between West Antarctica and the Antarctica Peninsula, but the boundary we use is supported by, for instance, Wikipedia’s page on the Antarctic Peninsula.)

Judging from a recent post at RealClimate, there appears to be an attempt to gloss over the differences between the results of our study (per the main RLS reconstruction) and that of S09 (per its main reconstruction) , some of which are pretty fundamental. For instance:

- we show no statistically significant warming for the continent as a whole over 1957-2006 (our finding is 0.06±0.08 degrees C/ decade, using a standard 95% confidence interval; I state all subsequent trends on this basis), whereas S09 showed statistically significant warming of 0.12±0.08.   S09’s central estimate of the continental trend is double ours, and the difference between the central trend estimates is statistically significant (0.06±0.05).

- S09 showed that warming in West Antarctica was considerably greater than in the Peninsula; we show the contrary.

- S09 show fast warming in West Antarctica, with a central estimate over twice its lower 95% confidence limit (0.20±0.09, using our geographical definitions). Our central estimate is half S09’s and is only marginally statistically significant (0.10±0.09). Again the difference in the two central trend estimates is statistically significant (0.09±0.06).

You can see that some have been missing the point from the beginning.  I warned that the reconstructions are statistically significantly different  yet some can’t stop parroting the line that these results confirm S09.  It’s half the trend in the majority of the area folks, not very darned close.  I wonder how many Joules that represents.

Nic writes further, a point which I had not considered.

As I see it, one of the most important points to come out of all this is that Nature’s peer review process completely failed to prevent a mathematically badly flawed paper being published. And it took a bunch of amateur researchers to publish a paper that brought these flaws to light and correct them – no mathematically competent professional climate scientist did so, perhaps because of fear it would do their careers no good.  One has to wonder how many other papers with incorrect results have been published by authors who go along with ‘consensus’ views, and have never been corrected.  Papers such as ours that question the status quo, on the other hand, are subject to a stringent peer review process, in our case involving one reviewer who from some of his extensive critical comments (almost all of which were invalid) clearly had a personal interest in avoiding a paper contradicting S09 being published!  IMHO, in the interests of science the peer review process needs to be made more transparent and even handed.

There were other points but diluting Nic’s commentary is not in the cards today.

Ryan’s email was a little different but he also made similar points.  First since this was an email to Andy, a little commentary for the two bloggers on the paper.

Unfortunately, some of my coauthors have been portrayed as trying to “dig” for evidence of “cooling”, as if by way of showing a cooling Antarctic, the demon of anthropogenic global warming would be banished.  I believe this characterization is unfair, as Steve McIntyre publicly posted on Climate Audit early in this process that he felt it should not be surprising that the Antarctic was warming along with the rest of the world, and his goal was to determine if the method used by S09 was appropriate.  Nicholas Lewis, Jeff Condon, and I have all publicly stated that we believe anthropogenic activities contribute to warming, though we may disagree with the consensus on the magnitude of that contribution.  Finally, the “digging” we had to do was not in finding something wrong with the S09 method (it was rather easy to verify the S09 method improperly spreads Peninsula warming throughout the continent) but in designing a method that avoided many of the deficiencies of the S09 method.

Then in his matter of fact style Ryan followed up with this:

In my opinion, the important and significant differences between our paper and S09 include:

1.  Improvements to the method, which include demonstrating that certain steps performed by S09 were not mathematically valid (regardless of whether they “worked” in terms of results)
2.  Demonstrating that the S09 method does, indeed, cause the Peninsula warming to be geographically relocated to the rest of the continent
3.  Demonstrating that the strong warming throughout West Antarctica shown in S09 – which was the primary claim in that paper – is an artifact, and the only statistically significant warming in West Antarctica is occurring in Ellsworth Land and the northern portion of Marie Byrd Land immediately adjacent
4.  Demonstrating that the seasonal patterns of change in S09 (which are important for distinguishing between possible physical mechanisms for the changes in Antarctic climate) were strongly influenced by the Peninsula contamination, particularly in West Antarctica and the half of East Antarctica from the south pole to the Weddel Sea

See, the seasonal trends of the peninsula represent certain physical warming processes and these trends appeared in S09 all across the West Antarctic.  This was simply evidence of peninsula station information spreading across the continent.  Ryan follows the above up with this:

While both studies show statistically significant warming in Ellsworth Land (which is what RealClimate seems to be focused on right now, as a way of saying our work “confirms” S09), evidence that Ellsworth Land was warming rather significantly was already present in the literature (e.g., Shuman and Stearns, 2001; Kwok and Comiso, 2002; King and Comiso, 2003; Chapman and Walsh, 2007).  Even a paper entitled “Antarctic Climate Cooling” (Nature, 2002, Doran et al.) shows warming in Ellsworth Land.  The novelties of S09 were statistically significant warming throughout the rest of West Antarctica, a statistically significant continental average, and a seasonal pattern of change that differed from previous gridded reconstructions (Chapman and Walsh, 2007 and Monaghan et al., 2008).  Our paper demonstrates that all of these novel results in S09 are artifacts.

Certainly other portions of S09 are confirmed by our paper (such as overall positive trends).  However, we note that earlier studies also showed the same things, so these were not newly introduced with S09.  The results that were newly introduced with S09, on the other hand, are all shown to be artifacts.

There you go.  No matter how you spin that kind of confirmation of result, it is hard to separate from the fact that the S09 method simply smeared the peninsula information across the continent.  It was demonstrated in the over high trend in the east, the overly low trend in the peninsula and more obviously in the seasonal trend distortions which caused the continent in S09 to match the peninsula.

All the authors have their own opinions, mine is that this is more than a simple improvement.

What is still missing from this discussion though is the description of the multiple novel methods Ryan came up with for solving the problem and the one finally settled on.  There were definite improvements in the algorithms, Nic also contributed to these and I hope that we will hear more on that after the paper is published.  Ryan’s code is very clean and despite the fact that R isn’t my favorite language, easy to follow.  Can’t wait for that.

92 Responses to “OLMC 10, What does it mean…”

  1. [...] This post was mentioned on Twitter by Science Blog News, C Jenkins. C Jenkins said: OLMC 10, What does it mean… http://goo.gl/fb/ljhFY [...]

  2. benpal said

    Thanks, Jeff, for this summary. As a layman, studying the actual paper would be out of my range.

  3. Chev said

    Came across this issue here: http://www.c3headlines.com/2010/12/peer-reveiwed-analysis-by-amateurs-corrects-bogus-antarctic-temperature-study-by-the-experts.html

    It’s fantastic to learn that amateurs are besting the experts.

  4. We truly live in post-normal times when reporting for a piece on a climate paper gets published before the piece is finished, but so be it… ; ) If someone would turn off the firehose of news, might actually get this published on Dot Earth..

  5. Eric Anderson said

    To call the new paper an improvement is not simply being nice, it is flat out incorrect and muddies the discussion. In every meaningful point made by S09, OLMC10 is a refutation, and should be stated as such. I think Steve was also pretty clear on this point in his post.

    Ihe math was simply wrong in S09, you don’t call the correct math an “improvement.” If the math was wrong, it was wrong. If I screwed up a math problem in class, I can’t imagine one of my math teachers giving me a soft pat on the head and kindly saying, so as to not hurt any feelings, that the correct math was simply an “improvement” to my bogus efforts.

    S09 was bogus. It stands refuted, plain and simple.

  6. Jeff,
    Re your claim that the results are “statistically significantly different”. I note that for the continental trend, S09 has 0.06±0.08, while OMLC10 has 0.12±0.08. Their central estimate is within your range, and yours is within theirs, so I was surprised at that claim, and the narrow limits on the difference 0.06±0.05 quoted by Nic. I would have thought it is hard for a difference to have a smaller range than the things that are differenced.

  7. Jeff Id said

    #6 If you use the same data for two different methods, the residuals have less variance. In an extreme example, say you found a unique way to calculate global temps and you wanted to compare it to the current methods. It isn’t hard to imagine that there is so much data, and the methods being reasonable, the residual trends would result in a much tighter CI. What might be surprising in this case is the short term variance (using the same data) is so different +/-0.05.

  8. Jeff Id said

    Andy,

    The paper was supposed to be out several days ago now and had problems due to some technical issues. This wouldn’t be the first time a paper was discussed before final release but I do need to point out that none of the methodology or the results have been discussed in any detail.

    What is going on here is that this little blog is far different from your well read publications. What seems like two days to you, seems a lot like two years of work with ten months of review to us. We are a bunch of technical non-climatologists who like math combined with a few lurking climatologists who generally fear the “lunatics” on the internet.

    When the JOC paper is released, quite a few of the lurkers will download and read it. Hopefully, some of the media will pick up on it because it is quite an important issue in the public’s mind when considering the future of the Antarctic. When S09 came out, the media storm was tremendous. I fully expect this improved work to be ignored by much of the media for various reasons, but I bet the climate community will actually like it.

    What I am very concerned with is the spin that the paper isn’t really different. Then the climate community doesn’t take the time and the effort was less important. Hopefully, that isn’t the result.

    Anyway, when you do write something, I will place a link to your post here and you will get a 3% boost in readership from a bunch of climate bloggers.

  9. Steve McIntyre said

    #3. the characterization as this particular dispute as between “amateurs” and ‘experts” misses an important point. Steig is not an “expert” in statistics; his background is in isotope geochemistry. The original paper does not appear to have been reviewed by competent statisticians.

  10. Jeff Id said

    My recent comment which was clipped at RC made the point that S09 wouldn’t have passed the review process that we were put through. Our review forced us away from the retained pc methods as a defensive position from one reviewer’s disagreements. In fairness, Steig said he didn’t clip the comment so something may have gone wrong.

  11. RomanM said

    #6 Nick:

    would have thought it is hard for a difference to have a smaller range than the things that are differenced.

    This is true for differences in independent samples, but not necessarily when there is (positive) correlation. That’s why paired data usually produces better results than unpaired.

  12. curious said

    9 Agreed – IMO the “amateur” ref. can be just another snarky attempt to adhom in response to valid arguments. I remember watching (online) G North’s presentation which used the “enter the amateurs” tag. Amateur vs. expert/professional is irrelevant – the question is what are the correct techniques, results and interpretations? Put them out for open review and let them stand or fall on their merits.

    10 “Something may have gone wrong” – sorry Jeff, but what’s right about RC? There is a paper under discussion where a contributing author’s comment is blocked and the headpost post author (who is also a responding author of earlier relevant work) doesn’t even know about it. The site is a joke.

  13. Kenneth Fritsch said

    ..the characterization as this particular dispute as between “amateurs” and ‘experts” misses an important point.

    I am not sure from Eric Steig’s reaction that the amateur/expert designation is not part of the sting – since it is in the eyes of the beholder(s).

  14. mrpkw said

    Thanks !!!
    These emails really help to explain the real differences of the two studies !!!

    Now Mongo understand.

  15. Kenneth Fritsch said

    Nic and Ryan have provided talking and discussion points pertinent to the differences in findings between S09 and RO10. Let us see how often these points are part of the wider discussions of these two papers.

  16. Bad Andrew said

    “Nicholas Lewis, Jeff Condon, and I have all publicly stated that we believe anthropogenic activities contribute to warming”

    What does this religious belief have to do with an objective analysis of Antarctica? Might not this belief have an effect on the analysis? I would be embarrassed to associate this statement with any serious work I had done. Seriously, guys. Why?

    Andrew

  17. Jeff Id said

    Cause CO2 does capture heat and there have been some goofy statements that we don’t believe that.

    Ya gotta say your prayers before you publish. :D

  18. Hoi Polloi said

    We truly live in post-normal times when reporting for a piece on a climate paper gets published before the piece is finished, but so be it… ; )

    Be careful to use excuses like “post-normal times” when “post-normal science” has been widely accepted by the AGW community, including you Mr.Revkin…

  19. Bad Andrew said

    JeffId,

    But your paper has nothing to do with the properties of C02, does it? It this just some kind of political game that you feel you need to play?

    Andrew

  20. Jeff Id said

    Sure, if the press is writing an article on climate science, can you imagine if they accidentally called us a bunch of deniers or something like that. How crappy would that be.

  21. Bad Andrew said

    Jeff,

    I just think it’s unfortunate that you guys can’t just present a paper, and what matters is what is in the paper. Beliefs that have nothing to do with the paper shouldn’t be relevant. You should be able to believe that Elvis Helped Shovel My Driveway yesterday, and have your work speak for itself.

    Andrew

  22. Bad Andrew has a valid point.

    Science has been abused for a noble purpose: To convince the public of a common threat to all (AGW).

    The goal: To reduce nationalism and save planet Earth from destruction by nuclear warfare.

    I endorse the goal: Not the means, that include distorting or hiding experimental data and pledges of belief. Those means may lead us to a tyrannical one world government like that described by George Orwell in the book, “1984”.

    With kind regards,
    Oliver K. Manuel

  23. Jon P said

    Here is what it means over at RC and note that Eric Steig agrees with this comment.

    So the lead paragraphs in an accurate news article on the O’Donnell paper would read something like this:

    NEW STUDY SHOWS SIGNIFICANT WARMING OVER MUCH OF ANTARCTICA

    A new paper published in the highly regarded Journal of Climate shows statistically significant warming in over 70% of West Antarctica, consistent with a previous study. However, the new study shows much higher warming throughout the Antarctic Peninsula than the previous work showed. West Antarctica has some of the most threatened ice sheets, glaciers, and ice shelves on the continent.

    The new study also shows significant warming over 30% of the much larger East Antarctica ice sheet, whereas previous studies showed no statistically significant warming. In contrast, less than 5% of East Antarctica showed significant cooling, inconsistent with some forecasts expecting significant cooling due to effects from the ozone hole over the South Pole.

    Is this a reasonable summary of the new results?

    [Response: Probably something like that. We'll have to look at those numbers carefully when we get a chance.--eric]

    Comment by Paul K2 — 11 December 2010 @ 1:19 PM

    http://www.realclimate.org/?comments_popup=5606

  24. curious said

    Thanks Jon P – that has an uncanny correlation with Zinfan’s comments on the “Doing it ourselves” thread! Obviously, “correlation” isn’t “causation” but it looks like he could be on the right track – zinfan* and paul K2 should touch base!! Good to know there is something substantial going on over at RC after all!

  25. If would be helpful for many of us to see an reply to the RC comment Jon P mentions at #23. Is it accurate?

    URL of the comment:

    http://www.realclimate.org/?comments_popup=5606#comment-194402

  26. Jeff Id said

    Fabius,

    Our West Antarctic trend is about 1/2 of Steig and we all think there still could be bleeding in of higher peninsula trends in that region. Still, even with some minor smearing in our result, the two results are statistically significantly different in the West. If halving the trend and a statistically different result is confirmation in climate science then the S09 result is confirmed.

    You could also add – “NEW PAPER CONFIRMS ANTARCTIC CONTINENTAL WARMING NOT STATISTICALLY SIGNIFICANT”

    I wouldn’t say that though, even though it is true, there are still areas of warming. Primarily the peninsula with the potential for mild warming in the West Antarctic.

  27. Jeff #7, Roman #11
    The fact that the difference was calculated using the same data raises then the question of what the “statistical significance” of the difference means. Different calculations done on the same numbers seems to produce a deterministic difference. So I’d be interested to see the basis of NicL’s calculation.

    What does the difference then mean – that the difference in the methods is “statistically significant”?

    To take a simplified analogy, suppose A looks at data on heights of recruits, and offers the mean as a central measure. B looks at the same data and extracts rms. Both have a sensible result with a similar variance. B>A – is this “significantly different”?

    You might object that they are just different things, but so are your and Steig’s trends – they are calculated differently, and assigned the same name.

    “Tests” on B-A could yield a variety of results, depending on how far you delve into the methods. It’s 100% significant if you use the Schwarz inequality, even though for practical purposes the difference might be very small.

  28. Jeff Id said

    Nic,

    I can’t wait until the thing publishes. The code is very clean and you will be able to see everything you could want.

  29. Ryan O said

    Re-created image for the RLS reconstruction, 1957 – 2006, with areas of statistically insignificant trends overlaid in gray:

    Also left a comment at RC. In moderation.

  30. Ryan O said

    Nick Stokes,

    This question came up in the review process. Rather than redo everything, I will simply quote from my response:

    There are three primary concerns we have with this comment. In summary, they are:

    1. Comparing whether 95% CIs overlap does not yield a 5% significance level for rejection of the two-sample null hypothesis

    2. Confidence intervals mathematically cannot be added to yield a combined p-value

    3. The comparison the reviewer makes is only valid under the conditions of independent samples and independent errors

    We would like to take some time to explain each in turn.

    1. Comparing whether 95% CIs overlap does not yield a 5% significance level for rejection of the two-sample null hypothesis

    Comparing the difference in location (trend) for two samples is not the same as comparing the difference in location for one sample to a fixed point. In the latter case, the fixed point – the null hypothesis – has no associated uncertainty. In the former case, both samples have uncertainty.

    Since mutual probabilities are multiplicative (i.e., pevent = p1 * p2, where the event is defined as the simultaneous occurrence of 1 and 2), requiring the difference in location between two samples to exceed the sum of their 95% CIs is equivalent to requiring a two-tailed significance level of 0.25%, not 5%.

    2. Confidence intervals mathematically cannot be added to yield a combined p-value

    Confidence intervals for linear regressions may be expressed as:

    CI = c * SE

    where s is the sample standard deviation, n is the number of observations, SE is the standard error of the mean, and c is a scalar multiplier that scales the standard error to a confidence interval. Since confidence intervals are simply scaled standard deviations, they cannot be added. Instead, one must take the square root of the pooled variance. The corresponding hypothesis test is the two-sample pooled-variance t-test (for samples) or z-test (for populations):

    t = (A_mean – B_mean) / sqrt(SE_A ^2 + SE_B ^ 2)

    where A_mean and B_mean are the regression coefficients for the series being compared. For identical standard deviations and sample sizes, this yields a pooled standard deviation of sqrt(2) * SE, not 2 * SE. This means the 5% significance level using this test corresponds to the point at which the 95% CIs overlap by approximately 40%.

    3. The comparison the reviewer makes is only valid under the conditions of independent samples and independent errors

    The null hypothesis for two-sample test discussed above is typically taken to be that the samples were obtained from the same population (with the alternative hypothesis being they were obtained from different populations). The assumptions for this test are that the two samples are comprised of independent observations and that the errors are likewise independent. The requirement of independent errors is explicit in the formula, which adds the error variances to calculate the pooled standard deviation. Variances only add when the variables are uncorrelated.

    Neither assumption holds in the comparison the reviewer makes. The assumption of independent observations is violated since S09 and RO10 use largely the same data for conducting the analysis. Even were we to assume that the data used by S09 and RO10 was different enough to be considered independent, the errors are clearly not. There is at least one underlying confounding factor that destroys the independence of the errors: time. Only a subset of the population (where the population consists of all possible measurements of near-surface Antarctic temperatures from time zero to the present) is available for observation at any given time, regardless of the source of the observation. Because the possible observations are limited to a subset of the population and S09 and RO10 draw the samples out of the same subset, the errors in both are necessarily dependent on the time the observations were made. The errors are not independent, and the pooled variance cannot be accurately calculated by adding the error variances.

    If the samples are known not to be independent and/or confounding factors are suspected, the proper test for significance is a one-sample t-test on the residuals (or, equivalently, the paired t-test). When this test is performed, only 4 (RLS) and 3 (E-W) of the 20 regional comparisons (4 regions, once with all seasons and once with each of the 4 seasons) fail to show significance at the 5% level.

    Along with the three items above, from a Bayesian point of view, the value of this test is rather limited. If the samples are identical, unless the mathematical treatments – and, hence, subsequent results – are exactly equivalent (and in this case they are not), the posterior probability of a real difference in results is precisely 1.0. The situation is analogous to using a hypothesis test to answer the question of whether using n – 1 or n degrees of freedom to calculate sample variance yields different results. It is an absolute certainty that a real difference exists, regardless of the outcome of the hypothesis test or whether the difference “matters”. Since the probability is already known prior to the test being conducted, one might question whether the test adds confusion rather than value.

    It is important to remember that the question of “where is A located?” and “what is the difference in location between A and B?” are different questions that can sometimes be answered with very different precision. In practice, one is rarely able to use the former to accurately estimate the latter. The former – “where is A located” – uses the sample variance to calculate uncertainty. The latter – “what is the difference in location between A and B” – uses the residual variance between A and B to calculate the uncertainty. When the samples are the same (or nearly so), or a confounding factor can be identified, the latter question can be answered with much higher precision than the former.

    In the event that one wishes to estimate the magnitude of the difference and associated uncertainty, knowing only that there is a difference is not very informative. In this case, the t-test on the residuals will yield the desired information. We agree that this information can be useful (though potentially subject to misinterpretation), and have provided both regional summaries and spatial maps that indicate whether the estimate of the difference is significant at the 5% level.

    We caution that one should evaluate these results in the context that the posterior probability of a real difference in results is 1.0, regardless of the calculated significance level of the hypothesis test. The important information is the residual variance, not the p-value itself.

    With respect to the original comment – that the West Antarctic trends between S09 and RO10 are not statistically different – when the correct test is used, they are, indeed, statistically different. The residual trend is 0.09 +/- 0.05, which, if one is curious, yields a p-value of about 5*10-5.

  31. Jeff Id said

    Only 86 pages to go.

  32. nvw said

    Your financial adviser named Bernie Steig promises you a return on your investments of 10% each year. You pay him $100,000 and expect to be able to live off the $10,000 earnings generated each year.

    In actuality the mathematical model Bernie used only generates half of what he promised and you are forced to live off $5,000 each year by careful consumption of cans of dog food.

    Bernie claims because his model still generated a positive investment return to you, he has proven his model is successful.

    Do you give more money to Bernie to invest?

  33. putra said

    good job for you !

  34. Jon P said

    Ryan,

    Thank you for the follow-up and I see your post made it onto RC.

  35. Ryan O said

    Yep! I also put some additional data online and let Eric know where he can get it (comment in moderation – but I’m sure it will show soon).

  36. Re: Ryan O (Dec 11 20:58),
    “We caution that one should evaluate these results in the context that the posterior probability of a real difference in results is 1.0, regardless of the calculated significance level of the hypothesis test. The important information is the residual variance, not the p-value itself.”

    That sounds pretty much like what I’m saying. So my question persists – what does the error range, and “statistical significance”, actually mean here? And why is the variance important?

    I note in point 1 you’re saying that it would not be right to say that the results were significant only if the separation exceeded the sum of the individual ranges. That’s pretty much point 2, as well. But for the continent, it’s actually less than each individual range. And for WA, it barely exceeds the individual ranges and is much less than the sum.

    I’m still tryinmg to work out how the difference uncertainty ranges were calculated, and as Jeff says, I might just have to wait. But your point 3 argument suggests to me that, while most of the data used was the same, there was some different selection, and the range is based on that difference. If so, it seems artificial.

  37. bigcitylib said

    Jeff wrote: “The paper was supposed to be out several days ago now and had problems due to some technical issues.”

    Did you guys screw something up?]

    REPLY: Not that I’m aware of.

  38. RomanM said

    #27,36 Nick.

    What does the difference then mean – that the difference in the methods is “statistically significant”?

    There is no conceptual problem here. It means exactly the same as the result for any other statistical test.

    Suppose we have a population with a parameter T. We take a sample from that population. You and I have two different methods which we can use to calculate estimates of T: Tr and Tn. These methods are pretty complex and either one or both of these statistics could in fact be biased.

    Now, we would like to test whether our two methods are different with respect to the bias of their estimation of T. The (unknown) expected values (means, for non-statisticians) of the these statistics are Mr and Mn. Does it not make sense to you that we can test the null hypothesis, H0: Mr = Mn (or Mr – Mn = 0) using the test statistic D = Tr – Tn?

    The same sample can be used to perform the test. In that case, what is needed is an estimate of the standard error of the test statistic, D, to produce a p-value which can be interpreted exactly as it is for any other test.

    The proper question to focus on is whether the details of the Ryan et al approach are correctly done.

  39. Ryan O said

    Nick,

    I’m not sure I understand the confusion. The test is a paired t-test.

  40. Ryan O said

    Nick,

    Let me explain in a different way. All that is done is to take S09’s reconstruction, subtract ours, and measure the leftover trend in each gridcell (corrected for autocorrelation of the leftovers) for the grid cell measurements, or regional averages for regional summaries. There’s no artificial correction or anything like that. Simply a paired t-test.

    Now, if you are wondering why a paired t-test would yield significances that are very different from the pooled variance test (which is what looking at the overlap of the CIs is), try this simple test:

    1. Generate a white noise series of length 10 or so.

    2. Take that white noise series and divide by the number of points in the series. Call this “A”.

    3. Take that same white noise series and divide by n-1. Call this “B”.

    4. Use the pooled-variance t-test to determine if there is a statistically significant difference in means between A and B (there won’t be).

    5. Use a paired t-test to determine if there is a statistically significant difference in means (there will be).

    This demonstrates that if the observations (your white noise series) are not independent in “A” and “B” that the pooled variance test can have very little statistical power. The paired test, on the other hand, will properly detect a real difference in means even with as few as 2 or 3 observations.

    The importance of the leftover variance is related to a different point, which is that S09’s method of regressing the PCs against the ground station data results in more variance loss than the RLS method of directly regressing the AVHRR spatial structure . . . which is highlighted by the difference in residual variance when the t-test is performed on S09 vs. the E-W reconstructions (which regress the PCs) and vs. the RLS reconstructions (which do not).

  41. Re: Ryan O (Dec 12 10:50),

    Ryan,
    My issue comes back to your caution which I highlighted. It says (correctly, I think), that people shouldn’t take much notice of p-values here. But statements about stat significance are just statements about p-values.

    I think your noise example is to the same effect as my recruits argument. Yes, a paired t-test will have greater power. But if, say, the samples were positive, there is a simple math argument available that says that B is certainly greater than A, p=1. Whatever the data. That’s a powerful test, but what does “significant” then mean?

    And RomanM, that’s the conceptual problem. Suppose the difference of method just led to a constant predictable offset, Tr – Tn = Mr – Mn = 0.0001. Then the test is whether 0.0001 is “significantly” different from 0.

  42. steven mosher said

    BCL,

    Its in the green book. you’re familiar with that correct?

  43. Ryan O said

    Nick,

    Ah, okay. I understand what you are saying. You hit the nail on the head . . . what does it mean?

    The posterior probability that there is a real difference in results is 1.0 regardless of the p-value, so the utility of the calculation is, indeed, limited. This is why on the first submission, we did not do any test like these because we saw only opportunity for confusion. However, 2 of the reviewers insisted that this calculation be done, so we did it.

    The test only tells you that there is a difference – but we already knew that. It makes no statement on whether the difference is physically meaningful. The latter question is in the eye of the beholder.

  44. RomanM said

    #41: Nick:

    Suppose the difference of method just led to a constant predictable offset, Tr – Tn = Mr – Mn = 0.0001.

    So? What’s the problem?

    Why are you (and some other people) hung up on the word “significant”? Do you stop at that point and say, “Well, that’s it ! They’re different. Now, we move on.” A statistical test is there to help you decide whether an observed difference might be really there or could just be a result due purely to random variation in the system. If I can really see that a small difference is real, that’s great – that is NOT a problem.

    However, no self-respecting statistician would stop at that point. They would continue by looking at the size of the difference (perhaps through a confidence interval) to evaluate whether that difference is also meaningfully large (that’s the other meaning of significant) in the context of what they are studying.

  45. Re: RomanM (Dec 12 17:59)

    Why are you (and some other people) hung up on the word “significant”?

    Stat significance is the topic of this thread:
    “I warned that the reconstructions are statistically significantly different “

  46. Jeff Id said

    Nick, Big City Liberal was stating that the results aren’t different because they fall within each others uncertainty range. Statistically, they are very much differentiable results. It is up to readers to determine if a 2X difference is enough to be a difference they are concerned about. In climate science, some models are running 4X trends and yet people still defend them so perhaps those of us in engineering type fields (like yourself) just need to widen our field of view ;)

  47. Ryan O said

    Also, Nick, in terms of posterior probabilities, all of the differences are statistically significant. Every last one of them.

    Statistics only answers whether the apparent difference could be due to a random effect. In this case, the differences between S09 and ours are not due to random effects. Statistics will not tell you if the differences are physically meaningful.

    BCL is conflating physically meaningful with statistically significant.

  48. curious said

    47 “BCL is conflating physically meaningful with statistically significant.”

    yep, but probably doesn’t know any better.

  49. Re: Ryan O (Dec 12 19:11),
    Ryan,
    I agree with what you’re saying, though I haven’t been able to find the BCL post mentioned. But in NicL’s email to Andy Revkin, I see:
    the difference between the central trend estimates is statistically significant (0.06±0.05).

    Again the difference in the two central trend estimates is statistically significant (0.09±0.06).
    .
    Jeff bolded these, so I presume they are, well, significant.

  50. Ryan O said

    Nick,

    They are. All the differences are significant, so what Nic and Jeff wrote is most certainly true. However, the reviewers and journal wanted us to put ranges on those, so that is what we did. The range corresponds to the uncertainty in locating the pairwise difference between A and B using the residual variance and ~560 DoF (after correction for serial correlation of the residuals). That’s it.

    We could have truthfully put “the difference between the central trend estimates is statistically significant (0.06 +/- 0.00)”, but people have a hard time accepting this.

    Another thing we could have done (and I mentioned this in the review process) is to repeat the S09 and our analysis thousands of times, each time making a different modification to the ground station observations. This analysis would show that the differences are systematic, not random (as the regression coefficients would, in both cases, move in similar directions with each modification), and would further reduce the range of the calculated uncertainty. We could do this enough times until the uncertainty range was reduced to effectively zero.

    The point I wanted to make (even though I conceded and calculated the paired differences) was that the value of these tests are quite limited if the posterior probabilities are a priori known. In general, I warned that it would add confusion. However, we did agree to comply with what the reviewers wanted.

    It would seem that you are the first victim of the warned-about confusion! :)

  51. Jeff Id said

    Nick,

    “Jeff bolded these, so I presume they are, well, significant.”

    Your odd side is coming out again. I wonder what your opinion is.

  52. Jeff Id said

    “It would seem that you are the first victim of the warned-about confusion!”

    Naw, I don’t believe it. Nick always gets these things, but he likes to play.

  53. Re: Jeff Id (Dec 12 20:06)

    Well, I’m pretty much agreeing with Ryan down the line. I quite understand about coping with reviewers who get good ideas – it does muddy things.

    But I don’t agree with the assertions that 010 refutes S09 or whatever, and that’s why I resist the use of statistical significance here, which isn’t the right concept. I agree with Steig etc who say that O10 is an improvement. That’s a good outcome. Someone else will improve on yours – that’s the way science goes.

  54. Mark T said

    Of course you do, Nick. It would be impossible for you to see it any other way.
    Mark

  55. Howard said

    Ryan O:

    Do you have a link for a similar map like this one:

    for the S09 calculations?

    That might help illustrate the significant difference in the two papers.

  56. RomanM said

    Nick’s confusion stems from his statement in comment 27:

    The fact that the difference was calculated using the same data raises then the question of what the “statistical significance” of the difference means. Different calculations done on the same numbers seems to produce a deterministic difference.

    Once a set of data has been collected, unless we are using monte carlo methods, ALL calculations on that same data become “deterministic”. Calculate the mean of one of the variables in the set. Calculate it again. The difference of the two results will always be zero. However, we can still perform statistical tests and find confidence intervals for the mean of the population. That is because we can simultaneously calculate estimates of the variability of the sample mean to evaluate how it may relate to the population mean.

    Now suppose we want to examine the size of the difference between the mean and median of a given population. I decide to take a single sample of data and calculate both of them from the same numbers. No matter how many times I do the calculation using this sample, that difference will always give the same result. Are you now saying that if I know the standard error of that difference (or can properly estimate it from the data), I cannot apply and interpret appropriate statistical methods to the situation? If that is the case, then you are rejecting all of the methods of statistical inference.

  57. J Bowers said

    Personally, I’d be interested in seeing a collaboration between Steig and O’Donnell.

    There’s as much chance of my winning the Euromillions lottery, but it’d be nice ;)

  58. stan said

    I’d like a little clarification here, please. From my point of view, it appears that Nick thinks that the purpose of 010 was to make a statement about the science of temperatures in the Antarctic. I was under the impression that the “amateurs” aren’t trying to make any kind of definitive statement about the science. They know the data quality is so crappy that anyone who attempts to make a definitive statement is setting himself up to play the fool. I thought their purpose was simply and only to show that Steig screwed up the math/stats.

    Boiled down to its essence, O10 says — If (big if) a climate scientist decides to try to use certain unusual stats techniques to investigate temperature trends on this continent, the method used in S09 is NOT the proper way to use those stat techniques. That’s it. I didn’t understand that they were trying to make a definitive statement about temperatures. Given how sparse the coverage and how many quality control issues were involved, they know better.

    If this understanding is wrong, someone please set me straight. If my understanding is correct, it would appear that some of the problem with the “team” is that people are talking past each other. They think that the study is trying to say something definitive about the science instead of being limited to pointing out that Steig screwed up his stats/math.

  59. Bad Andrew said

    Exactly stan,

    It’s almost like people who believe in Global Warming are so ate up with it that they can’t consider objectively a paper whose contents don’t have anything to do with it. Even the guys that wrote the paper felt it necessary to comment genuflections to Global Warming for some reason.

    O Science, where art thou?

    Andrew

  60. Jeff Id said

    “Even the guys that wrote the paper felt it necessary to comment genuflections to Global Warming for some reason.”

    Just for the purposes of the totally absent press. I’m sorry if you don’t believe that CO2 captures heat but it does. It’s like stating that ‘yes we believe in physics’. There could otherwise be an effort to paint us into a corner as non-believers. It’s not like it would be unique for myself to be described that way. When climategate broke last year, many of the articles which discussed tAV defined myself and by association you guys as deniers of global warming. It would be a shame if all this hard work received that label again.

  61. Ryan O said

    Roman,

    Not quite.

    The null hypothesis in the case Nick and I are discussing is that there is no difference in the linear trends between S09 and ours. I personally think this question is meaningless, but this is what we were asked to provide by two of the reviewers.

    The argument I had against providing this information is that the probability that there is a difference is already known. Both studies rely on the same data, and the mathematical treatments are different. If you take any number and perform one operation on it for Case A and a different operation for Case B (and the two operations are not equivalent), the posterior probability that there is a real difference in results is 1.0.

    Now, given that there are slight differences in the data used and that RegEM is affected unpredictably by noise on the data, there are some effects of the calculations that might be considered “random” in that it is difficult or impossible to predict how a change to the observations will exactly influence the results. So perhaps there is some merit to the question from that perspective, in that one could estimate the magnitude of these “random” effects. That doesn’t change the fact that the posterior probability that there is a real difference in results is 1.0.

    This, of course, makes no statement on the population of the underlying observations. We also estimate statistics and standard errors for the population based on using our results as the sample, in which case the null is whether a given statistic is consistent with zero. Asking this question makes sense, and these were the types of calculations we provided in the initial draft.

    However, what the reviewers wanted, essentially, was an answer to the class of questions that would also include, “is there a statistically significant difference between using n or n-1 degrees of freedom for estimating the variance of a sample?” when the set of observations used were the same for both methods. This, in my opinion, is a senseless question.

  62. Bad Andrew said

    JeffId,

    I understand that you are trying to work in the current political climate. ‘Denier’ is just another word for ‘political opponent’. I wouldn’t project any science onto it. Do you think the people who sling ‘denier’ around care about understanding your work? Let them name call. It’s meaningless.

    Andrew

  63. J Bowers said

    “It would be a shame if all this hard work received that label again.”

    I don’t think it will, especially once the dust settles. Your paper has obviously raised a number of interesting scientific questions, with many on both sides of the debate’s “fence” wanting to know more which in itself is quite the result.

  64. Kenneth Fritsch said

    The discussion at this thread is giving good background on what and how the reviewers and defenders here of the consensus think on these issues.

    In order to say there are no differences between S09 and RO10 one really has to either be “rooting” for the home team or merely looking at small differences as small difference whether they are statistically significant or not and, of course, missing the whole point about differences in methods. The Antarctica continental trend, even from S09’s methods, from 1957-2006 was “small”, but of course the initial consensus reaction to that was that there was statistically significant warming in the Antarctica continent.

    The importance that reviewers put on statistical significance is rather obvious from these comments above, and, if I can read between the lines, the publication of RO10 appeared to hinge on seeing some statistically significant differences between methods. Ryan O’s comments to me imply that he was putting forth the worth of RO10 as a correction of methods (in S09) and that with correct methods the paper was worthy of publishing regardless of whether the application resulted in a statistically significant difference in final results.
    In my view, the authors of Steig 09 had to be very aware of the need for showing something statistically significant for their paper to be published and in turn making the splash it did in the MSM. I would be interested to know whether the authors saw initially that to show significance from sparse Antarctica data that the data base would have to be expanded in time and space in the manner that they eventually did and that when they found the significant warming simply shut down any other thoughts and evaluations of the methods they had applied. Surely they knew as indicated in the S09 paper that their methods were not correctly representing the Peninsula trends. Perhaps they had Gavin Schmidt or someone with his disposition to rationalize the use of truncation in the PC method in the name of removing noise over being entirely correct on a regional basis – and do it without making any sensitivity tests to show that to be the proper compromise.

    A side point here: I have heard that the novelty of Steig 09 was in showing that the warming of the West Antarctica was as great or greater than the Peninsula which over turned previously published evidence. I am assuming that the more novel a paper the more readily it can get published. I also have in the back of my mind that using the satellite data to obtain spatial relationships with a surface measurement that could be carried back in time from the “calibration” period to “reconstruction” period was novel. A further novelty would be showing that sufficient data could be accumulated going back sufficiently far in time to show statistical significance for a warming trend in the Antarctica. What do the participants here think was considered novel about S09?

    A further question would be the worthiness of S09 without the showing of statistically significant warming of the Antarctica, i.e. was showing the method sufficient. Certainly the public relations for AGW probably needed the significant warming.
    I do want to note that from the blog discussions that Ryan O was very much more interested in determining the validity of the methods in S09 and whether “improvements” could be found than he was in showing that the warming trend was or was not significantly warming. From the beginning I was noting that in any publication showing statistical significance and CIs would be of utmost importance to the reviewers. If that sounds a little too much like TCO please let me know.

    I want to reiterate that all these discussions of S09 and RO10 have made for very enjoyable learning experiences for me and given a glimpse behind the reviewing and publishing processes.

  65. RomanM said

    #61, Ryan:

    The null hypothesis in the case Nick and I are discussing is that there is no difference in the linear trends between S09 and ours.

    This is what I am talking about as well. I have no problem with this hypothesis despite the views of some people who promulgate the fallacy that in real life two parameters are always unequal. What this hypothesis says is that, in your case, both of the methods are on the average estimating the same value (whether that is the correct population value or not).

    The argument I had against providing this information is that the probability that there is a difference is already known. Both studies rely on the same data, and the mathematical treatments are different. If you take any number and perform one operation on it for Case A and a different operation for Case B (and the two operations are not equivalent), the posterior probability that there is a real difference in results is 1.0.

    Except in rare cases where your data is discrete (e.g. integer-valued), there will always be a difference. The mathematical treatments are different because you are using a different estimation procedure on the same data. So what? Look at my example above. Suppose I sample a normal population (where the mean and the median are theoretically equal). How often will the mean and median of any sample be equal… even to 3 or 4 decimal places?

    As long as you are not using the characteristics of the the sample calculated difference to form your hypotheses (e.g. method B had a greater trend than method A so that will be my alternative hypothesis), you are on reasonably solid ground.

    Now, given that there are slight differences in the data used and that RegEM is affected unpredictably by noise on the data, there are some effects of the calculations that might be considered “random” in that it is difficult or impossible to predict how a change to the observations will exactly influence the results.

    This, of course, makes no statement on the population of the underlying observations. We also estimate statistics and standard errors for the population based on using our results as the sample, in which case the null is whether a given statistic is consistent with zero. Asking this question makes sense, and these were the types of calculations we provided in the initial draft.

    The population here is the Antarctic temperatures for the given time period. The sample is the specific measurements made so the “random” noise is the combined effect of measurement errors, spatial coverage limitations, etc. These effect of these factors is quantified and used to evaluate the size of the difference to decide whether it is large enough to be deemed “significant”.

    Testing whether a given statistic is “consistent with zero” is no different. Calculate the difference between the two results above. Now you have a “given statistic” and you original hypothesis is that it is “consistent with zero”.

    However, what the reviewers wanted, essentially, was an answer to the class of questions that would also include, “is there a statistically significant difference between using n or n-1 degrees of freedom for estimating the variance of a sample?” when the set of observations used were the same for both methods. This, in my opinion, is a senseless question.

    I agree that this is not the proper question which needs to be answered here. What they are asking for is a sensitivity test on using variations in a method. What they really wish is to determine how much the change in calculation of a methodology changes the end results relative to the magnitude of the estimated values themselves. Comparing the change to a measure of random variation tells them nothing.

    In the case comparing the the two trends, we first wish to decide whether the two methods are estimating the same thing. If not, the second question is “how large is the difference” – which can be evaluated by looking at confidence intervals and comparing to how substantially the original estimated values have changed.

  66. Ryan O said

    Roman,

    I apologize if I am being obtuse, but I do not quite understand your all of your reasoning. I understand everything (I think) with the exception of this statement:

    Except in rare cases where your data is discrete (e.g. integer-valued), there will always be a difference. The mathematical treatments are different because you are using a different estimation procedure on the same data. So what? Look at my example above. Suppose I sample a normal population (where the mean and the median are theoretically equal). How often will the mean and median of any sample be equal… even to 3 or 4 decimal places?

    This does not seem to apply to what we are doing. In this case, the underlying data is the same. S09’s sample and our sample are the same sample . . . they are not independently drawn samples from the same underlying population.

    Now, if you were presented with 2 treatments that used independently drawn samples from the same population, then asking the question, “Is there a statistically significant difference in the results?” makes sense to me, because the answer depends not only on the treatments (which are a priori known to be different), but also on the samples. In the case that both treatments use the same samples, then the probability that the answers will be different is 1.0 as long as the methods are not equivalent.

    This is a different question than asking if both treatments yield answers that are consistent with a null hypothesis that they both could represent the same underlying population. This question, also, makes sense . . . and I think (based on the above) that this is the question that you are concerned with. Unfortunately, this is not the question to which we were asked to supply an answer. For this question, since both S09 and we were restricted to the same sample, I would imagine that the proper way to answer it would be a Monte Carlo analysis, or to restrict the mathematical treatments to different subsets of the observations.

  67. Ryan O said

    Roman,

    As an aside, I think the way that we answered the question (i.e., using a paired t-test), provides an approximate answer to the question as to whether the two results are consistent with the same underlying population. I only think it is approximate, however, as the samples were not independent.

  68. RomanM said

    #66 Ryan:

    This does not seem to apply to what we are doing. In this case, the underlying data is the same. S09′s sample and our sample are the same sample . . . they are not independently drawn samples from the same underlying population.

    It is exactly the same situation in my example. I take a single sample of values from my population. I calculate the mean from those values. I calculate the median from the same values. I subtract to calculate the difference (just as in the trends). Now, “with probability 1″ that difference will not be equal to zero in the same way that you claim for your situation. If I am clever enough, I can calculate the theoretical distribution (or if less clever, :) , I could use bootstrapping) to determine the standard error of that difference and carry out the statistical test.

    What you (and Nick) are not taking into account is the fact that if the data collection were to be “repeatable” (e.g. similar data from other satellites of the same type were to be magically available), a repetition of the same comparison calculations within each of these new samples would produce (with probability 1) a trend difference which is not equal to the original one in the paper (or to each other).

    If someone calculated only the variability from these extra results, would you agree that it would be proper to then perform the test using only your calculated trend difference and the outside estimate of variability? If so, why is it improper to perform the same test when you can produce an estimate of that same variability internally from your single sample?

    This is a different question than asking if both treatments yield answers that are consistent with a null hypothesis that they both could represent the same underlying population.

    Not really. I thought that we are talking here about whether the following line from Jeff’s post has genuine meaning:

    S09′s central estimate of the continental trend is double ours, and the difference between the central trend estimates is statistically significant (0.06±0.05).

    This is a question of whether the two calculated trends are estimating the same parameter value (whether that value actually represents the “true” trend for Antarctica or not). What I am saying is that the statement can indeed be meaningfully supported by comparing the trends of both methods within this single set of data, something which Nick denied in comment #27.

  69. Ryan O said

    Haha!

    I now get it. ;) Duh. Seems quite obvious now!

    With my confusion put to rest, I would like to ask your opinion if our paired t-tests were the appropriate way to measure this. For the moment, I think that they are . . . but I would like to get your take as well.

    As always, Roman, thanks for your incredible patience . . . I can be quite slow at times. ;)

  70. Ryan O said

    Sometimes it’s a matter of a few key phrases to make it click:

    What you (and Nick) are not taking into account is the fact that if the data collection were to be “repeatable” (e.g. similar data from other satellites of the same type were to be magically available), a repetition of the same comparison calculations within each of these new samples would produce (with probability 1) a trend difference which is not equal to the original one in the paper (or to each other).

    I must admit, though, that I am “less clever” and would resort to bootstrapping. ;)

  71. Kenneth Fritsch said

    I agree that this is not the proper question which needs to be answered here. What they are asking for is a sensitivity test on using variations in a method. What they really wish is to determine how much the change in calculation of a methodology changes the end results relative to the magnitude of the estimated values themselves. Comparing the change to a measure of random variation tells them nothing.

    I hope that the participants at this blog are understanding where RomanM and RyanO agree and (originally) disagreed. I think that is an important point. That I think I might understand may only be that I am having another TCO moment.

  72. RomanM said

    #69, 70 Ryan

    I’d like to wait and see the full paper first to get the contextual sense of the whole paper first before looking at details if you don’t mind.

    The calculation of the joint distribution of the mean and the median (or the distribution of their difference) can be a daunting task for most underlying distributions and would require a knowledge of a lot of mathematical tricks (real ones, not the team sort) to get anywhere. I would probably join in the less clever bunch as well. :)

  73. AMac said

    A bit off-topic, but interesting. At GNXP, Razib Khan discusses Principal Component Analysis of genetic information in various European ethnic groups. The topic of interest is the pitfalls in using PCA to analyze the spatial and temporal changes in these populations, specifically in reconstructing the patterns of migration and admixture during the post-Ice-Age migrations of the Paleolithic and Neolithic.

    This is a little different from Antarctic temps, but I see some of the same general themes there, and here (I think). In particular, there are seemingly-commonsensical rules of thumb that may obscure rather than illuminate historical patterns.

    But, the original [PCA-based] synthetic maps have become prominent for many outside of genetics… And yet a reliance on these sorts of tools must not be blind to the reality that the more layers of abstraction you put between your perception and comprehension of concrete reality, the more likely you are to be led astray by quirks and biases of method.

  74. RB said

    If I’m not mistaken, I read OLMC09 as having restored the pre-Steig view of the Antarctic – showing yet again how the odds are stacked against overturning a long-held consensus. Now, about that those theories of Arrhenius and Plass …

  75. Re: RomanM (Dec 13 09:15),
    Roman,
    I haven’t disputed that you can do a test of statistical significance on the trends differently calculated on the same data. I have just, in the spirit of this thread, asked “what does it mean?”.

    The reason for asking is this question “what has it been taken to mean?”. NicL’s letter, helpfully bolded by Jeff, is a guide:
    “there appears to be an attempt to gloss over the differences between the results of our study (per the main RLS reconstruction) and that of S09 (per its main reconstruction) , some of which are pretty fundamental”

    The “gloss” seesm to be the observation that the differently calculated trends lie within each others uncertainty intervals, so that seems OK. The counter is that, being calculated from the same data, the uncertainty of the difference is much less.

    My observation was that, at first glance, the difference is then not random – it comes from predictable arithmetic. We know the methods are different, and don’t need a statistical test to tell us that. But yes, you can still ask what the distribution of differences would be if you repeated the calc on 10000 different Antarcticas. And you might find that the expected mean of those differences was significantly different from zero.

    The point of my rms vs mean example, or the more extreme <a href="http://noconsensus.wordpress.com/2010/12/10/olmc-10-what-does-it-mean/#comment-43076"0.0001 example, is to say that when you have a deterministic difference overlaid with some random variation, a test of statistical significance needs careful interpretation. And it doesn’t refute the “gloss”, which is that S09 and O10 are somewhat different methods which give similar predictions, assessed in the way predictions normally are. That is, the uncertainty due to methods is not large compared with the uncertainty of data fluctuation.

    You asked, re that 0.0001 example, “what’s the problem?”. The problem is that you have tested a null hypothesis that you could have determined, by prior analysis, to be false. But the fact that the statistical test says that it is false is said to confer statistical significance. If it was false anyway, what does this mean?

  76. Kenneth Fritsch said

    And it doesn’t refute the “gloss”, which is that S09 and O10 are somewhat different methods which give similar predictions, assessed in the way predictions normally are. That is, the uncertainty due to methods is not large compared with the uncertainty of data fluctuation.

    With all due respect, Nick, what the heck does this mean and where again is the evidence for what you say here? The language appears to be getting a bit fuzzy.

  77. Re: Kenneth Fritsch (Dec 13 19:02),

    Well, take the continent trend. O10 got 0.06 ±0.08, S09 got 0.12 ±0.08. Relative to their errors, they don’t seem very different. But NicL says they are, because the difference is 0.06±0.05.

    What that seems to mean is that if you notionally repeated the process many times, while the predicted trends might fluctuate a lot, the difference between S09 and O10 would still be mostly positive, centred on 0.06.

    My query is, so what? They are different methods – we knew that. And their difference in estimate of this statistic may be biased. It might even be predictable. That doesn’t refute the proposition that the estimates are similar. It’s just an aspect of their difference.

  78. RomanM said

    75 Nick,

    I really feel that I have addressed most of the points you raised in my earlier comments. However,

    My observation was that, at first glance, the difference is then not random – it comes from predictable arithmetic. We know the methods are different, and don’t need a statistical test to tell us that.

    The “not random” is is a red herring which I discussed above. As I pointed out, all samples are not random once they have been collected. However, when you apply your method to another sample, the predictable arithmetic in this case will not produce the same result as in this case.

    Knowing the methods are different does not necessarily mean that the end results are “different”. Two methods can both be unbiased, but the random variation will ensure that the calculated results will not be identical in a particular sample.

    And it doesn’t refute the “gloss”, which is that S09 and O10 are somewhat different methods which give similar predictions, assessed in the way predictions normally are. That is, the uncertainty due to methods is not large compared with the uncertainty of data fluctuation.

    I am not sure exactly what you are trying to say here. The test compares the difference between the methods to the variation in the method results due to the random factors in the system including the uncertainty of data fluctuation. I indicated that the next step is to compare the difference to the variation in the parameters of the system, i.e. the “actual temperatures” in Antarctica. These have already occurred so they can be treated as deterministic (although our measurement of them includes uncertainties which hopefully a good test is taking into account).

    You asked, re that 0.0001 example, “what’s the problem?”. The problem is that you have tested a null hypothesis that you could have determined, by prior analysis, to be false.

    Do you honestly think that that is a genuine problem in this case? Frankly, that is just another red herring to the matter at hand.

    Nick, a lot of statistics is about quantifying effects and deciding how much of what we can see is real as opposed to artifacts of the random elements in the system we are analyzing. That is why instead of seeing whether confidence bounds overlap, it is preferable to get a better numeric handle on observed differences. O10 (some new kind of oxygen?) was written because they thought that S09 was not done correctly. This particular portion of the analysis was sparked by someone making a request to see the difference in trends quantified in this fashion. What they did appears to me be a valid and meaningful approach.

    Just before posting this reply to your previous comment, I noticed your latest:

    And their difference in estimate of this statistic may be biased. It might even be predictable. That doesn’t refute the proposition that the estimates are similar. It’s just an aspect of their difference.

    This is a scientific statement? I agree with Kenneth, the language does appear to be getting fuzzy…

    I think the horse has breathed its last despite my stick.

  79. Re: RomanM (Dec 13 19:52),
    Roman,
    This is about as concisely as I can say it. NicL objects to a claim that the S09 estimate of trend (0.12 ±0.08) is similar to the O10 estimate (0.06 ±0.08). He says the difference is biased. I say that there is no reason to expect different methods to predict the same statistic from the same data with unbiased difference, and this doesn’t bear on the question of whether they are similar.

  80. Jeff Id said

    RB, that’s right. There are multiple pre-Steig papers which did far better jobs of trend distribution. It ain’t like temperature data is rocket surgery. When it gets difficult is when you apply unique methods for distribution of trends.

    That’s what I have difficulty understanding. If you’re going to present a new method for distribution of trends, isn’t it most important to compare to the simple ones? You know, like averaging or closest station infilling or some other method that should be within tenths of a degree of the fancy ones?

    That’s the difference IMO between S09 and an engineering approach.

    I’ve had a few minutest today to follow the discussion on stats. It seems pretty clear to me that the variance in the difference between two methods using the same data is important. It has been described as random and not random and compared to different methods like mean and median.

    What our method and S09 were supposed to represent was mean. Not mean and median. We are both estimating exactly the same thing. Which is why the difference between them and any non-climate weather noise variance left as a residual difference is key in determining the significance of the delta trend.

    That is also why I mentioned above that +/-0.05 variance as a residual should be an alarm bell by itself. There is no way that temperature means from two working methods using Earth data over fifty years should have that kind of differential variance. By itself, the +/-0.05 is a refutation IMO but others may see it differently.

    S09 had semi-broken methods which spread trends across the continent. O10 fixed that to some extent. However, if I were to look at regional trends for serious examination, I would still use the ground stations. O10 may have improved the ground information in some areas over simple ground stations, but it may have reduced the fidelity in others. Over the continent, it could be better than closest station trend, but nobody has proven that out.

    Finally, with this long winded comment, I’m not sure how much longer tAV is going to be around. I don’t have any time to enjoy this or engage in any of the fun. It was very disappointing not to be more involved in this paper. I believe there is plenty I could have helped with. My time is simply too limited these days. Running the business has been taking all of my time for the past 10 months. In the last 3 the few hours I had dropped to near zero. It’s too bad really because it is fun, but I don’t see any time in the near future either.

  81. Kenneth Fritsch said

    http://www.climatescience.gov/Library/sap/sap1-1/finalreport/sap1-1-final-appA.pdf

    Under section 7 in the above link Tom Wigely recommends using differences for comparisons in (a) and (b) below and the pooled estimate of the SE for (c)

    (a) comparing data sets that purport to represent the same variable (such as two versions of a satellite data set) – an example is given in Figure 2;

    (b) comparing the same variable at different levels in the atmosphere (such as surface and tropospheric data); or

    (c) comparing models and observations.

    By the way, I believe Wigely in this article also corrected a misconception that fellow contributor Ben Santer made in an earlier paper on comparing the significance in differences in trends between observed and modeled troposphere temperature anomalies by observing the overlap (or lack thereof) of the extremes of the CIs. Wigely points out that that is misleading and wrong. I am not sure what Nick Stokes is talking about here other than generalizations about the S09 and RO10 methods being little different and the results also. Perhaps statistical significance has suddenly lost meaning.

    Am I correct that we all agree that a better/correct method following on a problematic method should not be required to show results that are statistically different in order to be published – but, if you can as was the case here all the better as the correction certainly becomes more relevant?

  82. Kenneth Fritsch said

    Running the business has been taking all of my time for the past 10 months. In the last 3 the few hours I had dropped to near zero. It’s too bad really because it is fun, but I don’t see any time in the near future either.

    That is a bitter sweet development, but with the state of the economy being what it is it is good to hear about a business that is really taking off. Regardless of what our left wing friends think, the entrepreneurial spirit is what will lead us out of this economic mess – providing the government stays out of the way.

  83. Jeff Id said

    Kenneth,

    I’ll keep going for a bit longer yet. Actually the paper is what I’ve been waiting for. What would be perfect is a blog where I could do something technical every couple of weeks depending on my time. Of course there wouldn’t be much readership with just one spot. This little blog has more readers/week than my hometown newspaper right now. It (read you guys) has done more than I could ever have expected and more importantly I’ve learned a ton. More than I thought possible outside of school but there is so much more to do. Unfortunately us mortals only get so much time on this rock and after 2 1/2 years here, I’d like to do a lot of different things before I keel over. My thought is that maybe in another few years I’d come back and run it again but who knows.

    My intent right now is to run a post requesting people who would like to help/take over/contribute regularly. If not enough agree to contribute, I will look to some of the technical bloggers for a future outlet. WUWT has the popular market. SteveM has a specific paleo technical niche, Lucia is a permanent web presence who covers some unusual topics. I just need a spot to mess around with enough readers to make it fun and no pressure to put another post out. Study time!!

  84. Jeff,
    Sorry to hear about the time pressures. Your presence will be missed.

    If some version of tAV keeps going, I’ll be happy to continue being my normal pesky self.

    And congratulations again on persisting with the paper. I think it has been genuinely well received.

  85. Steve Fitzpatrick said

    Jeff,

    I’d consider a post or two as well… nothing as elegant as O’donnell et al of course, but maybe enough to provoke some thought. The ocean temperature profile and downmixing of surface heat seem to me to be key in evaluating the true state of the Earth’s energy balance. Of course, I’m very busy too… I’m at a hotel in Auckland, New Zeland right now but returing to Florida tomorrow morning. I will try to see if I can come up with something in a couple of weeks.

  86. steven mosher said

    “A side point here: I have heard that the novelty of Steig 09 was in showing that the warming of the West Antarctica was as great or greater than the Peninsula which over turned previously published evidence. I am assuming that the more novel a paper the more readily it can get published. ”

    I asked eric the question on RC: “what was novel about your paper”

    failed moderation.

  87. Kenneth Fritsch said

    I asked eric the question on RC: “what was novel about your paper”

    failed moderation.

    The RC moderator may have considered that a “trick” question or he may have felt that the novelty was very obvious. I did not get an answer here either from those whom I expect would might want to guess.

  88. RB said

    Jeff,
    There was another story recently regarding the bias relating to the eagerness to arrive at a new conclusion. This story recently featured here and described here concerns the discovery of an organism that supposedly feeds on arsenic, but the more likely conclusion is that in their excitement regarding the possibility of a new kind of life the scientists did not check for contamination in their experiments, and peer review failed as well.
    Sorry to hear about your time constraints, but I couldn’t see how you managed the blog with work and family and it’s good to know you’re human too :)

  89. RB said

    I’m not on the other side of the CAGW camp, but I think you guys might find this interesting.

    I don’t know whether the authors are just bad scientists or whether they’re unscrupulously pushing NASA’s ‘There’s life in outer space!’ agenda. I hesitate to blame the reviewers, as their objections are likely to have been overruled by Science’s editors in their eagerness to score such a high-impact publication.

  90. curious said

    86, 87 Steve, Kenneth – From comments by Ryan somewhere along the way, wasn’t the novelty one of using the satellite period data’s more complete geographical information to inform an historic temperature reconstruction from the sparse station records? Or had this been done before in other areas?

    FWIW I also thought it was novel using actual temperatures for a reconstruction instead of proxies. From my limited understanding this played a part in the whole “negative temperature” debate – a sort of confusion of treating an actual parameter as if it were a proxy? Once the paper comes out I intend to bottom this out for myself … “going deep” to quote our favourite climate commando (h/t TCO :-))

  91. Kenneth Fritsch said

    Curious, the novelty you mention in the first paragraph is the one I judged to be novel about paper – even though I think that without the showing of statistical significant warming for the Antarctica continent it would not have been as readily published or publicized by the MSM.

    I also judge that the method came about as much from the authors wanting to show significance and thus the necessity of attacking the sparse data from 1957 to the 1980s. I think where the authors bogged down and went wrong was once they found significant warming they rushed to publish without much further thought. I’ll not impart motivations here to such actions.

  92. kim said

    Pictures tell story;
    All pretty little colors.
    One is true, one not.
    ===========

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 140 other followers

%d bloggers like this: