Roman on Anomaly Trend Regression

I’ve been spamming Roman M’s thread on the correct method to calculate anomaly trend.  It’s a correction for an error in anomaly slope calculations by least square fit.  The method of correction is laid out mathematically and in code.  I do enjoy math threads, but when skeptics of AGW (If Roman even is one) make improvements in method and publish them on line, the rest of us in blogland should take notice.  This is the second step forward that Roman has proposed in calculating temperature trend for a global average, in the last two weeks!!  Like the combination of temeprature time series with a seasonal component posts by Roman, it is again a step forward.  After reading the math,it made sense, it then required beating my head on the R code for several hours.  Like good science, the result, openness and method are like a clean glass of water.  No hand waiving about teleconnections or rubbish about moisture feedback is between 2 and 5 times blah, blah… It’s real science with an actual answer that is different from the standard methods of climatology.

Anyway, when I see his thread with only ten comments – half mine, it’s a shame.  If you’re interested in why some of us get so excited about science and math, check it out.

Anomaly Regression – Do It Right!

20 thoughts on “Roman on Anomaly Trend Regression

  1. I appreciate Roman M’s work but clog your threads! Mabye I should learn to run R before even doing that much…

  2. When I read what Roman writes I have trouble formulating a smart question. It’s kinda like being in a class where you know you didnt finish all the pre requisites. More exposure will happen via guest posts here and at CA.

    I think CA would be a good place to repost his original posts.

  3. #3 I sent an email several weeks ago to SteveM suggesting the repost on a different topic, but no answer. I agree on the prereq comment. Each step takes time and thought, but I enjoy that. He’s really got my interest in application to tree rings but I’ve got too many posts to work on before then.

  4. I thought Roman had posting rights? The ” do it right” meme is strong enough to branch out into an interesting discussion that is
    both technical and sociological.

    Roman’s code has cooties. I’d rather do it wrong.

  5. I need to think this through. Correct me if I’m misunderstanding, but:

    The reason for the “error” is that you are comparing a data set that is continuous against one that is discrete. It is really not a surprise that a discrete representation of a continuous cycle will yield results that are slightly different. The discrete set is meant, almost by definition, to be a simplification. There is always a balance between simplification and accuracy.

    I disagree with the solution that breaking it down by annual periods improves results. This works in his example becasue it is a continuous trend around a perfectly sinusoidal example. Thus, the annual period eliminates all bias, but this is only because of the assumption of the curve used to represent temperature anomalies. However, I am not sure that is a proper representation of global patterns at all, since the global nature of readings supposedly balances this effect out. Further, because of solar cycles, oceanic cycles, El Nino effects, etc. there are longer-term cycles that come into play which then make the assumption of annual periods being the proper isolating factor incorrect.

    No, the better trend, in my opinion, will continue to be towards more discrete periods. Taking this further, 24 periods in a year would bring us closer to the continuous result, as would 48, as would 96, etc. until we converge on the exact result.

    I guess I’m just not sure I agree. I could be missing something, though.

  6. Re: The Diatribe Guy (Mar 23 01:13),

    The reason for the “error” is that you are comparing a data set that is continuous against one that is discrete.

    No, this is incorrect. It is due purely to the simple fact that the anomalizing procedure turns a linear trend component into a step function which jumps at the end of each year regardless of how the year is defined. The “perfect sinusoidal” oscillation is irrelevant as any monthly pattern periodically repeated will give the same type of result.

    To see that discreteness is not the issue, simply try it on the function

    f(t) = sin(2πt) + ct, for t between 0 and N, an integer.

    Anomalize this function by subtracting the “mean” function

    A(t) = f(t) – (1/N) Σ f({t}+k)

    where {t} is the fractional portion of t (i.e. t-floor(t)) and the sum is taken over the values of k = 0, 1, …, N-1. You will notice that the sin portion is removed (as it should be), but the linear portion becomes a step function with equal steps of size c. This is exactly the same as occurs in the discrete case. Replace f(t) by any periodic function with period equal to 1 and the result is the same.

    The annual cycle is an order of magnitude greater than the variation of any of the other quasicyclic periods which may come into play so assuming annual periodicity is important. Changing the number of subperiods to “isolate” for solar, oceanic or other effect is not going to have a substantial impact.

  7. “Taking this further, 24 periods in a year would bring us closer to the continuous result, as would 48, as would 96, etc. until we converge on the exact result. ”

    It seems intuitive that this would be the case, however, the size of the stairstep is actually set by the length of the season rather than the math which is used to calculate it. More points means you are measuring the run of the stair more thoroughly (the anomaly method still flattens it no matter how well we measure) and the rise is still created on an annual basis.

  8. OK, I admit to having only been half-paying attention last night (probably shouldn’t have been trying to read through that and thinking it through at 1 am).

    Upon further review, I’m still not sure the anomalizing “process” has anything at all to do with it, but set that aside for the moment.

    As an example – I’m a simple person, so forgive the simple analysis – I established a time zero at a value of zero. In a simple spreadsheet I set time = 1 at a value of one, and increased each time value by 1 until reaching 120 – ten “years”.

    Then, I just used the random number generator to give me a 12-month seasonality. It makes no sense from a practical standpoint, but I’m just observing the math, so I don’t care about that right now. My seasonality additives are: 45,6,25,26,86,18,53,63,9,21,74,97 for times 0 – 11, respectively, repeating as such continually.

    The trend should be 1 per month. It’s not. It’s 1.0366 per month. So, while I agree with your point in that regard, this isn’t a new problem to me. The slope is the same whether I anomalize it or just take the slope of the full value. I am using a discrete funtion here, and it demonstrates that it is not the anomalization process but it is the nature of the seasonality/cyclicality that is inherent in the data. This, of course, assumes that there actually is a seasonality that is consistent, and further that there is no other kind of cyclicality in the data.

    The same issue can occur if the effect is multiplicative. A further issue arises by taking a continuous trend and moving to a discrete presentation of the data. I was wrong to suggest this was the entire issue – it probably is actually not tremendously impactful.

    This trending issue has traditionall been dealt with in one of two ways: (1) take the average of trends with different starting periods over the period of cyclicality and take their average (which you did in your post). This works because if you have a known trend along with any kind of cyclical or seasonal data and the trend is overstated in one period, that must be offset by other periods. (2) The second, more common approach, is to simply identify the seasonal nature of the data and trend adjusted data. Of course, to do this you need a reasonable idea of what adjustments to make.

    I’ve looked at the seasonality of the temperature data, and there seems to be a very slight seasonality to global temps, but when I trended the adjusted data versus raw data the difference was miniscule. Such adjustments would absolutely be required for local analysis. But actually, the best way probably is option #1 above, because in doing that you are not introducing an assumption error on what the seasonal adustments are, you are just letting it take care of itself. I may have to consider doing that when looking at my trends.

  9. Re: The Diatribe Guy (Mar 23 10:07),

    The trend should be 1 per month. It’s not. It’s 1.0366 per month. So, while I agree with your point in that regard, this isn’t a new problem to me. The slope is the same whether I anomalize it or just take the slope of the full value.

    I put your numbers into R and ran the regressions there.

    The trend per month was calculated as 1.036009 for the “raw” data and .9900688 for the anomalized data. Since these results are good to about 15 decimal places, your statement is incorrect.

    In the raw case, you do not get a trend of 1 per month because of the order in which the seasonal values you generated occur with larger values later in the year. Cycle the last two (74 and 97) to the front and the trend becomes .9661435. – about 7% smaller. The anomalization effect is over and above that as is demonstrated by the difference in trends calculated above.

    In fact, that is the reason for using our suggested methodology which fixes both of these problems.

  10. I calculated the trend based on a minimum least squares single regression line in excel. The months are assigned as individual periods (not 1/12). I should have noted that my trend is based on the averages of the two raw numbers, since t=0 and t=1 represent endpoints in my numbers and wasn’t intended to be the monthly average. Maybe that’s our difference. So the first value I am regressing is 26.

    Lest you take my posts as an argument, I’m more thinking out loud. You guys are smart.

  11. All right, I am confused now. I assumed that you did the following:

    Take the trend sequence: 1,2,3,… 119, 120.

    Add the seasonal values: 45,6,25,26,86,18,53,63,9,21,74,97. Add these individually in order to the first twelve trend values, then repeat this cycle nine more times until each of the trend values has their monthly additive.

    This gives the temperature sequence. The time sequence is 1,2,3,4, …, 120, the same as the trend sequence. Do a simple regression of temperature on time.

    How can you get 26 from the numbers 1,2 45 and 6? (1+45+6)/2???

  12. So Roman, as a statitician, what is your take on the 5 sigma vs 3 sigma “dust up” between Tamino and Lubos?

  13. Time 0 = 0 + 45

    Time 1 = 1 + 6

    Average for month 1 = 26. Obviously makes an assumption about the relationship between the endpoints determining the average. The point is that a monthly anomaly is derived from a series of observations that results in an overall average. My average is simple for illustration purposes.

    I wasn’t presenting the data for reproduction, so I apologize for not being clearer. I was just tossing out my own observations.

  14. In fact, that is the reason for using our suggested methodology which fixes both of these problems.

    Our’s eh. Too kind. It is the correct method though, and it was entertaining to figure out what you did. A good puzzle.

    #13, I’m not following what you did. Maybe my simple way of thinking about this can help. Anomaly by month, week or day, insures that for each time period the average of the whole time series is zero. This means that if you have a perfect linear uptrend, the annual signal is flat with each successive time period inside a single year averaging to the same value. At the beginning of the next year there is a step up. Of course this applies equally to noisy data but the effect can disappear visually in the noise. The anomaly calculation process can not take into account the fact that there is a linear trend inside the 1 year period. Roman’s method restores the correct intra-annual trend.

  15. Re: Layman Lurker (Mar 23 12:58),

    I don’t believe in hard and fast rules for the determining the level of significance required when doing statistical tests nor is there anything inherent in statistics which automatically makes one level or another “more correct” (including .05 and .01 which are really historical artificialities from BC (Before Computers) times).

    What needs to be considered are the relative consequences of the two possible types of error in the statistical test. Type I error is the rejection a true null hypothesis and Type II is not rejecting the null hypothesis when the alternative hypothesis is true. This is further complicated by the fact that the alternative is often not a simple hypothesis but consists of many possibilities. Thus, a simple evaluation of the pvalue (which relates only to the type I error) may not be sufficient since the consequence may depend on which specific alternative is the true one.

    If a group wishes to use a specific criterion, that is their prerogative and it makes it neither right nor wrong. But this is OT, sorry Jeff, no more…

  16. #19, Don’t worry about OT too much. On occasion we reign active threads back in but this seems as good a spot as any to discuss stats. Tech stuff gets the right of way.

Leave a comment