# the Air Vent

## Because the world needs another opinion

• JeffId1 at gmail dot com

Everything you need to know about Climategate. Now available at Amazon.com

• ## Follow Blog via Email

Join 170 other followers

## Demonstration of Improved Anomaly Calculation

Posted by Jeff Id on March 21, 2010

This post is an application and further explanation of  Roman’s work on improved anomaly trend analysis.  At this post, Roman gave an example of a perfect sine wave with a trend of 0.2C added to it.  By taking the monthly anomaly of the data, steps are introduced on an annual basis.   These steps cause a distortion of trend which are actually unnecessary, however, there is a greater understanding and a little bit of a tricky regression involved(in my opinion-not Roman’s) .  The programming of the regression is absolutely simple in R where the difference between the standard call first and the improved call second are below.

```
coes = coef(lm(anomalydata~ time))

coes = coef(lm(rawdatat~0+month+time))    #Improved calculation

```

The first line returns two coefficients for the anomaly regressed against time (standard form) y = ax + b where a(slope) and b(intercept) are returned.   The second example line above, also returns coefficients but there are actually 13 returned.  The function is again y= ax+b, however 12 b’s are returned for a single a.  One  best fit line is created for each month with the only difference between the lines being the offset!

In the second line of code above, “month” is a special type of indicator variable which instructs R to find the offsets separately from the slope.  It’s a sort of mask which separates the data by month for the intercept but not by slope.  If you haven’t seen it before, it probably will take some study to understand.  What it means though is simple, the function returns 12 offsets based on each month, yet only one slope which is based on the best simultaneous fit to all the months together.

Of course the best slope is dependent on the best offsets by month so solving together is necessary.

There is a code advantage to the improved call though, in that you no longer need to calculate the anomaly.   Figure 1, shows the fit to GHCN station number 60360.

Figure 1 - 12 offset least squares slope calculation

See we got the slope without ever actually calculating the anomaly and using a single line in R.  What’s more is, we’ve removed the anomaly step problem created by monthly averaging.  There are some interesting implications of this method as well.  Since the simple monthly average anomaly decreases the quality of the fit (increases the residuals), the improved method trend uncertainty, is narrowed but more interestingly is that there is a “true” anomaly value which can be extracted from this method.  The true anomaly, takes into account the trend information by month such that monthly steps created by the simple removal of monthly mean and the simple linear trend is identical to the improved calculation results.

There are dozens of demonstrations which could be done on this topic, I’ve used a single temperature station, but it might be fun to try with simple linear data or some simple waveforms.  Anyway, that is left for the reader to figure out.  Below, I’ve written an appropriate slope calculation function which returns the 12+1 coefficients, residuals and true anomaly from a temperature series as shown above.

```######################################################################
##### slope plotting function with modifications for 12 month anomaly
######################################################################
get.slope=function(dat=dat)
{
len=length(dat)
month = factor(rep(1:12,as.integer(len/12)+1))
month=month[1:len]	#first month by this method is not necessarily jan
#the important bit is that there are 12 steps
time2=time(dat)
coes = coef(lm(dat~0+month+time2))
residual=ts(rep(NA,len),start=time2[1],deltat=1/12)
fitline=array(NA,dim=c(len,12))
for(i in 1:12)
{
fitline[,i]=coes[13]*time2+coes[i]
}
anomaly=rep(NA,len)
for(i in 1:len)
{
residual[i]=dat[i]-fitline[i,month[i]]
}
anomaly = residual+time2*coes[13]-mean(time2*coes[13],na.rm=TRUE)
list(coes,residual,anomaly)
}

```

So, employing the station data above, the anomaly calculation using the standard method results in a plot which looks like this:

Figure 2 - Standard anomaly calculation

And the anomaly returned by the above improvement is shown in Fig 3:

Figure 3, Improved "True anomaly" calculation

Totally and completely different  — as you can see ;).   The next plot is a little weird, I subtracted the anomaly from Fig 2 from the residual of Fig 3.  The reason I  didn’t just subtract anomaly is that you get a huge up and down cyclic scribble which gives no feel for the magnitude of the  difference, engineers need to understand the magnitude.  Simple version – Think of this plot as the difference between Fig 2 minus Fig 3 with the slope of Fig 3 artificially added in for scale.  The wiggle itself is the error in the standard anomaly calculation corrected for by using the simple  improved line of code presented at the top of this post.

Figure 4 - Annual signal error in standard anomaly calculation.

In this case the increased trend from 0.2253 to 0.2256 C/Decade – or 0.0003 C/Decade.  Not much!  The confidence interval tightened from 0.0818 to 0.0817 C/Decade. Again, not terribly much.

There are of course examples which can be created that will have bigger differences between the methods.  Now that I’ve completed this though, I’ve been able to create complete global gridded trend without direct calculation of anomaly at any point.  I’ll post that next.

1. ### steven moshersaid

Nice work.

The global work should be interesting to look at.

It would be cool when you are done to use the list of temperature stations to do some testing with Synthetic data.

That is, Zeke has his version, and CCC have GISSTEMP running and CRU code is out there.

So, one could create a series of synthetic data sets ( kinda like what chad was doing to test various methods ) so that
the benefit of Roman’s method ( if any of course ) could be demonstrated.

A draft of Hansen’s next paper (linked on Lucias ) is out covering various issues, you should have a look. Issues covered are:

2. Nightlights for the world
3. W/WO the artic
5. Which SST dataset.

2. ### Jeff Idsaid

Thanks Steve,

I haven’t had time to look much at Zeke’s latest. My plan is to push forward with what I can do as time permits. I would like to look at the US in particular b/c we have the best station density –and probably quality.

3. ### RomanMsaid

Why did you choose an example where the improvements aren’t that obvious? 😉 One thing that should be stressed is that the entire procedure is done without having to deal with the problems that would be encountered in the anomaly approach due to missing data values.

Good stuff! I have been messing with your script and I can offer a suggestion or two that might tighten it up a bit if you wish. I do have a question about the definition of the “anomaly” in this case. With respect to what “baseline” do you see the anomaly as being defined? As it is, the anomalies do not necessarily sum to zero (despite the fact that the residuals total zero) because you subtract the mean of the linear portion using all of the observations, but where actual data is missing, the linear part does not enter into the calculation. This may need a little thought.

4. ### Jeff Idsaid

#3, It’s a bit lazy on my part, but you get what you pay for. I don’t have a baseline for the anomaly so the algorithm just used the mean. Tighten away.

5. ### SteveWHsaid

Jeff ID

I’d hate to be a pain, but could you post the data set you’ve used to get the results above and what is required to get that script to run. I’ve download V2.mean, put it in Excel, extracted 60360 in an Excel workbook and put it in R (using commander).

All I get is [9] ERROR: unexpected ‘}’ in “}”

which tells me absolutely nothing.

If you would be so good as to spoon feed me here just to get me started maybe I could start to learn how to use some of your scripts.

6. ### Jeff Idsaid

#5, I set you a copy of the complete script for this post along with some instruction.

7. ### Jeff Idsaid

Roman,

Actually, it’s not really lazyness. Looking for an example which showed more or less effect is a bit against what I try to do. If I show a greater effect, I’m biasing the impression. If I show less, it’s the same.

My hope is that people like Steve in #5 will try it and figure out for themselves so in recognizing my own belief that this is a far better method, I just chose the very first example which by luck, happened to have enough data. It’s the same station which was defaulted in previous posts too.

I hope it is apparent to people that there are examples in which the improvements will have a greater effect on trend.

8. ### SteveWHsaid

#6

Thanks very much Jeff. I took me a while but I am getting some results (like the data ghmean,tinv, ccode) Lots of stuff to graph (V1 – 17, but the jpeg graphs show an icon with 0 bytes (it took me a while to unremark the plot line), but something about the margins being too big? Anyway enough to start learning how its done.

Steve :]

9. ### Kenneth Fritschsaid

This comment is a bit off topic, but I wanted to discuss my most recent analysis of individual station differences within a 5 x 5 degree grid. I have reached the point where I judge the important comparison to be difference series pair-wise with all stations within a grid. I am using the 1950-1990 period and annual anomalies for complete to very near complete station records.

I started by looking at trend differences over the 1950-1990 period – with difference series trends. I then looked at the actual difference series for paired stations with a significant trend and those with little or no trends. What I found was that significant trends can be derived from a difference series that shows an initial 20 year plateau and then a trending increase or decrease over the next 20 years. Important is the fact that there are really 2 rather apparent trends.

What I found with differences series with no trends were often an initial upward (downward) trend followed by a downward (upward trend) that together yielded no trend. Important again is the fact there are 2 rather distinct trends.

My difference statistic for my ongoing analysis will now be a sum of squares of the difference series from zero. I want to be able to obtain a reasonably unbiased measure of differences between individual stations and then compare those levels of difference to station properties. I do not know whether this sum of squares is a proper statistic.

I do think I see where some climate scientists who appear to invent their own statistics are motivated: by impatience and lack statistical skills. In my case, as perhaps contrasted with the some climate scientists, I will happily take the advice of those more expert than I – and let the result lead where it may.

10. ### SteveWHsaid

Hi Jeff

So I got everything to run and decided to play a little with it. Just running the script with the WMO changes.

Ran Cambridge Bay Nunavut WMO 71925 (69 Lat 105 Long) Slope 0.172/Dec and Eureka NT, WMO 71925 (80N 86 W) 0.2122/Dec. Being Canadian eh.

Both less Annaba 0.2256/Dec(in Algeria). Isn’t the Arctic supposed to be the fastest warming area? Hmmmm. Could I be using the program incorrectly?

11. ### Ausie Dansaid

Kenneth Fritsch

Hi – if you look at the NCDC global land and ocean data from 1880 to now, you will see an overall upward linear trend.
Superimposed on this is a distint 65 year zigzag pattern, with highs in 1880 (I think really 1878), 1943 and recently.
The lows were in 1911 and 1975.

Your “50 year” data sample would cover part of the 1943 down leg and part of the 1976 – 2008 up leg.

Why some individual stations show a “V” pattern, while others a “^”, I do not know.
But looking at a number of Australian stations, there are a number of distinct but different patterns, most persisting throughtout the data series and all but one (that I have found so far), do NOT exhibit any Hockey Stick like upturn after 1975.

I am asking myself just what does a global average really mean?

12. ### SteveWHsaid

Jeff

I am looking at the locales used by CRU? in Canada for their temperature data. The program seems to look at the station ID to separate out the one you are interested in. However, the first 5 digits of the ID can be used by locations very far apart. For example, Vancouver Int (airport) WMO 71892 0. Nanaimo A WMO 71892 1 (across the Strait on Vancouver Island); Ladner (a suburb) WMO 71892 3 etc.
It appears the program combines all these sites. I guarantee you they are not the same! Any way to get just one site?

What does getsinglestation=function(staid=71043, version=0)do? What is version?

13. ### Jeff Idsaid

#12, the version number sorts the individual instruments apart. In past versions, my program simply combined different series together, in this version the series are taken instrument by instrument. It’s an improvement. I’m working on a post with it right now.

14. ### SteveWHsaid

Jeff

Do you mean that the version number should separate the sites? Because when I run the script you gave me with(staid=71892 version=0) the plots have an x axis running from 1937 to 2001. Now Vancouver Int only runs from 1937 to 1991 in the ghmean (wow is this data base out of date – I have the complete Canadian WS info to 2007 and Van Int has records up to Jan 2010 {available on line})71892 1 from 1949-1990;71892 2 1971 to 1980;71892 4 1987 to 2001. Therefore it seems that the program is combining these four temperature sites – which are definitely not the same instrument. Also there are other 71892 designations with 10,20, 30 and 40 which the program seems not to recognize at all one running back to 1879, probably because they don’t seem to be in the ghmean database produced by your script – but they are in v2.mean. Ghmean database only seems to have 0 – 9?

15. ### Jeff Idsaid

I’ve got ghmean running to 2001 for Vancouver

237197 403 71892 0 4 1998 48 71 79 97 134 165 192 191 161 108 79 37
237198 403 71892 0 4 1999 50 55 64 87 111 145 166 183 146 96 77 49
237199 403 71892 0 4 2000 35 50 70 97 117 156 175 171 150 105 55 35
237200 403 71892 0 4 2001 51 39 67 88 123 147 172 175 148 -9999 -9999 -9999

Column 4 is tinv[, 3] which equals 0

From tinv —
2957 403 71892 0 VANCOUVER INT 49.18 -123.17 2 30 U 1169FLxxC O 1 A 2WATER

16. ### SteveWHsaid

Jeff

I was getting my info off an Excel spreadsheet from v2.mean and a tinv.txt
If I look at ghmean I can only get to 105 something about maxprint at the bottom. Sorry, changing preferences or whatever can be somewhat obtuse in R. Besides it is time to hit the sack. I’ll get back to this sometime tomorrow. Thanks for the help. Learning a little about R and scripting anyway. :]

17. ### SteveWHsaid

Jeff

Finally got ghmean into Excel which is a little easier to work with.

The last of the numbers above you got me was 237200 403 71892 0 4. The last two digits are the modifier for WMO # (?). The number 237200 403 71892 0 4 is for New Westminister, BC (according to tinv). Therefore, New Westminister’s temperatures are being included in the record for Vancouver Int by your script. My Excel WS of v2.mean includes double digit modifiers which give very different temperatures and dates. I don’t know what sites they refer to. tinv only gives single digit modifiers

If you put getsinglestation=function(staid=71892, version=4 (version 40 according to my Excel v2.mean WS) you get temperatures from before 1880 to 1980 from GOK where. Can’t find a reference on the WHO website. All very confusing, but I do think your program could combine different sites.