## GHCN Another Way

Posted by Jeff Id on January 25, 2010

Ok so I’ve spent far too much time this weekend messing around with GHCN data in R. R has memory limitations which have driven me to the edge of sanity but doggedly I pushed on. In case you weren’t sure, slogging through temperature data is a crap way to spend a weekend but I’m getting closer to an independently created gridded temperature dataset by someone who makes not one penny for promoting one conclusion or another. Unfortunately, were stuck with the crap data archive at GHCN.

Yesterday my goal was simple, instead of collecting timeseries into a single WMO number, I wanted to collect every individual series into it’s own column in a matrix. To that end I wrote the following code. It looks simple but there were a ton of issues with getting this to work . Two days worth in fact, R doesn’t have good memory handling, doesn’t have breakpoints and runs impossibly slow if certain commands are or are not used. All of it makes sense after you beat your head on the wall enough, which I did, and the beginning of the result is shown in the code for compiling all of the GHCN series individually is below.

allstations=as.numeric(levels(factor(ghmean[,2]*100+ghmean[,3]))) nost=length(allstations) rows=nrow(ghmean) tempsta=array(NA,dim=c(1920,nost)) indexvec=(ghmean[,4]-1850)+1 mask=indexvec<1 indexvec[mask]=161 stmaskvec=ghmean[,2]*100+ghmean[,3] td=array(unlist(ghmean[,5:16]),dim=c(595758,12)) datmask=rep(TRUE,1921) datmask[1921:1932]=FALSE for(i in 1:nost) { mask =allstations[i]==stmaskvec dat=array(NA,dim=c(12,161)) dat[,indexvec[mask]]=t(td[mask,]) dim(dat)=c(1932,1) mask=dat==-9999 dat[mask]=NA dat=calc.anom(dat/10) tempsta[,i]=as.numeric(dat[datmask]) print (i/nost) } tempsta=ts(tempsta,start=1850,freq=12) dimnames(tempsta)[[2]]=allstations #save(tempsta,file="ALL GHCN DATA") #load(file="ALL GHCN DATA")

This code loads ghcn into a matrix 1920 months (160 years) long and 13464 columns wide. We’ve seen a variety of numbers for GHCN station count but 13464 is the true number of total series. Availability by month of GHCN data is as follows:

So we can see the actual peak number of available stations crossed 8000 at one point. However there really were over 13000 instruments in operation over the last 150 years. My understanding is that most of those are still in operation today but for inexplicable reasons, are not collected together. One of the biggest scandals of climate science is that since 1992 there have only been an average of 2000 stations recorded in the primary global database. With all the media outlets screaming global warming doom, unless we pay more taxes, don’t you think someone would notice that NOBODY IS ARCHIVING PROBABLY 80 PERCENT TEMPERATURE STATIONS!!!! Oh yeah, but we have doom in India from melting glaciers, THAT NOBODY MEASURED EITHER! Doom from drought which will never happen. Doom from fish shrinkage due to global warming. Doom!! they say didn’t you hear them? Fig 2 shows the mean of all the GHCN data unweighted and averaged.

I just realized, a couple thousand of us are sitting here on a blog called a ‘skeptic’ blog, ‘denier’ blog, ‘contrarian’ blog and our SIN is plotting the DATA!!! That’s where we went wrong, plotting the damned data. You evil denialist beasts. Sorry, the past two months of climate change have permanently altered my perception of the authorities. — not for the better, in case you were wondering 😉

d^{2}Climate/dt^{2} = f (Politics)

The change in the climate of climate change over time as a function of politics. I know it’s a crap expression, but it makes me laugh.

Anyway, there ain’ t a lot of trend in Figure 2. In F3, the data is filtered to reveal the incredibly small change in temperatures recorded by the single most complete set of temperature stations on earth over the last 150 years.

There it is, global warming doom. Now most of these stations came from the US, but I wonder how much that really matters. Most of the warming this century happened in the northern hemisphere yet not in the US. If I’m Russia, I would hand out thousands of bic lighters to temperature station operators, if you know what I mean, and hope that the US committed unilateral suicide with the data. Anyway, I’m concerned with Fig 3 because there isn’t enough warming to match the consensus. Before we go into that, I want to say that NONE of my work has 1998 as the warmest year on record. EVER. Yet every consensus metric on earth does. What does that mean? Am I not selective enough with the data? It’s odd, and it could be something wrong here B/C nobody gave me 22 million dollars over the last ten years. Instead of ridiculous $$, we “skeptics”, who actually care about the effing data and whether the JIC’s are lying to us, have a weekend of effort.

The gridding code looks like this:

latlon=tinv[,5:6] stid=tinv[,2] coln=as.integer(as.numeric(unlist(dimnames(tempsta)))/100) Ncol=length(coln) collatlon=array(NA,dim=c(Ncol,2)) for ( i in 1:Ncol) { mask=stid[i]==coln collatlon[mask,2]=latlon[i,1]#latitude collatlon[mask,1]=latlon[i,2]#longitude } dt=rep(.5,Ncol) plt.map(dt,collatlon) gridval=array(NA,dim=c(Ncol,2)) #place stations on grid for(i in 1:length(coln)) { gridval[i,1]=as.integer((collatlon[i,2]+90)/5)+1 #lat gridval[i,2]=as.integer((collatlon[i,1]+180)/5)+1 #lon } griddata=function(dat=dat,coord=coord) { gridtem=array(NA,dim=c(36,72,160*12)) for(ii in 1:72) { print("col") print (ii) for(jj in 1:36) { maska=coord[,2]==ii maskb=coord[,1]==jj maskc= maska & maskb maskc[is.na(maskc)]=FALSE print(jj) if( sum(maskc,na.rm=TRUE)>1) { gridtem[jj,ii,]=rowMeans((dat[,maskc]),na.rm=TRUE) }else{ if(sum(maskc,na.rm=TRUE)==1) { print(sum(maskc,na.rm=TRUE)) print(jj) gridtem[jj,ii,]=dat[,maskc] } } } } gridtem } gt=griddata(tempsta,gridval) globt=rep(NA,1920) for(i in 1:1920) { globt[i]=mean(gt[,,i],na.rm=TRUE) } globt=ts(globt,start=1850,freq=12) plot(ff(globt), main="Mean of All GHCN Data Gridded 5x5 Deg.", ylab="Anomaly C", xlab="Year")

So what’s the deal then? Why is this result different? Again, I’m not terribly sure, another weekend and still no solid answers. There is a clue though. The data recently released from CRU which is supposed to be raw GHCN has an interesting difference.

It’s a reasonably continuous series yet GHCN timeseries which CRU claims Figure 5 is from, have not one single continuous set of data.

You can see that the data records from all the long stations stopped around 1990, with only one series recorded in recent times (where all the global warming occured). So, since CRU claims that the data they have for each station is directly from GHCN, how did they get so many years of continuous data for station number 25563? I’ve already spent hours looking into this question in an attempt to reverse engineer the process. I could take the easy road and send an email to UEA — do you think they would help the climategate blogger figure this out? Anyway, it’s time to go to work this Monday morning, so we’ll get to look at the next exciting bit tonight or tomorrow. CRUGlue, adhering the data .

## Bruce said

Jeff

I’ve been wondering since it was discovered that there is no unadjusted station data available – why is everyone expending so much time and energy analyzing and plotting this data?

Its all meaningless until we know we have the raw station data – no?

## CNY Roger said

Jeff,

Thank you for all your work here and especially the effort to compile these data.

It may take a while but I think that someday it will make a difference.

Roger Caiazza

Certified Consulting Meteorologist

## Pre Calc Rap | Always Calc said

[…] GHCN Another Way « the Air Vent January 25, 2010 — d(y)/d(x) Calc RemixJanuary 25, 2010 — dirty harry dancing in the calc whithornJanuary 25, 2010 — Go Calc iPhone demoJanuary 25, 2010 — I’m In CalcJanuary 25, 2010 — CalcJanuary 25, 2010 — Finding Limits Algebraically – Calc IJanuary 25, 2010 — I’m Learnin Calc by Nerdy IslandJanuary 25, 2010 — 2009 AQA module 1 Nov paper calc section AJanuary 25, 2010 — Calc AddictJanuary 24, 2010 — AP Calc Prank on Mrs Patrick 2007 […]

## Carrick said

Here are my random thoughts. Feel free to point out stupid ones among them…this is on the fly:

Isn’t it the case that some of these 13,000+ series are actually the same data (e.g., series 1-4 of station 25563), or are they using multiple instruments at some stations? I think this issue needs to be resolved soon.

Also it’s my understanding that many of these stations are still operating, GHCN just isn’t scraping the data for their archive anymore.

We need something like a surfacestations.org project get some of these missing stations data back into the data base. I understand there are also long-running stations in other countries with sparse representation in GHCN that are not being included. That’d be interesting to add too.

Finally, the grid averaging program needs to be updated to interpolate over land-surface area with missing 5×5 grid points. Something like RegEM would probably be the best approach here.

We’d also need a data base of 5×5 grid points for the percent land in each (no point in interpolating 0% land grid points for example, but we’d also like to do a weighted mean over grid points with less than 100% land surface area).

## Jeff Id said

#4 ” Isn’t it the case that some of these 13,000+ series are actually the same data”

In this case, I initially thought it was the same, however, now it doesn’t appear to be. The metadata is confusing to say the least. Version numbers don’t match up and there are even stations which have WMO numbers yet no metadata exists in the record for them.

## Steveta_uk said

Not sure if this is relevant, but the area of a polar 5×5 grid element is way smaller than an equatorial element, so if the grids are averaged, don’t the polar elements needs to be given much reduced weights?

## alf said

“I just realized, a couple thousand of us are sitting here on a blog called a ’skeptic’ blog, ‘denier’ blog, ‘contrarian’ blog and our SIN is plotting the DATA!!! That’s where we went wrong, plotting the damned data. You evil denialist beasts. Sorry, the past two months of climate change have permanently altered my perception of the authorities.”

Jeff; Welcome to the real world. I have a son who is also a rationalist and expects people to behave rationally. Happens to be one of our main sources of conflict.You have discovered the mass hysteria that humans have exhibited throughout history.Maybe history is useful after all.You have found the “BEAST” and he is alive and prospering.

## Carrick said

#4, Jeff, one way to fix this regardless of whether it is from one instrument or many, is to combine all of the data from each station into one time series. Is that how you are doing it? Sorry I haven’t had time to look at the R yet (swamped by work issues).

## Jeff Id said

#6, Shoot, I forgot to cosine correct the grid. I rewrote this code this weekend and left that little detail out.

Good catch.

## Jeff Id said

#8 Figures 2 and 3 are just ungridded un-messed with averages of the anomaly data.

## Carrick said

Steveta_uk :

Yes it is absolutely relevant.

When I do the average, I first average over grid points in each latitude (say 42.5-47.5°N), then I use cos(latitude) correction to weight the data as you move from the equator towards the poles.

## MKHS Calc Airband 09 pt 1/6 | Always Calc said

[…] GHCN Another Way « the Air Vent […]

## turkeylurkey said

Hey Jeff

Very honorable of you to volunteer for a yet another weekend of head-butting and relentlessly flogging a balky tool into grudging performance of ill-fitting tasks. Negative perspiration on the cos(lat) weighting oversight.

Whenever I say things like ‘took me to the edge of sanity’ after such a exercise, my friends tend to say things like;

‘Wow, so close, yet so far away…Imagine if you’d actually gotten all the way into the sanity zone, however briefly…’

Anyway, awesome effort sir. This salient would seem to be eventually capable of mapping the remote contributions of distant stations that become increasingly indefensible as station density goes down in Non-US locations.

I’m trying to remember how it was done back in the ‘olden daze’, pre-breakpoints.

I seem to recall that the front panel microswitch for bit 17 of my Varian 620i was very temperamental, and so loading the code required great vigilance…

And the second bank of (core)could be rather fussy; but I didn’t often need the extra 4K x 18 …

TL

## Calc That Integral | Always Calc said

[…] GHCN Another Way « the Air Vent […]

## RomanM said

#11 Carrick:

From what I can tell, this appears to be the method used by GISS as well: Calculate a single latitude band average, then combine these with cosine weights.

However, when applied to land surface data, this seems to be problematical because it basically treats the globe as composed entirely of land. The result is to overweight the tropics and the SH and underweight the mid-latitudes of the NH where there is a higher proportion of land surface than in the rest of the globe.

An extra weighting proportional to the land area in a latitude band seems to be needed. A simple approximation is to weight each grid cell with the cosine correction and then to average the available grid cells using these weights.

## Jeff Id said

#15, I intend to weight each gridcell by cos and average as you suggested.

An additional method I wanted to use was to do an area weighted plot using the closest stations on a slightly finer grid. This would spread a lot of information over the oceans but if that’s all we have to sample the area with I don’t see it as an incorrect approach.

## RomanM said

#16. Warning: The anomalies may appear larger than they are now… 😉

## P Solar said

Greetings Jeff. Just a few thoughts to throw in here.

1. What is the justification for the cosine weighting? Are polar temps somehow less “global” than tropical zones?

2. Are all of these data from ground stations? If some of these are ship based you land area weighting would need to separate out the SST data.

3. There is suggestion that the recent reduction in station count is itself a bias. Only four in California : 3 coastal + LAX = hardly typical of CA land area. NONE in Bolivia. Land averaging other “nearby” countries to effectively substitute high altitude Bol. is a bias. Just to cite two.

4. Urban heat islands unaccounted for and maybe more significant in the recent reduces data set.

5. I recall that NASA/GISS modified the pre-2000 “raw” data to correct their Y2K blunder rather than the other way around so how “raw” is the raw data you start with?

Not to pour cold water on your efforts, someone is going to need to start some valid analysis of the global sometime, but I’m wondering as to the value of the eventual results on the road you’re taking.

The first few plots you give are useful as a base line visualisation without much of the dubious homogenisation etc.

Where you see this line of work leading , what will be the value of the result?

Maybe just define this clearly for yourself if you have not already. I’m not asking for you to account for yourself in public or anything. You’re clearly capable but I’d hate to see you expending a lot of time as you are if it’s not leading to a useful reworking of the data.

Best regards.

## Jeff Id said

1 – Gridcell temperatures are a way to make sure that one area is not overweighted when compared to others. If we weight a tiny gridcell 5×5 deg at the pole the same as a large one at the equator, it’s unfairly weighting the pole stations. The correct method for fixing this is to use the cos of the angle from the equator.

2 – all ground stations.

3 – Station count changes are one of the most glaring issues in the record. It’s disgusting that we can spend trillions yet not actually check the station data.

5 – As raw as GHCN will allow. There is some discussion of this at their site.

The value of what I’m doing will hopefully to be an understanding of the temperature data as it is. If we have an open workable source of the data, we can then check urban vs rural and look at choices CRU made, how area weighting affects the result, the spatial distribution of warming etc..

## Carrick said

RomanN:

As I use it, it’s based on the assumption

Tdot(theta1, theta2; t1, t2) = Lambda(theta1, theta2; t1, t2) * Tdot(-pi/2, pi/2; t1, t2)

where Lambda is a positive (real-valued) function, and Tdot(theta1,theta2; t1, t2) is the temperature trend (dT/dz) for temperatures in the range of latitudes theta1 to theta2 averaged over the period t1 to t2 (t2 > t1).

In other words,

Lambda(theta1,theta2; t1, t2) = Tdot(theta1, theta2; t1, t2)/ Tdot(-pi/2, pi/2; t1, t2)

If we’re really luck, Lambda doesn’t depend on time (as long as t2-t1 is sufficiently large):

Lambda(theta1,theta2) = Tdot(theta1, theta2)/ Tdot(-pi/2, pi/2)

The idea behind neglecting the longitude phi is that rapid west-to-east circulation “smears” out mean trends over latitude bands. If one is looking at averages over a month that corresponds to roughly 5-complete circulations of the globe in the troposphere in the temperate climate zones.

The way I am thinking of it conceptually is a simple extension of “teleconnections” in which Lambda(theta1,theta2) represents a weighting function for latitude. If you measure Lambda(theta) = Lambda(theta-eps, theta+eps), you should be able to correct for missing latitudes. This can be generalized to a true Lambda(theta, phi) teleconnections function with a bit of work…. as long as Lambda(theta,phi) is sufficiently smoothly varying this will work. [In practice, you’d probably want to fit Lambda to the data using an expansion of spherical harmonics.]

All that said, if you want to do the weighting in the manner suggested by RomanM, you don’t need to average over latitude first: Just apply the cos(latitude) weight to each non-zero grid point, and use the sum of cos(latitutde) as a normalization factor. As long as you aren’t interpolating missing land grid points, this is exactly equivalent to what Roman suggested doing.

On the other hand, if you want to perform the interpolations for missing grid points (using e.g RegEM), and you don’t want to assume a simple form for Lambda as I did, then you need to use the weightings as described by RomanM.

## Carrick said

Make that the temperature trend dT/dt.

My hands betray me because I’m much more used to writing dT/dz.

## Kenneth Fritsch said

Jeff ID, I agree that your approach should provide some insights into the gridding of temperature data and that it is very worth whlie. We need a better understanding of all the implications of the processes used to do it and its implications. You might want to start looking into the details of what the data set operators do in the gridding process. Is the adjustment of temperature data part of the gridding process and if so would taking adjusted data and then gridding it lead to a different end result?

I have finally used R to pretty much automate getting individual station data from 5 X 5 degree grids from GHCN, producing a statistical summary of the trend variations within the grid, plots of the station time series with breakpoints and storing the plots in files. That probably does not sound like much of a feat for those proficient in R or other statistical packages, but for me it was a big accomplishment and not without frustrations. My trouble with R is that there are some data manipulations that I know I can do in Excel without even thinking about, but in R I have to think and learn. When you say that R can be slow, I am sure that you are aware that Excel, amongst its other deficiencies, is very slow compared to R. I attempted to download the GHCN adjusted data into Excel and found to do it I would have had to first download to Notebook (no problem there as I do that with zipped files that go eventually into R) and then load the data in parts into 8 different Excel worksheets.

I did have a frustration with the GHCN data in that the -9999 entry for no data did not leave a space between adjacent data and confused R when I attempted table loading. I finally loaded it by using line by line entry and than replacing -9999 with NA and grabbing the columns using string manipulations in R. There is probably an easier way to do this and I’ll learn about it later, but at least I feel my brute force efforts are my own.

Jeff, I am retired and I do things pretty much at my own pace now, but as the very impatient person that I am to a fault, in your shoes, I might be very frustrated with your weekend experiences using R. My only advice is to take your time and savor the experience.

I know that GHCN does much infilling of data or at least with the USHCN series and particularly in earlier times. That is why I have my doubts about data that goes back in time beyond 1900-1920 times. GHCN uses an algorithm for the infilling. GISS does not infill missing data. There are infilling rules from GHCN, but I do not think infilling for a completely missing years worth of data is allowed.

## RomanM said

#20 Carrick

What I am talking about calculating is the usual global monthly (or annual)

landanomaly that GISS calculates from the gridded data.IMHO, ideally, one would like to take all land grid cells, estimate the mean anomaly within each and then average those over the globe using weights proportional to the land area of each cell. This is complicated in practice by the fact that there are land regions which have no stations from which to estimate the temperatures. That and the fact that some grid cells are part land and part water are the reason that I wrote “a simple approximation is” to use only available grid cells weighted by the full area of the grid cell at that latitude band to estimate the global average.

If one were to calculate the latitude band average first, then one would need to use weights which were proportional to how much area of that band consisted of land. Using the usual cosine (which are proportional to how much of the area of the globe (both land and water) is present in the band would give biased results. I don’t see why the same considerations would not apply if one were averaging trends instead of anomalies.

What GISS appears to do (from several tests using their gridded monthly data) is to (IMO, incorrectly) do a simple average of the available cells within each band and then weight them using the usual band area cosine.

## Margaret said

About the issue of the “missing” stations.

It seems to me that this issue is not going to go away and so the real question is – is there anything that can be done about it? I was thinking along the lines of the http://www.surfacestations.org approach where lots of people each did a bit and produced a major resource. As I understand it most of these stations are still there — and producing temperature readings that are collected at a national level. The problem is that they are not being collated.

Would it be possible to use the power of the internet to start collecting them?? I know New Zealand puts its data online — so do other countries?? Could we co-ordinate a web search then email requests to met services etc to gather the data together?

It seems to me the major issue in doing this is knowledge. I wouldn’t know what to look fo, and even if I found it, someone would need to check it is the real deal so it needs…

– directions that any sucker can follow of exactly what is needed;

– co-ordination so we don’t all look for the same data over and over again;

and

– validation so that the final dataset stands up to scutiny.

I appreciate this would be a mammoth undertaking — but it seems to me that sooner or later we are going to get to the point of saying “what does that station drop-off really do to the statistics” and the only way to find out is to collect the larger data set and so the sooner the project gets off the ground the better.

## Nick Stokes said

Jeff,

In Figs 2 and 3, how did you convert from C to anomalies? You don’t mention a base period.

## Carrick said

Here is my derivation of Tdot from the crutem3v data.

Figure.

This is quick and dirty… I just used theta2-theta1=10°. Vertical axis is °C/year.

And this is my version of Lambda:

Figure.

## Carrick said

RomanM:

I’m assuming just land too.

The simplest assumption is that land temperature trend is just a constant * global mean temperature (i.e., my Lambda function is one constant for land, another for ocean).

The next simplest is that the land temperature trend is a function of latitude only (Lambda is function of theta only), and the third is it’s a full 2-d field (Lambda is a function of theta and phi).

It’s not necessarily incorrect. If you follow my discussion you see it can be based on assuming that east-west variations in Lambda can be neglected.

Remember Lambda is a measurement of climate change, not weather. You need a sufficiently long integration period to wash out most important weather fluctuations, which is why I chose 50 years for my measurement interval.

There is an advantage of making this assumption, because if you don’t, then dropping or adding stations in a given latitude band (without some form of interpolation for missing grid points)

can result in the introduction of bias into your global temperature trend.In other words, if you choose to including the “phi” dependence, you

mustcorrect for changes in the geographic distribution of surface stations either by only including grid points which have values for the entire duration you are integrating over, or by interpolating for grid points that “drop out” during the integration period.## RomanM said

#27 Carrick

OK. I see where you are coming from. I am not sure that I like the extra assumptions you are making and I prefer to do KISS (Keep It Statistically Simple). Measure the weather and let the time variable average out the trend. 🙂

## Carrick said

RomanM:

I prefer the KISS principle. But I also like the Einstein Relativity version:

KIASAPUBNOS

(Keep It As Simple as Possible, BUt NO Simpler.)

😄

Givve me a few minutes, I’ll show the bias effect from doing it your way. It will also illustrate the utility of introducing global scaling functions (“teleconneciton functions”).

## Nick Stokes said

Jeff,

I asked earlier (#25) how you calculate anomalies. My guess is that it is done within the function calc.anom(), which is not spelt out. But since it is within a loop over stations, my guess is that you just subtract the mean of each fragment that is passed.

This is not the regular way of calculating anomalies (fixed base period) and has a powerful effect in suppressing trend. The reason is that from old time fragments, you are subtracting an old mean. From new fragments, you subtract a new mean, which includes warming. If you think of the combined set of means that you subtract, they have a warming trend. This is subtracted from the original signal as you prepare Fig 2.

To give a concrete example, suppose you had a station covering 20CEN, which had a uniform trend from 10C to 11C. If it’s a complete set, you’d subtract the average 10.5C, andthe anomaly would show that 1C/cen trend. But suppose the data was in two fragments, split in 1950. From the first half you’d subtract 10.25, from the second 10.75. Net trend now, only 0.5C/cen.

## Carrick said

OK, so the first question is what is the bias in latitude in the net sampling of land temperatures, if we weight latitude by fraction of area? Is there a shift in this bias? We only have to worry if there is a shift, and it is radical over time.

Here is the formula for computing the mean latitude that I used:

Here N(theta) is the number of grid points that are filled for a given latitude theta. Note if we infill N(theta) so that N(theta) is constant over time, there would be no net effect. You only get an effect if theta_avg varies.

Answer is yes:

So this looks potentially like a problem, so let’s apply the assumption of teleconnections above, assume Lambda(theta) is constant over time (I’ll used the value of Lambda(theta) computed from 1960–2010). Here’s what I derive for the bias error:

Result:

The 10% bias you see at the end is just a result of the fact that there is more land in the northern hemisphere than southern. It is a real bias.. The drop in bias from 25% circa 1860 to 10% in 1950 however is artifactual: It is due entirely to the change in the spatial sampling of the land surface temperature field.

As usual, point out errors, hopefully I got this pretty close, but I’m doing this in between spats of my “real work”.

## Carrick said

Try #2: Forgot that Jeff doesn’t allow inlined comments:

OK, so the first question is what is the bias in latitude in the net sampling of land temperatures, if we weight latitude by fraction of area? Is there a shift in this bias? We only have to worry if there is a shift, and it is radical over time.

Here is the formula for computing the mean latitude that I used:

Equation 1: Computation of Mean Latitude

Here N(theta) is the number of grid points that are filled for a given latitude theta. Note if we infill N(theta) so that N(theta) is constant over time, there would be no net effect. You only get an effect if theta_avg varies.

Answer is yes:

Figure 1: Mean Latitude

So this looks potentially like a problem, so let’s apply the assumption of teleconnections above, assume Lambda(theta) is constant over time (I’ll used the value of Lambda(theta) computed from 1960–2010). Here’s what I derive for the equation for bias error:

Equation 2: Computation of bias error in mean temperature

Result:

Figure 2: Bias in mean temperature

The 10% bias you see at the end is just a result of the fact that there is more land in the northern hemisphere than southern. It is a real bias.. The drop in bias from 25% circa 1860 to 10% in 1950 however is artifactual: It is due entirely to the change in the spatial sampling of the land surface temperature field.

As usual, point out errors, hopefully I got this pretty close, but I’m doing this in between spats of my “real work”.

## Carrick said

Nick Stokes:

Nick is right about this. Might be interesting to modify the code to take first differences instead for each segment and average those. Your final temperature trend would then just be a sum over the averaged first differences, and you could adjust the integration constant to give you whatever baseline you like.

## boballab said

Interesting NCDC abandoned the First Difference method as it introduced random errors with decreasing number of stations when they made their own gridded anomaly map:

http://www.ncdc.noaa.gov/oa/climate/research/ghcn/ghcngrid.html

You can also get the NCDC Gridded data of GHCN from that link to compare to.

## Jeff Id said

#33, 34, First differences conceptually makes very good sense, however, when you work with small numbers of stations starting at different times a problem occurs. I learned of this while calculating Antarctic trend anomaly for Ryan’s paper.

When you re-integrate a first difference the anomaly is restored, however having no value for offset, the first point always has an assumed anomaly of zero and the first value returned becomes the offset for the entire anomaly. Since there is plenty of monthly weather noise in the signal this offset causes problems in small station numbers. Other than this assumed zero, first difference is the same as simply averaging anomalies together except that the start point is always pushed toward the zero line. Anomalies have their own assumed offsets. I found that instead of using a first difference, I could use overlapping months to the next closest station at the beginning of the series to calculate an appropriate offset for the anomaly.

It’s a bit trickier calculation but by doing this, I found a much more stable and reliable trend than simply averaging anomaly or using first differences. In the end it pushed continental averages for the Antarctic to 0.06C/Decade which was slightly higher than the Anomaly only method 0.053 and had a very good quality spatial distribution.

However, the whole experience taught me that first differences is simply not a good option for temp data. It might work better if it was filtered data but anyway, that’s my experience.

## Carrick said

Thanks for the comments and insight, Jeff. The problem with first differences is they do inflate the noise, and this data is intrinsically noise so…

Is it generally the case that within each time series there aren’t sudden jumps?

Also, is enough metadata provided so a time of observation bias correction can be automatically applied?

## Jeff Id said

#30, I used the whole timeseries for anomaly. You are correct that it causes an offset of sorts which would suppress trend as new stations are added in later. Perhaps recentering according to the 1978-2008 period would be a good way to make it work. My hope though is to work toward an offset method like I described in 35 so it won’t make a difference then.

## Carrick said

I went back and verified that computing the CRU average the way I (and GISTemp) do it gives a lower trend than the way CRU does it (i.e., the same method advocated by RomanM). The slope difference works out 0.10°C/century, or about 10% of the century long trend, right at where my model predicted. If you start from 1960, the trend bias gets smaller as I expected (though numerically it’s about a factor of two lower than I computed from my estimation of the Lamdba function). I didn’t estimate the uncertainty in the value introduced by met noise, that may account for some or all of the discrepancy.

But what is clear from this is if you don’t adjust for the missing 5×5 grid points, you will artificially inflate the temperature trend in the early part of the century, because of the higher mean latitude in these earlier data. Since 1960, it’s not so important, you can use either approach as long as you are consistent with the global mean temperature computed by the global climate models.

## Geoff Sherrington said

The conceptual basis for weighting and 2D areal or 3D volume distribution has long been handled by mining people. My inclination would be to get them involved.

One should not take a 5 x 5 grid cell lat long size as useful, just because someone else did some calculations on it. The grid size should be determined from the properties of the available data by methods familiar to miners (e.g the semivariogram for spatial data). You might even have to go to 3D, particulary as some of the critical points around the South Pole are so high above sea level. It depends on your purpose in deriving a global average, which can be approached in different ways for different uses.

The optimum grid size will grow larger as the number of data points decreases. This interferes with the choice of stations used in the reference period if anomalies are used.

I’ve an aversion to anomaly methods. The purist in me says use K, the compromise is deg C, the least desirable is deltaT. Delta T has to assume some properties of the time period chosen for reference. In the simplest case, if you drop out 50% of global stations over several decades, do you also drop out 50% of stations in the reference period? What governs your choice of what to delete from the reference period? If you leave the reference period fixed in composition, then you change the effect of dropping out 50% of stations as in the march of the thermometers.

## Kenneth Fritsch said

I think it is important that we all know what we are looking at in terms of the GNCN temperature data set. It is linked here:

http://www.ncdc.noaa.gov/oa/climate/research/ghcn/ghcngrid.html

There has been confusion on the Version 1 or 2 being used by GHCN and this excerpt clearly says that V2 is being used:

How the anomaly method is used to apply the gridded station averages to obtain a gridded value is straight forward and described below:

A 5 X 5 grid is relatively large area-wise in most part of the globe and would make me question why a weighting procedure is not used within the grid to obtain an average. I am working with individual station trends within grids and the variation amongst stations can be large. If the stations were uniformly distributed within the grid we would not need weighting, but such a distribution is going to be rare in my view.

What really frustrates me is that I do not understand the precaution noted in the excerpt below:

What does “treated months and grid boxes independently” mean??

## NicL said

Jeff

Am I right in thinking that you are treating all GHCN v2 station monthly mean temperature series with different Duplicate numbers, where the WMO number and Modifier are the same, as relating to different stations?

Peterson’s July 1997 article in the AMS bulletin, “An overview of the G H C N Temperature Database”, describing GHCn v2, states that the GHCN v2 mean dataset contains multiple versions of many stations, so I think that the various such duplicate series are all thought to belong to the same station.

It is certainly possible that in some cases that GHCN may be mistaken in believing that the multiple series (given different duplicate numbers, but the same WMO no. and modifier) in fact all relate to the same station.

## Nick Stokes said

I think it just means that they calculated station anomalies independently, by month. This is the best, if enough stations have enough data in the reference period (hence the reference to removal of duplicates, which gives longer data sequences). But they seem to discard stations which fail this test, which is a cost.

They are contrasting with the Reference Stations Method, a variant of which is used by Gistemp. There the process of anomaly formation uses a hierarchy of stations within a grid cell (not independent). It makes better use of the whole set of stations (few or no discards), but does not get independent station anomalies, so can’t be used well for resolution within cells.

If that is what they mean, it is not well described by the phrase “treat months and grid boxes independently”

## Carrick said

Jeff and crew,

I’ve had a chance to generate the % land as function of latitude function.

here is my result.

I’ve broken it down by all, land and land above different elevations.

Here are the cumulative averages over the Earth. First column is “what”, :second column is percent.

Ocean 71.1

Land 28.9

Above 1000 7.5

Above 2000 3.2

Above 3000 1.5

Above 4000 0.5

Above 5000 0.1

The mean lattitude of the land mass is at 15.5°N. Note this number is very close to the gridded mean CRUTEM post 1950 (see this.). No doubt they checked this.