Global Gridded GHCN Trend by Seasonal Offset Matching

This post employs global gridded data using Seasonal Offset Matching.  RomanM’s recent posts on the topic might even become standard method in climate science.   Certainly the Met office should at least pay attention to his methods if they are planning on redoing the dataset.  Of course this example is just the incredibly lacking GHCN dataset, full of station dropouts, weak metadata and urban warming but it provides a good example set to apply the methods toward.  In my previous versions of the GHCN the code spent a fair amount of time to determine if the data was from the same instrument or not.  This method does a good job combining the station info either way.  With that said, this is still a preliminary result but it is getting closer to a finished set.

The way the algorithm is applied here, stations with multiple copies of the same data are weighted more in the averaging.  Also, this temperature series is only for land based data.

Roman’s station combination code is explained here. – If you’ve not read up on it, this is the best description to date of what he has proposed.

The concept is to provide 12 months of offsets for each temperature series when combining data.  It maximizes corrections for missing data when combining multiple incomplete temperature series.  The function he wrote is deceptively simple.


calc.offset = function(tdat) {
  delt.mat = !is.na(tdat)
  delt.vec = rowSums(delt.mat)
  co.mat = diag(colSums(delt.vec*delt.mat)) - (t(delt.mat)%*%delt.mat)
  co.vec = colSums(delt.vec*delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt.mat)
  offs = psx.inv(co.mat)%*%co.vec
  c(offs)-mean(offs)}

### generally assumes that tsdat starts in month 1 and is a time series.

monthly.offset =function(tsdat) {
  dims = dim(tsdat)
  off.mat = matrix(NA,12,dims[2])
  dat.tsp = tsp(tsdat)
 for (i in 1:12) {
    dats = window(tsdat,start=c(dat.tsp[1],i), deltat=1)
    offsets = calc.offset(dats)
    off.mat[i,] = offsets }
  colnames(off.mat) = colnames(tsdat)
  rownames(off.mat) = month.abb
   nr = dims[1]
   nc = dims[2]
  matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
  temp = rowMeans(tsdat-matoff,na.rm=T)
  pred =  c(temp) + matoff
  residual = tsdat-pred
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }

This version which includes Roman’s own hand written regression, doesn’t fail on some of the weird series with no overlapping months of data. I had already written the monthly station combining code from my getstation efforts so it was a simple matter t0 combine Roman’s new algorithm with my work to plot a complete GHCN global temperature station trend.

gridval=array(NA,dim=c(length(sid),2))

#place stations on grid
for(i in 1:length(sid))
{
	gridval[i,1]=as.integer((rawcoord[i,2]+90)/5)+1  #lat
	gridval[i,2]=as.integer((rawcoord[i,1]+180)/5)+1  #lon
}
gridtem=ts(array(NA,dim=c(36,72,161*12)),start=1850,deltat=1/12)
for(ii in 1:72)
{
	for(jj in 1:36)
	{
		maska=gridval[,2]==ii
		maskb=gridval[,1]==jj
		maskc= maska & maskb

		if( sum(maskc)>0)
		{
			tar=array(NA,dim=c(161*12,sum(maskc)))
			tar=ts(tar,start=1850,deltat=1/12)
			kk=1
			for(i in sid[maskc])
			{
				rawsta = getallcurves(as.numeric(i))
				if(ncol(rawsta)>1)
				{
					sta=monthly.offset(rawsta)$temps
				}else{
					sta=rawsta
				}

				ind=which(tinv[,2]==i,arr.ind=TRUE)

				if(max(time(sta))>1850)
				{
					sta=window(sta,start=1850)
					index=as.integer(((time(sta)-1850)*12+.02)+1)
					tar[index,kk]=as.vector(sta)
				}
				kk = kk+1
			}
			if(ncol(tar)>1)
			{
				gridtem[jj,ii,]=calc.anom(monthly.offset(tar)$temps)
			}else{
				gridtem[jj,ii,]=calc.anom(tar)
			}
			print(jj)
			plot(gridtem[jj,ii,],type="l")

		}
	}
	print ("COLUMN")
	print(ii)
}

glav=rep(NA,161*12)
weight=sin(seq(2.5,180, 5)*(pi)/180)
meanweight=mean(weight)
weight=weight/meanweight
#weight=rep(1,36) # calculate unweighted temp
mask=rep(FALSE,36)
mask[1:18]=TRUE

for(kk in 1: (161*12))
{
	mm=rep(NA,72)
	for(ii in 1:72)
	{
		mm[ii]=mean(gridtem[mask,ii,kk]*weight,na.rm=TRUE)
	}
	glav[kk]=mean(mm,na.rm=TRUE)
}
glava=glav

mask=!mask
for(kk in 1: (161*12))
{
	mm=rep(NA,72)
	for(ii in 1:72)
	{
		mm[ii]=mean(gridtem[mask,ii,kk]*weight,na.rm=TRUE)
	}
	glav[kk]=mean(mm,na.rm=TRUE)
}

glavb=glav

#glav=(glava+glavb)/2

Again the code is preliminary but most of the bugs have been worked out over time. This calculates a 5×5 degree gridded temperature plot directly from the GHCN data without any hidden steps. It employs a unique and publishable method by Roman M for combining temperature series with anomalies and results in the following temperature plots.

The trend above looks pretty reasonable for a GHCN dataset.  I expanded it to include 1850 to present below.

The trend since 1978 is below.  Plenty of warming.  1998 again is not the highest year by the complete GHCN dataset.

Global trend since 1900 plotted by gridcell.

Click to expand


Global temperature trend since 1978

Click to expand

29 thoughts on “Global Gridded GHCN Trend by Seasonal Offset Matching

  1. Nice work!

    These sorts of posts are great as “see, it’s not so hard to show your work” exemplars (and usually when you know something is for public consumption, you write better, more well commented and documented stuff anyway).

    Thanks to you both, keep it up.

  2. omg, you’re using the adjusted GHCN data. Don’t even need a link to your data source to see that (though a link would be nice). Drop by justdata.wordpress for bash scripts to parse the raw data files and a link to the unadjusted temps.

    So sad, that so many people have wasted time working with that junk. “Scientists” start with the same padded & fluffed data. Never mind that Jim Hansen is lead author on the paper used to justify the adjustments…

  3. That is great work.

    Next some joker will ask for error bars.

    A while back ( like 2 years ) Roman discussed with me the way Brohan jones 06 handled the grids that had
    part ocean and part land.. I recall that he had issues with that as well..

    So when you get to including SSTs that will be a fun discussion.

  4. Jeff
    I don’t know anything about code, but remember some trigonometry – is
    weight=sin(seq(2.5,180, 5)*(pi)/180) the bit where you adjust for the relative size of the 5 x 5 cells?

  5. #5, Why would you assume I’m using adjusted data? This is what they call raw, although Ryan has pointed out even the raw isn’t really raw.

    #7, Yup.

  6. #6 Steven

    A while back ( like 2 years ) Roman discussed with me the way Brohan jones 06 handled the grids that had
    part ocean and part land.. I recall that he had issues with that as well..

    Your memory better that I thought! Yes, rather than combining grids with both land and water using a fixed pair of weights (such as relative area contribution), they “minimise” the variance of the end result by combining with weights related to the uncertainties of the estimates of the two components. Although it sounds like a good thing to do, it means that these weights can vary from month to month making grid anomalies at different times incomparable and introduces a bias into the end result.

    It is like trying to estimate a combined mean of some variable for two equal size populations (that differ drastically from each other). However, 90% of my sample is taken from population A and 10% from B. Instead of weighting the two calculated averages equally to calculate a combined average, since my error bars are so much smaller for group A, I weight them 90% and 10% instead.

    Learning some stratified sampling stuff could have been helpful for them…

  7. #9 My understanding is there are huge issues with grid cells containing water that is covered in ice part of the year. This plays hovak with the averages if the time of the ice advance/retreat changes.

  8. Jeff,

    Nice work. Its interesting to note that the next effect of the adjustments is to lower the global trend in recent years for the major series (GISS/NCDC/HadCRUt) since they all have a slope of ~ 0.16 C per decade for the period 1978-present (versus 0.233 for the raw data).

    It would be interesting to see how the Tamino/Roman method of station combination effects the trend vis-a-vis the reference station method used by GISS.

  9. #13 – Zeke, I would not get too excited. Many of the dubious adjustments that people find by pulling up individual stations seem to involve adding cooling to the earlier record rather than adding warming to the recent record. This would have the effect of making the 1940 peak look lower than it should be.

    Also, UAH has a trend of .13 degC/decade. That may be closer to the true trend once UHI is taken into account.

  10. #9

    Of course I remember! Dr. M you ever have one of those students who waved his hand wildly with a question he thought was smart?
    and maybe 1 times out of 100 it was smart? I’m that dude of course I’d remember that.

  11. Important to note here, an Anti-Groupthink behavior;

    Jeff did publish a graph which is not notably dissimilar from some Jones=esque graphs.

    This would seem to be a prime candidate for Self-Censorship if Jeff was constrained by GroupThink pressure.

    Instead, it seems to be a good example of
    Derive/Define the Analytical Method,
    then look unflinchingly at the results.

    Integrity on Parade…
    RR

  12. I agree with #17. I find it rather difficult to derive basic mathematical methods from uncommented code. I would like to see an exact analytical decription of how weighting is performed before delving into code.

  13. Having followed the links after my last post, this seems to be a suspect method. There does not seem to be a proper physical assumption behind the method and estimating the offsets, by a minimisation procedure, I assume, is likely to be ill-conditioned due to a large number of unknowns. This underlies the comment that there are many possible solutions for the offsets. Should there be a geographical weighting between stations?

    It simply reinforces my view that any amount of processing lousy data unless, it had been degraded in a very precise way, is not going to produce good estimates of the underlying process.

  14. It simply reinforces my view that any amount of processing lousy data unless, it had been degraded in a very precise way, is not going to produce good estimates of the underlying process.

    Yeah, being successful at ill-posed problems depends heavily on knowing the forward model and the noise process pretty well.

  15. #20 RC S:

    I don’t know what “proper physical assumption” you are looking for. Perhaps you should tell us. I presume that you saw the posts here for the model and here for some of the unbiasedness properties.

    This is an estimation problem. This particular solution treats each station equally. The method deals with the problems created by missing data in the various series and has some verygood properties. If you want weighted stations, that is possible within this context.

    The calculation methodology is proper statistical Analysis of Variance. The “non-uniqueness” consists of a single additive constant which could be added to each of one set of coefficients and subtracted from each coefficient in the other set. Ill-conditioned? I don’t think so. I used a pseudo-inverse to simply avoid choosing a single superfluous equation to eliminate when solving for the coefficients.

    I have no idea what you mean by

    It simply reinforces my view that any amount of processing lousy data unless, it had been degraded in a very precise way, is not going to produce good estimates of the underlying process.

    I would be interested in hearing an explanation.

  16. RomanM said:

    I have no idea what you mean by

    It simply reinforces my view that any amount of processing lousy data unless, it had been degraded in a very precise way, is not going to produce good estimates of the underlying process.

    I would be interested in hearing an explanation.

    I thought he was talking about something along the lines of deconvolution, if you know that your signal was blurred with a gaussian, it’s pretty easy (if numerically ill-posed), if the ‘degradation’ process was more complex than that though…

  17. The rawest of gobal temperature data that I have been able to find is from the Oak Ridge DoE CDIAC files TR022 and TR027, two thick books that go to 1986. They have been put into machine form as described at the Warwick Hughes blog http://www.warwickhughes.com/blog/?p=510

    Anyone who uses adjusted data does so at the mercy of the adjusters, of course.

    It was interesting to see a partial explanation for one of my hot topics, the global hot year of 1998.

    Roy Spencer (07:31:23)noted:

    “Steve Goddard: the January 1998 warm spike in the satellite data compared to surface data is because peak El Nino warmth is in the eastern tropical and subtropical Pacific, which is poorly sampled by thermometers. The same is true during most if not all El Nino years.”

    An alternative or perhaps complementary explanation comes from a recent, fascinating blink graph at a site I can’t find now (bookmark went wrong), where the hot 1998 was compared to ththe coldest year early in the 1900s, month by month. The peak maxima and minima were roughly the same, it’s just that 1998 had a longer warmer spring and autumn. How does GHG do that? How does El Nino do that? From where does the troposphere gain heat? Of course, large or medium monthly excursions from the customary annual pattern can upset infilling.

  18. I would like to see this repeated using rural stations only.
    From what I have seen elsewhere (WUWT I think),
    using rural stations only, plus your method, would almost eliminate ALL warming.

    In that case, it would certainly be worse than they thought!

Leave a reply to RomanM Cancel reply