CRU #2 – Why we look at data and code
Posted by Jeff Condon on December 30, 2009
Sorry for the lack of posting lately. I’ve spent the holiday with the family, reading books and playing with the kids. In the background I’ve been continuing the exploration of the CRU stations available from the GHCN dataset. The purpose was just to continue exploring how the global temperature series are created.
First, there are 1643 series of raw data from GHCN which have 6 digit ID’s that correspond to the 1741 stations CRU listed recently on their website. I got this number by having software search through GHCN station ID’s for a match to CRU. My number is slightly different from the station number reported at CA last week, unfortunately there should be no discrepancy. I can’t find any error in the code, but this is a post of general differences over a lot of data so if there are errors in a few stations it wont cause much trouble. This result means GHCN has data for all but 98 of the CRU stations recently published. One question is, how representative of the full GHCN dataset are these 1643 series
The R software presented below, collects data from each of the 1643 CRU stations from the GHCN rather than the “raw” data presented by CRU. In GHCN, many individual station ID’s have multiple sub-stations. The software here simply averages all the substations for a particular ID number and creates a single time series for each station number. Admittedly, this method has very little QC involved and needs improvement but it should give us some baseline understanding of the data as a whole.
The 1643 individual stations are then averaged by month to create the graph below (2 yr gauss filter). Please ignore the trend lines in the following graphs as they have little meaning and were left there for other plots.
The monthly per year station count is important.
You can see the spike at the most recent end of Figure 1 is based on a very few stations, as are temperatures pre-1900. Overall, Figure 1 has an upslope which is visually quite similar to CRU Figure 3, which seems to provide initial verification of CRU accuracy. Remember, in a previous post, we recreated the CRU gridded average from the code and data presented. – Not a bad match between Figures 3 & 1 I think.
GHCN provides adjusted versions of its data as well. The adjusted version supposedly corrects for steps, biases in time of day and such. Unfortunately, not all stations have an adjusted version. Below is an unweighted average of the 917 available adjusted data stations from GHCN which match the CRU codes released.
It’s odd that the upslope since 1970 basically disappears from adjusted data. Not at all what we’ve grown to expect from climate science in general, but again it may have to do with regional differences on these selected stations and the net correction may still be positive.
Figure 5 is the station count per year for the adjusted data in Figure 4. Note the huge drop off in stations in recent years, such a serious dropoff has to have an affect station trend but of course we don’t know how much yet. It certainly could be responsible for some of the differences between Figures 4 and 1 but it is a pretty big difference.
So the next question I had was: How representative are 1643 CRU choices, of the original 4495 individual GHCN surface station ID’s available? Again, these are not area weighted and some necessary corrections are missing, so the trend itself is not particularly reliable. Also certain areas (such as the US) are guaranteed to be oversampled in relation to the ‘developing’ nations (hahaha, can’t resist) but it probably gives some indication of what the subsample should look like.
The upslope in Figure 6 is again non-existent compared to the stations CRU chose to use. It doesn’t mean a huge amount at this point without area gridding taken into account, but it’s a little fishy looking to me that there is such a large difference between the chosen GHCN and the available GHCN dataset. It’s also interesting that the adjusted version by scientists at GHCN (Figure 4), doesn’t produce the same hockeystick as the raw data selected from CRU. The GHCN raw (Figure 6) matches Figure 4 adjusted much better than CRU Figures 1 or 3.
Station count (by averaged station ID), for all GHCN.
Finally, just to check the total of the available adjusted GHCN version I plotted Figure 8.
The Adjusted data has a bit of a downslope compared to the the raw again due mostly to too few series in history, but besides that it’s far more similar to raw than the CRU selections. Next step should be gridded averaging of the above data to see what results we get.
I’ve copied the code below in the first comment, WordPress doesn’t allow uploading a linked text file so you need to fix the fancy quotes and download GHCN data to make it work.