the Air Vent

Because the world needs another opinion

Introduction to RghcnV3 – Steve Mosher

Posted by Jeff Id on July 18, 2011

Steve has put together a high quality series of algorithms for combining surface station data in R.  He has combined methods from a wide variety of sources with the following example incorporating ideas from my favorite blog by Tamino.  This is the kind of science I like, open source, nothing hidden, no motivations, just basic numbers combined in reasonable fashion.  I do hope that these packages get recognized for their value as they have quite a bit of power for general climatology use.  — Jeff

Today I’m going to show you some of the basics of the RghcnV3 R package. Version 1.2 is staged for release on CRAN (http://cran.r-project.org/ ) and should be going live any day now. If you can’t wait just ping me and I’ll ship you sources which will work on any OS.

First, a few preliminaries. The package is designed as a set of common functions that a programmer can use to quickly write his own studies. In this example I’ve gone very light on the graphics code, I could spend hours crafting graphics functions dedicated to these problems , but in the end its very hard to parameterize or abstract that part of analysis. In short, for now, you have to make your own pretty graphs. What I’ll show here is just basic sanity checks.

The data this package is focused on is GHCN v3, which has just come out of beta. We Start by downloading the data.

files<- downloadV3()

For every type of file you want to download in RghcnV3 there is a corresponding download function, cleverly named download*. Each of these functions has defaults:

downloadV3(url = V3.MEAN.ADJ.URL, directory = getwd(), overwrite = TRUE, remove = FALSE)

The functions download the data and unpack the data using theR.utils package. Hands free! The download functions are all designed to return a file name(s) to the uncompressed file.

>files

$DataFilename

[1] “C:/Users/steve/Documents/GHCNPackage2/ghcnm.v3.0.0.20110714/ghcnm.tavg.v3.0.0.20110714.qca.dat”

$InventoryFile

[1] “C:/Users/steve/Documents/GHCNPackage2/ghcnm.v3.0.0.20110714/ghcnm.tavg.v3.0.0.20110714.qca.inv”

$Date

[1] “2011-07-14″

For every filetype there is also a function cleverly named read*

v3Mean <- readV3Data(filename = files$DataFilename)

v3Inv <- readInventory(filename = files$InventoryFile)

GHCN v3 has a new format with 51 columns of data. They’ve added QC data to every temperature data file. To maintain compatibility with old code that works with Version 2 formats, readV3Data() outputs a data structure that looks just like V2 format: 14 columns of data: id, year, and 12 months of data. Next, we want to trim or window the temperature data because it starts in 1801:

startYear<- 1900

endYear<- 2010

v3Mean <- windowV3(v3Mean, start = startYear, end = endYear)

If you are familiar with the “window” command from using time series data you’ll understand what window does. windowV3() “windows” a v3 object.

For this study I’m going to use Tamino’s regional station combing algorithm, so I have to select a region of the world and get the stations for that region. That means working in the inventory data which contains location information. I have a couple ways to do that. The first is a gross spatial “crop” of the inventory using the“cropInv()” command which takes a geographical “extent” extent(lonMin,lonMax…) But I also have a much slicker way to skin this cat. Using the “maps” package, I’ll select those stations deep in the heart of Texas

dex<- which(map.where(“state”, getLonLat(v3Inv)) == “texas”)

TexasInv<- v3Inv[dex, ]

There are a few functions here to note: “map.where()” from the maps package takes a map database, “state” and a set of locations (x,y) and returns the state name. “getLonLat()” is a RghcnV3 function which takes an inventory and returns the Longitude and Latitude. The line thus returns an index to those stations in v3Inv which are in the state of Texas. We then create an inventory of only those stations by using the index “dex” to select those rows in v3Inv : v3Inv[dex, ].

We can map those stations easily, and to do that I’ll use the extent() command of the raster package. Nothing fancy for this map, I’m just doing a sanity check on the method I used to select Texas stations. You can use the same method to select stations from various countries. Although the country database is slow. I’ll probably add some custom work to make this fast. Don’t inventories have “country codes”? yes they do. So you can slog through that if you like to translate the Ids into country codes and load a country code table. NCDC does not use a standard ( ISO or Fips) country code. They should and they should not encode the country code into the station id. If you want to work with things like states or provinces, then a map based approach is superior. Sometime in the future I’ll show how you can select stations in different “climate” regions using raster.

texas<- extent(-115,-85,25,40)

plot(texas)

points(getLonLat(TexasInv), col = “red” )

map(“state”, add = TRUE)

Figure 1

After we select our stations from the inventory we have to prepare the temperature data. The Ghcn data is in a packed format that looks something like this: id, year, 12 temps

42512345000 1980 12 13 14 15 16 17 12 13 14 15 14 13

42512345000 1981 12 NA 14 15 16 17 14 13 14 15 14 13

42512345000 1983 12 13 14 15 18 17 12 13 14 15 14 13

You’ll note that 1982 is missing. What we need to do is turn these three rows of data in one vector of data with the year 1982 having 12 missing values or NAs. Typically that’s done in a loop of sorts which is slow in R, so there is a custom function to do this called V3ToZoo(). “zoo” is a package aimed at regular and irregular time series, so I used that to do the conversion. Essentially, v3ToZoo() takes a packed 14 column v3 data format and creates a 2 dimensional object where every column contains station data. Every row contains data for a month. All months from start to finish are present

Temp <- v3ToZoo(v3Mean)

Let’s go a bit deeper into the zoo object:

time(Temp)

[1] 1900.000 1900.083 1900.167 1900.250 1900.333 1900.417 1900.500 1900.583 1900.667

The time variable is years and months in fractional representation. We have a row for every month, the frequency of the data is 12, and If I don’t like fractional dates, I can coerce it into year months with the zoo command “as.yearmon”

nrow(Temp)

[1] 1332

frequency(Temp)

[1] 12

as.yearmon(time(Temp))

[1] “Jan 1900″ “Feb 1900″ “Mar 1900″ “Apr 1900″ “May 1900″ “Jun 1900″ “Jul 1900″

[8] “Aug 1900″ “Sep 1900″ “Oct 1900″ “Nov 1900″ “Dec 1900″ “Jan 1901″ “Feb 1901″

What do the columns look like?

ncol(Temp)

[1] 7208

Every column is a station of data. Note, we ‘windowed’ the data to 1900 to 2010 so some early stations will be dropped from the collection of time series. And we can just plot temperatures by indexing into the columns of the object. Nothing pretty, just a sanity check:

plot(Temp[ , 1])

Figure 2

I’ve added something else when I did the conversion to “zoo”. Every column name is now a station identifier:

colnames(Temp)[1:10]

[1] “10160355000” “10160360000” “10160390000” “10160395001” “10160400001” “10160402000”

[7] “10160403000” “10160419000” “10160425000” “10160425001”

Those who prefer to use R’s time series objects can use the v3ToMts() function which does the same function but outputs a multiple time series object.

The stations in Temp number 7208; the stations in TexasInv number 100. The temperature data, thus , must be pruned to match the inventory data. That’s accomplished with the intersectInvZoo() function. Those familiar with R will understand that this function wraps R’s “intersect()” function but unlike intersect(), this function takes objects of different types and operates on their shared element. In an OOP design, the intersect function would just be overloaded to accept different object types. We pass an inventory and a zoo object to the function and it returns two objects back: a rectified Inventory and a rectified zoo object

DATA <- intersectInvZoo(TexasInv,Temp)

>identical(getStations(DATA$Inventory),getStations(DATA$Zoo))

[1] TRUE

So the function returns an Inventory and a temperature object where the stations are identical. Identical in order and identical in reference. “getStations()” is a function that returns the stations in the Object. It knows how to find the stations in whatever kind of object you pass it. There are a variety of “get*” functions that, as the name implies, “get” data from objects.

Now that the data is prepared we feed it to Tamino’s function:

regionAve<- regionalAverage(DATA$Zoo)

And Tamino’s function will return a list of three data items: A list of zoo objects; A data.frame of that same data ( if you don’t like zoo) and the offsets ( see his algorithm for a description). I’ll probably reduce this to just a zoo object and the offsets, but for now you can use the data.frame if you like. You will note that it explicitly includes “time” or t. “num” is the number of stations.

>names(regionAve)

[1] “Zoo” “data” “offsets”

>names(regionAve$data)

[1] “t” “x” “num” “se” “std.dev”

>names(regionAve$Zoo)

[1] “x” “num” “se” “std.dev”

From here plotting the regional average temperature is easy: Plotting just the first 20 years of data we see the following:

plot(regionAve$Zoo[,1], main = “Texas Heat”, xlab = “date”, ylab = “Temp (C)”,

ylim = c(0,40), xlim = c(1900,1920), col =”red”, lwd = 2)

Figure 3

And looking at the station count we see this:

plot(regionAve$Zoo$num, main = “Stations”, xlab = “date”, ylab=”Stations”,col =”red”,lwd =1)

Figure 4

To complete our picture we turn the temperatures into anomalies and then we create annual anomalies. Two more RghcnV3 functions get used: “anomalize()” allows you to set a base period, and annualize() creates an annual average. Options in these functions are not shown in this example, but there are optional settings for each explained in the manual:

regionAnomaly<- anomalize(regionAve$Zoo$x)

regionAnnual<- annualize(regionAnomaly)

And a less than fancy plot:

plot(regionAnnual, main=”anomaly”)

Figure 5

It’s also then easy to look at the correlation between the regional average and all 100 stations.

Y <- cor(x=coredata(regionAve$Zoo$x), y =as.matrix(coredata(DATA$Zoo)),

use = “pairwise.complete.obs”,method = “pearson”)

And that gives us what we expect: the stations are highly correlated with the regional average:

That’s a simple introduction to RghcnV3. The process for all the analysis typically follows the same path: Get data and inventory information; select time periods and inventory subsets; create a time series object. Check it against the inventory and compute. That process nominally looks like this:

files<- downloadV3()

v3Mean <- readV3Data(filename = files$DataFilename)

v3Inv <- readInventory(filename = files$InventoryFile)

v3Mean < – windowV3(v3Mean, start = startYear, end = endYear)

TexasInv<- v3Inv[which(map.where(“state”, getLonLat(v3Inv)) == “texas”), ]

Temp <- v3ToZoo(v3Mean)

DATA <- intersectInvZoo(TexasInv, Temp)

regionAve<- regionalAverage(DATA$Zoo)

The package can be downloaded from CRAN. Make sure you check back for newer versions as new features get added. Version 1.3 is under construction and there will be a few minor changes here and there ( function names and variable names), but nothing major


32 Responses to “Introduction to RghcnV3 – Steve Mosher”

  1. Andrew said

    I think the code is helpful for testing certain broad hypotheses about what does and does not have an effect on the large scale record. But while we are looking at small scale things like the state of Texas, there is perhaps an opportunity to do some examination of the individual records to check for spurious biases. I’m curious if Steven is interested in doing a Christy style analysis of various regions like was done in Alabama, California, and East Africa? I realize this may entail considerably more work than writing up code to aggregate and analyze global data, but based on the analyses that John has done, it could well be worth it.

  2. Steven Mosher said

    Sure, propose a hypothesis to test.

  3. Kenneth Fritsch said

    I have not had time to read this post in detail, but what I have read indicates that Mosher has provided an access for us do-it-yourselfers to individual station data that is not available in convenient form elsewhere. I use KNMI for conveniently obtaining temperature series, but obtaining loads of station data is not very convenient from that site. A bonus is that by going through his code I can learn more about R programming.

    Thanks, Steve.

  4. timetochooseagain said

    2-Lessee…Uh, I guess my hypothesis will be that individual stations when examined against large numbers of neighboring stations will contain un-explained shifts that can bias their trends. Oh, and I guess if you want a specific prediction, there will be plenty of warming in minimum temperatures and much less if any in maximum temperatures. Those were the consistent findings of John Christy’s papers. Actually, in John’s papers, he usually used much more data than available from GHCN, often going out and getting data keyed himself that was not collected yet. This was especially useful for doing homogeneity adjustment in the California study, where it was crucial not to use Valley data to adjust the Sierra data and vice versa, but in GHCN the nearest stations to the available Sierra stations were all in the irrigated valley which warmed quite a bit relative to the mountains. This suggests that one needs to isolate the conditions stations are under when doing homogeneity adjustment, so it is possible to isolate reasons for differential trends, rather than the adjustment spreading them around.

  5. Earle Williams said

    mosh,

    Curse you! Now I’ve got no excuse not to dig into the data and explore a few of my pet theories. What are your system specs for running R? I’m contemplating buffing up the RAM in my desktop computer, and am curious how the data and functions tax your system.

  6. Steven Mosher said

    #3 Kenneth

    yes, its atoolkit for do it yourself types. basically all the horrible messy stuff like getting files, getting them loaded into the right data format, creating time series, anonmalies, etc is all taken care of.

    Version 1.3, is just finishing up ( I had bug reports to make to upstream ) and is probably worth the wait since there is an analyst guide as well
    Check my web site in a couple days and I release source there.

  7. Steven Mosher said

    #5 earle, The memory requirements are not that great I used to run it on my old mac laptop with 2gb RAM.

    Obviously if you work with GIS data that exceeds that then you are swapping to disk a bunch of time, but for the temperature stuff
    that normally doesnt happen. If its slow then I gratefully take a pee, cigarette, coffee, read Lucia, comment on Judith break.

  8. Steven Mosher said

    “2-Lessee…Uh, I guess my hypothesis will be that individual stations when examined against large numbers of neighboring stations will contain un-explained shifts that can bias their trends. Oh, and I guess if you want a specific prediction, there will be plenty of warming in minimum temperatures and much less if any in maximum temperatures. Those were the consistent findings of John Christy’s papers. Actually, in John’s papers, he usually used much more data than available from GHCN, often going out and getting data keyed himself that was not collected yet. This was especially useful for doing homogeneity adjustment in the California study, where it was crucial not to use Valley data to adjust the Sierra data and vice versa, but in GHCN the nearest stations to the available Sierra stations were all in the irrigated valley which warmed quite a bit relative to the mountains. This suggests that one needs to isolate the conditions stations are under when doing homogeneity adjustment, so it is possible to isolate reasons for differential trends, rather than the adjustment spreading them around.”

    somebody need to learn how to frame a testable hypothesis.

    you seem to be saying this:

    1. take an individual station ( how to select them? randomly, all possible individual stations,)
    2. Compare it to a large number of neighbors: define large, define neighbor.
    3. Which dataset?

    Prediction: you will find “unexplained” shifts

    1. “unexplained” is not an observable. “unexplained” describes the explainers state of mind
    2. define “shift”. what constitutes a “shift” .1C? 10%, 1 SD?
    3. What do we expect from “random data” That is, given a set of random data how often will we find
    “unexplained” shifts ( hehe all the shifts are unexplained.

    Better prediction.
    1. “plenty of warming in the minimums
    A. define plenty, what kind of signal are you looking for?
    B. in the individual site? in all of them? what fraction? again how big is the effect you are looking for
    2. Much less in the maximum?
    A. define much less

    Adjustments to GHCN.

    You should keep up with the changes made to homogeniety

    In any case, frame up your hypothesis more clearly and provide a pointer to data sources. Define the method you want to use
    more clearly.

    Then, agree that you will accept the results of the test and change your beliefs if proven wrong.

    Steve

  9. Carrick said

    Good point, Steven. With 1/f^n noise, you get periods where the climate seems to “stick” around on one set point, then sudden shifts to a new set point. This by the way is a typical behavior of a complex system.

    This is globally averaged SST. It just gets worse as you limit yourself geographically (esp. just one station).

  10. timetochooseagain said

    I think the problem is that you are focusing on taking people’s “beliefs” and showing them their beliefs are wrong. I don’t have “beliefs” or expectations. I’m genuinely curious. I know it’s hard to believe but I don’t have a priori expectations before I see the data.

    But I think one should look at a station’s nearest neighboring stations that have similar characteristic environments (ie same use of local land, same terrain,etc.) by “large” I mean literally as many stations as possible, within about, say, 50 kilometers (this is about the radius of the “near” stations in the Alabama study). I think a “significant” shift would be one which is outside of two standard errors of the mean of the nearby stations during a specific period where the stations are thought to have been stable and have homogeneous data. “unexplained” means that the meta-data does not tell us what happened at that point that would account for a shift that time, but nevertheless we detect one. For instance 1994-95 there was a sudden warming of Athens, Alabama, relative to the three stations nearest to it, of 1.5 degrees in the JJA maximum temperature. There was no explanation for why it occurred in metadata but it was clearly spurious, so it’s effect was removed in John’s paper.

    As for data sources I think you can get a lot of data from GHCN. However I am not sure if more data will be needed. Certainly John used much more data.

    As for the max versus min, I was saying that you will find that minimum temperature trends are signficantly more positive than maximum temperature trends in many if not all cases, and maximum temperature changes will tend to be smaller, although how much I am unsure. I want to know how much, I don’t want to be told, “go on, guess, and I’ll show your wild guess is wrong” Well of course a wild guess is wrong. But is the picture qualitatively like I describe? I suspect it will indeed be.

    To get an idea of the kind of study I’d be interested in, see these papers:

    http://www.met.sjsu.edu/~wittaya/journals/WhenwasthehottestSummer.pdf

    http://journals.ametsoc.org/doi/pdf/10.1175/JCLI3627.1

    http://journals.ametsoc.org/doi/pdf/10.1175/2008JCLI2726.1

  11. Steven Mosher said

    version 1.3 is now verified

    http://stevemosher.wordpress.com/2011/07/20/version-1-3-verified/

    At this stage I’m not going to make any changes to the core API. that is an API that gets you from reading the data in to
    creating the necessary data structures you need for analysis.

    next Up.. turning Roman’s thermal Hammer into a set of functions ( its very close ) just like we did with Tamino.

    CRAn seems a bit slow in updating my package, so I’ll post source on my site and folks can just install from source.

    the demo code shows you how to do things. later some “Sweave” documents which are reproducable research

  12. Steven Mosher said

    Time:

    “I think the problem is that you are focusing on taking people’s “beliefs” and showing them their beliefs are wrong. I don’t have “beliefs” or expectations. I’m genuinely curious. I know it’s hard to believe but I don’t have a priori expectations before I see the data.”

    well, you do have expectations; You believe that the concept of neighbor is important. You think certain characteristics matter. if you are merely curious, then why not this question? do sites at even latitudes differ from sites at odd latitudes. Thats a curious question. here is what I am NOT interested in. I am not interested in creating a TOOL FOR THE CURIOUS and then having the curious tell me to go muck about in data FOR THEM.
    Aint gunna happen. When I started this thing I thought that good old jones and Hansen should just drop everything and answer my damn questions. Kinda self centered. Seeing that, I turned around and said, how can I build a tool for the curious so that they can get their hands dirty and answer questions that interest them. How can I give them Power. So I donate my time. To give you power. Not the power to tell me what to look at. The power for you to do your own darn looking. When you invest time looking at things. when you see your secret theories crash and burn in the still of the night, you gain some measure of respect for people who put their theories out there for others to test

    “But I think one should look at a station’s nearest neighboring stations that have similar characteristic environments (ie same use of local land, same terrain,etc.) by “large” I mean literally as many stations as possible, within about, say, 50 kilometers (this is about the radius of the “near” stations in the Alabama study). I think a “significant” shift would be one which is outside of two standard errors of the mean of the nearby stations during a specific period where the stations are thought to have been stable and have homogeneous data. “unexplained” means that the meta-data does not tell us what happened at that point that would account for a shift that time, but nevertheless we detect one. For instance 1994-95 there was a sudden warming of Athens, Alabama, relative to the three stations nearest to it, of 1.5 degrees in the JJA maximum temperature. There was no explanation for why it occurred in metadata but it was clearly spurious, so it’s effect was removed in John’s paper.”

    1. define similar land use. Land use comes in various categories that are captured by certain datasets. Define these land uses. There are
    many different approaches. I dont read minds. Point to a data source, tell me why you think it is accurate.
    2. as many stations as possible within 50km? why not 100km? if you are just looking for stations that have random 2 sigma jumps, guess what?
    the longer the time, the more stations, the more likely you see that odd ball. Athens odd ball.

    3. A sudden warming of Athens: 1.5 degrees in the JJA maximum temperature. How often should such a thing happen randomly?
    with no model of the noise structure, no model of spatial coherence of a climate field, you dont have anything there more than
    chance. Not sure what you can conclude except… every JJA shift of temperature in one isolated year of a 150 year record is
    not explainable. Well we knew that before we started to study. we can find 1% events. i’m not so sure that it is clearly spurious.
    I dont know what spurious means.

    “As for data sources I think you can get a lot of data from GHCN. However I am not sure if more data will be needed. Certainly John used much more data.

    As for the max versus min, I was saying that you will find that minimum temperature trends are signficantly more positive than maximum temperature trends in many if not all cases, and maximum temperature changes will tend to be smaller, although how much I am unsure. I want to know how much, I don’t want to be told, “go on, guess, and I’ll show your wild guess is wrong” Well of course a wild guess is wrong. But is the picture qualitatively like I describe? I suspect it will indeed be.”

    If you have no idea of the effect size then you cannot do a power test prior to doing your study. That means you dont know how much data
    to use. you say trend in minimum are significantly more positive that trends in max. For which stations? you’re still a bit confusing on how you want this set up.

    WRT your reading assignment. having me read a document and assuming that I will come away with the same idea that you do, is really dangerous. I’d much prefer that you just define your test in a complete manner. If I run off and do test A, for you and you object and say.. “oh I meant test B!, then I will have wasted my time.

    you see, if you just muck about in the data looking for things you will find them. heck I found urban stations that cooled! So you have a theory of sorts, you wanna test it, so lay it out there. Crap I had theories about airports, and coastal stations, and mountain valleys and stations by dams and “pitch black stations” and super rural stations and desert stations and no population growth stations and theories about micro site and well theories about everything. All wrong or unprovable.

  13. Steven Mosher said

    Time to choose:

    Glancing at Christys paper I can tell you the first mistake he made. He does not appear to understand how the TOBS adjustment is made and the factors he sights for not using it are in fact accounted for in the algorithm. the algorithm is an empirical adjustment which adjusts for changes in TOB based the local factors. So, I am not too interested in duplicating that kind of approach. However, If you can get the code and data from him for that I’ll gladly take a look at it.

  14. timetochooseagain said

    13-Perhaps you could explain why you think John’s approach to estimating the bias due to changes in time of observation won’t work. Certainly no papers or comments came out saying that it wouldn’t

    I’ll see if I can get the data and get back to you.

    12-I apologize for seeming to want to place any burden on you to investigate this, if you don’t want to, I’d don’t want you to do so. Truth be told the reason I am not trying this myself is that I lack the knowledge of R to use your code or other such tools to assess this.

    With regard to how often such jumps as Athens occur relative to their neighbors, I am sorry but I don’t believe in a priori knowledge. Hard to believe, I know. One has to look at the data. Of course, we have no reason to suppose, a priori, that the jumps will always lead to spurious warming, they probably lead to cooling about as often or possibly more often, we cannot know a priori, only a posteriori. However, you appear to think that if a station is suddenly, permanently 1.5 degrees warmer than it’s nearest neighbors (relative to how much warmer or cooler it was relative to them before) that this could just be “random”. Better tell that to the creators of USHCN v2 since they seem to believe that such jumps are real biases that must be accounted for. Their algorithm certainly tries to.

    Other than that you mostly shoot down my specifics. Well that’s why I wasn’t initially so specific (and didn’t get specific about some other things). ;) But seriously, I was only trying to be consistent in describing the methodology as used in Christy’s papers. One could, for instance, use a 100 km radius if one wished, but this would be studying a much broader area. Does it matter? Good question. Can’t know without trying.

  15. Steven Mosher said

    13-Perhaps you could explain why you think John’s approach to estimating the bias due to changes in time of observation won’t work. Certainly no papers or comments came out saying that it wouldn’t

    You got the burden of proof wrong. The Tobs adjustment laid out by Karl and verified by his and subsequent work has a stated error of prediction and confidence intervals. It was developed to account for just the sort of issues that Christy raised. If he wants to invent his own method of adjustment he must illustrate that his method is an improvement on the existing method. He must establish ( which is easy ) that his adjustment outperforms the karl adjustment. For example, he notes that some stations may adjust from early to late, while others change from late to early. That is precisely what the Karl adjustment takes account of. The Karl adjustment is TUNED to the local conditions. This requires reading the Karl paper to figure out. So, when I see someone doing a “home built” approach WITHOUT showing it’s superiority to a published ( code available if you ask Karl) method, I lose interest. I’m not interested in what someone “says” they did. I’m interested in getting their code so that others can build on and improve on the work.

    Finally your ideas about “looking” to find out illustrates a serious failure to understand. Of course one could, for example, look through proxies, find a few hockey sticks and then decide that the correct method was to screen for correlations. If you snoop the data you will eventually find something. looking through the rest of Christy’s method I find no analytical support that suggests his method of combining stations or finding breakpoints has any statistical merit. One of the things we criticized mann for was making up new methods without testing them first. Typically on synthetic data. Contrast that with Romans method or Tamino’s method which is textbook. I’m really not too keen on
    replicating a untested method

    Finally I am not “shooting down” your specifics. I’m getting you to define the parameters that must be defined.
    Lets take “land use” As I noted there are several ways to look at that. Do you mean vegetation type? do you mean irrigated? not irrigated, built? golf courses?, secondary forest? urban? etc..

    Let me give you an example. I read an article. It said that Dams can change precipitation in an area. My thought?
    I wonder if they change temperature. How to test that? Simple. I have to find a database of dams. Then I have to decide what counts as a big dam and I have to decide a radius of influence. Maybe I use those as regressors. Then I split my data and I run the test. But you see since I have to program something I have to make specific decisions. So, when I ask you to define things Im pointing out to you that there are analytical decisions that you make. I wont even start on the whole apriori thing. too long of a discussion

  16. intrepid_wanders said

    Ok Steve. Tech Question on RghcnV3 ;)

    I have managed an error. Here is the console:

    > v3Inv v3Mean startYear endYear v3Mean dex<- which(map.where(“state”, getLonLat(v3Inv)) == “texas”)
    Error: unexpected input in "dex<- which(map.where(“"

    Reading the data into the v3Mean took 25 minutes, are there any environmental parameters with R that I need to set?

  17. intrepid_wanders said

    Opps, no hard returns…

    > v3Inv v3Mean startYear endYear v3Mean dex<- which(map.where(“state”, getLonLat(v3Inv)) == “texas”)

    Error: unexpected input in "dex<- which(map.where(“"

  18. intrepid_wanders said

    One more time:

    > v3Inv v3Mean startYear endYear v3Mean dex<- which(map.where(“state”, getLonLat(v3Inv)) == “texas”)
    Error: unexpected input in "dex<- which(map.where(“"

  19. Steven Mosher said

    If you’re trying to run it from grabbing code in the post, That seldom works.

    come to my site, I’ll put the code up for you to download.

  20. Steven Mosher said

    On the V3 read. There are some things I’m looking at to speed that up. However, you should really only have to do that read ONCE.

    after you read it in and turn it into a zoo object v3ToZoo() you should just save it as .Rdata, then work from that.
    refreshing that data should probably be done monthly, but they refresh every day.

  21. Steven Mosher said

    codes up, You should use version 1.3. Install from the zip if you are using windows, otherwise use source install ( tar.gz)

  22. mct said

    Steve,

    As an ageing developer, your efforts might just get me to “do my own digging”. I downloaded R a while back and have been doing some fiddling, but this kind of library is ideal to one’s teeth into the data, not get bogged down in the run of the mill.

    For which, many thanks. I can only think how many hours must have been spent on it.

  23. steven mosher said

    Mct.

    lots of hours, mostly learning the language, testing packages, building various versions of the same thing,

    Roman’s and jeffids stuff will go in next,

    There are two approaches. one is ugly, the other one will require work upstream with the package maintainers.

    That’s

  24. intrepid_wanders said

    Thanks Steve. All the data loads either from the CRAN site and your site, but the mapping seems to be the issue. I loaded the “mapdata” package and it still returns the unexpected input. Is there another package?

    Thanks again.

  25. Andrew said

    15-Fair enough, a method must be tested to see if it will work. Still, until one performs a test, why is the null hypothesis that the TOBS adjustment is “correct” and the “default” and that other methods won’t do better? In this situation it seems the appropriate null hypothesis is that a method will do no better or worse than any other. until demonstrated otherwise. So our question is, does Christy’s method work better worse, or no different than the TOBS adjustment for correcting shifts due to TOBS change? So aside from who has to do the testing to answer this question, I’d don’t think we disagreed about this point. You think Christy has to prove his method works, I don’t care who proves it does or does not, I just want to see it done. So how to test it? I am not asking you to answer that, just wondering in general if anyone has ideas, including you, about how to do so, because I sure don’t know.

    I have recieved a reply from John about my request for the data and code. He is currently traveling, but can get the data to me when he gets back. Sadly, this will not be for three weeks. More sadly, the code may not be recoverable, as William Norris, who mainly wrote it, recently passed away. I think he will see what he can get.

  26. Andrew said

    As for land use classification, I think one should use which ever data provides the most accurate and detailed classification information. I’ll bet good money you and almost anyone other than me knows better what datasets to use for this. My approach would probably be to try everything available, just to check if it makes a difference. If it doesn’t, then any method will do. Just to be clear here, I mean this in the context of analysis involving homogeneity and testing local scale differences.This may wel be different than if it “matters” whether one tosses out stations from certain classifications on a global scale.

  27. Steven Mosher said

    #24. the package is the maps package not the mapdata package.

    The code on my site in the drop box AirVentPost.R should just work.

    1. Load your R.
    2. attach the packages
    3. load the libraries.
    4. Type sessionInfo()
    5. run the code.
    6. look at the results.

    In version 1.3 if you look in the external directory, you’ll see that I added a little extra: a smaller version of V3 data
    that only includes texas.

  28. Steven Mosher said

    Intrepid. You can also just run the demo Texas. That demo collects texas data without using map.where()

  29. Steven Mosher said

    “15-Fair enough, a method must be tested to see if it will work. Still, until one performs a test, why is the null hypothesis that the TOBS adjustment is “correct” and the “default” and that other methods won’t do better? In this situation it seems the appropriate null hypothesis is that a method will do no better or worse than any other. until demonstrated otherwise. So our question is, does Christy’s method work better worse, or no different than the TOBS adjustment for correcting shifts due to TOBS change? So aside from who has to do the testing to answer this question, I’d don’t think we disagreed about this point. You think Christy has to prove his method works, I don’t care who proves it does or does not, I just want to see it done. So how to test it? I am not asking you to answer that, just wondering in general if anyone has ideas, including you, about how to do so, because I sure don’t know.”

    First read the karl paper and the subsequent follow ups. Karl tested his method in a very simple way. Let me see if I can explain.

    1. You start by collecting hourly data from sites all over the US.
    2. You divide that sample in two.
    3. you use 1/2 of the data to build a model. The model calculates the error introduced by a change in TOBS. That is, if you change the tobs from
    11pm to 7am how does Tmax and Tmin change. You then build a model of this error. That model is dependent on the lat/lon, the season,
    sun position, etc. What you end up with is a regression formula that predicts the error from a Tobs change given those parameters.
    4. you then use your out of sample stations to check the standard error of prediction.

    That’s what karl did. So I know the standard error of prediction. I know that’s been tested on the entire Conus. I know its been checked by skeptics
    ( like me who thought tobs was bogus and a guy who used to read john dalys site.)

    Sorry, but I don’t think you understand how a Null works and I’m not going to waste time explaining.

    “I have recieved a reply from John about my request for the data and code. He is currently traveling, but can get the data to me when he gets back. Sadly, this will not be for three weeks. More sadly, the code may not be recoverable, as William Norris, who mainly wrote it, recently passed away. I think he will see what he can get.”

    Well, then you have science that is not reproducible. I believe Jones also lost some of his work. The point is this. If he doesnt have the original raw data, and all the steps required to produce the actual charts in the paper, then it’s not science. It’s an advertisement for science. I

  30. Andrew said

    29-I didn’t ask how the TOBS adjustment was tested. I asked how to test Christy’s approach. Your response? As far as I can tell it’s that you think there is no need, we have Karl’s approach, and it works good enough for you. Anyone who uses a different approach is wrong until proven right. I don’t think TOBS adjustment is nonsense, I think it is possible there is a better way. Christy’s approach also is trying to find changes simultaneous with TOBS adjustment. TOBS cannot do that. You say that Christy’s concern with the TOBS adjustment was something that it already accounts for (ie, where it is located, the specific change in the time observations were taken) but you never mention what I regard as the most important concern he states, which is not accounted for by TOBS adjustment, because TOBS adjustment isn’t designed to do so:

    “such observational shifts are often accompanied by another change in the observational context, such as location or observer, which only confuses the issue.”

    This, to me, is something one should be mindful of, and I think it makes more sense to do break point analysis than a general statistical model that only addresses the TOBS change. The errors due to very site specific effects that make up the uncertainty of the TOBS adjustment model do not account for this. So again, why won’t using breakpoints work? We could test whether they do or do not. But “why bother” since we already have the TOBS adjusment, right? I would also note that I don’t think either approach should lead to any systematic bias, so the question should ultimately be, whose method is less uncertain, and that’s how one would test it, presumably (although it may be that one approach is systematically biased, I see no reason to think so ahead of time. If one is that is an important finding, too). Is one method better? And why, or, how do we determine?

    “Sorry, but I don’t think you understand how a Null works and I’m not going to waste time explaining.”

    No no, by all means, explain to me what my misunderstanding is. You condescendingly explained that greenhouse gases would lead to warming at Lucia’s even though this is something I already know. Go ahead, I’m sure I haven’t the slightest clue what I am talking about. Honestly, if I am totally not grasping the point, why not explain it to me? Because I am unreasonable and unconvinceable? And how would you know, you don’t even bother to explain!

    “Well, then you have science that is not reproducible. I believe Jones also lost some of his work. The point is this. If he doesnt have the original raw data, and all the steps required to produce the actual charts in the paper, then it’s not science. It’s an advertisement for science. I”

    When I get the raw data, and I expect I will, when John gets back from traveling, I will see if I can’t piece things back together. As it stands now, we can’t do anything. This does not mean that the information is lost forever. Indeed, John hasn’t even said he can’t get the code, he’s said he is not sure if he can get the code. This is unfortunate. It will be an obstacle to checking the work. But John is going to make the effort to give people what data he can. Jones on the other hand, well his first response wasn’t, “sorry, I can’t get the data and code to you right now” or “I lost it” it was “No, you can’t have it!”. There is no comparison. When I do get the data, we can check just how well we are able to replicate the results. It will be more work without the code. Shame about William Norris, but there is little to be done about it. I have zero doubt if I’d asked before his passing John would have gotten the code from him.

  31. steven mosher said

    Andrew you totally misunderstand. Read carefully.

    When I see a method’s like Christy’s developed and used on a single dataset with NO reference to standard testing methodology I have to reject it. I have to reject it the same way I reject Mann, who invented methods and applied them without simple testing. Tobs isnt really the issue. I’m using that as an EXAMPLE of what is wrong with the paper. You have probably spent enough time at CA to realize that this was continual refrain. When you introduce a new method you start by doing a study of the method. Is the method good? what are its strong points and weak points. THEN you apply it to the problem. since I agreed with this approach WRT mann I can hardly cut christy any slack. So its NOT just the TOBs adjustment. Far from it. its the entire methodological approach. If you sat there and agreed when mcintyre criticized mann for his nonce methods, then you must be consistent and agree with my point.

    So, hopefully you will get the data. But you really dont need it to do what I am suggesting. In fact, you can work without the data. Cause what you have to do is show that the method works. It probably does, most methods do ‘work” the real question is how well.

    In the end, however, the substantive question is does it have anything to do with global warming? not likely

  32. timetochooseagain said

    31-Fair enough. I want to test the method. I don’t know how to do so. I probably am not sufficiently clever to figure out what is necessary to do so, but I don’t want to burden you with figuring it out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 148 other followers

%d bloggers like this: