The Azimuth Project
Blog - exploring climate data (part 3)

This is a blog article in progress, written by Nadja Kutz. To see discussions of the article as it was being written, visit the Azimuth Forum.

If you want to write your own article, please read the directions on How to blog.

This blog article is about the temperature data used in the reports of the Intergovernmental panel on Climate Change (IPCC). I present the results of an investigation into the completeness of global land surface temperature records. There are noticeable gaps in the data records, but I leave the discussion about the implications of these gaps to the readers.

The data used in the newest IPCC report, namely the Fifth Assessment Report (AR5) is, as it seems, at the time of writing not yet available at the IPCC data distribution centre.

The temperature databases used for the previous report, AR4, are listed here on the website of the IPCC. These databases are:


NCDC (probably as a guess using the data set GHCNM v3),


• the collection of Lugina et al.

The temperature collection CRUTEM3 was put together by the Climatic Research Unit (CRU) at the University of East Anglia. According to the CRU temperature page the CRUTEM3 data and in particular the CRUTEM3 land air temperature anomalies on a 5° × 5° grid-box basis has now been superseded by the so-called CRUTEM4 collection.

Since the CRUTEM collection appeared to be an important data source for the IPCC, I started by investigating the land air temperature data collection CRUTEM4. In what follows, only the availability of so-called land air temperature measurements will be investigated. (The collections often also contain sea surface temperature (SST) measurements.)

Usually only ‘temperature grid data’ or other averaged data is used for the climate assessments. Here ‘grid’ means that data is averaged over regions that cover the earth in a grid. However, the data is originally generated by temperature measuring stations around the world. So, I was interested in this original data and its quality. For the CRUTEM collection the latest station data is called the CRUTEM4 station data collection.

I downloaded the station’s data file, which is a simple text file, from the bottom of the CRUTEM4 station data page. I noticed on a first glance that there are big gaps in the file in some regions of the world. The file is huge, though: it contains monthly measurements starting in January 1701 ending in 2011 and there are altogether 4634 stations. Quickly finding a gap in such a huge file was a sufficiently disconcerting experience that persuaded my husband Tim Hoffmann to help me to investigate this station data in more accessible way, via a visualization.

The visualization takes a long time to load, and due to some unfortunate software configuration issues (not on our side) it sometimes doesn’t work at all. Please open it now in a separate tab while reading this article:

• Nadja Kutz and Tim Hoffman, Temperature data from stations around the globe, collected by CRUTEM 4

For those who are too lazy to explore the data themselves, or in case the visualization is not working, here are some screenshots from the visualization which documents the missing data in the CRUTEM4 dataset.

The images should speak for themselves. However, an additional explanation is provided after the images. One should in particular mention that it looks as if the deterioration of the CRUTEM4 data set has been greater in the years 2000-2009 than in the years 1980-2000.

Now you could say: okay, we know that there are budget cuts in the UK, and so probably the University of East Anglia was subject to those, but what about all these other collections in the world? This will be addressed after the images.

North America:

Jan 1980

Jan 2000

Jan 2009


Jan 1980

Jan 2000

Jan 2009


Jan 1980

Jan 2000

Jan 2009

Eurasia/Northern Africa:

Jan 1980

Jan 2000

Jan 2009


Jan 1980

Jan 2000

Jan 2009

These screenshots comprise various regions of the world for the month of January for the years 1980, 2000 and 2009. Each station is represented by a small rectangle around its coordinates. The color of a rectangle indicates the monthly temperature value for that station: blue is the coldest, red is the hotttest. Black rectangles are what CRU calls ‘missing data’, denoted with -999 in the file. I prefer instead to call it ‘invalid’ data, in order to distinguish it from the missing data due to stations that have been closed down. In the visualization, closed down stations are encoded by a transparent rectangle and their markers are also present.

We couldn’t find the reasons for this invalid data. At the end of the post John Baez has provided some more literature on this question. It is worth noting that satellites can replace surface measurements only to a certain degree, as was highlighted by Stefan Rahmstorf in a blog post on RealClimate:

the satellites cannot measure the near-surface temperatures but only those overhead at a certain altitude range in the troposphere. And secondly, there are a few question marks about the long-term stability of these measurements (temporal drift).

What about other collections?

Apart from the already mentioned collections, which were used in the IPCC’s AR4 report, there are actually some more institutional collections, and I also found some private weather collections. However among those private collections I haven’t found any collection that goes back in time as far as CRUTEM4. However, it could be that some of those private collections might be more complete in terms of actual data than the collections that reach further back in time.

After discussing our visualization on the Azimuth Forum it turned out that Nick Stokes, who runs the blog MOYHU in Australia, had the same idea as me—however, already in 2011. That is in this year he had visualized station data. For his visualization he used Google Earth. Moreover, for his visualization he used different temperature collections.

If you have Google Earth installed then you can see his visualizations here:

• Nick Stokes, click here.

The link is from the documentation page of Nick Stoke’s website.

What are the major collections?

As far as we can tell, the major global collections of temperature data that go back to the 18th or 19th or at least early 20th century seem to be following. First, there are the collections already mentioned, which are also used in the AR4 report:

• The CRUTEM collection from the University of East Anglia (UK).

• the GISTEMP collection from the Goddard Institute of Space Science (GISS) at NASA (US).

• the collection of Lugina et al, which is a cooperative project involving NCDC/NOAA (US) (see also below), the University of Maryland (US), St. Petersburg State University (Russia) and the State Hydrological Institute, St. Petersburg, (Russia).

• the GHCN collection from NOAA.

Then there are these:

• the Berkeley Earth collection, called BEST

• The GSOD (Global Summary Of the Day) and Global Historical Climatology Network (GHCN) collections. Both these are run by the National Climatic Data Center (NCDC) at National Oceanic and Atmospheric Administration (NOAA) (US). It is not clear to me to what extent these two databases overlap with those of Lugina et al, which were made in cooperation with NCDC/NOAA. It is also not clear to me whether the GHCN collection had been used for the AR4 report (it seems so). There is currently also a very partially working visualization of the GSOD data here. The sparse data in specific regions (see images above) is also apparent in this visualization.

• There is a comparatively new initiative called International Surface Temperatures Initiative (ISTI) which gathers collections in a databank and seeks to provide temperature data “from hourly to century timescales”. As written on their blog, this data seems not to be quality controlled:

The ISTI dataset is not quality controlled, so, after re-reading section 3.3 of Lawrimore et al 2011, I implemented an extremely simple quality control scheme, MADQC.

What did you visualize?

As far as I had understood in the visualization by Nick Stokes—which you just opened—the collection BEST (before 1850-2010), the collections GSOD (1921-2010) and GHCN v2 (before 1850-1990) from NOAA and CRUTEM3 (before 1850-2000) are represented.

CRUTEM3 is also visualized in another way Clive Best. In Clive Best’s visualization, it seems however that one has apart from the station name no further access to other data, like station temperatures, etc. Moreover, it is not possible to set a recent time range, which is important for checking how much the dataset changed in recent times.

Unfortunately this limited possibility to set a time range holds also true for two visualizations of Nick Stokes here and here. In his first visualization, which is more exhaustive than the second, the following datasets are shown: GHCNv3 and an adjusted version of it (GADJ), a prelimary dataset from ISTI, BEST and CRUTEM 4. So his first visualization seems quite exhaustive also with respect to newer data. Unfortunately, as mentioned, setting the time range didn’t work properly (at least when I tested it). The same holds for his second visualization of GHCN v3 data. So, I was only able to trace the deterioration of recent data manually (for example, by clicking on individual stations).

Tim and I visualized CRUTEM4, that is, the updated version of CRUTEM3.

What did you not visualize?

Newer datasets after 2011/2012, for example from the aforementioned ISTI or from the private collections, are not visualized in the two collections you just opened.

Moreover in the here mentioned visualizations, there is no coverage of the GISS collection, which however now uses NOAA’S GHCN v3 collections. The historical data of GISS could though be different from the other collections. The visualizations may also not cover the Lugina et al. collection, which was mentioned above in the context of the IPCC report. Lugina et al. could however be similar to GSOD (and GHCN) due to cooperation. Moreover, GHCN v3 could be substantially more exhaustive than CRUTEM or GHCN v2 (as shown in Nick Stoves visualization). However here the last collection was—like CRUTEM4—released in the spring of 2011.

GCHN v3 is also represented in Nick Stokes’ visualizations (here and here). Upon manually investigating it, it didn’t seem to much crucial additional data not found in CRUTEM4. Since this manual exploration was not exhaustive, I may be wrong—but I don’t think so.

Hence, to our knowledge, in the two visualizations you just opened, quite a lot of the available data is visualized—and as it seems “almost all” (?) of the far-back-reaching original quality controlled global surface temperature data collections as of 2011 or 2012. If you know of other similar collections please let us know.

As mentioned above private collections and in particular the ISTI collection may contain much more data. At the point of writing we don’t know in how far those newer collections will be taken into account for the new IPCC reports and in particular for the AR5 report. Moreover it seems not so clear how quality control may be ensured for those newer collections.

In conclusion, the previous IPCC reports seem to have been informed by the collections described here. Thus the coverage problems you see here need to be taken into account in discussions about the scientific base of previous climate descriptions.

Hopefully the visualizations from Nick Stokes and from Tim and me are ready for exploration! You can start to explore them yourself, and in particular see that the ‘deterioration of data’ is—just as in our CRUTEM4 visualization—also visible in Nick’s collections.

Note: I would like to thank people at the Azimuth Forum for pointing out references, and in particular Nick Stokes and Nathan Urban.

The effects of missing data

supplement by John Baez

There have always been fewer temperature recording stations in Arctic regions than other regions. The following paper initiated a controversy over how this fact affects our picture of the Earth’s climate:

Here is some discussion:

• Kevin Cowtan, Robert Way, and Dana Nuccitelli, Global warming since 1997 more than twice as fast as previously estimated, new study shows, Skeptical Science, 13 November 2013.

• Stefan Rahmstorf, Global warming since 1997 underestimated by half, RealClimate, 13 November 2013 in which it is highlighted that satellites can replace surface measurements only to a certain degree.

Anthony Watts’ protest about Cowtan, Way and the Arctic, HotWhopper, 15 Novemer 2013.

• Victor Venema, Temperature trend over last 15 years is twice as large as previously thought , Variable Variability, 13 November 2013.

However, these posts seem to say little about the increasing amount of ‘missing data’.

category: blog