Monthly Archives: June 2010

Why should scientific papers be "spatially enabled"

Now that I'm starting to build the databases needed for my new lithological database, I'm coming back to how I created my Devonian database.  The papers I generally worked with contained reports from the field, including lithology, measurements, location, etc.  That can be a LOT of information.  Collecting it all from each paper is time consuming to say the least.  Howevever, there was another problem…

That problem is being overly focused on the data in front of you and not the data you need.  The forest for the trees problem, if you will.  In the earth sciences, there are a number of research biases.  North America and Europe are far better studied than Africa, for example.  Thus, most publications are focused in those regions.  Similarly, some specific localities can be studied extensively, because of location or because of something interesting, while others are rarely visited.  This becomes a problem when you keep entering papers from the same area but miss important work from more rarely studied areas.

To combat this problem for the Devonian database, I created a “recon” or “search” database.  I tried to find any paper that might be relevant to the project and collect some basic information such as time range, and the general lat/lon area of the field study.   I could then map these records in a GIS application (at the time, I was using MapInfo, Terra Mobilis, and PGIS). 

As an example, I found about 500 of these records remaining in my archives.  Here is a global map example:

The yellow dots are entries in the Devonian Lithological Database.  The blue rectangles are “coverages” for particular scientific papers.  Where papers overlap, the blue color gets darker.  This is more evident regionally, for example:

As you can see, I can now show the data I have versus the field areas represented by papers I've found.  Careful examination of this sort of map highlights both papers I might not need to bother with (blue rectangles with lots of yellow dots) versus papers I should prioritize (blue rectangles with few if any yellow dots). 

These maps by no means represents all the papers I looked at in developing the database.  I think I physically at least looked at 3000-4000 papers but only 500 are represented in the above maps.  So, to include everything, it would take a great deal of work.

In any case, in this short example, I hope i've shown that in at least once case that geospatially enabled papers can be very important.  Now, the question is how to implement it!

Developing a new lithological database: Can I do it better this time?

It's now over 10 years since I published the Devonian Lithological Database as part of my PhD thesis. Clearly, it's not perfect or even what I can consider “finished”, but I'm proud of the work anyway. The data I collected have been used by oil companies and incorporated into newer and bigger databases. I hope people will still find it useful for years to come.

This year, I've begun at least the planning process of embarking on building a new lithological database. So, to really start the planning process, I need to recognized what worked and what didn't work in the Devonian database.

The design and structure of the Devonian database was based on the system developed at the University of Chicago by Fred Ziegler and crew. It was a relatively simple system of collecting basic information: units, lithology, location, etc. However, when they started, computers were relatively cumbersome to use. They filled out this information on big sheets of paper with about an 80 character limit – a limit imposed by the old punch card computer systems. Despite those limitations, the database remains one of the best available (and available online at the Geon Grid).

The main limitation in the University of Chicago and the Devonian databases was a lack of flexibility. This lack of flexibility is because the original concept was essentially a flat table. Put simply, one record was one line of text in a file. Generally speaking, you can do a lot with those kinds of files. For complicated data like lithological databases, those flat files create stark problems.

One example of the problems presented by flat files and lithological databases is lithology. In the original UC system and the Devonian system, lithologies were listed in a single field using alphanumeric codes in order of prominence. So, the codes were limited to 1 character from A-Z, 0-9. Thus, you could only have 36 lithology types. That's not much ameanifult all.

Another example is the time scale. One of the key things the database must be able to handle is time. The rocks are most meaningful in context of other rocks that formed at the same time. In most database searches, this usually requires searching by a number but you might want to search by epoch or series as well. This gets more complicated if you want to search by number using a different time scale where the the early and late boundaries for your desired time range might be a little off.

These problems are really minor issues in the original databases compared to actually doing something about them. For example, I had to use Microsoft Excel for my database and was limited to file sizes of about 1 megabyte, the size of a 3.5 inch floppy. Thus, you might notice that all record numbers have a region code; the region code also represented what file contained the record.

Today, however, fully relational databases are everywhere. Oracle, Access, Filemaker, even Bento are examples of commercially available databases. For open source, there's MySQL, Postgres, and Sqlite in addition to other types of file formats like XML, JSON, and a host of others.

My preference today is Sqlite. It doesn't require a server and is fully open with no GNU hindrances. Furthermore, there is an important extension to Sqlite: Spatialite. Spatialite adds open GIS data fields and commands to Sqlite. This allows direct import into some GIS apps, such as Quantum GIS, or the creation of shapefiles for use in other GIS platforms.

In any case, with modern relational databases, the limits of the old UC approach fall away. However, this comes with the price of more complexity. either you have to be good with SQL or you have to have a software interface to do the hard work for you.

In the next few weeks/months, I hope to update everyone on my design progress.