Jun 11, 2010

Developing a new lithological database: Can I do it better this time?

It’s now over 10 years since I published the Devonian Lithological Database as part of my PhD thesis. Clearly, it’s not perfect or even what I can consider “finished”, but I’m proud of the work anyway. The data I collected have been used by oil companies and incorporated into newer and bigger databases. I hope people will still find it useful for years to come.

This year, I’ve begun at least the planning process of embarking on building a new lithological database. So, to really start the planning process, I need to recognized what worked and what didn’t work in the Devonian database.

The design and structure of the Devonian database was based on the system developed at the University of Chicago by Fred Ziegler and crew. It was a relatively simple system of collecting basic information: units, lithology, location, etc. However, when they started, computers were relatively cumbersome to use. They filled out this information on big sheets of paper with about an 80 character limit - a limit imposed by the old punch card computer systems. Despite those limitations, the database remains one of the best available (and available online at the Geon Grid).

The main limitation in the University of Chicago and the Devonian databases was a lack of flexibility. This lack of flexibility is because the original concept was essentially a flat table. Put simply, one record was one line of text in a file. Generally speaking, you can do a lot with those kinds of files. For complicated data like lithological databases, those flat files create stark problems.

One example of the problems presented by flat files and lithological databases is lithology. In the original UC system and the Devonian system, lithologies were listed in a single field using alphanumeric codes in order of prominence. So, the codes were limited to 1 character from A-Z, 0-9. Thus, you could only have 36 lithology types. That’s not much ameanifult all.

Another example is the time scale. One of the key things the database must be able to handle is time. The rocks are most meaningful in context of other rocks that formed at the same time. In most database searches, this usually requires searching by a number but you might want to search by epoch or series as well. This gets more complicated if you want to search by number using a different time scale where the the early and late boundaries for your desired time range might be a little off.

These problems are really minor issues in the original databases compared to actually doing something about them. For example, I had to use Microsoft Excel for my database and was limited to file sizes of about 1 megabyte, the size of a 3.5 inch floppy. Thus, you might notice that all record numbers have a region code; the region code also represented what file contained the record.

Today, however, fully relational databases are everywhere. Oracle, Access, Filemaker, even Bento are examples of commercially available databases. For open source, there’s MySQL, Postgres, and Sqlite in addition to other types of file formats like XML, JSON, and a host of others.

My preference today is Sqlite. It doesn’t require a server and is fully open with no GNU hindrances. Furthermore, there is an important extension to Sqlite: Spatialite. Spatialite adds open GIS data fields and commands to Sqlite. This allows direct import into some GIS apps, such as Quantum GIS, or the creation of shapefiles for use in other GIS platforms.

In any case, with modern relational databases, the limits of the old UC approach fall away. However, this comes with the price of more complexity. either you have to be good with SQL or you have to have a software interface to do the hard work for you.

In the next few weeks/months, I hope to update everyone on my design progress.