Aug 7, 2010

Migrating the Devonian Lithological Database to a Fully Relational System: The Story So Far

The Devonian Lithological Database (DLD for short) is a database I published as part of my PhD work at the University of Arizona. As databases go, it was quite primitive but it got the job done. Over the past year or so, I’ve been migrating the database to a more modern SQL format using SQLite. SQLite is a public domain database designed to work without a server. It is easy to use (for a SQL database) and the data file is generally cross platform.

The migration from the original DLD format to the SQLite format has not been easy. DLD originally consisted of two basic tables: the data records and the source list. The data records were based in Microsoft Excel with 34 columns of information. The reference list was just a Endnote database. Inserting these tables into SQLite is actually quite easy. However, early on, issues made themselves apparent.

The first issue was database normalization (making sure you don’t repeat data more than once) suggested that there were actually far more than two basic tables to the database. I had used various codes to represent information in the database. For example, I came up with a letter code to represent the error in position for each record. That is, how off I thought I might be with the latitude and longitude. Thus, each of those code systems had to be a table so an end-use could at least translate the code. These code systems added an additional 5 tables to the database.

I also discovered I had a few records that used more than a single source from the reference list. This meant I had to have yet another table to list all the references associated with each record.

So, now the database which I thought was only 2 tables was now 8. It was more complicated than I had originally hoped but it was far better than the original Excel/Endnote combination. This approach tied together all the diverse data into one generally easy-to-use file.

Of course, there were more problems. The next problem is that the file is slow in the iPod/iPhone version I created last year. The reasons for the speed issue are complicated and I’m not sure that I can fully resolve them. Two of the main problems with the speed are my letter code system and redundant data.

The letter codes are nice human-readable way to convey information. SQLite isn’t human. In some of the cases, there is more than one letter code in the field (a one-to-many relationship). For example, the letter code system for lithology allows many letter codes in the same field and the order in which they appear is important. Parsing and understanding that sort of text field information is time consuming. So, I need to make a new table to replace this field. I haven’t done this yet mainly because it seems a bit scary to do with over 5000 records.

Data redundancy takes many forms in the lithological database. The prime example, however, is localities. Each record in the original database fully describes its location: place names, coordinates, etc. While there are over 5000 records in the database, there are fewer than 4000 unique localities. This leads to several problems. First, you have more data to sift through than you need: an obvious slow down. Second, maintaining information is harder than needed. By having each location entered only once, you only have to maintain that record in one place. If I had that information several places in the database, then I’d have to fix it everywhere which has a greater chance of error.

As of today, the database has gone from the 8 tables to 26 with a few more expected. Why so many? Going through this process has made it clear that there was also a quality assurance problem with the original database. Using a flat file like Excel was nice and easy. However, what it didn’t do was force you to use rules for data entry. Every time you enter something into a database, you have a chance to make a mistake. For example, I have formation names that are repeated in multiple records but are written differently: e.g. “Ft Vermillion” and “Ft. Vermillion”. In a search, it would be hard to find both. Using what are essentially look-up tables, the system would help force the use of consistent terms.

Designing a new database has been quite enlightening.