Category Archives: dld

Finishing up DLD 2.0

It’s been a long time coming, but I’m finishing up DLD 2.0 for iOS. I hope to submit next week.

What’s new? Everything. Depending on whether it gets through the approval process, here’s a summary:

1. An entirely new map interface! Now, the map is actually useful and will make use of Google map data as well!

2. A custom Time Scale controller! Now you can change the age range of the data displayed on the map without leaving the map view!

3. An entirely new database back-end. Now, this isn’t exciting for the user in general, but the nasty torture I’ve been inflicting on the database is improving the overall quality of the data.

Do you have any other wants for future versions? Let me know.

As always, contact me if you want a copy of the raw database file.

Tom

Migrating the Devonian Lithological Database to a Fully Relational System: The Story So Far

The Devonian Lithological Database (DLD for short) is a database I published as part of my PhD work at the University of Arizona. As databases go, it was quite primitive but it got the job done. Over the past year or so, I've been migrating the database to a more modern SQL format using SQLite. SQLite is a public domain database designed to work without a server. It is easy to use (for a SQL database) and the data file is generally cross platform.

The migration from the original DLD format to the SQLite format has not been easy. DLD originally consisted of two basic tables: the data records and the source list. The data records were based in Microsoft Excel with 34 columns of information. The reference list was just a Endnote database. Inserting these tables into SQLite is actually quite easy. However, early on, issues made themselves apparent.

The first issue was database normalization (making sure you don't repeat data more than once) suggested that there were actually far more than two basic tables to the database. I had used various codes to represent information in the database. For example, I came up with a letter code to represent the error in position for each record. That is, how off I thought I might be with the latitude and longitude. Thus, each of those code systems had to be a table so an end-use could at least translate the code. These code systems added an additional 5 tables to the database.

I also discovered I had a few records that used more than a single source from the reference list. This meant I had to have yet another table to list all the references associated with each record.

So, now the database which I thought was only 2 tables was now 8. It was more complicated than I had originally hoped but it was far better than the original Excel/Endnote combination. This approach tied together all the diverse data into one generally easy-to-use file.

Of course, there were more problems. The next problem is that the file is slow in the iPod/iPhone version I created last year. The reasons for the speed issue are complicated and I'm not sure that I can fully resolve them. Two of the main problems with the speed are my letter code system and redundant data.

The letter codes are nice human-readable way to convey information. SQLite isn't human. In some of the cases, there is more than one letter code in the field (a one-to-many relationship). For example, the letter code system for lithology allows many letter codes in the same field and the order in which they appear is important. Parsing and understanding that sort of text field information is time consuming. So, I need to make a new table to replace this field. I haven't done this yet mainly because it seems a bit scary to do with over 5000 records.

Data redundancy takes many forms in the lithological database. The prime example, however, is localities. Each record in the original database fully describes its location: place names, coordinates, etc. While there are over 5000 records in the database, there are fewer than 4000 unique localities. This leads to several problems. First, you have more data to sift through than you need: an obvious slow down. Second, maintaining information is harder than needed. By having each location entered only once, you only have to maintain that record in one place. If I had that information several places in the database, then I'd have to fix it everywhere which has a greater chance of error.

As of today, the database has gone from the 8 tables to 26 with a few more expected. Why so many? Going through this process has made it clear that there was also a quality assurance problem with the original database. Using a flat file like Excel was nice and easy. However, what it didn't do was force you to use rules for data entry. Every time you enter something into a database, you have a chance to make a mistake. For example, I have formation names that are repeated in multiple records but are written differently: e.g. “Ft Vermillion” and “Ft. Vermillion”. In a search, it would be hard to find both. Using what are essentially look-up tables, the system would help force the use of consistent terms.

Designing a new database has been quite enlightening.

Developing a new lithological database: Can I do it better this time?

It's now over 10 years since I published the Devonian Lithological Database as part of my PhD thesis. Clearly, it's not perfect or even what I can consider “finished”, but I'm proud of the work anyway. The data I collected have been used by oil companies and incorporated into newer and bigger databases. I hope people will still find it useful for years to come.

This year, I've begun at least the planning process of embarking on building a new lithological database. So, to really start the planning process, I need to recognized what worked and what didn't work in the Devonian database.

The design and structure of the Devonian database was based on the system developed at the University of Chicago by Fred Ziegler and crew. It was a relatively simple system of collecting basic information: units, lithology, location, etc. However, when they started, computers were relatively cumbersome to use. They filled out this information on big sheets of paper with about an 80 character limit – a limit imposed by the old punch card computer systems. Despite those limitations, the database remains one of the best available (and available online at the Geon Grid).

The main limitation in the University of Chicago and the Devonian databases was a lack of flexibility. This lack of flexibility is because the original concept was essentially a flat table. Put simply, one record was one line of text in a file. Generally speaking, you can do a lot with those kinds of files. For complicated data like lithological databases, those flat files create stark problems.

One example of the problems presented by flat files and lithological databases is lithology. In the original UC system and the Devonian system, lithologies were listed in a single field using alphanumeric codes in order of prominence. So, the codes were limited to 1 character from A-Z, 0-9. Thus, you could only have 36 lithology types. That's not much ameanifult all.

Another example is the time scale. One of the key things the database must be able to handle is time. The rocks are most meaningful in context of other rocks that formed at the same time. In most database searches, this usually requires searching by a number but you might want to search by epoch or series as well. This gets more complicated if you want to search by number using a different time scale where the the early and late boundaries for your desired time range might be a little off.

These problems are really minor issues in the original databases compared to actually doing something about them. For example, I had to use Microsoft Excel for my database and was limited to file sizes of about 1 megabyte, the size of a 3.5 inch floppy. Thus, you might notice that all record numbers have a region code; the region code also represented what file contained the record.

Today, however, fully relational databases are everywhere. Oracle, Access, Filemaker, even Bento are examples of commercially available databases. For open source, there's MySQL, Postgres, and Sqlite in addition to other types of file formats like XML, JSON, and a host of others.

My preference today is Sqlite. It doesn't require a server and is fully open with no GNU hindrances. Furthermore, there is an important extension to Sqlite: Spatialite. Spatialite adds open GIS data fields and commands to Sqlite. This allows direct import into some GIS apps, such as Quantum GIS, or the creation of shapefiles for use in other GIS platforms.

In any case, with modern relational databases, the limits of the old UC approach fall away. However, this comes with the price of more complexity. either you have to be good with SQL or you have to have a software interface to do the hard work for you.

In the next few weeks/months, I hope to update everyone on my design progress.