Reading SQL with PySqlite

About a year or so ago, I wrote a special script to run the FOAM climate model. The primary goal of this script, besides running the model, was to store a wide variety of information about the run, including settings, system information (like CPU temperature), and the timing and duration of the run. The storing process stored some of the information before the model starts and after the model ends. It's a great log of my model run and system performance history.

The drawback to this data was the database itself. Up until today, I've been using a single database to store all of the run data. However, I've been wanting a separate database for each model.

I didn't develop this approach in the first version of the script because I didn't know how to read the template SQL and directly insert it into the database. In the command line with Sqlite, you simply enter “.read mysqlfile.sql” or something similar. In python, that's not possible. Nor is it possible for PySqlite to accept more than one SQL command at a time. Without this ability, I couldn't automatically create a complete Sqlite file with all of the required tables.

The solution turned out to be remarkably easy. The SQL file was a straightforward text file. Reading a text file into python is very easy:

data_file = open("path_to_my_SQL_,'r')
theSQL =

Since pysqlite only handles one statement at a time, the commands need to be split into separate statments:

theStatements = theSQL.split(";")

The file can be split into discrete statements because the semicolon always marks the end of a statement.

At this point, you simply need to loop through each of the statements and execute in the sqlite file:

for statement in theStatements:
sqlite_cursor.execute(statement + ";")

Keep in mind, you have to reattach the trailing semicolon at the end of each statement.

There's probably even an easier way to do this, but it's good enough for me.

Finishing up DLD 2.0

It’s been a long time coming, but I’m finishing up DLD 2.0 for iOS. I hope to submit next week.

What’s new? Everything. Depending on whether it gets through the approval process, here’s a summary:

1. An entirely new map interface! Now, the map is actually useful and will make use of Google map data as well!

2. A custom Time Scale controller! Now you can change the age range of the data displayed on the map without leaving the map view!

3. An entirely new database back-end. Now, this isn’t exciting for the user in general, but the nasty torture I’ve been inflicting on the database is improving the overall quality of the data.

Do you have any other wants for future versions? Let me know.

As always, contact me if you want a copy of the raw database file.


Migrating the Devonian Lithological Database to a Fully Relational System: The Story So Far

The Devonian Lithological Database (DLD for short) is a database I published as part of my PhD work at the University of Arizona. As databases go, it was quite primitive but it got the job done. Over the past year or so, I've been migrating the database to a more modern SQL format using SQLite. SQLite is a public domain database designed to work without a server. It is easy to use (for a SQL database) and the data file is generally cross platform.

The migration from the original DLD format to the SQLite format has not been easy. DLD originally consisted of two basic tables: the data records and the source list. The data records were based in Microsoft Excel with 34 columns of information. The reference list was just a Endnote database. Inserting these tables into SQLite is actually quite easy. However, early on, issues made themselves apparent.

The first issue was database normalization (making sure you don't repeat data more than once) suggested that there were actually far more than two basic tables to the database. I had used various codes to represent information in the database. For example, I came up with a letter code to represent the error in position for each record. That is, how off I thought I might be with the latitude and longitude. Thus, each of those code systems had to be a table so an end-use could at least translate the code. These code systems added an additional 5 tables to the database.

I also discovered I had a few records that used more than a single source from the reference list. This meant I had to have yet another table to list all the references associated with each record.

So, now the database which I thought was only 2 tables was now 8. It was more complicated than I had originally hoped but it was far better than the original Excel/Endnote combination. This approach tied together all the diverse data into one generally easy-to-use file.

Of course, there were more problems. The next problem is that the file is slow in the iPod/iPhone version I created last year. The reasons for the speed issue are complicated and I'm not sure that I can fully resolve them. Two of the main problems with the speed are my letter code system and redundant data.

The letter codes are nice human-readable way to convey information. SQLite isn't human. In some of the cases, there is more than one letter code in the field (a one-to-many relationship). For example, the letter code system for lithology allows many letter codes in the same field and the order in which they appear is important. Parsing and understanding that sort of text field information is time consuming. So, I need to make a new table to replace this field. I haven't done this yet mainly because it seems a bit scary to do with over 5000 records.

Data redundancy takes many forms in the lithological database. The prime example, however, is localities. Each record in the original database fully describes its location: place names, coordinates, etc. While there are over 5000 records in the database, there are fewer than 4000 unique localities. This leads to several problems. First, you have more data to sift through than you need: an obvious slow down. Second, maintaining information is harder than needed. By having each location entered only once, you only have to maintain that record in one place. If I had that information several places in the database, then I'd have to fix it everywhere which has a greater chance of error.

As of today, the database has gone from the 8 tables to 26 with a few more expected. Why so many? Going through this process has made it clear that there was also a quality assurance problem with the original database. Using a flat file like Excel was nice and easy. However, what it didn't do was force you to use rules for data entry. Every time you enter something into a database, you have a chance to make a mistake. For example, I have formation names that are repeated in multiple records but are written differently: e.g. “Ft Vermillion” and “Ft. Vermillion”. In a search, it would be hard to find both. Using what are essentially look-up tables, the system would help force the use of consistent terms.

Designing a new database has been quite enlightening.

Why should scientific papers be "spatially enabled"

Now that I'm starting to build the databases needed for my new lithological database, I'm coming back to how I created my Devonian database.  The papers I generally worked with contained reports from the field, including lithology, measurements, location, etc.  That can be a LOT of information.  Collecting it all from each paper is time consuming to say the least.  Howevever, there was another problem…

That problem is being overly focused on the data in front of you and not the data you need.  The forest for the trees problem, if you will.  In the earth sciences, there are a number of research biases.  North America and Europe are far better studied than Africa, for example.  Thus, most publications are focused in those regions.  Similarly, some specific localities can be studied extensively, because of location or because of something interesting, while others are rarely visited.  This becomes a problem when you keep entering papers from the same area but miss important work from more rarely studied areas.

To combat this problem for the Devonian database, I created a “recon” or “search” database.  I tried to find any paper that might be relevant to the project and collect some basic information such as time range, and the general lat/lon area of the field study.   I could then map these records in a GIS application (at the time, I was using MapInfo, Terra Mobilis, and PGIS). 

As an example, I found about 500 of these records remaining in my archives.  Here is a global map example:

The yellow dots are entries in the Devonian Lithological Database.  The blue rectangles are “coverages” for particular scientific papers.  Where papers overlap, the blue color gets darker.  This is more evident regionally, for example:

As you can see, I can now show the data I have versus the field areas represented by papers I've found.  Careful examination of this sort of map highlights both papers I might not need to bother with (blue rectangles with lots of yellow dots) versus papers I should prioritize (blue rectangles with few if any yellow dots). 

These maps by no means represents all the papers I looked at in developing the database.  I think I physically at least looked at 3000-4000 papers but only 500 are represented in the above maps.  So, to include everything, it would take a great deal of work.

In any case, in this short example, I hope i've shown that in at least once case that geospatially enabled papers can be very important.  Now, the question is how to implement it!

Developing a new lithological database: Can I do it better this time?

It's now over 10 years since I published the Devonian Lithological Database as part of my PhD thesis. Clearly, it's not perfect or even what I can consider “finished”, but I'm proud of the work anyway. The data I collected have been used by oil companies and incorporated into newer and bigger databases. I hope people will still find it useful for years to come.

This year, I've begun at least the planning process of embarking on building a new lithological database. So, to really start the planning process, I need to recognized what worked and what didn't work in the Devonian database.

The design and structure of the Devonian database was based on the system developed at the University of Chicago by Fred Ziegler and crew. It was a relatively simple system of collecting basic information: units, lithology, location, etc. However, when they started, computers were relatively cumbersome to use. They filled out this information on big sheets of paper with about an 80 character limit – a limit imposed by the old punch card computer systems. Despite those limitations, the database remains one of the best available (and available online at the Geon Grid).

The main limitation in the University of Chicago and the Devonian databases was a lack of flexibility. This lack of flexibility is because the original concept was essentially a flat table. Put simply, one record was one line of text in a file. Generally speaking, you can do a lot with those kinds of files. For complicated data like lithological databases, those flat files create stark problems.

One example of the problems presented by flat files and lithological databases is lithology. In the original UC system and the Devonian system, lithologies were listed in a single field using alphanumeric codes in order of prominence. So, the codes were limited to 1 character from A-Z, 0-9. Thus, you could only have 36 lithology types. That's not much ameanifult all.

Another example is the time scale. One of the key things the database must be able to handle is time. The rocks are most meaningful in context of other rocks that formed at the same time. In most database searches, this usually requires searching by a number but you might want to search by epoch or series as well. This gets more complicated if you want to search by number using a different time scale where the the early and late boundaries for your desired time range might be a little off.

These problems are really minor issues in the original databases compared to actually doing something about them. For example, I had to use Microsoft Excel for my database and was limited to file sizes of about 1 megabyte, the size of a 3.5 inch floppy. Thus, you might notice that all record numbers have a region code; the region code also represented what file contained the record.

Today, however, fully relational databases are everywhere. Oracle, Access, Filemaker, even Bento are examples of commercially available databases. For open source, there's MySQL, Postgres, and Sqlite in addition to other types of file formats like XML, JSON, and a host of others.

My preference today is Sqlite. It doesn't require a server and is fully open with no GNU hindrances. Furthermore, there is an important extension to Sqlite: Spatialite. Spatialite adds open GIS data fields and commands to Sqlite. This allows direct import into some GIS apps, such as Quantum GIS, or the creation of shapefiles for use in other GIS platforms.

In any case, with modern relational databases, the limits of the old UC approach fall away. However, this comes with the price of more complexity. either you have to be good with SQL or you have to have a software interface to do the hard work for you.

In the next few weeks/months, I hope to update everyone on my design progress.

FOAM Output Variables

Since I get many questions on what's contained in FOAM output, here's a list of all the variables contained in the standard atmosphere, coupler, and ocean output files.

Atmosphere Output
ALB Albedo
ALBCLR Clear sky albedo
CLDHGH Vertically-integrated, random overlap, total cloud amount
CLDLOW Vertically-integrated, random overlap, low cloud amount
CLDMED Vertically-integrated, random overlap, mid-level cloud amount
CLDTOT Vertically-integraed, random overlap, total cloud amount
CLOUD Cloud fraction
CMFDQ Q tendency – moist convetsion
CMFDT T tendency-moist convection
CMFMC Moist convection mass flux
CNVCLD Random overlap total convective cloud amount
DC01 convective adjustment tendency
DTCOND T tendency – convective adjustment
DTH T horizontal diffusive heating
DTV T vertical diffusion
FLNS Net longwave flux at surface
FLNSC Clearsky net longwave flux at surface
FLNT Net longwave flux at top
FLNTC Clearsky net longwave flux at top
FLUT Top of Atmosphere Outgoing Longwave Flux
FLUTC Clearsky Top of Atmosphere Outgoing Longwave Flux
FSDS Flux Shortwave Downwelling at Surface
FSDSC Clearsky Flux Shortwave Downwelling at Surface
FSNS Net solar flux at surface
FSNSC Clearsky net solor flux at surface
FSNT Net solar flux at top
FSNTC Clearsky net solar flux at top
FSNTOA Net Solar Flux at the Top of the Atmosphere
FSNTOAC Clearsky Net Solar Flux at the Top of the Atmosphere
LHFLX Surface latent heat flux
LWCF Longwave Cloud Forcing
OMEGA Vertical pressure velocity
ORO ocean(0), land(1), sea ice(2)
PBLH Planetary Boundary Layer Height
PHIS surface geopotential
PRECC Convective precipitation rate
PRECL Large-scale (stable) precipitation rate
PRECSC Convective snow rate (water equivalent)
PRECSL Large-scale (stable) snow rate (water equivalent)
PS surface pressure
PSL Sea level pressure
Q specific humidity
QFLX Surface water flux
QPERT Perturbation specific humidity (eddies in PBL)
QRL Longwave heating rate
QRS Solar heating rate
RELHUM Relative humidity
SHFLX Surface sensible heat flux
SNOWH Water equivalent snow depth
SOLIN Solar insolation
SRFRAD Net radiative flux at surface
SWCF Shortwave Cloud Forcing
T temperature
TAUGWX East-west gravity wave drag surface stress
TAUGWY North-south gravity wave drag surface stress
TAUX X-component (eat-weat) at surface stress
TAUY Y-component (north-south) of surface stress
TMQ Water Vapor
TPERT Perturbation temperature (eddies in PBL)
TS1 Surface temperature (level 1)
TS2 Subsurface temperature (level 2)
TS3 Subsurface temperature (level 3)
TS4 Subsurface temperature (level 4)
U zonal wind component
USTAR Surface friction velocity
UTGW U tendency – gravity wave drag
V meridional wind component
VD01 vertical diffusion tenency of water vapor
VQ Meridional water transport
VT Meridional heat transport
VTGW V tendency – gravity wave drag
VVPUU Kenetic Energy
VZ Meridional transport
WET Soil moisture
Z3 Geopotential Height


EVP moisture flux
FRAC ice fraction
ICET1 ice layer 1 temp
ICET2 ice layer 2 temp
INTERT interface temp
LHF latent heat flux
LWV longwave out
MELTP melt potential
OHEAT ocean heat forcing field
OPREC precip from atm
OQFLX ocean freshwater forcing field
ORNF runoff into the ocean
ORO land mask
RAD surface radiation
RNF land runoff
SHF sensible heat flux
SNDPTH snow depth
SNM snow melt
SNOWT snow temp
TAUX ocean taux foring field
TAUY ocean tauy forcing filed
THCK seaice thickness flag
TSSUB1 top soil layer temp
TSSUB2 soil temp layer 2
TSSUB3 soil temp layer 3
TSSUB4 bottom soil temp layer
VOLR river volume
WS soil moisture


CONVEC Upper Ocean Convective adjustment frequencey
CONVEC2 Deep Ocean Convective adjustment frequencey
Currents Currents
HEATF Ocean heat forcing
P Normalized perturbation presure
S Salinity
SALTF Sfc salinity tendency due to freshwater forcing
Sconv Surface layer S convective adjustment
Sconvm1 Near surface layer S convective adjustment
Szz Surface layer S vertical mixing
Szzm1 Near surface layer S vertical mixing
T Temperature
TAUX X-component (east-west) of surface stress
TAUY Y-component (north-south) of surface stress
Tconv Surface layer T convective adjustment
Tconvm1 Near surface layer T convective adjustment
Tzz Surface layer T vertical mixing
Tzzm1 Near surface layer T vertical mixing
U Zonal current component
V Meridional current component
W Vertical velocity

Using SQLite and Python to Store Model Metadata

As I continue to run a range of climate models, I've learned from painful lessons that I need to record as much information about the model run as possible. When I first started this process, I simply kept files used to make the run (the geography and configuration files for the model) and the model output. At first, this seemed sufficient because, in the end, these were the data that were most important. As it turns out, however, that having a history of everything you did during the model run, such as adjustments to the settings or geography, is also important both historically to the run and possibly sorting out problems later.

My initial solution to this problem was to create a log file. Every time I ran the model, the important setting information was sent to a simple flat-file log. It turned out that this log was very important to debugging a model-run issue because it kept a record of how the model was initially run. I also started keeping information about the hardware in this log. Along with the model information, I began to store hardware temperature data from before and after the run in the log just in case I needed to debug hardware issues. However, these data turned out to be virtually useless in a flat log file. Other information I haven't been keeping that I wanted in the log was geography version control information. I use version control to track all my geography work, so I can both track how I change the geography and get an idea how much time I spend on it. However, the exact geography used in a run is important to know. However, even more info in a flat log file makes it even more difficult to review.

My new solution is to dump the flat file approach and go with SQLite. SQLite is a lightweight, public domain SQL file format that works well with a variety of languages. SQLite has become one of my preferred file formats over the years (nudging out XML for custom data structures). The Python scripting language seems a natural fit to work with SQLite as well.

So, how does this solution work? FIrst, I have a simple run script for the model using bash (for some reason, I could never get the model to run using Python). This script calls my python script before the model starts and after the model ends. It sends two pieces of information, a uuid and the model directory path. The python script assembles everything it needs on its own.

Why a uuid? Each time I run the model, I need to identify the run in the database with a unique id that can be used to link across a number of SQL tables. A uuid ensures that the id is unique. I've considered using a uuid for the overall simulation but I haven't implemented that.

To pull in settings data and temperature data, I've written parsers for each format. For the model I've been running, FOAM, I have parsers that read the atmos_params and run_params files in addition to parsing the temperature monitor software and subversion “svn info” command. The script then inserts these data into their own tables marked by the uuid. While most of these tables have fields for each value I pull out of files, the temperature data is stored in key->value type table since the number of temperature sensors is dependent on hardware and thus may change from machine to machine (and is also Mac only).

Here is the schema for the main table, “runs”:

uuid text,
starttime text,
endtime text,
runduration text,
std_out blob,
completed text,
comments text,
runcmd text,
yearsPerDay text

Some of these fields are not yet used. std_out and the rancmd is not yet implemented in the script . Right now, I'll do the comments field manually. My currently running simulation looks like this at the moment:

uuid = **deleted
starttime = 2010-01-16 23:33:45
endtime = 2010-01-17 10:29:57
runduration = 39372.0
std_out =
completed = NO
comments = manual shutdown because of memory problem
runcmd =
yearsPerDay = 14.2

For the geography source location, here's the results for the run above:

uuid = **deleted
url = file:///Volumes/**deleted
type = svn
date = 2009-08-28 09:42:01 -0500 (Fri, 28 Aug 2009)
author = tlmoore
rev = 191
branch =

the branch is empty in anticipation for moving from subversion to git.

For temperatures, I can now look at before and after values for specific sensors for a run:

uuid = *deleted
temperature = 64.4

uuid = *deleted
temperature = 98.6

One thing I'd change here is specifying pre- versus post-run measurements.

So far, I'm happy with most of this new solution. It just need refinements.

Trying to go paperless

Almost since the inception of personal computing, the paperless office has been an unrealized promise. I've been wanting to move in that direction for many years, but it still not easy to do. Today, my office and basement are filled to capacity with documents, scientific papers, and books. I'm out of space and managing everything is killing my productivity. I throw away and give away what I can, but that is only a small dent in the problem. Some stuff I'll always have to keep as paper, sadly. The remaining stuff, however, needs to move to the computer.

The rest of this article will be on what I'm using to at least start this transition in terms of software and hardware. I'm primarily Macintosh based, so much of this will related to Macintosh software. Feel free to contact me to suggest Windows and Linux alternatives for the same tasks. The following are just some quick thoughts…


My software requirements appear relatively simple. I need to scan and OCR documents and convert them to PDF. Once I have the PDF, I need to be able to organize the documents quickly and with a high degree of flexibility. Simple, right?

No, not simple at all. I don't have any application that fits the complete bill for this work. Frustrating, to say the least. Part of the problem, in reality, is me. Just like the problems the IRS has on making simple tax code, my document management reality is too complicated for any one application to fit the bill. Unfortunate, but in the end, I hope it will not be too bad.

Here are quick reviews of some applications I'm trying out in my work environment.

Evernote (

At least on the Macintosh (I don't know about Linux and Windows) notebook applications are flooding the market: DevonThink, Yojimbo, Evernote, etc. The idea behind all of these applications is to capture all those little pieces of information you gather each day and put them into an organized system to help you find it again. Evernote is nice because it's cross platform, stores data in the cloud (if you want), and free for certain usage levels. Originally, I was going to make Evernote my PDF document solution, but that experiment was a failure. Evernote bogged down quickly with a large number of PDFs and simply won't import large documents (25 meg or larger). Even worse, there is no simple way to export the PDFs once their in the system; you have to export them in an html-style folder system and then pull the pdfs out from that. Yuk. You can find the PDFs within Evernote's file system under ~/Library/Application Support/Evernote but the file names are encoded and fairly meaningless to a human reader. That said, this system is fine for most text and html documents. Collecting tidbits of information together is still a strong use-case for Evernote. So, Evernote will likely remain a tool I continue to use, but it will not be in my paperless office workflow.

Endnote (

Endnote is a reference manager that has been around for a long time. It's a powerful database solution that's a great tool for writing scientific papers. For a while, Endnote included the ability to embed a PDF of a paper along with it's reference. As it turned out, however, that this process is clumsy, slow, and annoying. Like evernote, the PDFs are renamed as they are included in the database. Great for the database; not great for human readers. Since the goal of Endnote is academic, it's not good for many day to day PDFs either. So, endnote wont do for managing PDFs, though I will continue to use it to manage references and write papers.

Papers (

Papers is an example of what a modern Endnote could have been. Paper's is a PDF document-centric reference manager. For citations, it falls far short of Endnote and has a lot of problems. For example, much of the clever interface is too clever for it's own good. Just try to get a Jr. on the end of an author's name! Not to mention, it essentially thinks all PDFs are journal articles and does not contain enough fields to properly cite a reference (this could be my noobie-ness with the app,though). It's a bit buggy as well. But despite these shortcomings, it's a powerful program. In particular, you can use online resources, such as Google Scholar and Google Books to identify the PDFs and quickly give them a citation. Viewing PDFs is quite nice. And just to make you more mobile, there is an iPhone app as well. So, I suspect I'll spend a great deal of time in this app, but Papers viewed as companion app to Endnote rather than a replacement.

Adobe Acrobat (

The main reason I have the pro version of Acrobat these days is for its OCR capabilities. I can scan a PDF and quickly OCR the document right within Acrobat (and make sure that I keep the original images and hide the OCR text). OCR is not an option for a paperless office, it's required for modern operating systems like Snow Leopard because of all the content information that's index in the OS making the files easy to find.

Yep (

This application is specifically for PDF management. It's designed to avoid dealing the directory structure directly and supports tagging using their openmeta standard. The idea here is that the combination of spotlight searching and tags would provide enough resources to quickly find the material you need. I'm still experiencing the demo version of this app, but I plan to purchase their bundle which goes beyond PDFs. Part of the advantage of Yep is that it doesn't take control of your PDFs, it leaves there where they are. Other solutions, including Papers and Evernote above, move your files into their own system. Papers at least mimics a system I would have used for sorting papers (based on publication years and author names). However, it will certainly take some getting used to. It also plays nicely with Papers, since it can actually search within the Papers folder structure. The latest version of Yep also supports Microsoft Word and Pages files along with PDFs. Yep even supports scanning, although it doesn't OCR.

The obvious question of why get both Yep and Papers arises. The answer is that my PDF documents play different roles. Papers handles research papers; Yep handles everything else that's a PDF. This difference is critical. Scientific papers have a large set of metadata associated with them; their citations. Yep doesn't support this at all, whereas Papers does. Plus, Papers is partially designed for reading the PDFs as well Рit has a full screen mode, for example. Yep is about management and passes off the viewing to other applications. So, together, the apps do nicely…

XCode (

Apple's software development environment has an API for working with PDFs (probably what Papers and Yep use). I use this API for something very specific. I commonly scan double-sided, multi-page documents using a document feeder. However, I can only scan one side at a time. As a result, if I flip the stack and scan the back sides, the pages are seriously out of order. So, I use the API to rearrange the pages automatically. Otherwise, I'd have to manually sort the pages in Acrobat or some other app. I did have an Acrobat script that used to do this work, but this solution seems a bit faster.



Right now, I have an HP all-in-one with a document feeder. It's a bit slow, but it works. It's also very noisy, so I can only use it when the kids are awake. Finally, it's big, so I can't move it to where the documents are, I have to move the documents to the scanner. This combination discourages me from getting all the scan work done. If I had the money, I'd probably get a small Fujita portable scanner that can scan double-sided paper.


Today, I have a great deal of storage on a Drobo. I like the device as it's easy to use and flexible and relatively safe for data storage. The data on the Drobo are well protected from a drive failure, but not from a Drobo hardware failure. So, the Drobo is about reducing risk, but not eliminating it. But even without the drobo, all of my documents probably take up less than 2 gb today. If I get everything converted, it would probably be less than 200 gb. Even today, 200 gb is a fairly trivial amount of space. Thus, backing up this content should be relatively easy and maybe a good candidate for cloud storage, such as Amazon S3 (, Mozy (, or other such service.


Today, there is no single solution that works for my paperless office needs. Such a solution probably will never exist because some documents must be treated differently than the rest. Furthermore, you have to have solutions for each step: 1) acquiring, via external source or scanning, 2) ensure you have the text (e.g. OCR), 3) name and file the content in a meaningful way, and 4) have an easy way to pull it out of the massive amounts of content on your system.