Using SQLite and Python to Store Model Metadata
As I continue to run a range of climate models, I’ve learned from painful lessons that I need to record as much information about the model run as possible. When I first started this process, I simply kept files used to make the run (the geography and configuration files for the model) and the model output. At first, this seemed sufficient because, in the end, these were the data that were most important. As it turns out, however, that having a history of everything you did during the model run, such as adjustments to the settings or geography, is also important both historically to the run and possibly sorting out problems later.
My initial solution to this problem was to create a log file. Every time I ran the model, the important setting information was sent to a simple flat-file log. It turned out that this log was very important to debugging a model-run issue because it kept a record of how the model was initially run. I also started keeping information about the hardware in this log. Along with the model information, I began to store hardware temperature data from before and after the run in the log just in case I needed to debug hardware issues. However, these data turned out to be virtually useless in a flat log file. Other information I haven’t been keeping that I wanted in the log was geography version control information. I use version control to track all my geography work, so I can both track how I change the geography and get an idea how much time I spend on it. However, the exact geography used in a run is important to know. However, even more info in a flat log file makes it even more difficult to review.
My new solution is to dump the flat file approach and go with SQLite. SQLite is a lightweight, public domain SQL file format that works well with a variety of languages. SQLite has become one of my preferred file formats over the years (nudging out XML for custom data structures). The Python scripting language seems a natural fit to work with SQLite as well.
So, how does this solution work? FIrst, I have a simple run script for the model using bash (for some reason, I could never get the model to run using Python). This script calls my python script before the model starts and after the model ends. It sends two pieces of information, a uuid and the model directory path. The python script assembles everything it needs on its own.
Why a uuid? Each time I run the model, I need to identify the run in the database with a unique id that can be used to link across a number of SQL tables. A uuid ensures that the id is unique. I’ve considered using a uuid for the overall simulation but I haven’t implemented that.
To pull in settings data and temperature data, I’ve written parsers for each format. For the model I’ve been running, FOAM, I have parsers that read the atmos_params and run_params files in addition to parsing the temperature monitor software and subversion “svn info” command. The script then inserts these data into their own tables marked by the uuid. While most of these tables have fields for each value I pull out of files, the temperature data is stored in key->value type table since the number of temperature sensors is dependent on hardware and thus may change from machine to machine (and is also Mac only).
Here is the schema for the main table, “runs”:
`
CREATE TABLE runs (
uuid text,
starttime text,
endtime text,
runduration text,
std_out blob,
completed text,
comments text,
runcmd text,
yearsPerDay text
);
`
Some of these fields are not yet used. std_out and the rancmd is not yet implemented in the script . Right now, I’ll do the comments field manually. My currently running simulation looks like this at the moment:
uuid = **deleted
starttime = 2010-01-16 23:33:45
endtime = 2010-01-17 10:29:57
runduration = 39372.0
std_out =
completed = NO
comments = manual shutdown because of memory problem
runcmd =
yearsPerDay = 14.2
For the geography source location, here’s the results for the run above:
uuid = **deleted
url = file:///Volumes/**deleted
type = svn
date = 2009-08-28 09:42:01 -0500 (Fri, 28 Aug 2009)
author = tlmoore
rev = 191
branch =
the branch is empty in anticipation for moving from subversion to git.
For temperatures, I can now look at before and after values for specific sensors for a run:
uuid = *deleted
sensor = SMC CPU A HEAT SINK
temperature = 64.4
uuid = *deleted
sensor = SMC CPU A HEAT SINK
temperature = 98.6
One thing I’d change here is specifying pre- versus post-run measurements.
So far, I’m happy with most of this new solution. It just need refinements.