The GCAM execution subsystem
by Beau Cronin Thursday, November 3, 2016

GCAM is the model that we will use to track energy generation and usage, as well as agricultural land use and yield. Like the CESM, it has been under development in an academic setting for many years. While it's core functionality is impressive and immensely valuable - incorporating a vast amount of scientific wisdom - the tools and data formats that provide job control and input/output certainly didn't have our use in mind.

So, again like the CESM, it is important to package GCAM in such a way that it can be used as a docile and reliable software component of the larger Earthers system. At this stage of the project, our strategy will be to keep the GCAM code intact, and wrap it in an interface that is more suited to our needs. Should Earthers succeed, it's likely that we'd want to dig into some of the core I/O routines in the GCAM make some different optimizations.

GCAM run execution and behavior

(see here for GCAM tutorial slides)

The core GCAM code is implemented in C++, and can be built without too much trouble on Posix systems. A future post will describe this process in more detail.

Each GCAM run by default spans all of the years from 2015 to 2100 in five year time steps. We would like to change the time step to be a single year, and also to make it possible to stop and start runs at any year. A complete run takes about 10 minutes, and requires about 4GB of RAM (i.e., it will not fit within the limits of a Lambda call - this is a real bummer).

Both the input and output data formats are in XML, and GCAM uses the Xerces parser to read and write them. The de facto XML schema (no actual xsd seems to exist) has clearly evolved over time, and is not exhaustively documented. It does not conform to XML best practices: element names are mixture of hyphenated and camel-cased, attributes and text elements are used inconsistently, element types are mixed together within parents, etc.

For interactive scientific use, GCAM comes with a graphical query interface implemented in Java. We will not use this interface, but rather postprocess the results from XML into another form that better suits our needs (see below).

Requirements

With the above in mind, we would like our GCAM execution system to have the following properties:

  • It must be horizontally scalable without tears; in the absence of a serverless option (which would be ideal), provisioning additional capacity should be as simple as adding properly configured instances to a pool.
  • It must be controllable through an HTTP service interface, with model options provided as POST data
  • Results must be queryable through an HTTP service interface
  • Overhead should be as low as possible

It's also important to enumerate the model options that the control interface supports, as well as the query types that the results interface is able to serve. Both of these still need to be fleshed out.

Proposed Architecture

With these requirements in mind, we propose the following architecture:

  • Chalice microservice for job control
  • SQS queue for job tracking (or Dynamo table?)
  • Docker container for each worker, containing a process that 
    • polls the job queue
    • executes a run
    • writes job metadata to Dynamo and raw output to S3
  • Lambda function triggered by S3 object creation (or Dynamo stream?) to postprocess results and write them to...
  • Postgres database for model results
  • Chalice microservice for querying model results