Grid / Batch / Secure / Cloud / Fun / Powerful Computing with Civis Platform

November 9, 2017 Matthew B.

We’ve all been there. You are running some compute-heavy code on your laptop at work. Now, your computer is making a horrible noise, and more importantly, it is really slowing down your Facebook feed. Wouldn’t it be nice if we could do that computing elsewhere so that we can all get back to cat photos and dumpster fire GIFs?

To the cloud! Amazon and Google sell access to their massive compute infrastructure so that we all don’t have to stand up physical machines ourselves. Unfortunately, even with awesome cloud computing providers, we have whole hosts of other issues that a lot of data scientists just don’t want or know how to deal with. How do we make sure this infrastructure is properly secured, especially over the internet, and also cost-effective? How do we make it easy and intuitive to use? What do we do about getting code, data, and packages there and back?

Grid Computing as a Common Cloud Access Pattern

Clearly access to computing power alone is not enough. Data scientists need tooling on top of this infrastructure that enables them to use it securely and cost-effectively for the most common access patterns and workflows. Civis Platform is a core part of this tooling, providing a variety of interfaces in order to cover many of these common uses cases. For users who want to run scripts without worrying about the underlying computing environment, Civis Platform provides Python, R, JavaScript, SQL scripts with predetermined compute environments. For users who need more flexibility, Civis Platform provides container scripts with customizable Docker images and access to version control providers (e.g,, GitHub, Bitbucket, etc.). These various computing interfaces are the core way in which our own data scientists automate their work and make it reproducible.

Interestingly, the high performance and grid computing communities maintain tools aimed to address similar use cases (e.g., HTCondor, OpenPBS/TORQUE, slurm). These tools specialize in the execution of ad-hoc, computationally intensive analyses on large clusters of computers.

Collectively they support the following data and computing access pattern:

  1. Put your data and code on a centralized file system
  2. Write a script to run your code
  3. Submit your job to the compute cluster
  4. Wait for it to finish (e.g., they send you an email)
  5. Download results to your local machine for lightweight analysis/plotting

Here at Civis, we identified a need to use our own computing infrastructure, Civis Platform, in the same way:

  1. Put your code and/or data in the cloud (e.g., GitHub, AWS S3, AWS Redshift)
  2. If needed, build a Docker container with your compute environment
  3. Submit your script to Civis Platform
  4. Wait for it to finish (Civis Platform sends you an email or you check the GUI)
  5. Download results to your local machine (or a notebook on Civis Platform!) for lightweight analysis/plotting

Notice the only real differences here are 1) where the data and code go, 2) the potential use of a GUI, and 3) containers!

For data scientists, using Civis Platform as grid computing infrastructure offers several advantages over managing your own instances on a cloud computing provider. We provide all of the security and credential management for you — we even achieved SOC 2 Type II certification end-to-end with Civis Platform. With Civis Platform, you can access data in your Redshift cluster and data on S3 through the Civis API, simplifying everything. We provide autoscaling for your compute instances so that they go away when you don’t need them. We provide a GUI to monitor your jobs and notifications when they finish. Each job exists in Civis Platform with a unique identifier and can be rerun at any time. Finally, Civis Platform seamlessly integrates with all of your public Docker containers, meaning your compute environment is up and running on Civis Platform in minutes. (Don’t have a Docker container? Sure, use one of ours! They have all of the common stuff, curated by our own data scientists no less, that you need in R or Python.)

Meet civis-compute

civis-compute is our grid computing interface for Civis Platform. It is a small command line utility that seamlessly runs code on Civis Platform. It was built to feel and function like traditional grid computing tools, while also having modern features for typical cloud computing stacks, like the ability to specify a Docker image. We wrote it to be a quick and easy way to run ad-hoc analyses in Civis Platform without incurring the overheads of manually pushing your data to the cloud and composing a container script with the proper commands. Underneath, civis-compute uses Civis Platform container scripts to execute jobs and Civis Platform’s file storage for job inputs and outputs.

So how do you try this out? Sign up for a Civis Platform free trial!

Then install civis-compute with pip:

$ pip install civis-compute

Make sure to put your Civis API key in your local environment.

Next, let’s take a look at the civis-compute command line interface:

$ civis-compute
Usage: civis-compute [OPTIONS] COMMAND [ARGS]...

  Welcome to the civis-compute command line interface!

  Make sure to have your Civis API key in the local environment as

  --help  Show this message and exit.

  cache   Manage the cache of file IDs.
  cancel  Cancel a Civis Platform container scripts.
  get     Download the outputs for a given SCRIPTID.
  status  Inspect Civis Platform container scripts.
  submit  Submit a SCRIPT to Civis Platform.

You can submit shell commands, shell scripts, Python scripts, Jupyter notebooks, and R scripts.

Configuration is possible either via command line switches or via comments in the code:

$ cat 
import time

#CIVIS name=zombie  # the name of the job in Civis Platform
#CIVIS required_resources={cpu: 256, memory: 1024, disk_space: 1}
#CIVIS docker_image_name=civisanalytics/datascience-python
#CIVIS docker_image_tag=3.2.0

t0 = time.time()
while True:
    if time.time() - t0 > 10:
        print('oooooooooooh!', flush=True)
        t0 = time.time()

Submitting the job works in one line:

$ civis-compute submit

Determining the status of the job requires the job ID:

$ civis-compute status 7519314
name: zombie
id: 7519314
  export CIVIS_JOB_DATA=/data/civis_job_data_${CIVIS_JOB_ID}_${CIVIS_RUN_ID} && \
  mkdir -p ${CIVIS_JOB_DATA} && \
  civis files download 6509699 && \
  chmod a+rwx && \
  python && \
  echo "Job Output:" && \
  tar -czvf ${CIVIS_JOB_DATA}.tar.gz -C /data civis_job_data_${CIVIS_JOB_ID}_${CIVIS_RUN_ID} && \
  if [ "$(ls -A ${CIVIS_JOB_DATA})" ]; then \
    civis scripts post-containers-runs-outputs ${CIVIS_JOB_ID} ${CIVIS_RUN_ID} \
    `civis files upload ${CIVIS_JOB_DATA}.tar.gz` File; \

Finally, if you write data to the directory given by the environment variable ${CIVIS_JOB_DATA} as your job is running, then you can download it one command via:

$ civis-compute get 7519314

A compressed, tarred archive with the specific job and run ID will be downloaded to your local working directory.

civis-compute has several other nice features including the ability to inspect your jobs, the ability to cancel jobs, and the ability to cache uploads of large files to Civis Platform in order to increase the speed at which you can iterate while working. See the full documentation for more details.

Putting Civis Platform to Work

Let’s use Civis Platform and civis-compute to do something non-trivial. We have written a convolutional neural network classifier using muffnn, our TensorFlow-based neural network package. Let’s fit the the fashion MNIST dataset using eight CPU cores in a Civis Platform container script. Here is a notebook with code to fit the classifier.

Note that the classifier itself is in a separate file,

$ cat 
import numpy as np

from sklearn.base import ClassifierMixin, BaseEstimator
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import check_random_state
from sklearn.exceptions import NotFittedError

import tensorflow as tf

from muffnn.core import TFPicklingBase, affine

class ConvNetClassifier(TFPicklingBase, BaseEstimator, ClassifierMixin):
    """Scikit-learn compatible convolutional neural network classifier.

    conv_hidden_units : list of tuples, optional
        Indicates the size and number of the convolutional filters.
    max_pool_size : int
        Size of max polling regions applied after each convolutional

I am using the civis-compute files option to upload this source code into the job as well. Note that the Docker container, civisanalytics/datascience-python already has muffnn installed. However, if it did not, we could use the shell_cmd parameter to install the repo on-the-fly.

We can submit the notebook to Civis Platform in one command:

$ civis-compute submit fashion_mnist.ipynb

Underneath, the notebook is first converted to a Python script and is then executed. Let’s check the outputs:

$ civis-compute status 7519366
name: fashion mnist
id: 7519366
  export CIVIS_JOB_DATA=/data/civis_job_data_${CIVIS_JOB_ID}_${CIVIS_RUN_ID} && \
  mkdir -p ${CIVIS_JOB_DATA} && \
  civis files download 6509754 && \
  civis files download 6509755 fashion_mnist.ipynb && \
  chmod a+rwx fashion_mnist.ipynb && \
  ipython --InteractiveShell.colors='nocolor' fashion_mnist.ipynb && \
  echo "Job Output:" && \
  tar -czvf ${CIVIS_JOB_DATA}.tar.gz -C /data civis_job_data_${CIVIS_JOB_ID}_${CIVIS_RUN_ID} && \
  if [ "$(ls -A ${CIVIS_JOB_DATA})" ]; then \
    civis scripts post-containers-runs-outputs ${CIVIS_JOB_ID} ${CIVIS_RUN_ID} \
    `civis files upload ${CIVIS_JOB_DATA}.tar.gz` File; \
log file:
  [2017-10-04T03:55:07.000Z] Queued
  [2017-10-04T03:55:07.000Z] Running
  [2017-10-04T03:55:08.000Z] Dedicating resources
  [2017-10-04T03:55:10.000Z] Downloading code and container
  [2017-10-04T03:55:11.000Z] Executing script
  [2017-10-04T03:58:01.000Z] epoch 0.20: test accuracy = 0.834
  [2017-10-04T03:58:29.000Z] epoch 0.40: test accuracy = 0.857
  [2017-10-04T03:58:57.000Z] epoch 0.60: test accuracy = 0.870
  [2017-10-04T04:07:52.000Z] epoch 4.40: test accuracy = 0.919
  [2017-10-04T04:08:20.000Z] epoch 4.60: test accuracy = 0.916
  [2017-10-04T04:08:48.000Z] epoch 4.80: test accuracy = 0.910
  [2017-10-04T04:09:16.000Z] epoch 5.00: test accuracy = 0.913
  [2017-10-04T04:09:17.000Z] Job Output:
  [2017-10-04T04:09:17.000Z] civis_job_data_7519366_61052013/
  [2017-10-04T04:09:17.000Z] civis_job_data_7519366_61052013/cnn.pkl
  [2017-10-04T04:09:24.000Z] link:
  [2017-10-04T04:09:24.000Z] name: civis_job_data_7519366_61052013.tar.gz
  [2017-10-04T04:09:24.000Z] objectId: 6509829
  [2017-10-04T04:09:24.000Z] objectType: File
  [2017-10-04T04:09:24.000Z] Process used approximately 842.75 MiB of its 16384 MiB memory limit
  [2017-10-04T04:09:24.000Z] Finished
  [2017-10-04T04:09:25.000Z] Script complete.

Finally, we can download the results to our local machine:

$ civis-compute get 7519366
$ tar xzvf civis_job_data_7519366_61052013.tar.gz
x civis_job_data_7519366_61052013/
x civis_job_data_7519366_61052013/cnn.pkl

Compute All the Things

Civis Platform with civis-compute is a powerful tool for general purpose computing tasks that are too expensive to run locally. Here at Civis, we use this tool to do everything from testing new algorithms to performance benchmarking. By exposing Civis Platform as a traditional grid computing resource, we help our data scientists iterate on their work faster and easier.

The post Grid / Batch / Secure / Cloud / Fun / Powerful Computing with Civis Platform appeared first on Civis Analytics.

Previous Article
Civis R&D Bookshelf: How Neural Nets Work and When You Might Not Need Them
Civis R&D Bookshelf: How Neural Nets Work and When You Might Not Need Them

This post is part of our Bookshelf series organized by the Data Science R&D department at Civis Analytics. ...

Next Article
Multiple languages, one team: Bringing R and Python together with Civis Platform
Multiple languages, one team: Bringing R and Python together with Civis Platform

What follows is a brief example of how the Civis Platform enables data scientists to collaborate effectivel...