The Civis API: Scale Up Your Data Science

September 2, 2015 James Michelson

In the final stretch of a major client project back in 2014, a fellow data scientist at Civis Analytics whispered a modest proposal to me: “We should be able to build all these models from massive amounts of survey data — and actually get some sleep.” A reasonable idea — and one that we ended up solving with the Civis API. While the interface of the Civis platform gives me the tools (which you can use too!) to solve client problems and build workflows, with the API, I can script the Civis platform to process more data and build more models, faster, than I would ever have time to do by hand.

I work as a data scientist in the Applied Data Science department, building analytics solutions for clients in a variety of industries and sectors. At first, I thought the hardest problems I would be solving at Civis Analytics would be around modeling methods and algorithms — and trust me, those problems are hard! — but even more challenging is how to design, scale and execute whatever methodological solution I come up with. No matter the project, my toughest questions are about scale and workflows: Can I build a predictive model on this much data in less than an hour? Can I scale this process to build hundreds of models as our business ramps up? How can I empower the rest of my team to use my work? Can we plug these Civis workflows into processes that happen elsewhere?

I rely on the Civis API every day to develop workflows and pipelines that solve these scaling problems for my team. Many of the problems I face are already solved by the large suite of tools already in the platform. However, when I encounter a new problem, I can write a script that calls existing functionality through the API and supplements it with whatever custom solutions I write myself.

My current challenge is building an automated modeling pipeline that can handle approximately 200,000 new data points generated from rolling, weekly surveys.

  • Challenge 1: Survey data can be messy, and needs to be extensively quality controlled.
  • Challenge 2: We’ll use hundreds of predictors in our models and yet we’ll need them all to run over the course of just a few hours.
  • Challenge 3: Our clients want new information as soon as possible, and we need to build new reports or update old ones even faster.

In the not-so-distant past, we’d have anywhere from two to a dozen full-time data scientists working on this sort of pipeline. One person would check the data, another would model, another would analyze, and maybe some more people would build reports.

All these smart data scientists should be able to do more than operate just one part of our business. And with the Civis platform, I can orchestrate that entire workflow by myself. It means that the rest of our data scientists can build excellent workflows of their own, handle more clients, and produce more data science insight. We can even collaborate and share our solutions across projects and clients.

The platform is our force-multiplier, and one aspect of it in particular has been instrumental in making my life easier: the API. With the API and some basic python skills, I can develop chained workflows that start automatically and on a schedule. Those workflows can operate on more datasets than I would have the ability to set up by hand. And best of all, I can set them to run even when I’m not in the office or when I’m working on something else, so I don’t need to repeat the same tasks over and over again.

Part of my code for creating and building a model in python with a set of parameters would look something like this:



create model job

model = requests.request('POST', urljoin(api_url, 'models'), auth=(api_key, ''), data=model_params) model_id = model.json()['id']

launch model

launch_url = urljoin(api_url, 'models/{id}/builds'.format(id=model_id)) requests.request('POST', launch_url, auth=(api_key, ''))

I can specify everything from the type of model used, an option to algorithmically find interaction terms and even an option to limit the cases I want to model in my Amazon Redshift tables. Once this model has finished building, I can then apply predictions from this model with code that might look something like this:



create prediction job

url = urljoin(api_url, '/models/{id}/predictions'.format(id=model_id))

predictions = requests.request('PUT', url, auth=(api_key, ''), data=score_params) predict_job_id = predictions.json()['id']

score table

url = urljoin(api_url, 'predictions/{id}/runs'.format(id=predict_job_id)) requests.request('POST', url, auth=(api_key, ''))

This code makes modeling incredibly easy to scale. It is straightforward to pipeline and extend to cover whatever use cases you need. In fact, all platform functions can be programmatically kicked off and their results returned. As a result, building a single custom workflow for a particular client does not monopolize all our person hours, meaning that we can handle more work than ever before.

The tools in the platform, coupled with the flexibility of the API, lets you customize your solutions — so that once you set it up, you can catch up on some shut-eye.

The post The Civis API: Scale Up Your Data Science appeared first on Civis Analytics.

Previous Article
Adventures in MySQL: When Composite Indexes Go Long
Adventures in MySQL: When Composite Indexes Go Long

MySQL, which we host on Amazon’s RDS, is one of the most important parts of our stack to optimize for perfo...

Next Article
Republican Primary Poll: August 10-19, 2015
Republican Primary Poll: August 10-19, 2015

As featured in the New York Times, the Civis Analytics Research Team conducted a survey of the Republican P...