SciPy 2015: Building Civis’s predictive modeling with Python

August 13, 2015 Stephen Hoover

At the beginning of July, I traveled to Austin for SciPy 2015, the annual conference dedicated to scientific computing in Python. Python is a powerful, easy-to-use programming language which has become very popular in the data science community. I was at the conference because the data science department at Civis Analytics uses Python to write all of the predictive modeling which goes into the new Civis platform — it’s been a wonderful tool for us.

Why do we like Python at Civis Analytics? Python’s expressiveness lets us write software faster. More than just the initial creation of software features, we’ve found that Python is a good language at all stages of our code’s lifecycle — creation, testing, review, deployment, maintenance, and improvements. The popularity of Python also means that resources are readily available to help developers through any problems. Python programmers are friendly and helpful people, and there’s a great store of knowledge available in books and websites to anyone using the language.

The other advantage to working in Python is the open-source community represented by the attendees of the SciPy conference. We at Civis are able to take advantage of the accumulated knowledge of the machine-learning research community contained in scikit-learn, high-performance numerical tools in NumPy, and the powerful tools for working with tabular data in the pandas library, among many others. The availability of this professional-grade, open-source software lets Civis’s data scientists focus on building new tools for organizations to take advantage of all of the data available to them.

While at the SciPy conference, I had the opportunity to talk about some of our experiences with writing Civis’s modeling software. Watch the video here:

One of the earliest choices you need to make when developing in Python is which version of the language to use. Despite the fact that Python 3 was released in 2008, adoption has been slow. By this point, however, it’s clear that new projects should use Python 3 (and Civis’s data science department has been doing so for over a year). There’s no one “killer” feature in Python 3 compared to Python 2, but there’s a lot of little things that all add up to make Python 3 much nicer to use. You get better handling of concurrency with asyncio, the improved “forkserver” context for multiprocessing, function annotations for enhanced bug checking and more helpful IDEs, and more. The only thing that might require you to use Python 2 is if you need to use third-party software which isn’t Python 3 compatible. All the larger, actively-maintained libraries are Python 3 compatible, but there’s still lots of older libraries which aren’t. If you’re forced to use Python 2, you should at least make sure that your own software is Python 3 compatible. Statements such as from __future__ import division, print_function and libraries such as six make this easy.

Another thing to remember (and this is vital for any software project) is to build in comprehensive testing from the ground up. If there’s any part of your software which isn’t being tested, it’s probably broken. If you’re lucky, it’s broken in a way which doesn’t affect the output you care about”¦ how lucky do you feel? I recommend you make liberal use of libraries such as unittest (Python standard library) and nose (a helpful third-party package). Make sure to check out the unittest.mock module for a powerful way to divide your testing into small, fast portions. Every time you find a bug, add a new test to make sure it never comes back. Never merge any new change until it passes all of your tests!

Finally, one of the biggest lessons I’ve learned from my experience with building the Civis modeling library is that sometimes you just have to go back and re-write. When you’re building a new software library, it’s not always clear at first how all of the parts are going to fit together. Moreover, you sometimes make decisions which are the easiest solution to your immediate problem, but which make future work more difficult. If your software is something which will only be used once, or never changed, that’s not a problem. But if you want your software to be reusable, extensible, and easy to maintain, sometimes you’ve got to change what you actually wrote to what you wish you’d written.

And don’t forget to check out all of the other wonderful talks and tutorials on the SciPy 2015 website.

The post SciPy 2015: Building Civis’s predictive modeling with Python appeared first on Civis Analytics.

Previous Article
Republican Primary Poll: August 10-19, 2015
Republican Primary Poll: August 10-19, 2015

As featured in the New York Times, the Civis Analytics Research Team conducted a survey of the Republican P...

Next Article
Meet Civis: Some of our Favorite Features of the Data Science Platform
Meet Civis: Some of our Favorite Features of the Data Science Platform

Earlier this summer, [we announced]({% post_url 2015-06-30-Bigger-Cheaper-Faster-Data-in-the-Cloud %}) Civi...