Workflows in Python: Getting data ready to build models

December 17, 2015 Katie Malone

A couple of weeks ago, I had the opportunity to host a workshop at the Open Data Science Conference in San Francisco. During the workshop, I shared the process of rapid prototyping followed by iterating on the model I’ve built. When I’m building a machine learning model in scikit-learn, I usually don’t know exactly what my final model will look like at the outset. Instead, I’ve developed a workflow that focuses on getting a quick-and-dirty model up and running as quickly as possible, and then going back to iterate on the weak points until the model seems to be converging on an answer.

This process has three phases, which I’ll highlight in an example I created to predict failures of wells in Africa. In this blog post, I’ll show how I got the raw data machine-learning ready and build a few quick models. In subsequent posts, I’ll revisit some of the choices made in the first model, effectively cleaning up some messes that I made in the interest of moving quickly. Lastly, I’ll introduce scikit-learn Pipelines and GridSearchCV, a pair of tools for quickly attaching pieces of data science machinery and comprehensively searching for the best model.

The example problem is solving the “Pump it Up: Mining the Water Table” challenge on, which has examples of wells in Africa, their characteristics and whether they are functional, non-functional, or functional but in need of repair. My goal is to build a model that will take the characteristics of a well and predict correctly which category that well falls into. A quick print statement on the labels shows that the labels are strings:

import pandas as pd
import numpy as np
 features_df = pd.DataFrame.from_csv("well_data.csv")
labels_df   = pd.DataFrame.from_csv("well_labels.csv") 
print( labels_df.head(20) )
69572 functional 8776 functional 34310 functional 67743 non functional 19728 functional 9944 functional 19816 non functional 54551 non functional 53934 non functional 46144 functional 49056 functional 50409 functional 36957 functional 50495 functional 53752 functional 61848 functional 48451 non functional 58155 non functional 34169 functional needs repair 18274 functional

The machine learning algorithms downstream are not going to handle it well if the class labels used for training are strings; instead, I’ll want to use integers. The mapping that I’ll use is that “non functional” will be transformed to 0, “functional needs repair” will be 1, and “functional” becomes 2. When I want a specific mapping between strings and integers, like here, doing it manually is usually the way I go. In cases where I’m more flexible, there’s also the sklearn LabelEncoder.

There are a number of ways to do the transformation here; the framework below uses applymap() in pandas. Here’s the documentation for applymap(); in the code below, I have filled in the function body for label_map(y) so that if y is “functional”, label_map returns 2; if y is “functional needs repair” then it should return 1, and “non functional” is 0.

As an aside, I could also use apply() here if I like. The difference between apply() and applymap() is that applymap() operates on a whole dataframe while apply() operates on a series (or you can think of it as operating on one column of your dataframe). Since labels_df only has one column (aside from the index column), either one will work here.

def label_map(y):
   if y=="functional":
       return 2
   elif y=="functional needs repair":
       return 1
       return 0
labels_df = labels_df.applymap(label_map)
print( labels_df.head() )
69572 2 8776 2 34310 2 67743 0 19728 2

Now that the labels are ready, I’ll turn my attention to the features. Many of the features are categorical, where a feature can take on one of a few discrete values, which are not ordered. In transform_feature( df, column ), I take features_df and the name of a column in that dataframe, and return the same dataframe but with the indicated feature encoded with integers rather than strings. This is something I’ll revisit in the next post, where I talk about dummying out categorical features with OneHotEncoder in sklearn or get_dummies() in pandas.

     amount_tsh date_recorded        funder  gps_height     installer  \
69572 6000 3/14/11 Roman 1390 Roman
8776 0 3/6/13 Grumeti 1399 GRUMETI
34310 25 2/25/13 Lottery Club 686 World vision
67743 0 1/28/13 Unicef 263 UNICEF
19728 0 7/13/11 Action In A 0 Artisan
   longitude   latitude              wpt_name  num_private  \

69572 34.938093 -9.856322 none 0
8776 34.698766 -2.147466 Zahanati 0
34310 37.460664 -3.821329 Kwa Mahundi 0
67743 38.486161 -11.155298 Zahanati Ya Nanyumbu 0
19728 31.130847 -1.825359 Shuleni 0

                     basin          ...          payment_type  \

id ...
69572 Lake Nyasa ... annually
8776 Lake Victoria ... never pay
34310 Pangani ... per bucket
67743 Ruvuma / Southern Coast ... never pay
19728 Lake Victoria ... never pay

  water_quality  quality_group      quantity quantity_group  \

69572 soft good enough enough
8776 soft good insufficient insufficient
34310 soft good enough enough
67743 soft good dry dry
19728 soft good seasonal seasonal

                 source           source_type source_class  \

69572 spring spring groundwater
8776 rainwater harvesting rainwater harvesting surface
34310 dam dam surface
67743 machine dbh borehole groundwater
19728 rainwater harvesting rainwater harvesting surface

               waterpoint_type waterpoint_type_group  

69572 communal standpipe communal standpipe
8776 communal standpipe communal standpipe
34310 communal standpipe multiple communal standpipe
67743 communal standpipe multiple communal standpipe
19728 communal standpipe communal standpipe

[5 rows x 39 columns]

def transform_feature( df, column_name ):
    unique_values = set( df[column_name].tolist() )
    transformer_dict = {}
    for ii, value in enumerate(unique_values):
        transformer_dict[value] = ii

def label_map(y):
    return transformer_dict[y]
df[column_name] = df[column_name].apply( label_map )
return df

list of column names indicating which columns to transform;

this is just a start! Use some of the print( labels_df.head() )

output upstream to help you decide which columns get the


names_of_columns_to_transform = ["funder", "installer", "wpt_name", "basin", "subvillage",
"region", "lga", "ward", "public_meeting", "recorded_by",
"scheme_management", "scheme_name", "permit",
"extraction_type", "extraction_type_group",
"management", "management_group",
"payment", "payment_type",
"water_quality", "quality_group", "quantity", "quantity_group",
"source", "source_type", "source_class",
"waterpoint_type", "waterpoint_type_group"]
for column in names_of_columns_to_transform:
features_df = transform_feature( features_df, column )

print( features_df.head() )

remove the "date_recorded" column--we're not going to make use

of time-series data today

features_df.drop("date_recorded", axis=1, inplace=True)


      amount_tsh date_recorded  funder  gps_height  installer  longitude  \
69572 6000 3/14/11 614 1390 685 34.938093
8776 0 3/6/13 1206 1399 1252 34.698766
34310 25 2/25/13 878 686 555 37.460664
67743 0 1/28/13 920 263 2059 38.486161
19728 0 7/13/11 906 0 164 31.130847
    latitude  wpt_name  num_private  basin          ...            \

id ...
69572 -9.856322 15454 0 5 ...
8776 -2.147466 19453 0 8 ...
34310 -3.821329 10040 0 0 ...
67743 -11.155298 19651 0 7 ...
19728 -1.825359 7904 0 8 ...

   payment_type  water_quality  quality_group  quantity  quantity_group  \

69572 0 2 3 4 4
8776 2 2 3 2 2
34310 6 2 3 4 4
67743 2 2 3 3 3
19728 2 2 3 1 1

   source  source_type  source_class  waterpoint_type  \

69572 3 0 2 3
8776 8 4 0 3
34310 7 3 0 2
67743 0 2 2 2
19728 8 4 0 3


69572 2
8776 2
34310 2
67743 2
19728 2

[5 rows x 39 columns]

Just a couple last steps to get everything ready for sklearn. The features and labels are taken out of their dataframes and put into a numpy.ndarray and list, respectively.

X = features_df.as_matrix()
y = labels_df["status_group"].tolist()

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score(). This splits my data into three equal portions, trains on two of them, and tests on the third. This process repeats three times. That’s why three numbers get printed in the code block below.

import sklearn.linear_model
import sklearn.cross_validation
clf = sklearn.linear_model.LogisticRegression()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )
[ 0.65363636  0.6569697   0.6560101 ]

I have a baseline logistic regression model for well failures. There’s an assumption implicit in this model (and the other classifiers below) that classification is the correct approach to take here. Classification is designed for unordered categorical tasks, like predicting whether my favorite ice cream flavor is chocolate, vanilla or strawberry. Regression gives a continuous output which also implies a built-in order to the answers that it gets; an example would be predicting my age or my income. The task of predicting well failures could be modeled either way; it has discrete categories for answers (functional/functional needs repair/non functional) but there’s also an ordering to the categories that a classifier isn’t necessarily going to pick up on. I have the choice of modeling with a classifier and potentially getting slightly worse performance, or building a regression but needing to add a post-processing step that turns my continuous (i.e. float) predictions into integer category labels. I’ve decided to go with the classification approach for this example, but this is a decision made for convenience that I could revisit when improving my model down the road.

I started with a simple logistic regression above (despite the name, this is a classification algorithm) and now I’ll compare to a couple of other classifiers, a decision tree classifier and a random forest classifier, to see which one seems to do the best.

import sklearn.tree
import sklearn.ensemble
clf = sklearn.tree.DecisionTreeClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )
clf = sklearn.ensemble.RandomForestClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )
[ 0.73590909  0.73691919  0.73005051]
[ 0.78777778  0.7889899   0.78409091]

And the winner appears to be the random forest, not really a surprise but you’ll have to wait for the next post to learn why the random forest is such a strong algorithm.

This brings me to the end of the “getting started” portion of this analysis. I now have a working data science setup, in which I have:

  • read in data
  • transformed features and labels to make the data amenable to machine learning
  • picked a modeling strategy (classification)
  • made a train/test split (this was done implicitly when I called cross_val_score)
  • evaluated several models for identifying wells that are failed or in danger of failing

In the next post I’ll clean up some of the technical debt that I’ve accrued by moving so quickly toward getting a model working.

The post Workflows in Python: Getting data ready to build models appeared first on Civis Analytics.

Previous Article
Workflows in Python: Curating Features and Thinking Scientifically about Algorithms
Workflows in Python: Curating Features and Thinking Scientifically about Algorithms

This is the second post in a series about end-to-end data analysis in Python using scikit-learn Pipeline an...

Next Article
Data Storytelling and Feature Creation
Data Storytelling and Feature Creation

Employees at Civis have many interests; three at the top of the list are data, Chicago and biking to work. ...