Workflows in Python: Using Pipeline and GridSearchCV for More Compact and Comprehensive Code

January 6, 2016 Katie Malone

The last two posts in this series have been about getting a data science analysis quickly up and running, and then circling back to improve it or understand the patterns I find, for example, which algorithms are working best and why. The upshot was a better handle on my workflow, but I’m left with a lot of free parameters of my algorithms to tune, and messing around with my workflow often leads to spaghetti code that becomes less and less understandable/easy to experiment with as I go. Enter the scikit-learn Pipeline and GridSearchCV objects: two tools that effectively allow me to pour gasoline on my data science fire, tightening up the code and doing parameter scans in just a few lines of code.

First up is Pipeline. There are a number of tools that I’ve chained together to get where I am now, like SelectKBest and RandomForestClassifier. After selecting the 100 best features, the natural next step is to run my random forest again to see if it does a little better with fewer features. In this case, I have SelectKBest doing selection, with the output of that process going straight into a classifier. Pipeline packages the transformation step of SelectKBest with the estimation step of RandomForestClassifier into a coherent workflow.

Why might I want to use Pipeline instead of keeping the steps separate?

  • It makes code more readable (or, if you like, it makes the intent of the code clearer).
  • I don’t have to worry about keeping track data during intermediate steps, for example between transforming and estimating.
  • It makes it trivial to move ordering of the pipeline pieces, or to swap pieces in and out.
  • It allows you to do GridSearchCV on your workflow

This last point is, in my opinion, the most important. I will get to that point very soon, but first I’ll get a Pipeline up and running that does SelectKBest followed by RandomForestClassifier.

import sklearn.pipeline
select = sklearn.feature_selection.SelectKBest(k=100)
clf = sklearn.ensemble.RandomForestClassifier()
steps = [('feature_selection', select),
        ('random_forest', clf)]
pipeline = sklearn.pipeline.Pipeline(steps)
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.33, random_state=42)

fit your pipeline on X_train and y_train X_train, y_train )

call pipeline.predict() on your X_test data to make a set of test predictions

y_prediction = pipeline.predict( X_test )

test your predictions using sklearn.classification_report()

report = sklearn.metrics.classification_report( y_test, y_prediction )

and print the report

print(report) precision recall f1-score support

      0       0.80      0.56      0.66      5007
      1       0.12      0.00      0.01       942
      2       0.69      0.92      0.79      7119

avg / total 0.69 0.72 0.68 13068

I make a list of steps, each of which is a transformer (like SelectKBest) or, for the last one in the list, an estimator (RandomForestClassifier), and then turn that list into a Pipeline. Then the Pipeline is a single coherent workflow, with the transformed data from SelectKBest being seamlessly passed along the RandomForestClassifier. Depending on exactly what I want to do in a given case, I could have many transformers strung together, with or without an estimator at the end.

By the way””I’ve slightly changed the way that I am evaluating my model, using classification_report. It gives me more information than cross_val_score, which I was using before, although it’s a little more involved to use (I am responsible for doing the training/testing split now, whereas cross_val_score did that automatically).

Now to GridSearchCV. When I decided to select the 100 best features, setting that number to 100 was kind of a hand-wavey decision. Similarly, the RandomForestClassifier that I’m using right now has all its parameters set to their default values, which might not be optimal.

So, a straightforward thing to do now is to try different values of k (the number of features being used in the model) and any RandomForestClassifier parameters I want to tune (for the sake of concreteness, I’ll play with n_estimators and min_samples_split). Trying lots of values for each of these free parameters is tedious, and there can sometimes be interactions between the choices I make in one step and the optimal value for a downstream step. In other words, to avoid local optima, I should try all the combinations of parameters, and not just vary them independently. If I want to try 5 different values each for k, n_estimators and min_samples_split, that means 5 x 5 x 5 = 125 different combinations to try. Not something I want to do by hand.

GridSearchCV allows me to construct a grid of all the combinations of parameters, tries each combination, and then reports back the best combination/model.

import sklearn.grid_search

parameters = dict(feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200], random_forest__min_samples_split=[2, 3, 4, 5, 10])

cv = sklearn.grid_search.GridSearchCV(pipeline, param_grid=parameters), y_train) y_predictions = cv.predict(X_test) report = sklearn.metrics.classification_report( y_test, y_predictions )

GridSearchCV seems a little scary at first, because the parameter grid is easy to mess up. There’s a particular convention being followed in the way that the parameters are named in the parameters dictionary; I need to have the name of the Pipeline step (e.g. feature_selection, not select; or random_forest, not clf), followed by two underscores, followed by the name of the parameter (in sklearn parlance) that I want to vary. To put this all together in a painfully simple example:

clf = RandomForestClassifier()
steps = [("my_classifier", clf)]
parameters = dict{my_classifier__min_samples_split=[2, 3, 4, 5]}  ### "my_classifier" is the name of the random forest classifier in the steps list; min_samples_split is the associated sklearn parameter that I want to vary
pipe = Pipeline(steps)
cv = GridSearchCV( pipe, param_grid = parameters)

But once I’ve got the parameter grid set up properly, the power of GridSearchCV is that it multiplies out all the combinations of parameters and tries each one, making a 3-fold cross-validated model for each combination. Then I can ask for predictions from my GridSearchCV object and it will automatically return to me the “best” set of predictions (that is, the predictions from the best model that it tried), or I can explicitly ask for the best model/best parameters using methods associated with GridSearchCV. Of course, trying tons of models can be kind of time-consuming, but the outcome is a much better understanding of how my model performance depends on parameters.

I should also mention that I can also use GridSearchCV on just a single object, rather than a full Pipeline. For example, I can optimize SelectKBest or the RandomForestClassifier on their own and that will work just fine. But since there can sometimes be interactions between various steps in the analysis, being able to optimize over the full Pipeline is really useful. It’s also trickier to do, which makes it a good example for teaching. Last, GridSearchCV will automatically cross validate all steps of the analysis, such as the feature selection–it’s not just the final algorithm that should be cross-validated, but the upstream transforms as well!

This brings me to the end of this series, about end-to-end data analysis in scikit-learn and pandas. My goal in these posts is not to show a perfect analysis, or even one that demonstrates all the steps one might try, but instead to focus on the process. If I can get something up and running quickly, even if it’s imperfect, I’m in a much better position to understand later on how much my refinements are indeed improving the analysis. At the same time, there are definitely best practices and tools (like Pipeline and GridSearchCV) that will make my life much easier as my work expands. Having a great set of tools in the python data science stack, and knowing when and how to deploy them, leaves me free to spend my time and energy on the most interesting, important and difficult-to-automate tasks–like trying to find the uninsured.

The post Workflows in Python: Using Pipeline and GridSearchCV for More Compact and Comprehensive Code appeared first on Civis Analytics.

Previous Article
Connect the Civis Platform to Google Sheets: Let Your Drive Be Part of Your Data-Driven Culture
Connect the Civis Platform to Google Sheets: Let Your Drive Be Part of Your Data-Driven Culture

Civis Analytics helps organizations across sectors use data science to improve outcomes. While working acro...

Next Article
The Republican Primary So Far, in One GIF
The Republican Primary So Far, in One GIF

Today, the New York Times featured some of our polling data and proprietary algorithm results and their imp...