Data Storytelling and Feature Creation

December 9, 2015 Civis Analytics

Employees at Civis have many interests; three at the top of the list are data, Chicago and biking to work. At the intersection of those three is the Divvy dataset, which details some of the rider information for the Chicago bicycle-sharing program. It is a fun dataset to play around with, even if it has a limited number of fields. It gives basic demographics (sex and age) for Divvy subscribers, as well as clues about all trips made on Divvy bikes, including date and departure location. From this data set, it is relatively easy to extract some simple information, such as the average age of a Divvy subscriber””35.7″”or learn that male riders outnumber females more than three to one. However, a common task at Civis Analytics is creating additional features from simple data. So let’s start digging into the data and see if there is anything else interesting.

I usually use Divvy to get to and from work, but is this how most other people use Divvy? To begin, let’s look at departure times from each station. Limiting the data to just weekdays, we can see how many rides happened during which hours of the day. As expected, there are spikes around the AM and PM rush hours (7-9 AM and 4-6 PM).

Total Divvy Rides

That makes sense, so let’s move to the stations themselves. If we map the number of rides to the stations, we can determine the most popular in terms of number of rides per the number of docks over the first half of the year. As you can see the stations at the center of the city get the most traffic but no real patterns are emerging yet.

Total Divvy Rides on map

Let’s go back to the rush hour element of this problem. Let’s tweak the map to show the percentage of rides that occur during the AM rush and the PM rush. With this certain patterns start to emerge. But what we really want to see is which stations people are using to get to work and what stations are mostly used for people arriving at work. Let’s subtract the AM rush hour percent from the PM rush hour percent to get a sense if a station is being used as a destination or departing point during rush hour (for example if a station has 40 percent of its rides starting during the PM rush and only 10 percent during the AM rush that would indicate the station is predominantly used as a place people arrive to work). You can see these three graphs in the animation below:

Divvy over time

Now we get a sense of which locations people are arriving at to go to work (with minimal math!) As you might expect, the main concentration of Divvy arrivals are in the heart of downtown. However, we can also see a path of Divvy stations extending out of the center of the city that seem to be destinations as well. So one final step, we will overlay another dataset to figure out what might explain these points. Below is the same map with data of CTA L-Stops locations from the Chicago Data Portal.

Divvy over time

Finally we have a data story that makes sense. Now we have a better understanding of the data and some additional features we can use for predictive modeling. Thinking a little bit about the data involved and extracting additional usage from all the available data is the first step of effective modeling.

If you are interested in recreating these visualizations, you can find the code here.

Written by Dennis Hume

The post Data Storytelling and Feature Creation appeared first on Civis Analytics.

Previous Article
Workflows in Python: Getting data ready to build models
Workflows in Python: Getting data ready to build models

A couple of weeks ago, I had the opportunity to host a workshop at the Open Data Science Conference in San ...

Next Article
Exploring Virtual Reality Data Visualization with Gear VR
Exploring Virtual Reality Data Visualization with Gear VR

With the release of the Gear VR virtual reality headset by Samsung and Oculus, it feels like the future is ...