Data Science on State of the Union Addresses: Obama (2016) vs. Obama (2015) vs. … vs. George Washington (1790)

January 15, 2016 Michael Heilman

Barack Obama recently gave his final State of the Union address, and since we’re interested in analyzing text data at Civis Analytics, I figured I ought to see if I could discover anything interesting. Rather than trying to understand the conversation on social media as we’ve done in previous work, I decided to take a somewhat longer view, comparing the text of this year’s speech to the texts of all of the previous addresses, starting with George Washington’s first address in 1790.

It turns out that one can use data science to get some pretty interesting insights out of State of the Union addresses with just some very simple text analysis methods.

  • There appear to be a few major turning points in the State of the Union timeline where there are large, lasting shifts in the language used. In particular, the period from 1815 until World War I and the period after World War II form coherent blocks of time during which addresses are similar to each other but dissimilar from other time periods.
  • Speeches near each other in time tend to be more similar, but the words that make them similar differ: for example, Barack Obama’s 2016 speech was similar to George W. Bush’s because of discussion of terrorism, but it was similar to Clinton’s speeches because of discussion about jobs, young people, and technology.
  • The salient topics in Barack Obama’s addresses are jobs, kids, college, clean energy, terrorism, Iraq, and Afghanistan.
  • Looking at a few key topics from the past 40 years, we see that Bill Clinton spoke a lot about kids and families compared to other recent presidents, George W. Bush spoke a lot about terrorism, and Barack Obama spoke a lot about jobs and businesses. Also, mentions of oil and energy fell off after Jimmy Carter’s addresses but have increased in the past few years.

Here’s a quick visualization for that last point. I’ll explain it in more detail later.
SOTU trends plot

In the rest of the post, I’ll explain a simple method for computing similarities between addresses before presenting a broad historical overview with analyses of topical and linguistic change over time. I’ll then focus in on Barack Obama’s addresses and how they related to some trends in the past 40 years, before offering a few parting thoughts.

Computing similarities between addresses

A lot of this post is about comparing different State of the Union addresses to
see how similar they are. Here’s the (relatively simple) text analysis methodology for doing that.
I took the data set of State of the Union address texts and performed the following steps:

  • split each address into sentences, and each sentence into words;
  • combined the list of sentences for each address into a bag of words;
  • removed stopwords (e.g., “the”, “a”, “an”), very rare words, numbers, and punctuation;
  • created a sparse matrix with the word counts for each speech (number of addresses by number of words);
  • weighted the words using log entropy weighting;
  • and finally computed the cosine similarity between each pair of address row vectors.

This results in a matrix of similarity scores between addresses, which can be
visualized as follows.


In this similarity matrix visualization, each address corresponds to a row and a column. The similarity between address i and address j is shown at the cell at row i and column j (or row j and column i). Darker cells indicate higher similarities, with the diagonal being maximally dark because it corresponds to an address’s similarity to itself. Note that the cells above the diagonal are a reflection of those below it. This makes it easier to compare a single address across time by looking across a single row or column.
Click here for a version with a label for each address.

Looking across the history of the State of the Union

There are probably many observations to be made about the similarity matrix above, but the thing that seems most salient is that there are large blocks of similar speeches, perhaps representing important eras of American history: 1815 to 1912 (pre-WWI), 1923 to 1932 (Coolidge and Hoover), 1946 to 2016 (post-WWII). (Note: these blocks probably also to some extent represent changes in dialect, both as American English changed over time and since the State of the Union changed from being frequently written to being delivered orally.)

The pre-1815 speeches varied quite a lot from each other, though each presidency makes up a little block. Jefferson’s addresses in particular form a dark block of similar cells along the diagonal in the upper left of the plot.

Focusing on post-WWII addresses, the post-Reagan addresses appear to make up a block, perhaps because they shift away from talking about the international issues such as the Cold War and focus a lot on jobs and families. (I’ll try to interpret this shift a bit more below.)

There are also some outliers in the above plots that are interesting to explore. For example, Bush’s post 9/11 speech, largely about terrorism, is dissimilar to everything except Bush’s subsequent speeches. Carter’s 1981 address is the longest address at over 33,000 words, many times longer than most speeches since 1900 (e.g., Barack Obama’s 2016 address had about 5,200 words). The 1981 address was a written rather than spoken (as were many State of the Union addresses in America’s early years), and though I normalized for the length of speeches in our analyses, its extreme length probably resulted in partial overlap with a lot of other addresses.

We can also “zoom in” by restricting the matrix plot to only include addresses from a particular time period. Note that this causes the mapping from similarity scores to colors to change a bit because the general level of similarity is a bit higher for smaller time periods. This may allow finer-grained distinctions to be made. In the plot below, we’ll zoom in on the post-WWII period.

SOTU Post-WWII Matrix

Examining language and topical change over time

To help better understand the structure of the matrix visualization, I computed the mean of the log entropy scores for each word during various time periods (e.g., pre-WWI). I then ranked words for several time periods in attempt to get the most salient or interesting words for those time periods. For lack of a better term, we’ll call these the most “salient” words.

For example, there is a stark contrast in the salient words before and after Franklin Delano Roosevelt’s (FDR) presidency, which spanned some of the most difficult years the nation has faced because of the Great Depression and WWII. There appears to be a turning point in the language of addresses around WWI or WWII. Much of the language that shows up as particular to the period of time prior to this turning point pertains to the growth of the country, its relationship with colonial powers in Europe, treaties, territories, etc. During and following FDR’s presidency, the language shifts to focus on government programs, jobs, the economy, the Cold War, energy. There also appears to be a larger focus on statistics (e.g., “millions” and “billions” show up as salient words), which at a glance appears related to increased discussion about jobs and government revenues and expenditures.

Salient words for addresses by era

Pre-FDR period (1790 – 1932) FDR’s presidency (1934 – 1945) Post-FDR period (1946 – 2016)
mexico program program
states economic programs
constitution objectives tonight
subject today billion
treasury democratic jobs
treaty democracy million
united production help
public problems americans
spain thinking budget
territory recovery soviet
government groups economic
general dollars percent
law men nuclear
department million america
commerce fighting today
war japanese dollars
duties overwhelming spending
vessels planes health
officers unemployed energy
expenditures world job

Barack Obama’s State of the Union addresses

We can also zoom in on just President Obama’s speeches.

SOTU Obama Matrix

Obama’s addresses are all relatively similar to each other. However, not surprisingly, similarity is generally highest between addresses in subsequent years. Looking at the words that are strongly associated with Obama’s addresses, we see a focus on jobs, kids, college, clean energy, terrorism, Iraq, and Afghanistan. Comparing the top words for his first term and second term, the most notable thing seems to be a shift away from talking about Iraq and toward talking about terrorism and Afghanistan. (Note that “al” and “qaeda” show up as different words because I didn’t do any detection of multiword expressions.)

Focusing just on Obama’s 2016 speech, note the use of the word “voices”, which appeared 10 times in singular or plural form (e.g., “democracy breaks down when the average person feels their voice doesn’t matter”). The string “voice” only appears 94 times total in the other 229 addresses.

Salient words for Obama’s addresses

Obama’s presidency (2009 – 2016) Obama’s first term (2009 – 2012) Obama’s second term (2013 – 2016) Obama’s final address (2016)
jobs jobs kids voices
businesses businesses jobs hardworking
kids college businesses kids
tonight innovation tonight syria
college clean folks qaeda
solar tonight networks planet
oil iraq oil everybody
republicans trillion terrorists got
democrats energy job harder
energy solar college terrorists
job kids student incredible
iraq republicans got al
innovation democrats afghanistan isolating
afghanistan tuition workforce big
al recession republicans terrorist
qaeda lending qaeda student
americans breaks democrats tougher
folks deficit solar muster
student oil al entrepreneur
terrorists million iran immigrant

We can also look at which specific words make Obama’s 2016 speech similar to previous speeches. To do this, I took the log entropy-weighted word vector for the 2016 speech and computed the elementwise product with the vectors for each previous speech, respectively. I then found the words for each speech with the largest magnitude for that product. These are essentially the salient or interesting words that overlapped between the 2016 speech and the previous speech. To avoid information overload (if we aren’t there already), the table below just shows the results going back to President Jimmy Carter’s addresses, and just the top three overlapping salient words. One thing to note is that similarity generally decreases as we go back in time, as can be seen in the similarity matrices plotted above (e.g., the 2016 speech was most similar to Obama’s other speeches as well as the other speeches in the last 20 years or so).

Address Top overlapping words with 2016
Carter 1978 jobs, oil, hardworking
Carter 1979 tonight, jobs, commitment
Carter 1980 oil, iran, afghanistan
Carter 1981 oil, sector, solar
Reagan 1982 voices, jobs, sits
Reagan 1983 jobs, job, sector
Reagan 1984 voices, tougher, tonight
Reagan 1985 jobs, pushing, tonight
Reagan 1986 tonight, planet, commitment
Reagan 1987 syria, tonight, kids
Reagan 1988 fighters, tonight, talk
Bush 1989 tonight, voices, kids
Bush 1990 kids, tonight, got
Bush 1991 voices, iraq, tonight
Bush 1992 big, jobs, tonight
Clinton 1993 got, jobs, cuts
Clinton 1994 kids, everybody, got
Clinton 1995 voices, kids, got
Clinton 1996 harder, businesses, voices
Clinton 1997 internet, tonight, college
Clinton 1998 got, internet, college
Clinton 1999 tonight, computer, iraq
Clinton 2000 internet, big, college
Bush 2001 tonight, big, energy
Bush 2001 #2 terrorists, terrorist, tonight
Bush 2002 terrorist, terrorists, coalition
Bush 2003 al, terrorist, terrorists
Bush 2004 terrorists, iraq, terrorist
Bush 2005 terrorists, iraq, got
Bush 2006 terrorist, qaeda, terrorists
Bush 2007 qaeda, al, terrorists
Bush 2008 qaeda, al, terrorists
Obama 2009 jobs, college, businesses
Obama 2010 kids, businesses, jobs
Obama 2011 internet, kids, qaeda
Obama 2012 kids, jobs, got
Obama 2013 qaeda, kids, al
Obama 2014 kids, businesses, jobs
Obama 2015 kids, terrorists, networks

This suggests that the similarity between Obama’s 2016 speech and his previous speeches was because of his discussion of kids, college, jobs, and terrorism. Its (more moderate) similarity to George W. Bush’s addresses was due to discussion of terrorism, whereas the similarity to Bill Clinton’s addresses was due to college, jobs, and technology. We can even go back and compare Obama’s 2016 speech to Carter’s speeches. Though there is less similarity there compared to more recent speeches, there is some interesting overlap in discussion about energy and the Arab world. Across this time period, we also see some overlap in the use of modern colloquial language (e.g., “got”, as in Obama’s 2016 statement that “we’ve actually got to cut the cost of college”).

Recent trends in selected topics

We can also find some interesting trends in the discussion of particular topics.
From looking at the tables of words above as well as the words that were salient
in addresses from the past 40 years, a few topics jumped
out at me, so I decided to take a closer look by plotting topical frequency
over time.

Individual words are rare, and so plots of word frequencies can
show a lot of variance, but grouping related words together can give us a clearer picture.
While unsupervised learning techniques such as topic modeling
can be used to automatically find groups of words, here I decided for simplicity
to manually group small groups of closely related words, as follows.

Group label Words
families kid(s), parent(s), child(ren), family, families
terrorism terror, terrorism, terrorist(s)
jobs job(s), business(es), worker(s)
energy oil, gas, solar, energy, coal, petroleum, fuel(s)

The results of these analyses are the plots at the beginning of the post.
They show the percentage of the total number of words in each
address that belong to each topical group.
Note that this analysis doesn’t use log entropy weighting or stopword removal described above.
Please also note that a lot has happened
since Jimmy Carter became president, and so for brevity I’m going to omit some
really important trends (e.g., the end of the Cold War).
The plots also take the words out of context, and
in an effort to keep them focused and avoid polysemy, I have omitted potentially
related words (e.g., the word “power” could be put in the energy-related topic,
but it would also include the sense of “power” related to influence).

Despite the shortcomings of the simple methodology here, we see some interesting
trends. Each topic has a fairly distinct peak during one of the presidencies
on the timeline. Bill Clinton devoted a relatively large fraction of his speeches to
the “families” topic compared other presidents. George W. Bush spoke a lot about
terrorism, and Barack Obama spoke the most about the jobs topic.

The energy topic in particular shows an interesting trend: a few of Carter’s addresses focused
a lot on energy (e.g., due to the oil crisis), and then it was mentioned less
frequently for many years until Obama’s recent addresses.
We also saw this trend to some extent in the previous section, where the salient words
that led to similarity between Obama’s 2016 address and Carter’s addresses included
the words “oil” and “solar”.

Parting Thoughts

In this post, I’ve presented some pretty simple but hopefully compelling (hey, you read this far) analyses of State of the Union addresses. The State of the Union addresses represent the president’s outlook on the past, present, and future of America, and historical analyses provide us with a glimpse of how the country is evolving. The analyses also help to put President Obama’s recent address in a broader context.

Of course, while this is super interesting, an analysis of political addresses over many scores of years aren’t particularly “actionable” (e.g., “Mr. President, other presidents who mentioned jobs and families were also interested in this product”). However, similar historical analyses can be performed on other text data, which may help us gain insights about other types of conversations (e.g., analyses of tweets over days or hours instead of years, or analyses of customer feedback over time, etc.) and thereby help us make practical decisions. Each new data set brings new challenges (especially when there is text), and whether the data involves the future of America or a more practical goal, we at Civis Analytics are interested in using the latest and greatest data science methods for modeling, visualization, etc. to address those challenges.

Update (Jan. 25, 2016): A similar, recently published analysis by Rule, Cointet, and Bearman (2015) was brought to our attention after releasing this post. If you’re interested in a really nice deep dive into the history of the State of the Union, please check it out.


  • The corpus of address was adapted from this site, whose author gathered the texts from Project Gutenberg and updated them with more recent addresses from the White House, and various news sources, as discussed here. I added the 2016 address text from the White House website.
  • A few addresses to Congress with different titles (e.g., “Address on Administration Goals”, “Address to Joint Session of Congress”) were included in these analyses.
  • It is likely that there are many other such analyses of State of the Union addresses in the academic literature and elsewhere (e.g., here, and here). I apologize if I am missing references to related work. We’ve come across a few other similar analyses since the initial release of this post: this one from 2013, this one from early 2016, this 2016 Washington Post infographic, and the work of Rule, Cointet, and Bearman mentioned above.
  • In writing this post, as in just about all of work, I relied heavily on open source projects. The main ones I used here: the SciPy stack, gensim, seaborn, matplotlib, segtok

The post Data Science on State of the Union Addresses: Obama (2016) vs. Obama (2015) vs. … vs. George Washington (1790) appeared first on Civis Analytics.

Previous Article
Q&A with Discovery on how they use Civis Media Optimizer
Q&A with Discovery on how they use Civis Media Optimizer

We recently launched the Civis Media Optimizer — bringing the precision of digital to the scale of TV. But ...

Next Article
Connect the Civis Platform to Google Sheets: Let Your Drive Be Part of Your Data-Driven Culture
Connect the Civis Platform to Google Sheets: Let Your Drive Be Part of Your Data-Driven Culture

Civis Analytics helps organizations across sectors use data science to improve outcomes. While working acro...