TExt Analytics with Yellowbrick
A Tutorial Using Twitter Data
By Yue Zhang
Text Analytics, also known as text mining, is the process of deriving information from text data. This process often involves parsing and reorganizing text input data, deriving patterns or trends from the restructured data, and interpreting the patterns to facilitate tasks, such as text categorization, machine learning, or sentiment analysis.
In this tutorial, we are going to perform text analytics on Twitter data, and explore two very useful text visualizers from Python's Yellowbrick package. We will further perform sentiment analysis on hourly tweets and investigate people's sentiment patterns throughout different hours of a day. Hopefully by the end of the tutorial, readers will feel comfortable using Yellowbrick's text visualization tools, and become more interested in performing text analytics!
Getting Started with Twitter
This tutorial assumes readers have experience creating a Twitter account and accessing Twitter's API to download tweets. Although we won't go over it in detail, for demonstration purposes, the following code snippet can be used to access live twitter data. Simply fill in the placeholders of consumer_key, consumer_secret, access_token, and access_token_secret variables with your own credentials.
consumer_key = 'xxx'
consumer_secret = 'xxx'
access_token = 'xxx'
access_token_secret = 'xxx'
auth = OAuth1(
consumer_key,
consumer_secret,
access_token,
access_token_secret
)
BOUNDING_BOX = "-125.00,24.94,-66.93,49.59" # geo coordinates of USA
def tweet_generator():
stream = requests.post('https://stream.twitter.com/1.1/statuses/filter.json',
auth=auth,
stream=True,
data={"locations" : BOUNDING_BOX})
for line in stream.iter_lines():
if not line:
continue
tweet = simplejson.loads(line)
if 'text' in tweet:
yield tweet['text']
# download tweets into a file named "tweets"
with open(tweets, 'a') as tweetfile:
# stream 1000 tweets at a time
for tweet in islice(tweet_generator(), 1000):
tweetfile.write('{}|'.format(tweet.encode('utf-8')))
tweetfile.write("\n")
For a more detailed tutorial on streaming data from Twitter into a MySQL database with Tweepy, see “Streaming Tweets from Twitter to Database”.
Text Visualizers in Yellowbrick
Yellowbrick is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, to produce visualizations for your models.
Yellowbrick's yellowbrick.text module contains two text visualizers:
-
Token Frequency Distribution (FreqDistVisualizer): plot the frequency of tokens in a corpus.
-
t-SNE Corpus Visualization (TSNEVisualizer): plot similar documents closer together to discover clusters.
Before starting the analysis, let's import necessary packages referenced in this tutorial:
%matplotlib notebook
import matplotlib
import yellowbrick
from yellowbrick.text import FreqDistVisualizer
from yellowbrick.text import TSNEVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
Token Frequency Distribution Visualizer
The first text analytics method we are going to walk through is token frequency distribution. Yellowbrick's Token Frequency Distribution Visualizer demonstrates the frequency of each vocabulary item in the corpus in a bar chart. Besides counting single words, the visualizer could count any kind of observable event, such as bigram, trigram and other combinations. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. The visualizer creates a distribution plot after the corpus data is already tokenized and vectorized.
Tokenization, in the context of text analytics and natural language processing, is the process of parsing documents in our corpus into smaller, atomic elements which we recognize as words. During tokenization, certain data elements, such as punctuation and common English words (i.e. stop words) are discarded, ensuring that only nontrivial tokens are subject to frequency distribution analysis.
Python's Scikit-Learn module has a built-in list of English stop words, which can be referenced by using CountVectorizer(stop_words='english'). To customize the stop word list for Twitter data, we could enhance the list with a few more common Twitter text terms (stop_words2) as demonstrated below, and input the user defined stop word list to CountVectorizer to perform tokenization.
from sklearn.feature_extraction import text
stop_words2 = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
print stop_words2
frozenset(['all', 'show', 'anyway', 'fifty', 'four', 'go', 'mill', 'find', 'seemed', 'whose', 're', 'herself', 'whoever', 'behind', 'should', 'to', 'only', 'under', 'herein', 'do', 'his', 'get', 'very', 'de', 'myself', 'cannot', 'every', 'yourselves', 'him', 'is', 'cry', 'beforehand', 'these', 'she', 'where', 'the', 'ten', 'thin', 'eleven', 'namely',
...
After tokenizing and vectorizing the corpus data using CountVectorizer, we instantiate a FreqDistVisualizer object, and then call fit() on that object with the count vectorized documents and the features (i.e. the words from the corpus), which computes the frequency distribution. The visualizer then plots a bar chart of the top 50 most frequent terms in the corpus, with the terms listed along the y-axis and frequency counts depicted at x-axis values. As with other Yellowbrick visualizers, when the user invokes poof(), the finalized visualization is shown.
vectorizer = CountVectorizer(stop_words=stop_words2)
docs = vectorizer.fit_transform(tweets)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features)
visualizer.fit(docs)
visualizer.poof()
As the Token Frequency Distribution Plot demonstrates, from a typical download of 1000 tweets, positive terms such as "like," "love," "happy," "good," and "best" appear frequently in tweets.
t-SNE Corpus Visualizer
Besides a bar chart, we could also visualize text data in a scatter plot. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction algorithm for visualizing high-dimensional datasets. Specifically, it visualizes high-dimensional data in two- or three-dimensional space, by decomposing high-dimensional document vectors into lower dimensions using probability distributions from both the original dimensionality and the decomposed dimensionality. The technique allows us to create scatter plots so that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
Since t-SNE is a computationally expensive algorithm, typically a simpler decomposition method, such as SVD or PCA, is applied ahead of time. Yellowbrick's TSNEVisualizer creates an inner transformer pipeline that applies such a decomposition first (SVD with 50 components by default), then performs the t-SNE embedding. The visualizer then plots the scatter plot, coloring by cluster or by class, or neither if a structural analysis is required.
In the example below, I have tweets categorized by the date when I live-streamed data from Twitter. You can see the individual data points in the scatter plot corresponding to 8 dates. Tweets downloaded on 3/19 and 3/28 tend to form their clusters, while boundaries for tweets downloaded on other dates are not so apparent.
tfidf = TfidfVectorizer()
docs = tfidf.fit_transform(data)
labels = target
tsne = TSNEVisualizer()
tsne.fit_transform(docs, labels)
tsne.poof()
Sentiment Analysis Using Twitter Data
Now that we've seen the basics of text visualization using Yellowbrick's text visualizers, let's apply the skills to a small sentiment analysis project. In this project, we are going to analyze people's sentiment trends (positive vs. negative) throughout 24 hours of a day, expressed by the content of their tweets. The Department of Sociology at Cornell University conducted a similar analysis in 2011. In this tutorial, we will apply the same method on a smaller set of Twitter data from 2018, and see if results still hold.
Sentiment analysis is the field of study that aims to extract opinions and sentiments from natural language text using computational methods. In a broader sense, the goal of sentiment analysis is to analyze people’s opinions, sentiments, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Bing Liu 2015). People around the world use Twitter to express their opinions and attitudes, thus making tweet data a great research target for sentiment analysis.
For this tutorial, we are interested in extracting general positive and negative sentiments for Twitter users from the United States, treating the specific topics of their tweets indifferently. The goal of the analysis is to identify if certain sentiment trends exists in 24 hours of a single day, and in which time frames do people demonstrate the highest levels of positive or negative sentiments.
1) Lexicon-Based Approach
Positive and negative sentiments can be expressed through word choice, punctuation, emoji, or content from tweets. We will first walk through a lexicon-based approach, which flags sentiments of individual tweets by words of positive or negative indicators. For example, in the first tweet below, "beautiful," "heart," and "happy" are clear positive sentiment words, whereas "never" and "hate" are negative sentiment words.
Positive sentiment:
"This is BEAUTIFUL & makes my heart happy. Instant follow. 😊💖"
Negative sentiment:
"I grew up never using the word hate but that’s the only word I can use to…"
The lexicon-based approach is the most traditional method used to perform sentiment analysis, where positive and negative sentiment words are collected and made available by prior research. For this tutorial, we use Bing Lu's Opinion lexicon to conduct this analysis. Once we identify words indicating positive or negative sentiments from each tweet, we compute the sentiment score of the tweet using a Laplacian smoothing method:
positive_ratio = float(len(positive_words)+1) / float(len(positive_words)+len(negative_words)+2) – 0.5
import os
import io
import nltk
from string import punctuation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
# collected 1000 tweets at each hour for 7 days, each hour's tweets are saved in one .txt file
data = '.../tweet_data/'
#set paths for sentiment lexicons
negfile = '.../OpinionLexiconNeg.txt'
posfile = '.../OpinionLexiconPos.txt'
negative = io.open(negfile, 'r', encoding='utf8').read().split('\n')
positive = io.open(posfile, 'r', encoding='utf8').read().split('\n')
items = [
a for a in os.listdir(data)
]
result = []
positive_list = []
negative_list = []
for a in range(0,len(items)):
infile = os.path.join(data, items[a])
tweets = io.open(infile,'r', encoding='utf8').read()
tweet_list = tweets.split('\n')
ratio_list = []
for tweet in tweet_list:
# 1) Tokenize the tweet
words = []
if tweet == '':
pass
else:
tokens = [t.lower() for t in nltk.wordpunct_tokenize(tweet)]
for i in tokens:
if i not in list(punctuation) and i !='':
words.append(i)
# 2) Extract positive and negative terms from this tweet
positive_words = []
negative_words = []
for w in words:
if w in negative:
negative_words.append(w)
negative_list.append(w)
elif w in positive:
positive_words.append(w)
positive_list.append(w)
# 3) Compute sentiment score using Laplacian smoothing
ratio = float(len(positive_words)+1) / float(len(positive_words)+len(negative_words)+2) - 0.5
ratio_list.append(ratio)
if len(ratio_list) != 0:
score = sum(ratio_list) / float(len(ratio_list)) # calculate average score over 1000 tweets at each hour
else:
score = 0
result.append(score)
Let's take a look of the top 50 positive and negative terms extracted from the tweets:
vectorizer = CountVectorizer(stop_words=stop_words2)
docs = vectorizer.fit_transform(positive_list)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features)
visualizer.fit(docs)
visualizer.poof()
vectorizer = CountVectorizer(stop_words=stop_words2)
docs = vectorizer.fit_transform(negative_list)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features)
visualizer.fit(docs)
visualizer.poof()
The chart below presents hourly sentiment scores generated from Twitter data, using the lexicon-based approach. Although there are some data gaps during data collection, we can see a cyclical pattern where sentiment scores peak around early morning time and bottom right after the start of each day.
#This is for datetime on the x-axis
date_time = [dates[i]+'-'+times[i] for i in range(0,len(dates))]
df = pd.DataFrame()
df['date'] = dates
df['time'] = times
df['date_time'] = date_time
df['score'] = result
df.date = pd.to_datetime(df.date)
df.date_time = pd.to_datetime(df.date_time)
plt.plot(df.date_time, df.score)
plt.xticks(rotation=45)
Let's look at the hourly sentiment scores more closely, by averaging the sentiment scores at each hour across 7 days: now the trend is more obvious. From the sample of tweets, we can see that people's sentiment scores are highest around 9 or 10 AM, and lowest from 2 to 4 AM each day. During other hours of a day, the sentiment scores fluctuate around a constant range. This reflects that people's energy levels are highest when they are fresh in the morning, and lowest after midnight for night owls.
avg_time = df.groupby('time')['score'].mean()
plt.xticks(rotation=90)
plt.plot(avg_time)
2) Classification-Based Approach
The traditional lexicon-based approach is standardized and easy to implement, but it may not be applicable to all forms of text data. The definition of a positive versus negative sentiment can vary due to culture and context; also humor, slang, or pop culture. For data collected from Twitter, where people do not need to follow conventional writing, it's more appropriate to build your own lexicon than using a research-based lexicon. This is what we call a classification-based approach, where we apply machine learning algorithms to automatically extract features that correspond to positive or negative sentiment, and use the model to classify new documents.
3) TextBlob
The classification-base approach requires building a training dataset with known sentiment outputs, which are not easily accessible for projects with limited time or resources. Another alternative is using third-party tools that are trained with data close to your target dataset. TextBlob is a Python package that performs a suite of text analytics jobs, such as tokenization, parsing, classification, and sentiment analysis. It's training dataset is comprised of preclassified movie reviews, and it uses Naive Bayes classifier to classify testing data's polarity in positive and negative probabilities.
from textblob import TextBlob
result_tb = []
for a in range(0,len(items)):
infile = os.path.join(data, items[a])
tweets = io.open(infile,'r', encoding='utf8').read()
tweet_list = tweets.split('\n')
ratio_list = []
for tweet in tweet_list:
textblob = TextBlob(tweet)
ratio_list.append(textblob.sentiment.polarity)
if len(ratio_list) != 0:
score = sum(ratio_list) / float(len(ratio_list)) # calculate average score over 1000 tweets at each hour
else:
score = 0
result_tb.append(score)
df['score_tb'] = result_tb
plt.plot(df.date_time, df.score_tb)
plt.xticks(rotation=45)
avg_time = df.groupby('time')['score_tb'].mean()
plt.xticks(rotation=90)
plt.plot(avg_time)
Results from the lexicon-based approach and TextBlob both seem to conclude that positive sentiments concentrate from 8 AM to 10 AM and negative sentiments concentrate from 2 AM to 4 AM. Two clusters are formed when we visualize all tweets extracted from these two periods on a t-SNE plot, reinforcing our conclusion.
tfidf = TfidfVectorizer()
docs = tfidf.fit_transform(data)
labels = target
tsne = TSNEVisualizer()
tsne.fit_transform(docs, labels)
tsne.poof()
As we can see, outputs from TextBlob are comparable to outputs from the lexicon-based approach, confirming that people's sentiment level follows a cyclical pattern throughout 24 hours of a day. The level is highest around 8 AM to 10 AM in the morning and lowest around 2 AM to 4 AM in the early morning.
The experiment done by Cornell University’s Department of Sociology in 2011 analyzed positive sentiment and negative sentiment separately, and found the following:
-
Positive sentiment peaks early in the morning and again near midnight.
-
Negative sentiment is the lowest in the morning and rise throughout the day to nighttime peak.
There are some overlaps of our conclusion with the Cornell researchers, but a couple of discrepancies stand: positive sentiment does not peak again near midnight and negative sentiment peaks from 2 AM to 4 AM instead of around midnight. Perhaps around midnight, peak positive and negative sentiments cancel out. Or people stay up later to express negative sentiment in 2018 compared with 2011. I'll leave that for you to find out!
Conclusion
Yellowbrick is a powerful tool that generates numerous diagnostic visualizations to facilitate the model selection process. In the text analytics space, it produces token frequency distribution visualization and t-SNE corpus visualization. This tutorial walks through how one would use Yellowbrick's text visualizers to perform text analytics. I find Yellowbrick's text visualizers to be very helpful and easy to use, and strongly encourage you to explore other powerful functionalities of Yellowbrick. Hope you enjoyed the tutorial!
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!
SUBSCRIBE TO THE DDL BLOG
Did you enjoy this post? Don't miss the next one!
Online Data Science Corporate Training
Live, online, instructor-led courses on the latest data science, analytics, and machine learning methods and tools.
Need Help with Text Analytics?
Our consulting services help you use data to make smarter decisions, grow your business, and accomplish more with the resources you have.