Practical Application of Data Science and Machine Learning
Using Data Science to Solve a Real-World Problem
By Linwood Creekmore III
Lots of people write or publish data science tutorials. Yet, most of these tutorials -- including those written by folks who hold the title of data scientist -- fail to provide examples where machine learning solves an everyday problem. New comers entering the field will struggle initially because tutorials and education rarely align with the problems (or data) you face in the workplace.
This real world data science tutorial hopes to address this issue by creating a machine learning pipeline to solve a problem you would face as a data scientist in a real work environment. We will cover, in detail, the steps to train a model that finds "relevant" or "non-relevant" content based on the requirements of a customer. In the process, you will learn a few tricks on how to pick and tune the best model to achieve optimum performance. And, most importantly, we close out by creating a web service that delivers our machine learning goodness to customers. This is a crucial step because data scientists need the skills to deliver data products to both technical and non-technical audiences. By following the steps in this tutorial, you will complete the entire cycle of the data science pipeline.
The Scenario and Goal of this Tutorial
You did it! You landed your first job as a data scientist on a cross functional team that does consulting work for clients. The senior partners of the company were so impressed with your resume, they gave you a lead position on a major account. Our fictitious customer is a sports agency that specializes in representing professional American football quarterbacks (QBs) during contract negotiations. They need a system to track news articles of current or prospective clients, so they hired our firm to develop an application with a tailored feed of news.
As the data scientist, you must build a model that can process the content of a news article and determine whether it is relevant to the customer. To make matters interesting, this client is heavily involved in the development process so has asked to see a prototype of the model in action before it is integrated into the application. Additionally, our developer teammates need an API endpoint to access the output of our model for testing in the application.
Note: This scenario is representative of a task you would have working as a data scientist in industry. While there are lots of difficult-sounding problems out there, the workplace problems are usually less flashy.
Based on our scenario, this tutorial solves a classification problem. Classification is the process of predicting the category of some data by learning the differences in features between different categories. For example, we could classify days as hot or cold based on the temperature reading at midday. By learning the differences in temperatures associated with hot or cold days, a model would eventually develop enough "intuition" to organize days into two buckets that correspond to hot and cold categories. In a similar fashion, we are going to create a model to organize news articles into two categories. Our example is trivial but any time you need to separate relevant data from irrelevant data, the tools you learn here will help. Here's a graphic showing what our model plans to do:
Our journey to build and deploy a machine learning model will use real, unclean data; the data is not overly "dirty" but it will be a welcome change for those who need to expand beyond the Titanic, Iris, or Auto MPG datasets used in most data science tutorials. I did not provide a copy of the data, but I will provide the necessary code so that you can access and prepare the data yourself.
Below are the steps we will cover in this tutorial:
-
Creating the environment.
-
Pulling, cleaning, and loading data (cleaning handled by custom function).
-
Modeling the data.
-
Testing and scoring data against multiple models to pick the best one.
-
Tuning the model to improve performance.
-
Building a web application and API endpoint to expose your model to technical and non-technical audiences.
Create the Environment
Let's start by creating an environment with the tools we will need. But first, let's briefly cover why you need a virtual environment.
Chances are, this tutorial is one of many tutorials or projects you are completing to improve your data science skills. You also have realized by now that different projects require different tools; one project may require a library that only runs in Python 2 and another may require a library that is only used in Python 3. By using virtual environments, you can create virtual spaces on a single computer that support both projects, eliminating the potential of version conflicts or dealing with serious errors because you corrupted a system dependency. For more details on virtual environments, read this blog post.
Note: Using virtual environments has another, more important benefit. Sharing! Outside of creating an isolated environment for testing, we can export our virtual environment into easily shared formats, making it easier for others to recreate our work.
You will use my exported file to build a mirror environment with all the tools I used to make this tutorial. I recommend using the Anaconda Python distribution platform for set up. Why? It's free, easy to install, and widely used in the data science community. Moreover, some Python libraries have OS-level dependencies which fail to build when using pip install; this leads to lots of frustration and hand wringing. Avoid these challenges and use Anaconda because it makes set up easy. Visit this page and follow the install instructions for your system.
Next, clone or download the GitHub project for this tutorial which has the yml file to recreate my environment. Navigate to the base folder in the command line and execute these commands:
# find environment.yml and builds environment
conda env create -q -y
# activate the environment
source activate classificationEnv
If you did not receive an error, you have successfully recreated my environment. If you received errors or decided not to use Anaconda, you'll need a bit of luck and determination to get things installed. I recommend using Google and StackOverflow to search for and solve the error messages you receive.
Note: Developing your Google and StackOverflow skills is something you should work on as well. This will be a critical skillset for your data science job. Why? Often, people have already encountered and solved the problem you're facing. Don't reinvent the wheel!
Access and Clean the Data
With environment setup complete, we shift our focus to accessing data. Let's assume our customer prefers to use ESPN news, so we will get our training data from their website. I built up a corpus of 1100+ news articles from ESPN covering the 2017-2018 football season. This website provides a snapshot of the ESPN articles used to build the corpus.
I decided to use the requests library to build my corpus. By changing the month and year in the linked URL above, you could retrieve a list of urls covering news on the NFL for that specific timeframe. Then, you just need to cycle through each URL and download the text to those stories.
Note: If you are doing a topic outside of NFL news, make sure you have a way to access the data and build a respectable training set. I suggest having at least 500 samples that have a near even split of the desired classes you're trying to predict.
To pull the ESPN articles, I used the multiprocessing library and a custom script to download, clean, and create 100-row parquet files of NFL news stories on local disk. My script works on any web page so feel free to use it if you switch to another topic. Below is an example where my script pulls data from three urls:
import re
import requests
import numpy as np
import pandas as pd
from utilities import textgetter
from bs4 import BeautifulSoup
#empty dataframe
df = pd.DataFrame()
# pull data from website
r = requests.get(
"http://www.espn.com/nfl/news/archive/_/month/march/year/2018")
# parse the website data
soup = BeautifulSoup(r.content, 'html.parser')
# make regex to extract NFL news stories
regexp = re.compile("(http://www\.espn\.com/nfl/story).*")
# store all urls to NFL stories in urls parameter
urls = np.asarray([
link['href'] for link in soup.find_all('a', href=True)
if regexp.search(link['href'])
])
# complete the first 3 urls
for l in urls[:3]:
try:
o = next(textgetter(l))
df = df.append(pd.Series(o), ignore_index=True)
except:
print('Could not download {}'.format(l))
pass
This will return a Python dictionary of clean, extracted data from the article. We now have key value pairs for title, author, publish date, text, and several other data fragments that would be useful for machine learning. I processed over 1,000 urls and built a pandas dataframe of the output. The first 3 rows look like this:
Now we're ready to label our data.
Label the Data: Find Which Articles Discuss Quarterbacks
There are several approaches to labeling this data. We could label each article manually based on the presence of the word "quarterback" (easy) or we could use other features (such as names) to automatically label our data and hope that our model learns to associate the word "quarterback" to our target class (this is a more useful demonstration of machine learning). To keep this interesting, we'll opt for the second approach.
The rules are simple; if an article mentions a QB's name, it receives a label of "1" (positive class). If it doesn't mention a QB name, it receives a "0" label (negative class). With simple rules like that, all we need is a machine-readable list of QB names.
Luckily, ESPN has a webpage with QB names in a table. After downloading and parsing this table, we use pandas and regular expressions (regex) to test whether an article mentioned a QB's name and labeled accordingly. Note, you will need to install some extra pandas dependencies for the pandas.read_html method to work. Review the pandas documentation for help. Here is the code to download the table of QB names:
# download the table of active QBs
qb = pd.read_html('http://www.espn.com/nfl/players/_/position/qb')[0].rename(
columns={
0: 'name',
1: 'team',
2: 'college'
})
And here is the regex to isolate the QB names and filter our dataframe.
# clean columns to isolate names
r = re.compile('^[A-Z](?![\w])')
qb = qb[~(qb['team'] == 'TEAM')
& (~qb['name'].apply(lambda x: True if r.search(x) else False))]
# build the regex string
h = "|".join(
qb.assign(
name=qb.apply(lambda x: " ".join(x['name'].split(',')[::-1]), axis=1))[
'name'].values)
r = re.compile(h, flags=re.IGNORECASE)
# create filter
vmatch = np.vectorize(lambda x:bool(r.search(x)))
# Create the label column
df=df.assign(label=df.apply(lambda x: 1 if vmatch(x['text']) else 0,axis=1))
Our rule set should have created a new label column. This column will be our ground truth and organizes the articles into our target categories. The automated approach we used was quick, but it is not representative of what you will face in real project. In an ideal world, we would manually label these articles with a "0" or "1" or have a pre-labeled data set. Having labeled data is the biggest challenge to using supervised learning methods; it takes considerable time to build up a properly labeled dataset.
Note: If you are using different data for this tutorial, consider labeling it manually to achieve the best results.
Let's preview the first 3 rows of our new label column:
Picking the Best Model
At this point, we have all the tools and data we need to build our model. We will use scikit-learn to train our model, given that it's Python's most popular machine learning library. Besides being widely used and having extensive documentation, scikit-learn has many implementations of classification, regression and clustering algorithms. Because of the abundance of algorithms, this library is well suited to solve just about any machine learning problem. But with that variety also comes a curse. With so many models to choose from, which one should you use?
This decision of which model to use is often influenced by past experiences. Data scientists, biased by formal education, online tutorials, or prior work successes, maintain a "go to" list of algorithms to solve data analytics problems. Why? In simple terms, we use a smaller subset of models because familiarity breeds preference. But sometimes, another algorithm or model may work better for your data. This tutorial introduces a repeatable process (and reusable code) to overcome our model biases and test new algorithms.
The repeatable process focuses on sampling the performance of multiple algorithms to see which one performs best. To get started, we need a seed of algorithms to try. The flow chart on machine learning from scikit-learn and page on supervised learning methods provide a selection of models to test. To cycle through and test models easily, we build a custom scikit-learn pipeline to preprocess and convert our documents into the multidimensional arrays scikit-learn uses to train its models.
We also want to make sure we adhere to best practices while implementing machine learning pipelines. To ensure that our models are generalizable, we include a 10-fold cross-validation step which evaluates our model on different combinations of test/training splits.
Note: Cross validation is useful for avoiding problems such as overfit and gives a more realistic idea for how your model will perform on unseen data.
Let's take all the guidance above and create a generic function to test our models:
import datetime
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit, cross_validate, cross_val_score
# custom transformer
class DenseTransformer(BaseEstimator, TransformerMixin):
def transform(self, X, y=None, **fit_params):
return X.todense()
def fit_transform(self, X, y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)
def fit(self, X, y=None, **fit_params):
return self
def modeler(estimator):
"""
Function creates pipeline to convert text
to multidimensional array and fit to an
estimator of choice
Parameters
---------
estimator: scikit-learn object
A scikit-learn object that corresponds to
an estimator
Returns
-------
scikit-learn pipeline
Returns a new scikit-pipeline object
with the new estimator appended to the
end.
"""
# make a pipeline for any estimator
sklearn_pipe = Pipeline([('vectorizer',
TfidfVectorizer(
stop_words='english', ngram_range=(1, 1))),
('to_dense', DenseTransformer()), ('est',
estimator)])
return sklearn_pipe
def model_selection(estimator, X, y, num=5, size=.05, state=42):
"""
Function to test scikit-learn evaluators on text
dataset for classification.
Parameters
----------
estimator : scikit-learn estimator
A scikit-learn estimator; can use hyperparameters
or just use with default settings
X: pandas.Series, array, or list
The features used for classification
y: pandas.Series, array, or list
The label used for testing
num: integer
Specifies the number of cross validation folds
size: float
Determines the test size left out to evaluate
the model
state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
"""
# set random state
estimator.random_state = state
# create pipeline
model = modeler(estimator)
# cross validation set up
cv = StratifiedShuffleSplit(
n_splits=num, test_size=size, random_state=state)
# get the estimator name
es_name = estimator.__class__.__name__
# test if this is bayesian method which can't be multiprocessed
if es_name in ['MultinomialNB', 'BernoulliNB', 'GaussianNB',
'Perceptron','MLPClassifier']:
core_use = 1
else:
core_use = -1
# set start time
start = np.datetime64(datetime.datetime.now())
# get scores
scores = cross_val_score(model, X, y, cv=cv, n_jobs=core_use,scoring='f1')
# get the duration
dur = np.timedelta64(np.datetime64(datetime.datetime.now()) - start,
'ms') / num
# get standard deviation
var = np.std(scores)
return es_name, np.float64(np.mean(scores)), var, dur
Next, we only need to make a list of all the algorithms we want to try. Using the f1 score, standard deviation, and time as measures, we will compare the overall performance of each model. But, what are we looking for? The perfect model will have a high f1 score, low compute time, and very low standard deviation.
Because we are applying the same function across a list of objects where the results of one operation do not depend on the results of another, this process is considered embarrassingly parallel.
Note: Embarrassingly parallel processes can take advantage of multiprocessing to speed up computation times.
While testing the multiprocessing workflow, there was a small hiccup which accounts for splitting our estimators into two lists. The Naive Bayes and Deep Learning family of scikit-learn models would not process in parallel so I iterated over these models and appended their scores to the array with all the scores. This code will train 16 algorithms on our data:
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier,Perceptron
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier,AdaBoostClassifier
# our first list of algorithms that can be trained in parallel
fiers = [LinearSVC(),NuSVC(),SVC(),
KNeighborsClassifier(),DecisionTreeClassifier(),ExtraTreeClassifier(),
LogisticRegression(),SGDClassifier(max_iter=1000),
RandomForestClassifier(),BaggingClassifier(),AdaBoostClassifier()]
# Our list of algorithms that must be trained individually
singles = [MLPClassifier(max_iter=10,hidden_layer_sizes=(10,)), Perceptron(max_iter=10),
MultinomialNB(),BernoulliNB(),GaussianNB()]
Now we build a partial function to run through our list of scikit-learn estimators in parallel. After finishing up with the parallel job, we use the traditional Python list iteration to pass through our list of estimators that cannot be processed in parallel. At the end of this step, we should have scores for all 16 models.
from functools import partial
from concurrent.futures import ProcessPoolExecutor
# set up multiprocessor
e = ProcessPoolExecutor()
# build partial function to run in parallel
midies = partial(model_selection,X=df.text,y=df.label,num=10,state=42,size=.05)
# map function to list of estimators
results = np.array(list(e.map(midies,fiers)), dtype=[('model', np.dtype('U100')), ('score', np.float64),('std_dev', np.float64),('fittime', 'timedelta64[ms]')])
# now we add the estimators that could not be run in parallel
new = results.copy()
for m in singles:
holder = np.append(new,np.array(model_selection(m,df.text,df.label,num=10,size=.05), dtype=[('model', np.dtype('U100')), ('score', np.float64),('std_dev', np.float64),('fittime', 'timedelta64[ms]')]))
new = holder
Our score processing is done but we need to compare the performance. The scores are stored in a structured numpy array. The tabular view of a pandas dataframe is an excellent tool for comparison and we can also use function mapping capabilities of pandas to create a custom weighting system. The weighting will reward a model for having a high f1 score and low processing time. At the same time, it will penalize models for having a larger standard deviation. Remember, our ideal model would process the training data quickly (low processing time), perform consistently (low standard deviation), and have a high f1 score. Here's our code:
import re
# create a filter
digits = re.compile('^[\d]+')
# store score in a dataframe
evals = pd.DataFrame.from_records(new)#.sort_values(['score','std_dev','fittime'],ascending=[False,True,True])
evals = evals.assign(numbertime_ms=evals.fittime.apply(lambda x: float(digits.search(str(np.timedelta64(x,'ms'))).group()) if digits.search(str(np.timedelta64(x,'ms'))) else np.nan))
evals.assign(combined=evals.apply(lambda x: ((((x['score']*.75)/(x['std_dev']*.15))))/np.log10(x['numbertime_ms']*.10) if x['score']>0 else np.nan,axis=1)).\
sort_values('combined',ascending=False)
Here is our table of scores:
The pandas dataframe aligns our scores neatly, enabling a quick comparison. I must note that all scores are based on estimators using default settings. Hyperparameter tuning sometimes leads to substantial performance gains but we use the default configuration to triage and decide which model to spend time tuning. Looking at our table above, how did the models perform?
The BaggingClassifier estimator had a high f1 score but was ranked lower because of the longer compute time (~12 seconds to train the model). The model with the fastest training time, the Multinomial Naive Bayes estimator (and other Bayesian estimators), was the top of the pack despite its lower score and wider standard deviation (our ranking wasn't perfect, but it's good enough). Looking over the results, it appears the Linear Support Vector Classification (Linear SVC) estimator provides the best balance of speed, performance, and consistency. This is the model we will explore going forward.
What does the Linear SVC do?
Since the Linear SVC model was our best performer, let's take some time to understand how it works. Linear SVC is a supervised machine learning method that calculates the optimal separation distance between target classes and a hyperplane; this separation is called the margin. SVC then uses a quadratic function to identify support vectors and iterates over the data set to draw a hyperplane that provides the most distance between the two classes. Support vectors are observations in one class that are closest to an observation in another class. In short, this model will draw the best line between the classes to separate them. This graphic shows the major components of the Linear SVC calculation:
Read the following for more details on Support Vector Machines:
-
National Instruments Discussion of Support Vector Machines
-
Linear and nonlinear support vector classification theory sections in this paper
-
Brief scikit-learn discussion of Support Vector Machines
Tuning to Improve Model Performance
From our scoring tests above, we see that the LinearSVC model performs well with a ~91% f1 score, but we can likely improve our model's performance by tuning hyperparameters. Hyperparameters are parameters that are not directly learned by our model and tweaking a small subset of these parameters can substantially impact a model's performance. It's best to read the docstring for the scikit-learn estimator you are using to see how different values impact your model's performance. Once you find the models you want to tune, we need to test different values to see how performance changes. Scikit-learn can perform a randomized search of parameters using RandomizedSearchCV to find the perfect combination of values that achieve the highest score.
Note: I am using the scipy.stats function to specify a distribution over possible values. Any function can be passed that provides a RVS (random variate sample) method to sample a value. See scipy.stats documentation for details.
After looking at the possible LinearSVC parameters, our code to tune is as follows:
# import random search and scipy
from sklearn.model_selection import RandomizedSearchCV
import scipy
# establish our model
clf = modeler(LinearSVC())
# set up different parameters to test
param_dist2 = {"est__C": scipy.stats.exponnorm(1000),
"est__tol": scipy.stats.expon(.001),
"est__dual": [True],"est__penalty":["l2"],
"est__loss": ["hinge"]}
# define number of iterations
n_iter_search = 300
# set up cross validation
cv = StratifiedShuffleSplit(n_splits=3, test_size=.05, random_state=42)
# define the search
random_search2 = RandomizedSearchCV(clf, param_distributions=param_dist2,cv=cv,n_iter=n_iter_search,random_state=42)
Let's execute our randomized search for the best parameters:
# fit the parameters to find the best combination
random_search2.fit(df.text, df.label)
Our best parameters and scores were:
print(random_search2.best_params_)
[Out]:
{'est__C': 753.95491923611723,
'est__dual': True,
'est__loss': 'hinge',
'est__penalty': 'l2',
'est__tol': 2.2761232251945227}
print(random_search2.best_score_)
[Out]:
0.94029850746268662
We were able to improve model performance slightly, achieving a ~94% f1 score. This is a good score, but let's expand how we review the performance of our model.
Evaluating Model Performance Visually using yellowbrick
Up until now, we printed the numeric scores to evaluate the performance of our model. But, we can also compare performance visually. The Yellowbrick library provides a suite of visual diagnostic tools that help us pick the best model based on performance in several key metric areas.
Let's review model performance using the classification report visualizer, which is an intuitive display of your model's scores in precision, recall, and the f1-measure. Evaluation is simple; you want a matrix that has all dark colors and high numbers closer to 1.0. Let's build the function to visualize model performance:
from yellowbrick.classifier import ClassificationReport
from sklearn.model_selection import train_test_split
def visual_model_selection(X, y, estimator,testing=0.1,path=None):
"""
Test various estimators and return a visual representation
of performance.
Parameters
----------
estimator : scikit-learn estimator
A scikit-learn estimator; can use hyperparameters
or just use with default settings
X: pandas.Series, array, or list
The features used for classification
y: pandas.Series, array, or list
The label used for testing
testing: float
Size of test set
"""
modpipe = Pipeline([('vectorizer', TfidfVectorizer(stop_words='english',ngram_range=(1,1))),
('to_dense', DenseTransformer()),
('est',estimator)
])
# Instantiate the classification model and visualizer
visualizer = ClassificationReport(modpipe,classes=['Not about quarterback','About quarterback'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=testing,random_state=42)
visualizer.fit(X_train, y_train,n_jobs=-1,random_state=42) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
if path:
g = visualizer.poof(path)
del g
else:
g = visualizer.poof()
del g
Now we visualize performance for the LinearSVC model we tuned. Earlier, our f1 score reached 94%, but now we can use this tool to understand strengths and weaknesses of the model. Remember, we want to see all dark red blocks and numbers closer to 1.0 in our heat map.
from sklearn.model_selection import train_test_split
visual_model_selection(X=df.text,y=df.label,estimator=LinearSVC(C=753.95491923611723, dual=True,loss='hinge',penalty='l2',tol=2.2761232251945227,random_state=42))
This visual gives us a better understanding of our model. The good news is that our heatmap has a high concentration of dark colors, which means it is high performing. But, a look at the numbers provides more detail. The model has better precision at finding articles that discuss quarterback. Therefore, you can have more confidence that when the model returns "about a quarterback," the article is indeed discussing a quarterback. But, we see the opposite relationship in the recall metric; the model can correctly classify more articles that are not discussing quarterbacks than models that do. That means we're going to be missing some "about a quarterback" articles. In other words, you are seeing a real life example of the trade-off between precision and recall. Again, this performance is adequate, but we'll go one step farther to see how we can improve our scores.
Ensembling/Stacking to get a "Boost"
RandomizedSearchCV gave us a small uptick in model performance but we will close out our model tuning section by introducing a powerful machine learning approach. Ensembling is the process of combining the outputs of various algorithms on the same data to create a stronger learner. When solving supervised learning problems, the model can adjust the various models with a special emphasis on improving the performance on misclassified outputs. The approach will often outperform individual models and result in a higher performing model. What does this mean for our tutorial? After we have found a high performing model, we can then use this estimator to boost performance. For more reading on boosting:
-
Adaboost discussion
-
Boosting Algorithms
-
Improving Machine Learning with Ensemble methods
Scikit-learn has a method for the popular boosting algorithm, AdaBoost. Let's try this estimator and get the visual inspection as well:
visual_model_selection(X=df.text,y=df.label,estimator=AdaBoostClassifier(base_estimator=LinearSVC(C=753.95491923611723,
dual=True,loss='hinge',penalty='l2',tol=2.2761232251945227,random_state=42),algorithm='SAMME',random_state=42))
Wow! Note the change in numbers. Averaging scores across rows or columns, the score is consistently ~96%. Our tinkering has been successful because we boosted our model's performance by 5%! Now, we're ready to save the boosted model for later use.
Storing the Model: Persisting for later use
To make the model available for use later, we will pickle it to disk. You may hear others in industry refer to this as "persisting" a model. In this case, our model includes the full pipeline of reading the data in, converting strings to vectors, passing those vectors to the model, and predicting the class of the observation. Scikit-learn has a method that uses an external tool to persist our model. Here is the code:
# create the model
clf = modeler(AdaBoostClassifier(base_estimator=LinearSVC(C=753.95491923611723,
dual=True,loss='hinge',penalty='l2',tol=2.2761232251945227,random_state=42),algorithm='SAMME',random_state=42))
# fit all the data; we're done tuning and testing
clf.fit(df.text,df.label)
# store the model to disk
joblib.dump(clf,'final_boostedLinearSVC.pkl')
Our model is stored and ready for reuse. Now it's time for a real test!
Testing the model: Does it work on new, unseen data?
While improving the performance of your model against known data is positive, nothing proves the viability of your model like a real world test. I will test our model on four articles with the following expected outcomes:
-
Article about an NFL quarterback; expected outcome → class 1
-
Article about a college quarterback; expected outcome → class 1
-
Article about football but not discussing a quarterback; expected outcome → class
-
Article about current event; expected outcome → class 0
Pay extra special attention to tests 2 and 3; if we predict the correct class for these articles, we can can rejoice because the model is generalizable enough to make relevant predictions on news articles it has not seen. We will discuss the relevance below.
After a random search, I found several non-ESPN.com articles that fit our designed test. Again, I'll use the custom script I made to strip data from websites and return it as a structured record. Here is the code:
from utilities import textgetter
url1= "http://www.nfl.com/news/story/0ap3000000926907/article/packers-aaron-rodgers-my-role-is-to-play-quarterback"
url2= "https://www.ocregister.com/2017/08/08/highlights-from-tuesdays-ucla-football-training-camp-practice/"
url3= "http://www.nfl.com/news/story/0ap3000000927788/article/giants-not-interested-in-excowboys-wr-dez-bryant"
url4 = "https://edition.cnn.com/2018/04/20/asia/north-korea-closes-nuclear-site/index.html"
# nfl quarterback non ESPN article
article1 = next(textgetter(url1))
print("The first article has prediction of {}".format(clf.predict([article1['text']])[0]))
# nfl non-quarterback article not from ESPN
article2 = next(textgetter(url2))
print("\n\nThe second article has prediction of {}".format(clf.predict([article2['text']])[0]))
# nfl non-quarterback article not from ESPN
article3 = next(textgetter(url3))
print("\n\nThe third article has prediction of {}".format(clf.predict([article3['text']])[0]))
# random non football article
article4 = next(textgetter(url4))
print("\n\nThe fourth article has prediction of {}".format(clf.predict([article4['text']])[0]))
[Out]:
The first article has prediction of 1
The second article has prediction of 1
The third article has prediction of 0
The fourth article has prediction of 0
We got every prediction right! Take special note of the significance of predictions 2 and 3.
We trained our model using labels based on the names of NFL QBs only. Prediction 2 is significant because the model was able to generalize and recognize articles that discussed the "quarterback" position in general. Without human intervention, we successfully created a link between the word "quarterback" and our topic of interest (remember, we only labeled articles based on the presence of an active NFL QB's name). Because of this association, the model is able to accurately recognize articles about college QBs as well. But our notable takeaways don't stop with the positive class but our negative class shows some pretty powerful intuition.
Prediction 3 is significant because it proves that our model is specific enough to distinguish between articles that discuss "football" and articles that discuss "football and quarterbacks." This article is just about football, and does not mention the word "quarterback."
Based on these tests, we know that our model is sufficient for the task of identifying articles on American football quarterbacks. We're done tuning and testing!
Making a Machine Learning Service to Deliver a Data Product
We are confident that our model performs well, but we need to deliver this product to our technical and non-technical customers. Most tutorials would end after the paragraph above and this is a major reason why data scientists struggle to find purpose when they start a new job. We all make beautiful and powerful models, but how do you get the model to the customer? The final section of this tutorial will provide an example of delivering your model.
In our scenario, we had to demonstrate the model to our non-technical customer (sports agency) but also provide a machine-readable path for our developers to access the model. Because of this dual-natured requirement, our product should be user-facing and simple enough for a non-technical audience. But, it should also have a query-able back end for our technically-inclined customers. A web application seems like a good solution.
Here are the functional requirements:
-
Allow a user to visit a web site.
-
Allow user to paste a url of article.
-
Pass data to our model and get a prediction.
-
Show that prediction to user.
-
Allow repeats.
-
Provide same functionality for technical consumers (developers or other data scientists) but in machine readable endpoint.
Luckily, it doesn't take much to build a web application with these features. In fact, we can pretty much copy the web application example from the cherryPy documentation. It's important to note that we are not creating a robust web application (that's the developer's job). We just need a prototype to deliver the model to our customers.
Building the Web Application
To get started, we use the CherryPy framework to make an application. One example in the tutorials section provides a good baseline of code for us to copy. The project folder for this tutorial has a folder named qb_intspector that has all the files we need to run the application; you will see explanations of each file below. First, let's see the file structure of the application:
qb_intspector/ │ index.html │ my.db │ QBapp.py │ utilities.py │ └─── models/ │ │ │ | qbmodel_final.pkl │ └─── public/ │ │ │ └─── css/ │ │ │ │ style.css │ │ │ └─── images/ │ │ │ │pic1.jpg │ │pic2.jpg │ │pic3.jpg
Here is an explanation for each file:
-
index.html - this is the code that defines how our website looks. It also contains some Javascript and jQuery language to dynamically load content based on user actions. Instructions on the Javascript and jQuery code is outside the scope of this tutorial.
-
my.db - this is our SQLite database. Each time a user requests a prediction, that data is stored in the database for later use. This database can be used by developers to query predictions that were done in the past. The database is created and destroyed when the application starts and stops.
-
QBapp.py - this is the workhorse code of our application. It defines all the functionality we need to run the application and pass data to our machine learning model to get the predictions.
-
utilities.py - contains utility functions used by our web application. Specifically, this file contains the code to extract structured data from a web-based new article.
-
models - this folder contains the scikit-learn model we built earlier.
-
public - this folder contains the images and other formatting files for our HTML code.
To start the application:
-
Open a terminal or command prompt.
-
Navigate to the folder where the QBapp.py file is located.
-
Execute this command python QBapp.py.
The web application should have started. Visit http://127.0.0.1:8080/ in a browser to see our web application in action.
Using the QB-INTSpector Web Application
Non-Technical Use: Getting Predictions Using the Web Application
Once the application is running, getting a prediction is easy. Copy a web url and paste it into the input box and click Get Prediction. After pressing this button, the web application will reach out to the website, extract the news article data, convert the text data into an array of vectors, pass the vectors to our classifier model, get a prediction, store the data in a database, and return a message to the web application on the category of the article. Here's what happens when the article is discussing a quarterback.
All of that happens with one button click. This is far simpler than requiring a person to install Python, load libraries, and execute some code.
Technical Use: Using the Prediction Services and Data
The last piece is giving our developers access to the model. Our web application has a backend API service that can retrieve predictions. This endpoint will return a JSON string with all the extracted data for the news article with added key value pairs for the actual prediction. Here is the code a developer would use to access your prediction service:
def pulls(x):
"""Function to access prediction web service
Parameters
----------
x: string
This is the url to the web article your want a prediction for.
Returns
-------
record: json string
Returns json string with extracted data and key/value pair of
prediction added in.
"""
s = requests.Session()
param_dict = {"url":x}
r = s.post('http://127.0.0.1:8080/generator?',params={"url":x})
return r.content
Developrs can also access cached results using this code:
import sqlite3
# connect to database
conn = sqlite3.connect("qb_intspector/my.db")
# make the cursor
c = conn.cursor()
# search for all records
c.execute("SELECT * FROM user_string")
# store results in dataframe
d = pd.DataFrame(list(c.fetchall()))
print(d)
Remember, this database is completely deleted when the web application is shut down. You can modify the code to load and delete the database if you want to persist this data.
Using a combination of the services below, our developers could pass in a daily stream of urls, get predictions, and load up a custom web page or production web application that satisfies our sports agent customer's needs.
Conclusion
The journey was long, but we successfully completed one full cycle through the data science pipeline. Moreover, we used a real world example that included tips on how to pick and tune the best model to achieve optimum performance. And then finally, we created a web application that can deliver the output of our model to technical and non-technical audiences. The tools and tricks you learned in this tutorial can be applied to many machine learning projects.
As a side note, the Javascript, JQuery, and HTML I used above is beyond the scope of this discussion. I wanted to focus on the model and give a quick example of a delivery mechanism. If you'd like to suggest any improvements to this process or let me know about any errors you see, feel free to reach out on Twitter!
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!
SUBSCRIBE TO THE DDL BLOG
Did you enjoy this post? Don't miss the next one!
Online Data Science Corporate Training
Live, online, instructor-led courses on the latest data science, analytics, and machine learning methods and tools.
Need help with machine learning?
Our consulting services help you use data to make smarter decisions, grow your business, and accomplish more with the resources you have.