The final project for this course is the creation of a data analytics project proposal that addresses an issue or opportunity from the scenario and data set(s) you chose in Module One that leverages data analytics, evaluates the current use of data, and highlights recommended tools with the ultimate goal of improving business value. For this milestone, submit the conclusion portion of the final project (Part III). Be sure that you are addressing how the proposal will benefit the organization. Specifically, the following critical elements must be addressed: III. Conclusion
A. Value: Determine the value of applying data analytics to this company or business based on your analysis of the value of the initiative you proposed. In other words, describe the benefit of using data analytics to meet the goals, needs, or opportunities of your company, and derive actionable insight.
B. Insights: Communicate the insights you gained from your analysis of the initiative, the data, and the data analytic tools and technology you explored with management. How are these insights potentially beneficial to the company, the industry, and the company’s future? How are they beneficial to your future as an analytics professional?
Running head: DATA ANALYTICS
1
Data Analytics
Name of Student
Institutional Affiliation
Date
2
DATA ANALYTICS
Tool applicability to Initiative
The significance of data science is reflected in various situations. Currently, the
business environment is tirelessly developing and applying tools to help them in making
informed decisions. In this case, Sonia needs a data analytics tool to determine the likelihood
of patients getting a heart attack. As the data scientist, my goal is using her patients’ data in
inspecting, transforming, and modeling the data at your disposal with the ultimate goal to
discover paramount information that suggests when the second heart attack and first heart
attack is likely to occur and the supporting decision to be made. Hence, there is a need for
using the appropriate analytic tool for analyzing the data in response to the prediction of risk
factors.
During the assessment and selecting the right tool for analyzing the data, one is
required to explore the data and find the patterns and relationships among the data sets. One
should also have the ability to apply statistical techniques for determining the truthfulness of
the hypotheses about the data set that is used; they are true or false.
My initiative will involve using quantitative analytic tools, which will comprise of
numerical data analysis with quantifiable variables which will be compared or measured
statistically. My goal will be using data analytic tools involving data mining such that the
large data sets will be sought through to identify, the trends, patterns ad relationships,
predictive analytics which will aim at determining the likelihood of the heart attacks and any
future risk factors that may trigger the heart attack.
Initiative
In this case, testing the applicability of tools to the initiative requires one to build an
analytical model with the use of predictive modeling tools and programming languages like
R. I prefer to use R in this case because it is the best analytic tool that can be used in such an
3
DATA ANALYTICS
instance. R will first be used in the running against a partial data set to test its accuracy and
output. Later, it will be revised and put into the second test until all the functions are
synchronized as required. Once it functions according to the aim of the project, then it is run
in production mode against Sonia’s full data set.
Tool applicability to Data
Several data analytic tools can be used in this instance. However, it is challenging to
identify which the best tool considering the task to be completed. In this case, Sonia requires
an application that will design a tool that will improve prediction for heart attack. It should
also possess an easy-to-use interface such that the tool should guide all non-technical users
through the modeling process. Further, the tool should support advanced analysis algorithms
and finding patterns in data. Lastly, the tool should have the capability to define test
parameters for the analytic tool (Sedkaoui, 2018).
Since R language is my preferred tool, it has a standard data analysis capability to
access data in various formats. There exist other statistical models such as Regressions and
Anova making the extraction the required information for the available Data. With the
available medical data, R can be used for cleaning, analysis, and representation of the data to
show future predictions by accessing the mean, maxima, and the standard deviation of the
risk factors of the patients (Laxmi Lydia E; Shankar K; S.Sheeba Rani; Lakshmanaprabu S.K,
2019). The R programming tool can be used to draw outpatients with risk factors with a score
likely to cause a heart attack such as the Age, marital status, stress management, and
cholesterol, among others. R tool is also applicable to Windows and macOS, making it more
suitable for use in predictive analytics when predicting heart attacks in various individuals. R
is also best in creating visually appealing graphics required to gain meaningful insights about
the patients. Graphical representation of the weight anxiety traits, 2nd heart attack,
4
DATA ANALYTICS
cholesterol, weight, Age, and gender will all be represented graphically with the use of the R
tool, making it the best tool for the initiative.
G-tool Recommendations
The favorable data analytic tools that can be used to analyze the available data are the
python, SAS, and Tableau. A python is an analytic tool that is easy and very powerful.
Python is essential because it outmatches R for having statistical analytic tools and other
functionality such as machine learning. This means that python can be used to analyze the
patient’s data presented by Sonia and generate possibilities of when the patients are likely to
get a heart attack. SAS is a programming language used for data manipulation and can be
easily managed and accessed (Laxmi Lydia E; Shankar K; S.Sheeba Rani; Lakshmanaprabu
S.K, 2019). It is favorable in this case because it can profile patients and prospects as a way
of predicting their health effects, managing them, and optimizing communication. Lastly,
Tableau is a data visualization tool used for the exploration of data and making various kinds
of analysis. It has a unique feature of accessing data in multiple formats for analysis and
visualization (Sedkaoui, 2018). Tableau is also favorable for this analysis because it is very
powerful in creating visual information through graphical representations. Sonia’s employer
will more consider Tableau because it is capable of reducing costs and improving the patient
experience. It also reduces the time taken when connecting to the data, visualizing, analyzing
it, and finding the right prospects.
5
DATA ANALYTICS
References
Laxmi Lydia E; Shankar K; S.Sheeba Rani; Lakshmanaprabu S.K. (2019). Statistical
Predictive Modelling through R Programming. Evincepub Publishing.
North, M. (2016). Data Mining for the Masses, Second Edition: With Implementations in
RapidMiner and R.
Sedkaoui, S. (2018). Data Analytics and Big Data. Hoboken, NJ: John Wiley & Sons.
Data Mining
for the Masses
Dr. Matthew North
A Global Text Project Book
This book is available on Amazon.com.
© 2012 Dr. Matthew A. North
This book is licensed under a Creative Commons Attribution 3.0 License
All rights reserved.
ISBN: 0615684378
ISBN-13: 978-0615684376
ii
DEDICATION
This book is gratefully dedicated to Dr. Charles Hannon, who gave me the chance to become a
college professor and then challenged me to learn how to teach data mining to the masses.
iii
iv
Data Mining for the Masses
Table of Contents
Dedication ……………………………………………………………………………………………………………………………………. iii
Table of Contents………………………………………………………………………………………………………………………….. v
Acknowledgements ………………………………………………………………………………………………………………………. xi
SECTION ONE: Data Mining Basics…………………………………………………………………………………………… 1
Chapter One: Introduction to Data Mining and CRISP-DM ………………………………………………………… 3
Introduction ………………………………………………………………………………………………………………………………. 3
A Note About Tools …………………………………………………………………………………………………………………. 4
The Data Mining Process ………………………………………………………………………………………………………….. 5
Data Mining and You ……………………………………………………………………………………………………………….11
Chapter Two: Organizational Understanding and Data Understanding ……………………………………….13
Context and Perspective …………………………………………………………………………………………………………..13
Learning Objectives ………………………………………………………………………………………………………………….14
Purposes, Intents and Limitations of Data Mining ……………………………………………………………………15
Database, Data Warehouse, Data Mart, Data Set…? ………………………………………………………………..15
Types of Data …………………………………………………………………………………………………………………………..19
A Note about Privacy and Security …………………………………………………………………………………………..20
Chapter Summary……………………………………………………………………………………………………………………..21
Review Questions……………………………………………………………………………………………………………………..22
Exercises …………………………………………………………………………………………………………………………………..22
Chapter Three: Data Preparation ………………………………………………………………………………………………….25
Context and Perspective …………………………………………………………………………………………………………..25
Learning Objectives ………………………………………………………………………………………………………………….25
Collation …………………………………………………………………………………………………………………………………..27
v
Data Mining for the Masses
Data Scrubbing ……………………………………………………………………………………………………………………….. 28
Hands on Exercise…………………………………………………………………………………………………………………… 29
Preparing RapidMiner, Importing Data, and……………………………………………………………………………. 30
Handling Missing Data ……………………………………………………………………………………………………………. 30
Data Reduction ……………………………………………………………………………………………………………………….. 46
Handling Inconsistent Data …………………………………………………………………………………………………….. 50
Attribute Reduction…………………………………………………………………………………………………………………. 52
Chapter Summary ……………………………………………………………………………………………………………………. 54
Review Questions ……………………………………………………………………………………………………………………. 55
Exercise …………………………………………………………………………………………………………………………………… 55
SECTION TWO: Data Mining Models and Methods ………………………………………………………………… 57
Chapter Four: Correlation …………………………………………………………………………………………………………… 59
Context and Perspective ………………………………………………………………………………………………………….. 59
Learning Objectives…………………………………………………………………………………………………………………. 59
Organizational Understanding …………………………………………………………………………………………………. 59
Data Understanding ………………………………………………………………………………………………………………… 60
Data Preparation ……………………………………………………………………………………………………………………… 60
Modeling …………………………………………………………………………………………………………………………………. 62
Evaluation ……………………………………………………………………………………………………………………………….. 63
Deployment …………………………………………………………………………………………………………………………….. 65
Chapter Summary ……………………………………………………………………………………………………………………. 67
Review Questions ……………………………………………………………………………………………………………………. 68
Exercise …………………………………………………………………………………………………………………………………… 68
Chapter Five: Association Rules ………………………………………………………………………………………………….. 73
Context and Perspective ………………………………………………………………………………………………………….. 73
Learning Objectives…………………………………………………………………………………………………………………. 73
Organizational Understanding …………………………………………………………………………………………………. 73
vi
Data Mining for the Masses
Data Understanding ………………………………………………………………………………………………………………….74
Data Preparation ………………………………………………………………………………………………………………………76
Modeling …………………………………………………………………………………………………………………………………..81
Evaluation ………………………………………………………………………………………………………………………………..84
Deployment ……………………………………………………………………………………………………………………………..87
Chapter Summary……………………………………………………………………………………………………………………..87
Review Questions……………………………………………………………………………………………………………………..88
Exercise ……………………………………………………………………………………………………………………………………88
Chapter Six: k-Means Clustering …………………………………………………………………………………………………..91
Context and Perspective …………………………………………………………………………………………………………..91
Learning Objectives ………………………………………………………………………………………………………………….91
Organizational Understanding ………………………………………………………………………………………………….91
Data UnderstanDing ………………………………………………………………………………………………………………..92
Data Preparation ………………………………………………………………………………………………………………………92
Modeling …………………………………………………………………………………………………………………………………..94
Evaluation ………………………………………………………………………………………………………………………………..96
Deployment ……………………………………………………………………………………………………………………………..98
Chapter Summary………………………………………………………………………………………………………………….. 101
Review Questions………………………………………………………………………………………………………………….. 101
Exercise ………………………………………………………………………………………………………………………………… 102
Chapter Seven: Discriminant Analysis ………………………………………………………………………………………. 105
Context and Perspective ……………………………………………………………………………………………………….. 105
Learning Objectives ………………………………………………………………………………………………………………. 105
Organizational Understanding ………………………………………………………………………………………………. 106
Data Understanding ………………………………………………………………………………………………………………. 106
Data Preparation …………………………………………………………………………………………………………………… 109
Modeling ……………………………………………………………………………………………………………………………….. 114
vii
Data Mining for the Masses
Evaluation ……………………………………………………………………………………………………………………………… 118
Deployment …………………………………………………………………………………………………………………………… 120
Chapter Summary ………………………………………………………………………………………………………………….. 121
Review Questions ………………………………………………………………………………………………………………….. 122
Exercise …………………………………………………………………………………………………………………………………. 123
Chapter Eight: Linear Regression………………………………………………………………………………………………. 127
Context and Perspective ………………………………………………………………………………………………………… 127
Learning Objectives……………………………………………………………………………………………………………….. 127
Organizational Understanding ……………………………………………………………………………………………….. 128
Data Understanding ………………………………………………………………………………………………………………. 128
Data Preparation ……………………………………………………………………………………………………………………. 129
Modeling ……………………………………………………………………………………………………………………………….. 131
Evaluation ……………………………………………………………………………………………………………………………… 132
Deployment …………………………………………………………………………………………………………………………… 134
Chapter Summary ………………………………………………………………………………………………………………….. 137
Review Questions ………………………………………………………………………………………………………………….. 137
Exercise …………………………………………………………………………………………………………………………………. 138
Chapter Nine: Logistic Regression …………………………………………………………………………………………….. 141
Context and Perspective ………………………………………………………………………………………………………… 141
Learning Objectives……………………………………………………………………………………………………………….. 141
Organizational Understanding ……………………………………………………………………………………………….. 142
Data Understanding ………………………………………………………………………………………………………………. 142
Data Preparation ……………………………………………………………………………………………………………………. 143
Modeling ……………………………………………………………………………………………………………………………….. 147
Evaluation ……………………………………………………………………………………………………………………………… 148
Deployment …………………………………………………………………………………………………………………………… 151
Chapter Summary ………………………………………………………………………………………………………………….. 153
viii
Data Mining for the Masses
Review Questions………………………………………………………………………………………………………………….. 154
Exercise ………………………………………………………………………………………………………………………………… 154
Chapter Ten: Decision Trees…………………………………………………………………………………………………….. 157
Context and Perspective ……………………………………………………………………………………………………….. 157
Learning Objectives ………………………………………………………………………………………………………………. 157
Organizational Understanding ………………………………………………………………………………………………. 158
Data Understanding ………………………………………………………………………………………………………………. 159
Data Preparation …………………………………………………………………………………………………………………… 161
Modeling ……………………………………………………………………………………………………………………………….. 166
Evaluation …………………………………………………………………………………………………………………………….. 169
Deployment ………………………………………………………………………………………………………………………….. 171
Chapter Summary………………………………………………………………………………………………………………….. 172
Review Questions………………………………………………………………………………………………………………….. 172
Exercise ………………………………………………………………………………………………………………………………… 173
Chapter Eleven: Neural Networks ……………………………………………………………………………………………. 175
Context and Perspective ……………………………………………………………………………………………………….. 175
Learning Objectives ………………………………………………………………………………………………………………. 175
Organizational Understanding ………………………………………………………………………………………………. 175
Data Understanding ………………………………………………………………………………………………………………. 176
Data Preparation …………………………………………………………………………………………………………………… 178
Modeling ……………………………………………………………………………………………………………………………….. 181
Evaluation …………………………………………………………………………………………………………………………….. 181
Deployment ………………………………………………………………………………………………………………………….. 184
Chapter Summary………………………………………………………………………………………………………………….. 186
Review Questions………………………………………………………………………………………………………………….. 187
Exercise ………………………………………………………………………………………………………………………………… 187
Chapter Twelve: Text Mining ……………………………………………………………………………………………………. 189
ix
Data Mining for the Masses
Context and Perspective ………………………………………………………………………………………………………… 189
Learning Objectives……………………………………………………………………………………………………………….. 189
Organizational Understanding ……………………………………………………………………………………………….. 190
Data Understanding ………………………………………………………………………………………………………………. 190
Data Preparation ……………………………………………………………………………………………………………………. 191
Modeling ……………………………………………………………………………………………………………………………….. 202
Evaluation ……………………………………………………………………………………………………………………………… 203
Deployment …………………………………………………………………………………………………………………………… 213
Chapter Summary ………………………………………………………………………………………………………………….. 213
Review Questions ………………………………………………………………………………………………………………….. 214
Exercise …………………………………………………………………………………………………………………………………. 214
SECTION THREE: Special Considerations in Data Mining …………………………………………………….. 217
Chapter Thirteen: Evaluation and Deployment …………………………………………………………………………. 219
How Far We’ve Come …………………………………………………………………………………………………………… 219
Learning Objectives……………………………………………………………………………………………………………….. 220
Cross-Validation ……………………………………………………………………………………………………………………. 221
Chapter Summary: The Value of Experience …………………………………………………………………………. 227
Review Questions ………………………………………………………………………………………………………………….. 228
Exercise …………………………………………………………………………………………………………………………………. 228
Chapter Fourteen: Data Mining Ethics ……………………………………………………………………………………… 231
Why Data Mining Ethics? ……………………………………………………………………………………………………… 231
Ethical Frameworks and Suggestions …………………………………………………………………………………….. 233
Conclusion …………………………………………………………………………………………………………………………….. 235
GLOSSARY and INDEX …………………………………………………………………………………………………………. 237
About the Author ……………………………………………………………………………………………………………………… 251
x
Data Mining for the Masses
ACKNOWLEDGEMENTS
I would not have had the expertise to write this book if not for the assistance of many colleagues at
various institutions. I would like to acknowledge Drs. Thomas Hilton and Jean Pratt, formerly of
Utah State University and now of University of Wisconsin—Eau Claire who served as my Master’s
degree advisors. I would also like to acknowledge Drs. Terence Ahern and Sebastian Diaz of West
Virginia University, who served as doctoral advisors to me.
I express my sincere and heartfelt gratitude for the assistance of Dr. Simon Fischer and the rest of
the team at Rapid-I. I thank them for their excellent work on the RapidMiner software product
and for their willingness to share their time and expertise with me on my visit to Dortmund.
Finally, I am grateful to the Kenneth M. Mason, Sr. Faculty Research Fund and Washington &
Jefferson College, for providing financial support for my work on this text.
xi
Data Mining for the Masses
xii
Data Mining for the Masses
SECTION ONE: DATA MINING BASICS
1
Chapter 1: Introduction to Data Mining and CRISP-DM
CHAPTER ONE:
INTRODUCTION TO DATA MINING AND CRISP-DM
INTRODUCTION
Data mining as a discipline is largely transparent to the world. Most of the time, we never even
notice that it’s happening. But whenever we sign up for a grocery store shopping card, place a
purchase using a credit card, or surf the Web, we are creating data. These data are stored in large
sets on powerful computers owned by the companies we deal with every day. Lying within those
data sets are patterns—indicators of our interests, our habits, and our behaviors. Data mining
allows people to locate and interpret those patterns, helping them make better informed decisions
and better serve their customers. That being said, there are also concerns about the practice of
data mining. Privacy watchdog groups in particular are vocal about organizations that amass vast
quantities of data, some of which can be very personal in nature.
The intent of this book is to introduce you to concepts and practices common in data mining. It is
intended primarily for undergraduate college students and for business professionals who may be
interested in using information systems and technologies to solve business problems by mining
data, but who likely do not have a formal background or education in computer science. Although
data mining is the fusion of applied statistics, logic, artificial intelligence, machine learning and data
management systems, you are not required to have a strong background in these fields to use this
book. While having taken introductory college-level courses in statistics and databases will be
helpful, care has been taken to explain within this book, the necessary concepts and techniques
required to successfully learn how to mine data.
Each chapter in this book will explain a data mining concept or technique. You should understand
that the book is not designed to be an instruction manual or tutorial for the tools we will use
(RapidMiner and OpenOffice Base and Calc). These software packages are capable of many types
of data analysis, and this text is not intended to cover all of their capabilities, but rather, to
illustrate how these software tools can be used to perform certain kinds of data mining. The book
3
Data Mining for the Masses
is also not exhaustive; it includes a variety of common data mining techniques, but RapidMiner in
particular is capable of many, many data mining tasks that are not covered in the book.
The chapters will all follow a common format. First, chapters will present a scenario referred to as
Context and Perspective. This section will help you to gain a real-world idea about a certain kind of
problem that data mining can help solve. It is intended to help you think of ways that the data
mining technique in that given chapter can be applied to organizational problems you might face.
Following Context and Perspective, a set of Learning Objectives is offered. The idea behind this section
is that each chapter is designed to teach you something new about data mining. By listing the
objectives at the beginning of the chapter, you will have a better idea of what you should expect to
learn by reading it. The chapter will follow with several sections addressing the chapter’s topic. In
these sections, step-by-step examples will frequently be given to enable you to work alongside an
actual data mining task. Finally, after the main concepts of the chapter have been delivered, each
chapter will conclude with a Chapter Summary, a set of Review Questions to help reinforce the main
points of the chapter, and one or more Exercise to allow you to try your hand at applying what was
taught in the chapter.
A NOTE ABOUT TOOLS
There are many software tools designed to facilitate data mining, however many of these are often
expensive and complicated to install, configure and use. Simply put, they’re not a good fit for
learning the basics of data mining. This book will use OpenOffice Calc and Base in conjunction
with an open source software product called RapidMiner, developed by Rapid-I, GmbH of
Dortmund, Germany. Because OpenOffice is widely available and very intuitive, it is a logical
place to begin teaching introductory level data mining concepts. However, it lacks some of the
tools data miners like to use. RapidMiner is an ideal complement to OpenOffice, and was selected
for this book for several reasons:
RapidMiner provides specific data mining functions not currently found in OpenOffice,
such as decision trees and association rules, which you will learn to use later in this book.
RapidMiner is easy to install and will run on just about any computer.
RapidMiner’s maker provides a Community Edition of its software, making it free for
readers to obtain and use.
4
Chapter 1: Introduction to Data Mining and CRISP-DM
Both RapidMiner and OpenOffice provide intuitive graphical user interface environments
which make it easier for general computer-using audiences to the experience the power
of data mining.
All examples using OpenOffice or RapidMiner in this book will be illustrated in a Microsoft
Windows environment, although it should be noted that these software packages will work on a
variety of computing platforms. It is recommended that you download and install these two
software packages on your computer now, so that you can work along with the examples in the
book if you would like.
OpenOffice can be downloaded from: http://www.openoffice.org/
RapidMiner Community Edition can be downloaded from:
http://rapid-i.com/content/view/26/84/
THE DATA MINING PROCESS
Although data mining’s roots can be traced back to the late 1980s, for most of the 1990s the field
was still in its infancy. Data mining was still being defined, and refined. It was largely a loose
conglomeration of data models, analysis algorithms, and ad hoc outputs. In 1999, several sizeable
companies including auto maker Daimler-Benz, insurance provider OHRA, hardware and software
manufacturer NCR Corp. and statistical software maker SPSS, Inc. began working together to
formalize and standardize an approach to data mining. The result of their work was CRISP-DM,
the CRoss-Industry Standard Process for Data Mining. Although
the participants in the creation of CRISP-DM certainly had vested interests in certain software and
hardware tools, the process was designed independent of any specific tool. It was written in such a
way as to be conceptual in nature—something that could be applied independent of any certain
tool or kind of data. The process consists of six steps or phases, as illustrated in Figure 1-1.
5
Data Mining for the Masses
1. Business
Understanding
6. Deployment
2. Data
Understanding
Data
3. Data
Preparation
5. Evaluation
4. Modeling
Figure 1-1: CRISP-DM Conceptual Model.
CRISP-DM Step 1: Business (Organizational) Understanding
The first step in CRISP-DM is Business Understanding, or what will be referred to in this text
as Organizational Understanding, since organizations of all kinds, not just businesses, can use
data mining to answer questions and solve problems. This step is crucial to a successful data
mining outcome, yet is often overlooked as folks try to dive right into mining their data. This is
natural of course—we are often anxious to generate some interesting output; we want to find
answers. But you wouldn’t begin building a car without first defining what you want the vehicle to
do, and without first designing what you are going to build. Consider these oft-quoted lines from
Lewis Carroll’s Alice’s Adventures in Wonderland:
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where–” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“–so long as I get SOMEWHERE,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”
Indeed. You can mine data all day long and into the night, but if you don’t know what you want to
know, if you haven’t defined any questions to answer, then the efforts of your data mining are less
likely to be fruitful. Start with high level ideas: What is making my customers complain so much?
6
Chapter 1: Introduction to Data Mining and CRISP-DM
How can I increase my per-unit profit margin? How can I anticipate and fix manufacturing flaws
and thus avoid shipping a defective product? From there, you can begin to develop the more
specific questions you want to answer, and this will enable you to proceed to …
CRISP-DM Step 2: Data Understanding
As with Organizational Understanding, Data Understanding is a preparatory activity, and
sometimes, its value is lost on people. Don’t let its value be lost on you! Years ago when workers
did not have their own computer (or multiple computers) sitting on their desk (or lap, or in their
pocket), data were centralized. If you needed information from a company’s data store, you could
request a report from someone who could query that information from a central database (or fetch
it from a company filing cabinet) and provide the results to you. The inventions of the personal
computer, workstation, laptop, tablet computer and even smartphone have each triggered moves
away from data centralization. As hard drives became simultaneously larger and cheaper, and as
software like Microsoft Excel and Access became increasingly more accessible and easier to use,
data began to disperse across the enterprise. Over time, valuable data stores became strewn across
hundred and even thousands of devices, sequestered in marketing managers’ spreadsheets,
customer support databases, and human resources file systems.
As you can imagine, this has created a multi-faceted data problem. Marketing may have wonderful
data that could be a valuable asset to senior management, but senior management may not be
aware of the data’s existence—either because of territorialism on the part of the marketing
department, or because the marketing folks simply haven’t thought to tell the executives about the
data they’ve gathered. The same could be said of the information sharing, or lack thereof, between
almost any two business units in an organization. In Corporate America lingo, the term ‘silos’ is
often invoked to describe the separation of units to the point where interdepartmental sharing and
communication is almost non-existent. It is unlikely that effective organizational data mining can
occur when employees do not know what data they have (or could have) at their disposal or where
those data are currently located. In chapter two we will take a closer look at some mechanisms
that organizations are using to try bring all their data into a common location. These include
databases, data marts and data warehouses.
Simply centralizing data is not enough however. There are plenty of question that arise once an
organization’s data have been corralled. Where did the data come from? Who collected them and
7
Data Mining for the Masses
was there a standard method of collection? What do the various columns and rows of data mean?
Are there acronyms or abbreviations that are unknown or unclear? You may need to do some
research in the Data Preparation phase of your data mining activities. Sometimes you will need to
meet with subject matter experts in various departments to unravel where certain data came from,
how they were collected, and how they have been coded and stored. It is critically important that
you verify the accuracy and reliability of the data as well. The old adage “It’s better than nothing”
does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a
data mining activity, because decisions based upon partial or wrong data are likely to be partial or
wrong decisions. Once you have gathered, identified and understood your data assets, then you
may engage in…
CRISP-DM Step 3: Data Preparation
Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text,
and others are in picture form such as charts, graphs and maps. Some data are anecdotal or
narrative, such as comments on a customer satisfaction survey or the transcript of a witness’s
testimony. Data that aren’t in rows or columns of numbers shouldn’t be dismissed though—
sometimes non-traditional data formats can be the most information rich. We’ll talk in this book
about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be
one of our most common layouts, we’ll also get into text mining where paragraphs can be fed into
RapidMiner and analyzed for patterns as well.
Data Preparation involves a number of activities. These may include joining two or more data
sets together, reducing data sets to only those variables that are interesting in a given data mining
exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or reformatting data for consistency purposes. For example, you may have seen a spreadsheet or
database that held phone numbers in many different formats:
(555) 555-5555
555/555-5555
555-555-5555
555.555.5555
555 555 5555
5555555555
Each of these offers the same phone number, but stored in different formats. The results of a data
mining exercise are most likely to yield good, useful results when the underlying data are as
8
Chapter 1: Introduction to Data Mining and CRISP-DM
consistent as possible. Data preparation can help to ensure that you improve your chances of a
successful outcome when you begin…
CRISP-DM Step 4: Modeling
A model, in data mining at least, is a computerized representation of real-world observations.
Models are the application of algorithms to seek out, identify, and display any patterns or messages
in your data. There are two basic kinds or types of models in data mining: those that classify and
those that predict.
Figure 1-2: Types of Data Mining Models.
As you can see in Figure 1-2, there is some overlap between the types of models data mining uses.
For example, this book will teaching you about decision trees. Decision Trees are a predictive
model used to determine which attributes of a given data set are the strongest indicators of a given
outcome. The outcome is usually expressed as the likelihood that an observation will fall into a
certain category. Thus, Decision Trees are predictive in nature, but they also help us to classify our
data. This will probably make more sense when we get to the chapter on Decision Trees, but for
now, it’s important just to understand that models help us to classify and predict based on patterns
the models find in our data.
Models may be simple or complex. They may contain only a single process, or stream, or they may
contain sub-processes. Regardless of their layout, models are where data mining moves from
preparation and understanding to development and interpretation. We will build a number of
example models in this text. Once a model has been built, it is time for…
9
Data Mining for the Masses
CRISP-DM Step 5: Evaluation
All analyses of data have the potential for false positives. Even if a model doesn’t yield false
positives however, the model may not find any interesting patterns in your data. This may be
because the model isn’t set up well to find the patterns, you could be using the wrong technique, or
there simply may not be anything interesting in your data for the model to find. The Evaluation
phase of CRISP-DM is there specifically to help you determine how valuable your model is, and
what you might want to do with it.
Evaluation can be accomplished using a number of techniques, both mathematical and logical in
nature. This book will examine techniques for cross-validation and testing for false positives using
RapidMiner. For some models, the power or strength indicated by certain test statistics will also be
discussed. Beyond these measures however, model evaluation must also include a human aspect.
As individuals gain experience and expertise in their field, they will have operational knowledge
which may not be measurable in a mathematical sense, but is nonetheless indispensable in
determining the value of a data mining model. This human element will also be discussed
throughout the book. Using both data-driven and instinctive evaluation techniques to determine a
model’s usefulness, we can then decide how to move on to…
CRISP-DM Step 6: Deployment
If you have successfully identified your questions, prepared data that can answer those questions,
and created a model that passes the test of being interesting and useful, then you have arrived at
the point of actually using your results. This is deployment, and it is a happy and busy time for a data
miner. Activities in this phase include setting up automating your model, meeting with consumers
of your model’s outputs, integrating with existing management or operational information systems,
feeding new learning from model use back into the model to improve its accuracy and
performance, and monitoring and measuring the outcomes of model use. Be prepared for a bit of
distrust of your model at first—you may even face pushback from groups who may feel their jobs
are threatened by this new tool, or who may not trust the reliability or accuracy of the outputs. But
don’t let this discourage you! Remember that CBS did not trust the initial predictions of the
UNIVAC, one of the first commercial computer systems, when the network used it to predict the
eventual outcome of the 1952 presidential election on election night. With only 5% of the votes
counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide;
10
Chapter 1: Introduction to Data Mining and CRISP-DM
something no pollster or election insider consider likely, or even possible. In fact, most ‘experts’
expected Stevenson to win by a narrow margin, with some acknowledging that because they
expected it to be close, Eisenhower might also prevail in a tight vote. It was only late that night,
when human vote counts confirmed that Eisenhower was running away with the election, that
CBS went on the air to acknowledge first that Eisenhower had won, and second, that UNIVAC
had predicted this very outcome hours earlier, but network brass had refused to trust the
computer’s prediction. UNIVAC was further vindicated later, when it’s prediction was found to
be within 1% of what the eventually tally showed. New technology is often unsettling to people,
and it is hard sometimes to trust what computers show. Be patient and specific as you explain how
a new data mining model works, what the results mean, and how they can be used.
While the UNIVAC example illustrates the power and utility of predictive computer modeling
(despite inherent mistrust), it should not construed as a reason for blind trust either. In the days of
UNIVAC, the biggest problem was the newness of the technology. It was doing something no
one really expected or could explain, and because few people understood how the computer
worked, it was hard to trust it. Today we face a different but equally troubling problem: computers
have become ubiquitous, and too often, we don’t question enough whether or not the results are
accurate and meaningful. In order for data mining models to be effectively deployed, balance must
be struck. By clearly communicating a model’s function and utility to stake holders, thoroughly
testing and proving the model, then planning for and monitoring its implementation, data mining
models can be effectively introduced into the organizational flow.
Failure to carefully and
effectively manage deployment however can sink even the best and most effective models.
DATA MINING AND YOU
Because data mining can be applied to such a wide array of professional fields, this book has been
written with the intent of explaining data mining in plain English, using software tools that are
accessible and intuitive to everyone. You may not have studied algorithms, data structures, or
programming, but you may have questions that can be answered through data mining. It is our
hope that by writing in an informal tone and by illustrating data mining concepts with accessible,
logical examples, data mining can become a useful tool for you regardless of your previous level of
data analysis or computing expertise. Let’s start digging!
11
Chapter 2: Organizational Understanding and Data Understanding
CHAPTER TWO:
ORGANIZATIONAL UNDERSTANDING AND DATA
UNDERSTANDING
CONTEXT AND PERSPECTIVE
Consider some of the activities you’ve been involved with in the past three or four days. Have you
purchased groceries or gasoline? Attended a concert, movie or other public event? Perhaps you
went out to eat at a restaurant, stopped by your local post office to mail a package, made a
purchase online, or placed a phone call to a utility company. Every day, our lives are filled with
interactions – encounters with companies, other individuals, the government, and various other
organizations.
In today’s technology-driven society, many of those encounters involve the transfer of information
electronically. That information is recorded and passed across networks in order to complete
financial transactions, reassign ownership or responsibility, and enable delivery of goods and
services. Think about the amount of data collected each time even one of these activities occurs.
Take the grocery store for example. If you take items off the shelf, those items will have to be
replenished for future shoppers – perhaps even for yourself – after all you’ll need to make similar
purchases again when that case of cereal runs out in a few weeks. The grocery store must
constantly replenish its supply of inventory, keeping the items people want in stock while
maintaining freshness in the products they sell. It makes sense that large databases are running
behind the scenes, recording data about what you bought and how much of it, as you check out
and pay your grocery bill. All of that data must be recorded and then reported to someone whose
job it is to reorder items for the store’s inventory.
However, in the world of data mining, simply keeping inventory up-to-date is only the beginning.
Does your grocery store require you to carry a frequent shopper card or similar device which,
when scanned at checkout time, gives you the best price on each item you’re buying? If so, they
13
Data Mining for the Masses
can now begin not only keep track of store-wide purchasing trends, but individual purchasing
trends as well. The store can target market to you by sending mailers with coupons for products
you tend to purchase most frequently.
Now let’s take it one step further. Remember, if you can, what types of information you provided
when you filled out the form to receive your frequent shopper card. You probably indicated your
address, date of birth (or at least birth year), whether you’re male or female, and perhaps the size of
your family, annual household income range, or other such information. Think about the range of
possibilities now open to your grocery store as they analyze that vast amount of data they collect at
the cash register each day:
Using ZIP codes, the store can locate the areas of greatest customer density, perhaps
aiding their decision about the construction location for their next store.
Using information regarding customer gender, the store may be able to tailor marketing
displays or promotions to the preferences of male or female customers.
With age information, the store can avoid mailing coupons for baby food to elderly
customers, or promotions for feminine hygiene products to households with a single
male occupant.
These are only a few the many examples of potential uses for data mining. Perhaps as you read
through this introduction, some other potential uses for data mining came to your mind. You may
have also wondered how ethical some of these applications might be. This text has been designed
to help you understand not only the possibilities brought about through data mining, but also the
techniques involved in making those possibilities a reality while accepting the responsibility that
accompanies the collection and use of such vast amounts of personal information.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Define the discipline of Data Mining
List and define various types of data
List and define various sources of data
Explain the fundamental differences between databases, data warehouses and data sets
14
Chapter 2: Organizational Understanding and Data Understanding
Explain some of the ethical dilemmas associated with data mining and outline possible
solutions
PURPOSES, INTENTS AND LIMITATIONS OF DATA MINING
Data mining, as explained in Chapter 1 of this text, applies statistical and logical methods to large
data sets. These methods can be used to categorize the data, or they can be used to create predictive
models.
Categorizations of large sets may include grouping people into similar types of
classifications, or in identifying similar characteristics across a large number of observations.
Predictive models however, transform these descriptions into expectations upon which we can
base decisions. For example, the owner of a book-selling Web site could project how frequently
she may need to restock her supply of a given title, or the owner of a ski resort may attempt to
predict the earliest possible opening date based on projected snow arrivals and accumulations.
It is important to recognize that data mining cannot provide answers to every question, nor can we
expect that predictive models will always yield results which will in fact turn out to be the reality.
Data mining is limited to the data that has been collected. And those limitations may be many.
We must remember that the data may not be completely representative of the group of individuals
to which we would like to apply our results. The data may have been collected incorrectly, or it
may be out-of-date. There is an expression which can adequately be applied to data mining,
among many other things: GIGO, or Garbage In, Garbage Out. The quality of our data mining results
will directly depend upon the quality of our data collection and organization. Even after doing our
very best to collect high quality data, we must still remember to base decisions not only on data
mining results, but also on available resources, acceptable amounts of risk, and plain old common
sense.
DATABASE, DATA WAREHOUSE, DATA MART, DATA SET…?
In order to understand data mining, it is important to understand the nature of databases, data
collection and data organization. This is fundamental to the discipline of Data Mining, and will
directly impact the quality and reliability of all data mining activities. In this section, we will
15
Data Mining for the Masses
examine the differences between databases, data warehouses, and data sets. We will also
examine some of the variations in terminology used to describe data attributes.
Although we will be examining the differences between databases, data warehouses and data sets,
we will begin by discussing what they have in common. In Figure 2-1, we see some data organized
into rows (shown here as A, B, etc.) and columns (shown here as 1, 2, etc.). In varying data
environments, these may be referred to by differing names. In a database, rows would be referred
to as tuples or records, while the columns would be referred to as fields.
Figure 2-1: Data arranged in columns and rows.
In data warehouses and data sets, rows are sometimes referred to as observations, examples or
cases, and columns are sometimes called variables or attributes. For purposes of consistency in
this book, we will use the terminology of observations for rows and attributes for columns. It is
important to note that RapidMiner will use the term examples for rows of data, so keep this in
mind throughout the rest of the text.
A database is an organized grouping of information within a specific structure.
Database
containers, such as the one pictured in Figure 2-2, are called tables in a database environment.
Most databases in use today are relational databases—they are designed using many tables which
relate to one another in a logical fashion. Relational databases generally contain dozens or even
hundreds of tables, depending upon the size of the organization.
16
Chapter 2: Organizational Understanding and Data Understanding
Figure 2-2: A simple database with a relation between two tables.
Figure 2-2 depicts a relational database environment with two tables. The first table contains
information about pet owners; the second, information about pets. The tables are related by the
single column they have in common: Owner_ID. By relating tables to one another, we can reduce
redundancy of data and improve database performance. The process of breaking tables apart and
thereby reducing data redundancy is called normalization.
Most relational databases which are designed to handle a high number of reads and writes (updates
and retrievals of information) are referred to as OLTP (online transaction processing) systems.
OLTP systems are very efficient for high volume activities such as cashiering, where many items
are being recorded via bar code scanners in a very short period of time. However, using OLTP
databases for analysis is generally not very efficient, because in order to retrieve data from multiple
tables at the same time, a query containing joins must be written. A query is simple a method of
retrieving data from database tables for viewing. Queries are usually written in a language called
SQL (Structured Query Language; pronounced ‘sequel’). Because it is not very useful to only
query pet names or owner names, for example, we must join two or more tables together in order
to retrieve both pets and owners at the same time. Joining requires that the computer match the
Owner_ID column in the Owners table to the Owner_ID column in the Pets table. When tables
contain thousands or even millions of rows of data, this matching process can be very intensive
and time consuming on even the most robust computers.
For
much
more
on
database
design
and
(http://www.geekgirls.com/ menu_databases.htm).
17
management,
check
out
geekgirls.com:
Data Mining for the Masses
In order to keep our transactional databases running quickly and smoothly, we may wish to create
a data warehouse. A data warehouse is a type of large database that has been denormalized and
archived. Denormalization is the process of intentionally combining some tables into a single
table in spite of the fact that this may introduce duplicate data in some columns (or in other words,
attributes).
Figure 2-3: A combination of the tables into a single data set.
Figure 2-3 depicts what our simple example data might look like if it were in a data warehouse.
When we design databases in this way, we reduce the number of joins necessary to query related
data, thereby speeding up the process of analyzing our data. Databases designed in this manner are
called OLAP (online analytical processing) systems.
Transactional systems and analytical systems have conflicting purposes when it comes to database
speed and performance. For this reason, it is difficult to design a single system which will serve
both purposes. This is why data warehouses generally contain archived data. Archived data are
data that have been copied out of a transactional database. Denormalization typically takes place at
the time data are copied out of the transactional system. It is important to keep in mind that if a
copy of the data is made in the data warehouse, the data may become out-of-synch. This happens
when a copy is made in the data warehouse and then later, a change to the original record
(observation) is made in the source database. Data mining activities performed on out-of-synch
observations may be useless, or worse, misleading. An alternative archiving method would be to
move the data out of the transactional system. This ensures that data won’t get out-of-synch,
however, it also makes the data unavailable should a user of the transactional system need to view
or update it.
A data set is a subset of a database or a data warehouse. It is usually denormalized so that only
one table is used. The creation of a data set may contain several steps, including appending or
combining tables from source database tables, or simplifying some data expressions. One example
of this may be changing a date/time format from ‘10-DEC-2002 12:21:56’ to ‘12/10/02’. If this
18
Chapter 2: Organizational Understanding and Data Understanding
latter date format is adequate for the type of data mining being performed, it would make sense to
simplify the attribute containing dates and times when we create our data set. Data sets may be
made up of a representative sample of a larger set of data, or they may contain all observations
relevant to a specific group. We will discuss sampling methods and practices in Chapter 3.
TYPES OF DATA
Thus far in this text, you’ve read about some fundamental aspects of data which are critical to the
discipline of data mining. But we haven’t spent much time discussing where that data are going to
come from. In essence, there are really two types of data that can be mined: operational and
organizational.
The most elemental type of data, operational data, comes from transactional systems which record
everyday activities.
Simple encounters like buying gasoline, making an online purchase, or
checking in for a flight at the airport all result in the creation of operational data. The times,
prices and descriptions of the goods or services we have purchased are all recorded.
This
information can be combined in a data warehouse or may be extracted directly into a data set from
the OLTP system.
Often times, transactional data is too detailed to be of much use, or the detail may compromise
individuals’ privacy. In many instances, government, academic or not-for-profit organizations may
create data sets and then make them available to the public. For example, if we wanted to identify
regions of the United States which are historically at high risk for influenza, it would be difficult to
obtain permission and to collect doctor visit records nationwide and compile this information into
a meaningful data set. However, the U.S. Centers for Disease Control and Prevention (CDCP), do
exactly that every year. Government agencies do not always make this information immediately
available to the general public, but it often can be requested. Other organizations create such
summary data as well. The grocery store mentioned at the beginning of this chapter wouldn’t
necessarily want to analyze records of individual cans of greens beans sold, but they may want to
watch trends for daily, weekly or perhaps monthly totals. Organizational data sets can help to
protect peoples’ privacy, while still proving useful to data miners watching for trends in a given
population.
19
Data Mining for the Masses
Another type of data often overlooked within organizations is something called a data mart. A
data mart is an organizational data store, similar to a data warehouse, but often created in
conjunction with business units’ needs in mind, such as Marketing or Customer Service, for
reporting and management purposes.
Data marts are usually intentionally created by an
organization to be a type of one-stop shop for employees throughout the organization to find data
they might be looking for. Data marts may contain wonderful data, prime for data mining
activities, but they must be known, current, and accurate to be useful. They should also be wellmanaged in terms of privacy and security.
All of these types of organizational data carry with them some concern.
Because they are
secondary, meaning they have been derived from other more detailed primary data sources, they
may lack adequate documentation, and the rigor with which they were created can be highly
variable. Such data sources may also not be intended for general distribution, and it is always wise
to ensure proper permission is obtained before engaging in data mining activities on any data set.
Remember, simply because a data set may have been acquired from the Internet does not mean it
is in the public domain; and simply because a data set may exist within your organization does not
mean it can be freely mined. Checking with relevant managers, authors and stakeholders is critical
before beginning data mining activities.
A NOTE ABOUT PRIVACY AND SECURITY
In 2003, JetBlue Airlines supplied more than one million passenger records to a U.S. government
contractor, Torch Concepts.
Torch then subsequently augmented the passenger data with
additional information such as family sizes and social security numbers—information purchased
from a data broker called Acxiom. The data were intended for a data mining project in order to
develop potential terrorist profiles. All of this was done without notification or consent of
passengers. When news of the activities got out however, dozens of privacy lawsuits were filed
against JetBlue, Torch and Acxiom, and several U.S. senators called for an investigation into the
incident.
This incident serves several valuable purposes for this book. First, we should be aware that as we
gather, organize and analyze data, there are real people behind the figures. These people have
certain rights to privacy and protection against crimes such as identity theft. We as data miners
20
Chapter 2: Organizational Understanding and Data Understanding
have an ethical obligation to protect these individuals’ rights. This requires the utmost care in
terms of information security. Simply because a government representative or contractor asks for
data does not mean it should be given.
Beyond technological security however, we must also consider our moral obligation to those
individuals behind the numbers. Recall the grocery store shopping card example given at the
beginning of this chapter. In order to encourage use of frequent shopper cards, grocery stores
frequently list two prices for items, one with use of the card and one without. For each individual,
the answer to this question may vary, however, answer it for yourself: At what price mark-up has
the grocery store crossed an ethical line between encouraging consumers to participate in frequent
shopper programs, and forcing them to participate in order to afford to buy groceries? Again, your
answer will be unique from others’, however it is important to keep such moral obligations in mind
when gathering, storing and mining data.
The objectives hoped for through data mining activities should never justify unethical means of
achievement.
Data mining can be a powerful tool for customer relationship management,
marketing, operations management, and production, however in all cases the human element must
be kept sharply in focus. When working long hours at a data mining task, interacting primarily
with hardware, software, and numbers, it can be easy to forget about the people, and therefore it is
so emphasized here.
CHAPTER SUMMARY
This chapter has introduced you to the discipline of data mining. Data mining brings statistical
and logical methods of analysis to large data sets for the purposes of describing them and using
them to create predictive models. Databases, data warehouses and data sets are all unique kinds of
digital record keeping systems, however, they do share many similarities. Data mining is generally
most effectively executed on data data sets, extracted from OLAP, rather than OLTP systems.
Both operational data and organizational data provide good starting points for data mining
activities, however both come with their own issues that may inhibit quality data mining activities.
These should be mitigated before beginning to mine the data. Finally, when mining data, it is
critical to remember the human factor behind manipulation of numbers and figures. Data miners
have an ethical responsibility to the individuals whose lives may be affected by the decisions that
are made as a result of data mining activities.
21
Data Mining for the Masses
REVIEW QUESTIONS
1) What is data mining in general terms?
2) What is the difference between a database, a data warehouse and a data set?
3) What are some of the limitations of data mining? How can we address those limitations?
4) What is the difference between operational and organizational data? What are the pros and
cons of each?
5) What are some of the ethical issues we face in data mining? How can they be addressed?
6) What is meant by out-of-synch data? How can this situation be remedied?
7) What is normalization? What are some reasons why it is a good thing in OLTP systems,
but not so good in OLAP systems?
EXERCISES
1) Design a relational database with at least three tables. Be sure to create the columns
necessary within each table to relate the tables to one another.
2) Design a data warehouse table with some columns which would usually be normalized.
Explain why it makes sense to denormalize in a data warehouse.
3) Perform an Internet search to find information about data security and privacy. List three
web sites that you found that provided information that could be applied to data mining.
Explain how it might be applied.
4) Find a newspaper, magazine or Internet news article related to information privacy or
security. Summarize the article and explain how it might be related to data mining.
22
Chapter 2: Organizational Understanding and Data Understanding
5) Using the Internet, locate a data set which is available for download. Describe the data set
(contents, purpose, size, age, etc.). Classify the data set as operational or organizational.
Summarize any requirements placed on individuals who may wish to use the data set.
6) Obtain a copy of an application for a grocery store shopping card. Summarize the type of
data requested when filling out the application. Give an example of how that data may aid
in a data mining activity. What privacy concerns arise regarding the data being collected?
23
Chapter 3: Data Preparation
CHAPTER THREE:
DATA PREPARATION
CONTEXT AND PERSPECTIVE
Jerry is the marketing manager for a small Internet design and advertising firm. Jerry’s boss asks
him to develop a data set containing information about Internet users. The company will use this
data to determine what kinds of people are using the Internet and how the firm may be able to
market their services to this group of users.
To accomplish his assignment, Jerry creates an online survey and places links to the survey on
several popular Web sites. Within two weeks, Jerry has collected enough data to begin analysis, but
he finds that his data needs to be denormalized. He also notes that some observations in the set
are missing values or they appear to contain invalid values. Jerry realizes that some additional work
on the data needs to take place before analysis begins.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain the concept and purpose of data scrubbing
List possible solutions for handling missing data
Explain the role and perform basic methods for data reduction
Define and handle inconsistent data
Discuss the important and process of attribute reduction
APPLYING THE CRISP DATA MINING MODEL
Recall from Chapter 1 that the CRISP Data Mining methodology requires three phases before any
actual data mining models are constructed. In the Context and Perspective paragraphs above, Jerry
25
Data Mining for the Masses
has a number of tasks before him, each of which fall into one of the first three phases of CRISP.
First, Jerry must ensure that he has developed a clear Organizational Understanding. What is
the purpose of this project for his employer? Why is he surveying Internet users? Which data
points are important to collect, which would be nice to have, and which would be irrelevant or
even distracting to the project? Once the data are collected, who will have access to the data set
and through what mechanisms? How will the business ensure privacy is protected? All of these
questions, and perhaps others, should be answered before Jerry even creates the survey mentioned
in the second paragraph above.
Once answered, Jerry can then begin to craft his survey. This is where Data Understanding
enters the process. What database system will he use? What survey software? Will he use a
publicly available tool like SurveyMonkey™, a commercial product, or something homegrown? If
he uses publicly available tool, how will he access and extract data for mining? Can he trust this
third-party to secure his data and if so, why? How will the underlying database be designed? What
mechanisms will be put in place to ensure consistency and integrity in the data? These are all
questions of data understanding. An easy example of ensuring consistency might be if a person’s
home city were to be collected as part of the data. If the online survey just provides an open text
box for entry, respondents could put just about anything as their home city. They might put New
York, NY, N.Y., Nwe York, or any number of other possible combinations, including typos. This
could be avoided by forcing users to select their home city from a dropdown menu, but
considering the number cities there are in most countries, that list could be unacceptably long! So
the choice of how to handle this potential data consistency problem isn’t necessarily an obvious or
easy one, and this is just one of many data points to be collected. While ‘home state’ or ‘country’
may be reasonable to constrain to a dropdown, ‘city’ may have to be entered freehand into a
textbox, with some sort of data correction process to be applied later.
The ‘later’ would come once the survey has been developed and deployed, and data have been
collected. With the data in place, the third CRISP-DM phase, Data Preparation, can begin. If
you haven’t installed OpenOffice and RapidMiner yet, and you want to work along with the
examples given in the rest of the book, now would be a good time to go ahead and install these
applications. Remember that both are freely available for download and installation via the
Internet, and the links to both applications are given in Chapter 1. We’ll begin by doing some data
preparation in OpenOffice Base (the database application), OpenOffice Calc (the spreadsheet
application), and then move on to other data preparation tools in RapidMiner. You should
26
Chapter 3: Data Preparation
understand that the examples of data preparation in this book are only a subset of possible data
preparation approaches.
COLLATION
Suppose that the database underlying Jerry’s Internet survey is designed as depicted in the
screenshot from OpenOffice Base in Figure 3-1.
Figure 3-1: A simple relational (one-to-one) database for Internet survey data.
This design would enable Jerry to collect data about people in one table, and data about their
Internet behaviors in another. RapidMiner would be able to connect to either of these tables in
order to mine the responses, but what if Jerry were interested in mining data from both tables at
once?
One simple way to collate data in multiple tables into a single location for data mining is to create a
database view. A view is a type of pseudo-table, created by writing a SQL statement which is
named and stored in the database. Figure 3-2 shows the creation of a view in OpenOffice Base,
while Figure 3-3 shows the view in datasheet view.
27
Data Mining for the Masses
Figure 3-2: Creation of a view in OpenOffice Base.
Figure 3-3: Results of the view from Figure 3-2 in datasheet view.
The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities. In this example, although the personal information in the
‘Respondents’ table is only stored once in the database, it is displayed for each record in the
‘Responses’ table, creating a data set that is more easily mined because it is both richer in
information and consistent in its formatting.
DATA SCRUBBING
In spite of our very best efforts to maintain quality and integrity during data collection, it is
inevitable that some anomalies will be introduced into our data at some point. The process of data
scrubbing allows us to handle these anomalies in ways that make sense for us. In the remainder of
this chapter, we will examine data scrubbing in four different ways: handling missing data, reducing
data (observations), handling inconsistent data, and reducing attributes.
28
Chapter 3: Data Preparation
HANDS ON EXERCISE
Starting now, and throughout the next chapters of this book, there will be opportunities for you to
put your hands on your computer and follow along. In order to do this, you will need to be sure
to install OpenOffice and RapidMiner, as was discussed in the section A Note about Tools in
Chapter 1. You will also need to have an Internet connection to access this book’s companion
web site, where copies of all data sets used in the chapter exercises are available. The companion
web site is located at:
https://sites.google.com/site/dataminingforthemasses/
Figure 3-4. Data Mining for the Masses companion web site.
You can download the Chapter 3 data set, which is an export of the view created in OpenOffice
Base, from the web site by locating it in the list of files and then clicking the down arrow to the far
right of the file name, as indicated by the black arrows in Figure 3-4 You may want to consider
creating a folder labeled ‘data mining’ or something similar where you can keep copies of your
data—more files will be required and created as we continue through the rest of the book,
especially when we get into building data mining models in RapidMiner. Having a central place to
keep everything together will simplify things, and upon your first launch of the RapidMiner
software, you’ll be prompted to create a repository, so it’s a good idea to have a space ready. Once
29
Data Mining for the Masses
you’ve downloaded the Chapter 3 data set, you’re ready to begin learning how to handle and
prepare data for mining in RapidMiner.
PREPARING RAPIDMINER, IMPORTING DATA, AND
HANDLING MISSING DATA
Our first task in data preparation is to handle missing data, however, because this will be our first
time using RapidMiner, the first few steps will involve getting RapidMiner set up. We’ll then move
straight into handling missing data. Missing data are data that do not exist in a data set. As you
can see in Figure 3-5, missing data is not the same as zero or some other value. It is blank, and the
value is unknown.
Missing data are also sometimes known in the database world as null.
Depending on your objective in data mining, you may choose to leave missing data as they are, or
you may wish to replace missing data with some other value.
Figure 3-5: Some missing data within the survey data set.
The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities. In this example, our database view has missing data in a
number of its attributes. Black arrows indicate a couple of these attributes in Figure 3-5 above. In
some instances, missing data are not a problem, they are expected. For example, in the Other
Social Network attribute, it is entirely possible that the survey respondent did not indicate that they
use social networking sites other than the ones proscribed in the survey. Thus, missing data are
probably accurate and acceptable. On the other hand, in the Online Gaming attribute, there are
answers of either ‘Y’ or ‘N’, indicating that the respondent either does, or does not participate in
online gaming. But what do the missing, or null values in this attribute indicate? It is unknown to
us. For the purposes of data mining, there are a number of options available for handling missing
data.
To learn about handling missing data in RapidMiner, follow the steps below to connect to your
data set and begin modifying it:
30
Chapter 3: Data Preparation
1) Launch the RapidMiner application. This can be done by double clicking your desktop
icon or by finding it in your application menu. The first time RapidMiner is launched, you
will get the message depicted in Figure 3-6. Click OK to set up a repository.
Figure 3-6. The prompt to create an initial data repository for RapidMiner to use.
2) For most purposes (and for all examples in this book), a local repository will be sufficient.
Click OK to accept the default option as depicted in Figure 3-7.
Figure 3-7. Setting up a local data repository.
3) In the example given in Figure 3-8, we have named our repository ‘RapidMinerBook, and
pointed it to our data folder, RapidMiner Data, which is found on our E: drive. Use the
folder icon to browse and find the folder or directory you created for storing your
RapidMiner data sets. Then click Finish.
31
Data Mining for the Masses
Figure 3-8. Setting the repository name and directory.
4) You may get a notice that updates are available. If this is the case, go ahead and accept the
option to update, where you will be presented with a window similar to Figure 3-9. Take
advantage of the opportunity to add in the Text Mining module (indicated by the black
arrow), since Chapter 12 will deal with Text Mining. Double click the check box to add a
green check mark indicating that you wish to install or update the module, then click
Install.
32
Chapter 3: Data Preparation
Figure 3-9. Installing updates and adding the Text Mining module.
5) Once the updates and installations are complete, RapidMiner will open and your window
should look like Figure 3-10:
Figure 3-10. The RapidMiner start screen.
33
Data Mining for the Masses
6) Next we will need to start a new data mining project in RapidMiner. To do this we click
on the ‘New’ icon as indicated by the black arrow in Figure 3-10. The resulting window
should look like Figure 3-11.
Figure 3-11. Getting started with a new project in RapidMiner.
7) Within RapidMiner there are two main areas that hold useful tools: Repositories and
Operators. These are accessed by the tabs indicated by the black arrow in Figure 3-11.
The Repositories area is the place where you will connect to each data set you wish to
mine. The Operators area is where all data mining tools are located. These are used to
build models and otherwise manipulate data sets. Click on Repositories. You will find that
the initial repository we created upon our first launch of the RapidMiner software is
present in the list.
34
Chapter 3: Data Preparation
Figure 3-12. Adding a data set to a repository in RapidMiner.
8) Because the focus of this book is to introduce data mining to the broadest possible
audience, we will not use all of the tools available in RapidMiner. At this point, we could
do a number of complicated and technical things, such as connecting to a remote
enterprise database. This however would likely be overwhelming and inaccessible to many
readers. For the purposes of this text, we will therefore only be connecting to comma
separate values (CSV) files.
You should know that most data mining projects
incorporate extremely large data sets encompassing dozens of attributes and thousands or
even millions of observations.
We will use smaller data sets in this text, but the
foundational concepts illustrated are the same for large or small data. The Chapter 3 data
set downloaded from the companion web site is very small, comprised of only 15 attributes
and 11 observations. Our next step is to connect to this data set. Click on the Import
icon, which is the second icon from the left in the Repositories area, as indicated by the
black arrow in Figure 3-12.
35
Data Mining for the Masses
Figure 3-13. Importing a CSV file.
9) You will see by the black arrow in Figure 3-13 that you can import from a number of
different data sources.
Note that by importing, you are bringing your data into a
RapidMiner file, rather than working with data that are already stored elsewhere. If your
data set is extremely large, it may take some time to import the data, and you should be
mindful of disk space that is available to you. As data sets grow, you may be better off
using the first (leftmost) icon to set up a remote repository in order to work with data
already stored in other areas. As previously explained, all examples in this text will be
conducted by importing CSV files that are small enough to work with quickly and easily.
Click on the Import CSV File option.
36
Chapter 3: Data Preparation
Figure 3-14. Locating the data set to import.
10) When the data import wizard opens, navigate to the folder where your data set is stored
and select the file. In this example, only one file is visible: the Chapter 3 data set
downloaded from the companion web site. Click Next.
Figure 3-15. Configuring attribute separation.
37
Data Mining for the Masses
11) By default, RapidMiner looks for semicolons as attribute separators in our data. We must
change the column separation delimiter to be Comma, in order to be able to see each
attribute separated correctly. Note: If your data naturally contain commas, then you
should be careful as you are collecting or collating your data to use a delimiter that does
not naturally occur in the data. A semicolon or a pipe (|) symbol can often help you avoid
unintended column separation.
Figure 3-16. A preview of attributes separated into columns
with the Comma option selected.
12) Once the preview shows columns for each attribute, click Next. Note that RapidMiner has
treated our attribute names as if they are our first row of data, or in other words, our first
observation. To fix this, click the Annotation dropdown box next to this row and set it to
Name, as indicated in Figure 3-17. With the attribute names designated correctly, click
Next.
38
Chapter 3: Data Preparation
Figure 3-17. Setting the attribute names.
13) In step 4 of the data import wizard, RapidMiner will take its best guess at a data type for
each attribute. The data type is the kind of data an attribute holds, such as numeric, text or
date. These can be changed in this screen, but for our purposes in Chapter 3, we will
accept the defaults. Just below each attribute’s data type, RapidMiner also indicates a Role
for each attribute to play. By default, all columns are imported simply with the role of
‘attribute’, however we can change these here if we know that one attribute is going to play
a specific role in a data mining model that we will create. Since roles can be set within
RapidMiner’s main process window when building data mining models, we will accept the
default of ‘attribute’ whenever we import data sets in exercises in this text. Also, you may
note that the check boxes above each attribute in this window allow you to not import
some of the attributes if you don’t want to. This is accomplished by simply clearing the
checkbox. Again, attributes can be excluded from models later, so for the purposes of this
text, we will always include all attributes when importing data. All of these functions are
indicated by the black arrows in Figure 3-18. Go ahead and accept these defaults as they
stand and click Next.
39
Data Mining for the Masses
Figure 3-18. Setting data types, roles and import attributes.
14) The final step is to choose a repository to store the data set in, and to give the data set a
name within RapidMiner. In Figure 3-19, we have chosen to store the data set in the
RapidMiner Book repository, and given it the name Chapter3. Once we click Finish, this
data set will become available to us for any type of data mining process we would like to
build upon it.
Figure 3-19. Selecting the repository and setting a data set name
for our imported CSV file.
40
Chapter 3: Data Preparation
15) We can now see that the data set is available for use in RapidMiner. To begin using it in a
RapidMiner data mining process, simply drag the data set and drop it in the Main Process
window, as has been done in Figure 3-20.
Figure 3-20. Adding a data set to a process in RapidMiner.
16) Each rectangle in a process in RapidMiner is an operator. The Retrieve operator simply
gets a data set and makes it available for use. The small half-circles on the sides of the
operator, and of the Main Process window, are called ports. In Figure 3-20, an output (out)
port from our data set’s Retrieve operator is connected to a result set (res) port via a spline.
The splines, combined with the operators connected by them, constitute a data mining
stream. To run a data mining stream and see the results, click the blue, triangular Play
button in the toolbar at the top of the RapidMiner window. This will change your view
from Design Perspective, which is the view pictured in Figure 3-20 where you can
change your data mining stream, to Results Perspective, which shows your stream’s
results, as pictured in Figure 3-21. When you hit the Play button, you may be prompted to
save your process, and you are encouraged to do so. RapidMiner may also ask you if you
wish to overwrite a saved process each time it is run, and you can select your preference on
this prompt as well.
41
Data Mining for the Masses
Figure 3-21. Results perspective for the Chapter3 data set.
17) You can toggle between design and results perspectives using the two icons indicated by
the black arrows in Figure 3-21. As you can see, there is a rich set of information in results
perspective. In the meta data view, basic descriptive statistics are given. It is here that we
can also get a sense for the number of observations that have missing values in each
attribute of the data set. The columns in meta data view can be stretched to make their
contents more readable. This is accomplished by hovering your mouse over the faint
vertical gray bars between each column, then clicking and dragging to make them wider.
The information presented here can be very helpful in deciding where missing data are
located, and what to do about it. Take for example the Online_Gaming attribute. The
results perspective shows us that we have six ‘N’ responses in that attribute, two ‘Y’
responses, and three missing. We could use the mode, or most common response to
replace the missing values. This of course assumes that the most common response is
accurate for all observations, and this may not be accurate. As data miners, we must be
responsible for thinking about each change we make in our data, and whether or not we
threaten the integrity of our data by making that change.
In some instances the
consequences could be drastic. Consider, for instance, if the mode for an attribute of
Felony_Conviction were ‘Y’. Would we really want to convert all missing values in this
attribute to ‘Y’ simply because that is the mode in our data set? Probably not; the
42
Chapter 3: Data Preparation
implications about the persons represented in each observation of our data set would be
unfair and misrepresentative. Thus, we will change the missing values in the current
example to illustrate how to handle missing values in RapidMiner, recognizing that what we
are about to do won’t always be the right way to handle missing data. In order to have
RapidMiner handle the change from missing to ‘N’ for the three observations in our
Online_Gaming variable, click the design perspective icon.
Figure 3-22. Finding an operator to handle missing values.
18) In order to find a tool in the Operators area, you can navigate through the folder tree in
the lower left hand corner. RapidMiner offers many tools, and sometimes, finding the one
you want can be tricky. There is a handy search box, indicated by the black arrow in Figure
3-22, that allows you to type in key words to find tools that might do what you need. Type
the word ‘missing’ into this box, and you will see that RapidMiner automatically searches
for tools with this word in their name. We want to replace missing values, and we can see
that within the Data Transformation tool area, inside a sub-area called Value Modification,
there is an operator called Replace Missing Values. Let’s add this operator to our stream.
Click and hold on the operator name, and drag it up to your spline. When you point your
mouse cursor on the spline, the spline will turn slightly bold, indicating that when you let
go of your mouse button, the operator will be connected into the stream. If you let go and
the Replace Missing Values operator fails to connect into your stream, you can reconfigure
43
Data Mining for the Masses
your splines manually. Simply click on the out port in your Retrieve operator, and then
click on the exa port on the Replace Missing Values operator. Exa stands for example set,
and remember that ‘examples’ is the word RapidMiner uses for observations in a data set.
Be sure the exa port from the Replace Missing Values operator is connected to your result
set (res) port so that when you run your process, you will have output. Your model should
now look similar to Figure 3-23.
Figure 3-23. Adding a missing value operator to the stream.
19) When an operator is selected in RapidMiner, it has an orange rectangle around it. This will
also enable you to modify that operator’s parameters, or properties. The Parameters pane
is located on the right side of the RapidMiner window, as indicated by the black arrow in
Figure 3-23. For this exercise, we have decided to change all missing values in the
Online_Gaming attribute to be ‘N’, since this is the most common response in that
attribute. To do this, change the ‘attribute filter type’ to ‘single’, and you will see that a
dropdown box appears, allowing you to choose the Online_Gaming attribute as the target
for modification. Next, expand the ‘default’ dropdown box, and select ‘value’, which will
cause a ‘replenishment value’ box to appear. Type the replacement value ‘N’ in this box.
Note that you may need to expand your RapidMiner window, or use the vertical scroll bar
on the left of the Parameters pane in order to see all options, as the options change based
on what you have selected. When you are finished, your parameters should look like the
44
Chapter 3: Data Preparation
ones in Figure 3-24. Parameter settings that were changed are highlighted with black
arrows.
Figure 3-24. Missing value parameters.
20) You should understand that there are many other options available to you in the
parameters pane. We will not explore all of them here, but feel free to experiment with
them. For example, instead of changing a single attribute at a time, you could change a
subset of the attributes in your data set. You will learn much about the flexibility and
power of RapidMiner by trying out different tools and features. When you have your
parameter set, click the play button. This will run your process and switch you to results
perspective once again. Your results should look like Figure 3-25.
45
Data Mining for the Masses
Figure 3-25. Results of changing missing data.
21) You can see now that the Online_Gaming attribute has been moved to the top of our list,
and that there are zero missing values. Click on the Data View radio button, above and to
the left hand side of the attribute list to see your data in a spreadsheet-type view. You will
see that the Online_Gaming variable is now populated with only ‘Y’ and ‘N’ values. We
have successfully replaced all missing values in that attribute. While in Data View, take
note of how missing values are annotated in other variables, Online_Shopping for example.
A question mark (?) denotes a missing value in an observation. Suppose that for this
variable, we do not wish to replace the null values with the mode, but rather, that we wish
to remove those observations from our data set prior to mining it. This …