California State University Cleaning and Profiling Code Worksheet

Cleaning and Profiling Code

Use only Hadoop MapReduce in this part of your project.

Do not use anything else.

You must write and submit 2 separate MapReduce jobs:

MR Job 1.

Data profiling – to explore your data

– Name the files: CountRecs.java, CountRecsMapper.java, CountRecsReducer.java

(Please use these exact names for your classes)

– This MR job counts the number of records in a dataset

– Run it on the original dataset, before cleaning, and output the number of records

– Run it on the cleaned dataset (result of MR Job 2 described below), output number of records – If the number of records don’t match, you should figure out why that is

– Re-submit a schema if it has changed.

MR Job 2.

Data cleaning – to avoid nasty exceptions later on in your analytic

– Name the files: Clean.java, CleanMapper.java, CleanReducer.java

(Please use these exact names for your classes)

– This MR job cleans the data – for example, by dropping columns you don’t need.

– It should write out a new file with only the columns you will use in your analytic.

– The selected columns for your data schema

For full credit, provide the classes for each job

Data Profiling
Data Cleaning
Data Profiling
Data Profiling
Data profiling helps you discover, understand and organize your data.
Data profiling helps cover the basics with your data, verifying that the information in
your tables matches the descriptions.
For example, a state column might use a combination of both two-letter codes
and the fully spelled out (sometimes incorrectly) name of the state. Data
profiling would uncover this inconsistency and inform the creation of a
standardization rule that could make them all consistent, two-letter codes.
Sometimes the data profiling process leads you to render your dataset unusable
Data Profiling
Why profile data?
Data profiling allows you to answer the following questions about your data:
●
Is the data complete? Are there blank or null values?
●
Is the data unique? How many distinct values are there? Is the data duplicated?
●
Are there anomalous patterns in your data? What is the distribution of patterns in your data?
●
Are these the patterns you expect?
●
What range of values exist, and are they expected? What are the maximum, minimum, and average
values for given data? Are these the ranges you expect?
Data profiling helps you discover, understand and organize your data.
Data Profiling
Structure discovery, also known as structure analysis, validates that the data that you have is
consistent and formatted correctly. There are several different processes that you can use for this,
such as pattern matching.
For example, if you have a data set of phone numbers, pattern matching helps you find the
valid sets of formats within the data set.
Pattern matching also helps you understand whether a field is text- or number-based along with
other format-specific information.
Structure discovery also examines simple basic statistics in the data. By using statistics like the
minimum and maximum values, means, medians, modes and standard deviations, you can gain
insight into the validity of the data.
Data Profiling
Content discovery is the process of looking more closely into the individual elements of the database
to check data quality. This can help you find areas that contain null values or values that are incorrect
or ambiguous.
Many data management tasks start with an accounting for all the inconsistent and ambiguous entries
in your data sets.
For example, finding and correcting your data to fit street addresses into the correct format is an
essential part of this step. The potential problems that could arise from non-standard data, like being
unable to reach customers via mail because the data set includes incorrectly formatted addresses, are
costly and can be addressed early in the data management process.
Data Profiling – Statistics Gathering
Attribute Level (data row level) profiling
• All Data Types
• Null Count – Null Percentage: number and/or percentage of records with a null
value
• Mode–Most frequent value
• PatternCount–Number of difference distinct patterns observed; mm/dd/yyyy or
999999.99 for example
• Datatype observed always (or almost always) in the column
• Length of data in the column
• Uniqueness
Data Profiling – Statistics Gathering
• Numeric Data Types
• Mean
• Median
• Precision
• Standard Deviation
For fields with non-unique data the frequency distribution
(group-by) results can yield very interesting results
• Can be compared with allowed values
• Frequent and infrequent values should be studied
Data Cleaning
Finding incorrect records in a dataset and removing or replacing them with
clean data.
The complete data cleaning process can be broken down into two broad
data cleaning steps:
1. Identify and fill in missing values.
2. Correct existing data.
Data Cleaning
Remove entries that have letters or non-numeric values where there should be only
numbers (such as zip codes and phone numbers) and entries with invalid characters (like
@ or ‘ symbols in names or physical addresses).
Fix missing values
Fix unwanted values that do not fit in the dataset
Be mindful of outliers but in some cases if they are suspicious and do not make sense,
they are removed.
Data Cleaning
Datasets – for team organization
All datasets should be “brought together”
Once you have your data profiled and cleaned you work together to produce your
analytic
You might want to bring in another dataset
MapReduce *can* be used to brought your datasets together, but not required for
the project

Turn in your highest-quality paper
Get a qualified writer to help you with

“ California State University Cleaning and Profiling Code Worksheet ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now