PrepareDataforListings123 π
Use the information in the email you received from your manager, Stevie, along with the full Listings123 data (DataExport_2020). You will be preparing your data and applying data cleaning best practices to the DataExport_2020data. Make sure you fill out the Prompts sheet when completing each step.
This project’s email thread is located below, in case you need to reference it.
You will prepare your data for analysis by applying the following steps:
- Explore the Data
- Summarize Data Using a Pivot Table
- Clean Inconsistent, Invalid, Duplicate, and Missing Data
- Clean Text Data I (Search and Count)
- Clean Text Data II (Split Text to Columns)
- Merge Tables
- Summarize Data Wrangling
IMPORTANTBefore submitting your project, make sure:
β―β― All the column names AND sheet names are named exactly as stated in the instructions. Do not change any existing names in the spreadsheet unless instructed.
β―β― The indicated columns are organized in the correct sheet.
β―β― True and False values are stated as TRUE and FALSE, not 1/0, T/F, etc.
Otherwise, your project will be graded incorrectly.
TIPIf you’d like more of a challenging experience or less guidance when completing this project do not open the
- HINT
drop downs until after you’ve attempted the project step.
STEP 1: Explore the Data
Before jumping into wrangling your data, it’s important to explore the data and address the requests mentioned in the email.
INSTRUCTIONS
Go to the Step 1 in the Promptssheet in your spreadsheet and answer Questions 1-3. You will need to reference the DataExport_2020sheet to answer these questions.Before answering Question 4 in the Promptssheets, calculate the Number of Reviews Per Day column in the DataExport_2020 sheet.HINT
- Then proceed to answer Step 1, Question 4 in the Prompts sheet
- NOTEAll of your work for Step 1 should be in the Prompts (Step 1) and DataExport_2020 sheets. STEP 2: Summarize Data Using A Pivot Table Listings123 is hoping to find specific traits within the listings data that might indicate the success of a host. Listings123 believes it might be different for each city. Summarize the data by city using a Pivot Table. INSTRUCTIONSCreate a Pivot Table in the Summary By City sheet using the data from the DataExport_2020 sheet. The pivot table should contain the following fields summarized by City.COUNT of Host ID(Rename this field as “COUNT of Listings”)
- COUNTUNIQUE of Host ID
- AVERAGE of Bathrooms
- AVERAGE of Bedrooms
- AVERAGE of Beds
- SUM of Number of Reviews
- AVERAGE of Days Listed
- AVERAGE of Review Scores Location
- AVERAGE of Review Scores Value
HINTCreate a new field in column L of the Summary By City sheet that finds the percentage of listings with existing hosts. Name this column “% of Listings with Repeat Hosts”HINTFormat your pivot table to make it legible. Apply customization best practices that include your new column.HINTGo to the Step 2 in the Prompts sheet and answer Questions 1-3. You will need to reference the Summary By City sheet to answer these questions.
NOTEAll of your work for Step 2 should be in the Prompts (Step 2) and Summary By City sheets.
STEP 3: Clean Inconsistent, Invalid, Duplicate, and Missing Data
Mid-way through your project, you’re messaging David (Database Administrator) directly, for more information about the quality of the data, and you receive the following Slack message: Analysts do find bad listings in the data. We once found a listing with 852 bedrooms! I have been removing these listings from the data, and the engineering team has worked to implement validation checks to prevent this type of dirty data.
I’d recommend you clean DataExport_2020 before continuing your project.
As an analyst, you should always check your data before starting your analysis whether you are cleaning your data for dirty data or to prepare it to fit the analysis.
For this step you’ll be working in a copy of DataExport_2020 sheet called Copy of DataExport_2020.
IMPORTANTYou’ll notice there are new blank columns (J:Q) in theCopy of DataExport_2020sheet. Do not remove them as part of your data cleaning process. Feel free to ignore them for now. You will come back to them later in the project.
INSTRUCTIONS
- Go to the Copy of DataExport_2020 sheet in your spreadsheet and examine the data.Use the following checklist to start your data cleaning process:
HINT
- Go through the Copy of DataExport_2020 sheet and clean the inconsistent data types.
HINTClean the invalid values in the Copy of DataExport_2020 sheet. Ignore columns I and X.HINTClean the duplicate rows in the Copy of DataExport_2020 sheet. HINTClean the missing values in the Copy of DataExport_2020 sheet. Ignore columns J:Q.Based on information you gathered from David you are able to replace missing values, such as:β―β― If the Room Type is a “Private room” or “Shared room” replace the blank Bathrooms with 1, Beds with 1, or Bedrooms with 1.β―β― There are ~17 empty values in the Amenities column. If this is less than 10% of the data delete the rows with empty amenity values. Otherwise, replace accordingly.
- You want to get rid of the #DIV/0! error message in the Number of Reviews Per Day column. (You originally calculated this in DataExport_2020.) You know this error is happening because there are values being divided by 0. To remove the error, you want to adjust your formula to account for specific conditions.Go to the Number of Reviews Per Day (Clean) column in the Copy of DataExport_2020 sheet and adjust your formula to divide the Number of Reviews by the Days Listed, if the value for the days listed is greater than 0, otherwise, divide the Number of Reviews by 1.
HINT
- Go to the Step 3 in the Prompts sheet and answer Questions 1. You will need to reference the Copy of DataExport_2020 sheet to answer these questions.
TIPIn Step 7, Question 1 in your Prompts sheet you will have to document all the data cleaning steps you have completed in your analysis. It’s recommended you start noting them down now.
NOTEAll of your work for Step 3 should be in the Prompts (Step 3) and Copy of DataExport_2020 sheets.
STEP 4 – Clean Text Data I (Search and Count)
You know that amenities are a key factor for renters choosing between listings. This is a column you want to analyze in the future, but notice that right now it is not in a clean format.
Continue using the following checklist in your data cleaning process:
HINTINSTRUCTIONS
- When pulling the data, there were some amenities that didn’t transfer correctly. Specifically, there are two non-descriptive amenities in the amenities columns. Find and replace the following amenities with “Other” in the Copy of DataExport_2020 sheet:β―β― “translation missing: en.hosting_amenity_49″β―β―”translation missing: en.hosting_amenity_50″HINT
- In the same sheet indicate whether each listing has the amenities listed below in its assigned column. Return True or False.
- Internet (column J)
Kitchen (column K)24-Hour Check-In (column L)Free Parking (column M)Private Entrance (column N)Smoking Allowed (column O)Other (column P)HINTGo to the Step 4 in the Prompts sheet and answer Question 1. You will need to reference the Copy of DataExport_2020 sheet to answer this question.HINT
TIPIn Step 7, Question 1 in your Prompts sheet you will have to document all the data cleaning steps you have completed in your analysis. It’s recommended you continue noting them down now.
NOTEAll of your work for Step 4 should be in the Prompts (Step 4), Copy of DataExport_2020, and Scratch Work sheets.
STEP 5 – Clean Text Data II (Split Text to Columns)
Continue using the following checklist in your data cleaning process:INSTRUCTIONS
- Copy the Amenities column from the Copy of DataExport_2020 sheet and paste it in column A of the Scratch Work sheet.
- Split the text within the Amenities column into separate columns in the Scratch Work sheet.
- Calculate the Number of Amenities column in the Copy of DataExport_2020 sheet.HINT
- Go to the Step 5 in the Prompts sheet and answer Questions 1-2. You will need to reference the Copy of DataExport_2020 sheet to answer these questions. HINT
- TIPIn Step 7, Question 1 in your Prompts sheet you will have to document all the data cleaning steps you have completed in your analysis. It’s recommended you continue noting them down now. NOTEAll of your work for Step 5 should be in the Prompts (Step 5), Copy of DataExport_2020, and Scratch Work sheets. STEP 6: Merge Tables While preparing your data your manager, Stevie sends another email message:STEP 6: Merge Tables While preparing your data your manager, Stevie sends another email message:
- Hi, great to meet you! I have attached the data below. Let me know if you have any questions. π Click here to access Alexis’ data INSTRUCTIONSManually copy the data into your current spreadsheet in the sheet called Restaurants Hotels. Update the formatting if necessary.Fill in the missing values in the Number of full-service restaurants column in the Restaurants Hotels sheet by manually looking up the data in city-data.com.HINTIn the Merged City Data sheet paste the values from the Summary By City sheet (excluding the Grand Total row). Merge the Number of full-service restaurants and Fire-safe hotels and motels data with the data in the Restaurants Hotels sheet.HINT
- You’ll notice there is no data for New Orleans, LA. Alexis has been using a computer program to automatically pull data from city-data.com, but due to an error, the program is not able to collect data for New Orleans, Look up Number of full-service restaurants and Fire-safe hotels and motels data from city-data.com for New Orleans, LA, and manually enter it into the Merged City Data sheet.
Go to the Step 6 in the Prompts sheet and answer Questions 1-2. You will need to reference the Merged City Data sheet to answer these questions.TIPIn Step 7, Question 1 in your Prompts sheet you will have to document all the data cleaning steps you have completed in your analysis. It’s recommended you continue noting them down now.
NOTEAll of your work for Step 6 should be in the Prompts (Step 6), Restaurants Hotels, and Merged City Data sheets.
STEP 7: Summarize Data Wrangling
Suppose a teammate on the analytics team is onboarding on the project and asked about your data wrangling steps thus far. The goal is for your teammate to have a smooth onboarding process given the clear organization and in-depth detail of your response.
INSTRUCTIONS
- Go to the Step 7 in the Prompts sheet and answer Question 1. Use numbered steps to document your wrangling process.
NOTEAll of your work for Step 7 should be in the Prompts (Step 7) sheet.
That’s a wrap! π―
Hooray!!! π You’ve completed Part 2 of the final project. That was a big lift, but very important work you will complete as an analyst.