Tut #2 - Data Cleaning
By the end of this tutorial, you will be able to:
Describe what data cleaning is and explain why it’s so important for data analysis
Clean a dataset using a variety of methods, preparing it for analysis in the next tutorial
Here’s how the tutorial will be structured:
What is data cleaning and why does it matter?
What are some key steps in the data cleaning process?
Practical exercise: Cleaning your dataset
Key takeaways and further reading
1. What is data cleaning and why is it so important?
As a data analyst, you’ll receive data from a variety of sources. This data will come in all different formats and, more often than not, it will comprise what’s known as “dirty” data. In other words, it won’t be ready for analysis straight off the bat—you’ll need to clean it first.
What is dirty data?
Incomplete data—for example, a spreadsheet with missing values that would be relevant for your analysis. If you’re looking at the relationship between customer age and number of monthly purchases, you’ll need data for both of these variables. If some customer ages are missing, you’re dealing with incomplete data.
Duplicate data—for example, records that appear twice (or multiple times) throughout the same dataset. This can occur if you’re combining data from multiple sources or databases.
Inconsistent or inaccurate data—data that is outdated or contains structural errors such as typos, inconsistent capitalization, and irregular naming conventions. Say you have a dataset containing student test scores, with some categorized as “Pass” or “Fail” and others categorized as “P” or “F.” Both labels mean the same thing, but the naming convention is inconsistent, leaving the data rather messy.
Garbage in, garbage out: The importance of data cleaning
Have you heard of the saying “Garbage in, garbage out”—otherwise known as GIGO? That’s why data experts spend a good 60% of their time on data cleaning.
Working with dirty data is not only bad practice; it can be extremely costly in the long run.
2. What are some key steps in the data cleaning process?
Delete unnecessary columns. Chances are, your dataset will contain some values that aren’t relevant to your analysis. For example, in an analysis of students’ test scores compared to hours spent studying, things like student ID number and date of birth aren’t relevant. You could simply delete the columns containing this data.
Identify and remove duplicates. Duplicate data tends to occur during the data collection phase, so it’s important to filter them out. You’ll be removing duplicates from your dataset as part of today’s practical exercise, so we’ll show you how to do that shortly!
Deal with missing data. In the case of missing data, you can either delete the entire entry associated with it (i.e. delete the whole row which contains the empty cell), impute the missing value based on other data, or flag all missing data as such by entering “0” or “missing” in the respective cell. Each method for handling missing data has implications for your analysis, so you’ll need to choose your approach carefully. You can learn more about how to handle missing data here.
Remove unwanted outliers. Outliers are values that differ significantly from other values in your data. For example, if you see that most student test scores fall between 50 and 80, but that one student has scored a 2, this might be considered an outlier. Outliers may be the result of an error, but that’s not always the case, so approach with caution when deciding whether or not to remove them. You can learn more about how to detect and handle outliers here.
Fix inconsistencies. As already mentioned, inconsistencies in data include things like typos and irregular naming conventions. You can fix these manually (for example, using the “Find and replace” function in Google Sheets or Microsoft Excel to locate one spelling or convention and replace it with another) or by creating a filter. You can read about how to find inconsistencies in your data in Excel in this guide.
Links:
3. Practical exercise: Cleaning your dataset
Now let’s take what you’ve learned about data cleaning and apply it to your Citi Bike dataset. We’ll focus on:
Identifying and removing duplicates
Identifying and handling missing data points
Task 1: Identify and remove duplicates
Select the entire dataset by clicking on the gray rectangle to the left of columnA and above row1
With your dataset highlighted, click on “Data” in the toolbar and select “Remove duplicates” from the dropdown menu
Leave all checkboxes selected and click “Remove duplicates.” The dataset contained over 3,500 duplicate rows!
Go to the “Data” menu in the toolbar and select “Create a filter” from the dropdown menu.
select (Blanks). A tick should appear next to it. Then press “Ok”
You’ll now see only the blank, empty rows that you want to delete (rows 16846-20401).
To delete these rows: Click inside cell A:16846. Press and hold “Shift + Control”, then press the down arrow on your keyboard. This will highlight the entire column of empty cells you want to delete.
With your selection highlighted, right click and then select “Delete selected rows” from the drop-down menu:
Click “Data” in the top toolbar and select “Turn off filter” from the drop-down menu
Task 2: Identify and remove missing data points
As we learned earlier, there are different ways to handle missing data. For our Citi Bike dataset, we’ll look at just one option: deleting all rows in your spreadsheet that contain missing values. It’s important to bear in mind that this is not the only solution for handling missing data—nor is it necessarily always the best approach. For now, though, it will give us a clean dataset without needing to get into anything too complicated. This is a gentle introduction, after all! If you’re curious, you can learn about alternative methods for handling missing data here.
First, we need to find the rows in our dataset that contain missing data (i.e. blank cells). There are a few ways to do this but we’re going to use the filter method.
Usually, you would need to filter each column individually to check for blanks. For the purpose of this exercise (and to save you a bit of time), we can tell you that the missing data points are located in column F (End Station Name).
Filter column F, “End Station Name.” (Data ⇒ Create a filter)
Delete the row
remove the filter (“Data” ⇒ “Remove filter”)
Note: There are are few different ways to remove blanks. Another way is to apply conditional formatting to “color” blank cells so they’re more visually noticeable, or you can apply a filter to each column in your dataset and filter for “blanks.”
3. Key takeaways and further reading
Ready for some exploratory analysis. we’ll get descriptive statistics from our dataset in order to uncover some initial insights into how people are using the NY Citi Bike service
You can learn more about the fascinating world of data in these posts:
Comments