1. Case study EDA & imputation
The best way to solidify knowledge is to apply it to a real world business problem. Let's use your new SQL skills in EDA, functions, and stored procedures to solve a business problem with the TaxiRide dataset.
2. Taxi ride business problem
An overseas private equity firm is looking to invest in US Transportation. The sharp decline in the New York City taxi medallion price may represent a buying opportunity.
The firm wants to understand the average fare per trip distance, ride count, and total ride time for each weekday and NYC borough.
They also want to dig deeper and identify the pickup locations within the boroughs that should be scheduled for each driver shift.
3. Essential EDA
We always begin with EDA, it's essential. You need to expect data anomalies and be in a position to explain how they affect your query results. Your EDA conclusions could greatly affect how business decisions are implemented using your queries.
When working with distributed transactional datasets it's helpful to identify data collection errors. If a distributed device is not calibrated properly, the data it collects could contain errors.
Here is a query to identify transactions in the BikeShare dataset where the StartDate or EndDate is in the future, or if the StartDate happens before the EndDate. This could help identify a potential distributed equipment or sensor issue.
4. Data imputation
As you start to calculate average fare per TripDistance you encounter a divide by zero error.
You then discover there are hundreds of records where the Trip Distance is incorrectly set to zero.
You have to decide which data imputation method you will use to resolve this data integrity issue. There are many ways to resolve the missing data problem, but we will just cover 3 in this lesson; Mean, Hot Deck, and Omission.
5. Mean imputation
Mean imputation is a technique that involves replacing any corrupt value with the column's mean value.
This is beneficial because it doesn't change the mean value.
However, it also increases correlations involving the column that is imputed.
Here is a stored procedure that will update the records which have an incorrect duration of zero and set them to the value of the mean. You could also introduce more complexity and calculate a value that evaluates start and end time.
6. Hot Deck imputation
Hot Deck imputation is a technique where a missing value is set to a randomly selected value from the dataset.
The TABLESAMPLE clause can be used for this.
The SQL code here creates a function that will return a randomly selected Duration value using the TABLESAMPLE clause of the FROM clause. The function is selecting the first record from the TABLESAMPLE of 1,000 random records where the Duration is more than 0.
In the second code block, we implement Hot Deck imputation by calling the function each time we encounter a Duration value of zero via CASE statement.
In the previous example, we updated the missing values directly in the table with a stored procedure. In this example, we are getting a hot deck value from the function each time a zero is encountered, but not affecting the underlying data.
7. Impute via Omission
Sometimes the best way to resolve corrupt data is simply to exclude those records from your analysis. Here we calculate the average duration per DayofWeek excluding records where Duration is zero.
When selecting an imputation technique, you should consider dataset size, the analysis goals, data distribution, as well as the corrupt data's relationship to other columns. All imputation techniques inherently inject bias, and you need to be comfortable with the effect of that bias.
8. Your turn!
Let's get started on the business case!