Get startedGet started for free

Dropping data

1. Dropping Data

More data is better right? Not if its bad data! The saying "garbage in, garbage out" is doubly true in data science!

2. Where can data go bad?

Data has a lot of places where it can get messed up. Data may be recorded incorrectly or contain extreme events. Inconsistent formatting such as mixing numeric and text data can make a field hard to use. Duplications can add redundant observations. Missing data can cause your analysis to have blind spots. Lastly, sometimes the data is just not relevant to the analysis. Failure to account for bad data can set your analysis up to fail. Please be careful!

3. Dropping Columns

Inspecting our data we can see there are some columns that are not worth including in our analysis. NO is just the record number, UNITNUMBER is the apartment or house number, and CLASS is completely constant.

4. Dropping Columns

Dropping the columns requires entering a single column name or passing a list of columns to drop. Here we pass a list called cols_to_drop to the function and then drop them. Please note the star, which tells the function to unpack the list and feed them to the function one by one.

5. Text Filtering

A common task in cleaning your data will be filtering it. Here we will filter records that contain specific text values. Where applies the filter to the dataframe records. Like creates a True/False condition for the records. The Tilde provides a way to take the opposite, or a NOT. The PySpark code reads: Filter the dataframe where the POTENTIALSHORTSALE field is NOT like the string 'Not Disclosed'

6. Outlier Filtering

For an initial pass at the model, it might be helpful to remove large outliers. One definition of an outlier for near normally distributed data is something that occurs more than 3 standard deviations from the mean, only 0-point-3% of data should be filtered. Remember outliers occur on both sides, so filter on both sides of the mean!

7. Value Filtering Example

Here we will filter extreme values from the LISTPRICE column. To start, we will use the aggregate functions standard deviation and mean. Then use collect to force the calculation to run and use the zero, zero index to access the values. Lastly we created a multi-conditional filter, which is just two boolean statements ANDed together. It reads, filter where the LISTPRICE is less than the hi_bound and more than the low_bound.

8. Dropping NA's or NULLs

Dealing with missing data is something we will cover later as dropping data is usually a naive approach. Nevertheless, it is important to sometimes take shortcuts to quickly prove out the basis for further work. Dropna does what you'd expect it to do. In its basic form, it will remove a record where there is any null value in any column. You can get more specific as we'll see on the next slide.

9. Dropping NA's or NULLs

In the first example, we drop any record contains a NULL value. The second example we will look at only two columns and if both are NULL then we will remove the record. Lastly, we can apply a threshold across all the columns and say if more than two columns contain NULL values, remove the entire record.

10. Dropping Duplicates

Duplicates occur when two or more records contain the exact same information. Often this can happen after you drop columns or join datasets. dropDuplicates will drop the first duplicate it finds. Since Spark is distributed, which one it finds first may or may not be in the order of how your file was loaded. If you want to be more picky about where you are looking for duplicates you can specify a list of column names to look for them there specifically.

11. Let's practice!

In this video, we learned why we might have bad data and several ways to remove it so it doesn't adversely impact our analysis. Let's see you take a shot at removing some data in the exercises!