Get startedGet started for free

Cleaning Data

1. Cleaning Data

External data is rarely in the format we want it to be. We will now see how to modify DataFrames to prepare data for analysis.

2. Show column information

One of the first things we will look at when analyzing a new dataset is the column names. We commonly want them to be legible and concise, with no spaces. Here, we already loaded the DataFrame stock_data. To inspect its column names, we use the first method, returning the first row of a DataFrame. We can see the column names and the associated data type. We also see the first row of data in our DataFrame, giving us a look at the format of our data.

3. Rename a column

Can you spot any problems? One column, 'Adj Close', which stands for Adjusted Close, has a space. Spaces should generally be avoided in column names. One reason is that we can't access a column using dot notation if a column name has a space in it. Let's replace the space with an underscore. To do this, we use the rename method. We pass in the DataFrame that we want to modify, in this case stock_data, and then a dictionary containing the old column name, and the new column name. Here we pass Adj Close with the space, and then Adj_Close with the underscore. Using again the first method gives us the expected result - the 'Adj_Close' column now has an underscore.

4. Describe and find missing data

One common issue with data is missing values. Some data may have been mismeasured. Some manually compiled data may have missing values due to transcription errors, and some data may be intentionally missing. One way to easily find missing values is the describe method.

5. Describe missing data

As we can see from the output of our describe call, we have a lot of information here! Let's focus on the nmissing column, showing the number of missing values in each column. We can see that the Close column, representing the stock's closing price, has four missing values. Let's remove any rows that have missing values.

6. Remove missing data

To remove rows with missing values, we use the dropmissing function. We pass in the DataFrame name and the column from which we wish to drop missing values. Before we do this, we use the nrow function to print the number of rows in our DataFrame. We can see that we have 252 rows, including the missing values. After using dropmissing, we again print the number of rows. We now have 248 rows - four less than before. This confirms that the missing values are gone. We could have also called the describe function again.

7. Replace missing data

So far, we simply dropped the rows with missing data, but this isn't always appropriate. In our stock_data example, we can't simply ignore a day. In this situation, we want to replace the missing values. Let's introduce replacing missing data. To replace missing values, we can use the replace function. We pass in the column containing missing values. Then we pass the values to be replaced (in this case, missing) followed by the replacement values. Here, we have chosen to replace the missing values with a closing price of 130. Printing the rows that previously had missing values, we can see that the closing price is now 130 for all those rows. Picking a random value is not the best approach and might make the data worse than if the values were missing. In the upcoming exercises you'll figure out a way to replace values more accurately.

8. Let's practice!

You've seen a lot on how to work with DataFrames. now let's practice.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.