Get startedGet started for free

Customizing your pandas import

The pandas package is also great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN. To wrap up this Chapter, you're now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which

  • contains comments after the character '#';
  • is tab-delimited;

This exercise is part of the course

Importing Data in Python

View Course

Exercise instructions

  • Complete the sep (the pandas' version of delim), comment and na_values arguments of pd.read_csv(). comment takes characters that comments occur after in the file; na_values takes a list of strings to recognize as NA/NaN. There is one such string in this corrupted file: to figure out what it is, print the first few lines of titanic_corrupt.txt to shell.
  • Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep=____, comment=____, na_values=____)

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Edit and Run Code