Get startedGet started for free

Customizing your pandas import

The pandas package is great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.

Key arguments for pd.read_csv() include:

  • sep sets the expected delimiter.
    • You can use ',' for comma-delimited.
    • You can use '\t' for tab-delimited.
  • comment takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.
  • na_values takes a list of strings to identify as NA/NaN. By default, some values are already recognized as NA/NaN. Providing this argument will supply additional values.

This exercise is part of the course

Introduction to Importing Data in Python

View Course

Exercise instructions

  • Complete the arguments of pd.read_csv() to import titanic_corrupt.txt correctly using pandas:
    • sep sets the delimiter to use, and works the same way as np.loadtxt()'s delimiter argument. Note that the file you're importing is tab-delimited.
    • comment takes characters that comments occur after in the file, which in this case is '#'.
    • na_values takes a list of strings to be treated as NA/NaN, in this case the string 'Nothing'.
  • Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Edit and Run Code