Customizing your pandas import

The pandas package is great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.

Key arguments for pd.read_csv() include:

sep sets the expected delimiter.
- You can use ',' for comma-delimited.
- You can use '\t' for tab-delimited.
comment takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.
na_values takes a list of strings to identify as NA/NaN. By default, some values are already recognized as NA/NaN. Providing this argument will supply additional values.

Complete the arguments of pd.read_csv() to import titanic_corrupt.txt correctly using pandas:
- sep sets the delimiter to use, and works the same way as np.loadtxt()'s delimiter argument. Note that the file you're importing is tab-delimited.
- comment takes characters that comments occur after in the file, which in this case is '#'.
- na_values takes a list of strings to be treated as NA/NaN, in this case the string 'Nothing'.
Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.

script.py

IPython Shell

Introduction and flat files

Importing data from other file types

Working with relational databases in Python

Exercise

Exercise

Customizing your pandas import

Instructions