Customizing your pandas import
The pandas package is also great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments
occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN. To wrap up this Chapter, you're now going to import a slightly corrupted
copy of the Titanic dataset titanic_corrupt.txt, which
- contains comments after the character
'#'; - is tab-delimited;
This exercise is part of the course
Importing Data in Python
Exercise instructions
- Complete the
sep(thepandas' version ofdelim),commentandna_valuesarguments ofpd.read_csv().commenttakes characters that comments occur after in the file;na_valuestakes a list of strings to recognize asNA/NaN. There is one such string in this corrupted file: to figure out what it is, print the first few lines oftitanic_corrupt.txtto shell. - Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the
'Age'of passengers aboard the Titanic.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep=____, comment=____, na_values=____)
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()