Customizing your pandas import
The pandas package is great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments
occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted
copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.
Key arguments for pd.read_csv() include:
sepsets the expected delimiter.- You can use
','for comma-delimited. - You can use
'\t'for tab-delimited.
- You can use
commenttakes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.na_valuestakes a list of strings to identify asNA/NaN. By default, some values are already recognized asNA/NaN. Providing this argument will supply additional values.
This exercise is part of the course
Introduction to Importing Data in Python
Exercise instructions
- Complete the arguments of
pd.read_csv()to importtitanic_corrupt.txtcorrectly using pandas:sepsets the delimiter to use, and works the same way asnp.loadtxt()'sdelimiterargument. Note that the file you're importing is tab-delimited.commenttakes characters that comments occur after in the file, which in this case is'#'.na_valuestakes a list of strings to be treated asNA/NaN, in this case the string'Nothing'.
- Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the
'Age'of passengers aboard the Titanic.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()