Customizing your pandas import
The pandas
package is great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments
occurring in flat files, empty lines and missing values (NA
or NaN
). To wrap up this chapter, you're going to import a corrupted
copy of the Titanic dataset titanic_corrupt.txt
, which contains comments after the character '#'
, and is tab-delimited.
Key arguments for pd.read_csv()
include:
sep
sets the expected delimiter.- You can use
','
for comma-delimited. - You can use
'\t'
for tab-delimited.
- You can use
comment
takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.na_values
takes a list of strings to identify asNA
/NaN
. By default, some values are already recognized asNA
/NaN
. Providing this argument will supply additional values.
This exercise is part of the course
Introduction to Importing Data in Python
Exercise instructions
- Complete the arguments of
pd.read_csv()
to importtitanic_corrupt.txt
correctly using pandas:sep
sets the delimiter to use, and works the same way asnp.loadtxt()
'sdelimiter
argument. Note that the file you're importing is tab-delimited.comment
takes characters that comments occur after in the file, which in this case is'#'
.na_values
takes a list of strings to be treated asNA
/NaN
, in this case the string'Nothing'
.
- Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the
'Age'
of passengers aboard the Titanic.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()