Customizing your pandas import
The pandas package is great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments
occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted
copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.
Key arguments for pd.read_csv() include:
sepsets the expected delimiter.- You can use
','for comma-delimited. - You can use
'\t'for tab-delimited.
- You can use
commenttakes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.na_valuestakes a list of strings to identify asNA/NaN. By default, some values are already recognized asNA/NaN. Providing this argument will supply additional values.
Deze oefening maakt deel uit van de cursus
Introduction to Importing Data in Python
Oefeninstructies
- Complete the arguments of
pd.read_csv()to importtitanic_corrupt.txtcorrectly using pandas:sepsets the delimiter to use, and works the same way asnp.loadtxt()'sdelimiterargument. Note that the file you're importing is tab-delimited.commenttakes characters that comments occur after in the file, which in this case is'#'.na_valuestakes a list of strings to be treated asNA/NaN, in this case the string'Nothing'.
- Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the
'Age'of passengers aboard the Titanic.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()