Aan de slagGa gratis aan de slag

Customizing your pandas import

The pandas package is great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.

Key arguments for pd.read_csv() include:

  • sep sets the expected delimiter.
    • You can use ',' for comma-delimited.
    • You can use '\t' for tab-delimited.
  • comment takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.
  • na_values takes a list of strings to identify as NA/NaN. By default, some values are already recognized as NA/NaN. Providing this argument will supply additional values.

Deze oefening maakt deel uit van de cursus

Introduction to Importing Data in Python

Cursus bekijken

Oefeninstructies

  • Complete the arguments of pd.read_csv() to import titanic_corrupt.txt correctly using pandas:
    • sep sets the delimiter to use, and works the same way as np.loadtxt()'s delimiter argument. Note that the file you're importing is tab-delimited.
    • comment takes characters that comments occur after in the file, which in this case is '#'.
    • na_values takes a list of strings to be treated as NA/NaN, in this case the string 'Nothing'.
  • Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Code bewerken en uitvoeren