BaşlayınÜcretsiz Başlayın

Customizing your pandas import

The pandas package is great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values (NA or NaN). To wrap up this chapter, you're going to import a corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#', and is tab-delimited.

Key arguments for pd.read_csv() include:

  • sep sets the expected delimiter.
    • You can use ',' for comma-delimited.
    • You can use '\t' for tab-delimited.
  • comment takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.
  • na_values takes a list of strings to identify as NA/NaN. By default, some values are already recognized as NA/NaN. Providing this argument will supply additional values.

Bu egzersiz

Introduction to Importing Data in Python

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Complete the arguments of pd.read_csv() to import titanic_corrupt.txt correctly using pandas:
    • sep sets the delimiter to use, and works the same way as np.loadtxt()'s delimiter argument. Note that the file you're importing is tab-delimited.
    • comment takes characters that comments occur after in the file, which in this case is '#'.
    • na_values takes a list of strings to be treated as NA/NaN, in this case the string 'Nothing'.
  • Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='____', comment='____', na_values=[____])

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Kodu Düzenle ve Çalıştır