Get startedGet started for free

Removing invalid rows

Now that you've successfully removed the commented rows, you have received some information about the general format of the data. There should be at minimum 5 tab separated columns in the DataFrame. Remember that your original DataFrame only has a single column, so you'll need to split the data on the tab (\t) characters.

The DataFrame annotations_df is already available, with the commented rows removed. The spark.sql.functions library is available under the alias F. The initial number of rows available in the DataFrame is stored in the variable initial_count.

This exercise is part of the course

Cleaning Data with PySpark

View Course

Exercise instructions

  • Create a new variable tmp_fields using the annotations_df DataFrame column '_c0' splitting it on the tab character.
  • Create a new column in annotations_df named 'colcount' representing the number of fields defined in the previous step.
  • Filter out any rows from annotations_df containing fewer than 5 fields.
  • Count the number of rows in the DataFrame and compare to the initial_count.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Split _c0 on the tab character and store the list in a variable
tmp_fields = ____(annotations_df['_c0'], ____)

# Create the colcount column on the DataFrame
annotations_df = annotations_df.____('____', ____(____))

# Remove any rows containing fewer than 5 fields
annotations_df_filtered = annotations_df.____(~ (____))

# Count the number of rows
final_count = ____
print("Initial count: %d\nFinal count: %d" % (initial_count, final_count))
Edit and Run Code