Removing invalid rows
Now that you've successfully removed the commented rows, you have received some information about the general format of the data. There should be at minimum 5 tab separated columns in the DataFrame. Remember that your original DataFrame only has a single column, so you'll need to split the data on the tab (\t) characters.
The DataFrame annotations_df is already available, with the commented rows removed. The spark.sql.functions library is available under the alias F. The initial number of rows available in the DataFrame is stored in the variable initial_count.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Create a new variable
tmp_fieldsusing theannotations_dfDataFrame column'_c0'splitting it on the tab character. - Create a new column in
annotations_dfnamed'colcount'representing the number of fields defined in the previous step. - Filter out any rows from
annotations_dfcontaining fewer than 5 fields. - Count the number of rows in the DataFrame and compare to the
initial_count.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split _c0 on the tab character and store the list in a variable
tmp_fields = ____(annotations_df['_c0'], ____)
# Create the colcount column on the DataFrame
annotations_df = annotations_df.____('____', ____(____))
# Remove any rows containing fewer than 5 fields
annotations_df_filtered = annotations_df.____(~ (____))
# Count the number of rows
final_count = ____
print("Initial count: %d\nFinal count: %d" % (initial_count, final_count))