Splitting into columns
You've cleaned up your data considerably by removing the invalid rows from the DataFrame. Now you want to perform some further transformations by generating specific meaningful columns based on the DataFrame content.
You have the spark context and the latest version of the annotations_df DataFrame. pyspark.sql.functions is available under the alias F.
Bu egzersiz
Cleaning Data with PySpark
kursunun bir parçasıdırEgzersiz talimatları
- Split the content of the
'_c0'column on the tab character and store in a variable calledsplit_cols. - Add the following columns based on the first four entries in the variable above: folder, filename, width, height on a DataFrame named
split_df. - Add the
split_colsvariable as a column.
Uygulamalı interaktif egzersiz
Bu örnek kodu tamamlayarak bu egzersizi bitirin.
# Split the content of _c0 on the tab character (aka, '\t')
split_cols = ____(annotations_df['____'], '\t')
# Add the columns folder, filename, width, and height
split_df = annotations_df.withColumn('folder', split_cols.getItem(____))
split_df = split_df.withColumn('filename', ____
split_df = split_df.____
____
# Add split_cols as a column
split_df = split_df.____