1. Data Cleansing and Filtering
In this video, we’ll talk about implementing data cleansing in Fabric. We'll cover issues like duplicates, missing data, and inconsistent data types.
2. Data cleansing tools
There are various tools and languages you can use to clean your data. These include SQL scripts, Spark Notebooks and Dataflows.
3. Handling blank rows
Sometimes the source data has some gaps resulting on full rows where all the columns are blank. A cleansing process should remove these blank rows.
4. Handling blank rows with Dataflows
In a Dataflow, this can be achieved by adding to the query the transformation "Remove blank rows".
5. Handling blank rows with Spark
In PySpark, the dropna function drops rows with nulls.
By default, the function drops a row if any column on the row is null, but a parameter can be added so it only drops the row if the entire row is null.
6. Handling duplicate rows
Another common cleansing scenario involves removing duplicate records.
7. Handling duplicate rows with Dataflows
In a Dataflow, this can be achieved by adding to the query the transformation "Remove duplicates".
8. Handling duplicate rows with Spark
In PySpark, the dropDuplicates function removes duplicate rows from a dataframe.
9. Replacing values
In some cases, it might be required to change values in a cell, or to populate empty cells with a specific value.
10. Replacing values with Dataflows
In a Dataflow, select a column, add the Replace values transformation and specify the value to change from and the new value. In this example, the transformation changes null values to the text '(Unknown)'.
11. Replacing values with Spark
In PySpark, you can use the DataFrame 'replace' function to change the values stored in a column. In this example, the transformation changes the 'N/A' text with '(Unknown)'.
12. Correcting data types
Each column in a dataset has a data type that all rows should conform to. Having the correct data type ensures data consistency in the model and facilitates performing comparisons and other operations.
13. Correcting data types
For example, if a columns is defined as numeric, Power Query will offer arithmetic operations like Sum and Maximum in the list of transformations.
14. Correcting data types with Dataflows
There are several places where you can change the data type for a column in a Dataflow query. The quickest way is by clicking on the data type icon on the left side of the column heading.
15. Correcting data types with Dataflows
It will display a list of data types where you can select a new one.
16. Correcting data types with Dataflows
Another way is by right-clicking on the column heading. It will display a large context menu, where Change type is one of the available options.
17. Correcting data types with Dataflows
Finally, you can add a Change type transform in the diagram view.
18. Correcting data types with Spark
In PySpark, you can use the 'cast' function to cast a column into another data type. In this example, the orderID column is converted to integer data type.
19. Data Filtering
Filtering involves including or excluding rows from a dataset based on some condition. This is what the WHERE condition does on a SQL SELECT statement.
20. Data filtering with Dataflows
In a Dataflow, the filter transformation on a column allows you to specify a condition or select specific values.
21. Data filtering with Spark
The filter or where function in PySpark achieves the same objective, passing on the condition as a parameter.
22. Let's practice!
Now, let's do a couple of exercises to practice data cleansing and filtering.