Data Cleansing and Filtering

1. Data Cleansing and Filtering

In this video, we’ll talk about implementing data cleansing in Fabric. We'll cover issues like duplicates, missing data, and inconsistent data types.

2. Data cleansing tools

There are various tools and languages you can use to clean your data. These include SQL scripts, Spark Notebooks and Dataflows.

3. Handling blank rows

Sometimes the source data has some gaps resulting on full rows where all the columns are blank. A cleansing process should remove these blank rows.

4. Handling blank rows with Dataflows

In a Dataflow, this can be achieved by adding to the query the transformation "Remove blank rows".

5. Handling blank rows with Spark

In PySpark, the dropna function drops rows with nulls. By default, the function drops a row if any column on the row is null, but a parameter can be added so it only drops the row if the entire row is null.

6. Handling duplicate rows

Another common cleansing scenario involves removing duplicate records.

7. Handling duplicate rows with Dataflows

In a Dataflow, this can be achieved by adding to the query the transformation "Remove duplicates".

8. Handling duplicate rows with Spark

In PySpark, the dropDuplicates function removes duplicate rows from a dataframe.

9. Replacing values

In some cases, it might be required to change values in a cell, or to populate empty cells with a specific value.

10. Replacing values with Dataflows

In a Dataflow, select a column, add the Replace values transformation and specify the value to change from and the new value. In this example, the transformation changes null values to the text '(Unknown)'.

11. Replacing values with Spark

In PySpark, you can use the DataFrame 'replace' function to change the values stored in a column. In this example, the transformation changes the 'N/A' text with '(Unknown)'.

12. Correcting data types

Each column in a dataset has a data type that all rows should conform to. Having the correct data type ensures data consistency in the model and facilitates performing comparisons and other operations.

13. Correcting data types

For example, if a columns is defined as numeric, Power Query will offer arithmetic operations like Sum and Maximum in the list of transformations.

14. Correcting data types with Dataflows

There are several places where you can change the data type for a column in a Dataflow query. The quickest way is by clicking on the data type icon on the left side of the column heading.

15. Correcting data types with Dataflows

It will display a list of data types where you can select a new one.

16. Correcting data types with Dataflows

Another way is by right-clicking on the column heading. It will display a large context menu, where Change type is one of the available options.

17. Correcting data types with Dataflows

Finally, you can add a Change type transform in the diagram view.

18. Correcting data types with Spark

In PySpark, you can use the 'cast' function to cast a column into another data type. In this example, the orderID column is converted to integer data type.

19. Data Filtering

Filtering involves including or excluding rows from a dataset based on some condition. This is what the WHERE condition does on a SQL SELECT statement.

20. Data filtering with Dataflows

In a Dataflow, the filter transformation on a column allows you to specify a condition or select specific values.

21. Data filtering with Spark

The filter or where function in PySpark achieves the same objective, passing on the condition as a parameter.

22. Let's practice!

Now, let's do a couple of exercises to practice data cleansing and filtering.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.