Data Exploration and Transformation
1. Data Exploration and Transformation
In this video, we'll introduce different methods to transform data in a lakehouse or warehouse.2. Data transformation scenarios
There are many reasons why data must be transformed after ingesting it from source systems. Sometimes the source data has entries that would be considered invalid for data analysis, for example empty fields or duplicate values.3. Data transformation scenarios
When combining data from multiple systems, it is also possible that data types between systems don't match. For example, one system might store dates as text, which might affect how dates display and behave when querying data.4. Data transformation scenarios
Sometimes we also need to remove data that is not relevant. For example, the source data might have rows at the top or the bottom of the dataset that are used to provide tracking information like the number of rows in the dataset. Those rows could be discarded.5. Data transformation scenarios
Sometimes we also need to combine data from multiple systems into a single dataset, or to create new columns to add calculations to the dataset that might be useful for analysis.6. Data transformation scenarios
Finally, we might need to transform the data so it is stored in a format more suitable for analysis and reporting. Converting data to a star or snowflake schema is an example of this scenario.7. Cleansing data
The main purpose of cleansing is to remove invalid or non-relevant data. This can include removing the top or bottom rows, removing rows with duplicates, blanks or errors, and converting data types.8. Filtering data
You may want to filter your data to remove rows that do not meet certain condition. The WHERE clause is a common example of filtering.9. Merging and Joining data
Merge transformations allow you to join two or more datasets into one. This is useful when combining data from multiple sources, or when shaping data into star schemas.10. Data Aggregation and De-aggregation
Sometimes it is desirable to summarize data. For example, you may want to create a table with totals so reports can run faster. This aggregation results in a dataset with fewer records. On the other hand, sometimes it is desirable to split a single column into multiple columns to facilitate certain queries. This is called de-aggregation.11. Data exploration and transformation tools
There are various tools that support data exploration and transformations. Non-developers can use low-code solutions like Dataflows and Data Wrangler, which provide several built-in data transformation functions. Spark or T-SQL developers might prefer Spark notebooks or SQL scripts to query and explore data, and to develop custom data transformation routines. Let's look at these tools in more detail.12. Dataflows
Dataflows have built-in data cleansing tasks such as removing blanks and duplicates, data filtering tasks, and data aggregation tasks like merging queries.13. Data Wrangler
Data Wrangler is a tool built on Fabric notebooks that offers data exploration and preprocessing tasks. It provides a grid interface for displaying data, dynamic summary statistics, built-in visualizations, and a library of common data cleaning operations.14. Spark notebooks
Notebooks support multiple languages and data manipulation libraries like PySpark and Pandas.15. SQL scripts
Finally, SQL scripts allow developers to use T-SQL data manipulation statements such as SELECT, INSERT, UPDATE and DELETE to craft custom data cleansing routines.16. Let's practice!
Now, let's do a couple of exercises to get more familiar with the different data transformation scenarios.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.