1. What is data cleaning and preparation?
Hi! I’m Deanna Sanchez, and I’m excited to welcome you to this Alteryx course. In the following chapters, we will learn various ways to clean and prepare data for analysis.
2. Why is clean data important?
A very important part of data analysis is the concept of GIGO, or “Garbage In, Garbage Out”. It means that if your data has errors, your results may have errors too. When you clean your data, you can avoid mistakes and help prevent errors from occurring. Properly prepared data also allows you to standardize your datasets, ensuring formatting, naming conventions, and other guidelines. Clean data also increases productivity and speed to insight by streamlining your analytics and processes!
3. Cleaning your data is like...
You can think of cleaning your data like tuning up your car; clean spark plugs and other parts can make everything run smoother and faster, just like fixing "dirty data"!
4. Examples of Dirty Data
Sometimes, you may receive datasets that need cleaning and preparation before performing analyses. There are many kinds of "dirty data"; a few examples are missing or incomplete data,
5. Examples of Dirty Data
unstandardized or inconsistent data,
6. Examples of Dirty Data
and even data entry errors that can be overlooked.
7. Examples of Dirty Data
Leading and trailing whitespace can cause issues with filtering and joining data, and extra characters, such as currency signs on financial data, can prevent numeric formulas from being applied.
8. Clean data techniques
There are many aspects of cleaning data at the start of your analysis, and one goal is to have data that is free from any missing values. You can flag missing data by imputing blanks for null string data types and zeroes for null numeric data types, as well as filter missing data.
9. Clean data techniques
Applying standards to your dataset can be essential, such as ensuring data matches your existing formatting requirements. An example is adding dollar signs to currency fields or ensuring IDs have leading zeros. Naming conventions can include appending a date to an output filename, and modifying the case type to Upper Case to enable case-sensitive joins is useful. Data types play a major role in analytics, and modifying data types, such as converting string to numeric values, is easily achieved in Alteryx.
10. Clean data techniques
Removing unneeded items such as leading and trailing whitespace, along with tabs and line breaks, and even entire rows and columns, can be part of the data cleansing process. Alteryx allows you to remove each of these quickly, as well as punctuation and other characters.
11. Profile with color-coding
Alteryx helps you identify when to clean data with color coordination in the Results window and Profile view. Green equals "OK";
12. Profile with color-coding
White is a count of "Unique" records;
13. Profile with color-coding
Yellow equals "Null";
14. Profile with color-coding
Red equals "Not OK";
15. Profile with color-coding
Grey equals "Empty" values.
16. Data types in Alteryx
In data preparation, it is important to understand the various data types, along with how and when they should be used. The five main categories of data types utilized in Alteryx are Boolean, which signifies a binary format such as zero or one, true or false, and can be used to flag data.
17. Data types in Alteryx
Numeric, which includes Integer, Fixed Decimal, and Double. The Integer data type is a number without decimals, and includes Integer 16, 32, and 64, which changes based on the byte storage. Double is a double-precision floating point value, and is a good default for numeric data since it can hold various decimal positions.
18. Data types in Alteryx
19. Data types in Alteryx
DateTime, such as dates, times, and a combined datetime in an ISO standard format.
20. Data types in Alteryx
And Spatial, which utilizes spatial objects such as points, lines and polygons.
21. Dataset details
In this and the following chapters for the course, the hands-on exercises will feature a New York Property Sales dataset, which lists location, sale date, and sale amount, and we will discover the top 10 highest-selling properties.
22. Let's practice!
Now, let’s explore Data Preparation in Alteryx!