Get startedGet started for free

Preparing and cleansing fields

1. Preparing and cleansing fields

Before we can start our data preparation journey in Alteryx, we need some data. Let’s load the FIFA World Cup Matches dataset, containing games from 1930 to 2014, team names, number of goals, and cities. Once we’ve input the data, we can start reviewing the data by adding a Browse tool and clicking Run. In this and other chapter screencasts, we will explore FIFA World Cup data to see the highest attended game, and learn more about home vs. away scores. We can immediately see in the Results window for the Browse tool that there are three records at the top of the dataset that appear to be subheaders. We won’t need those, so we will be removing them. Also, looking at the City field, you can see by the red color-coding the records are “Not OK” due to trailing whitespace, which is seen when you hover over the red triangles. As we scroll past, we notice the fields of Assistant 1 and Assistant 2. These are not needed, so we will remove those from the data. Lastly, we’ll inspect the fields of Home Team Goals, Away Team Goals, and other fields containing numeric data. Select the MetaData button and we can see that the fields are currently in a string data type. Click on the Data button to return to the data view in the Results window. To start the data cleansing process, we will add a Select tool from the Favorites toolbar. Within this tool, we can rename, change data types and sizes, and remove fields. To rename the column of City to Location, click the Rename cell next to City, and type “Location”. We want to remove the Assistant 1 and Assistant 2 fields from the data stream, so we easily uncheck the box next to each. Lastly, we want to change many of the numeric fields such as “Home Team Goals”, “Away Team Goals”, “Attendance”, and “Current Year” to a Double data type to perform numeric functions. We will modify the DateTime field to a true DateTime format as well. At the bottom, there is an option for “Unknown” values. Keeping this checked will allow new fields added upstream to be included in the data stream, making the workflow truly dynamic. If the box is unchecked, the fields are locked and new fields added upstream won’t pass through. The data type drop-down options start with Boolean, then numeric types Byte to Double. String types are String to V_WString, then DateTime, and lastly, Spatial. To learn more, click in the search bar and type “data types” to bring up the Alteryx Help file for that topic. Now add a Browse Tool, click Run, and view our updated data. Next, clean the trailing whitespace in the newly renamed Location field by adding a Data Cleansing tool. In this case, we’ll select all fields by clicking All. Replacing nulls with blanks for string fields and zeroes for numeric fields is also a key default feature. The Stage field has a leading asterisk, so we will clean the punctuation as well. Here is where you remove all whitespace and you can even modify case. We want to remove those first three subheader records, so we choose the Select Records tool from the Preparation toolset, allowing us to choose the number of records to return. In this workflow, we want row 4 onwards, so we would type “4” and then a plus “+”. To quickly add a unique identifier, we can utilize the Record ID tool, keeping all default settings, which will add a “Record ID” field as the first column. Now that we’ve cleaned the data, we want to discover which match had the highest attendance. To do this, we add a Sort Tool and sort on the Attendance field in a Descending manner. Now add a Browse tool and click Run. The highest attendance was an amazing 173,850, played in 1950 between Uruguay and Brazil in Rio de Janeiro! Let’s practice with some exercises using New York City property data!

2. Let's practice!

Congratulations on creating a workflow based on the Data Preparation toolset, and now let’s practice with some exercises using New York City property data!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.