Verifying Data Load
Let's suppose each month you get a new file. You know to expect a certain number of records and columns. In this exercise we will create a function that will validate the file loaded.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Create a data validation function
check_load()
with parametersdf
a dataframe,num_records
as the number of records andnum_columns
the number of columns. - Using
num_records
create a check to see if the input dataframedf
has the same amount withcount()
. - Compare input number of columns the input dataframe has with
num_columns
by usinglen()
oncolumns
. - If both of these return
True
, then printValidation Passed
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def ____(____, ____, ____):
# Takes a dataframe and compares record and column counts to input
# Message to return if the critera below aren't met
message = 'Validation Failed'
# Check number of records
if num_records == df.____():
# Check number of columns
if num_columns == ____(df.____):
# Success message
message = ____
return message
# Print the data validation message
print(check_load(df, 5000, 74))