Verifying Data Load
Let's suppose each month you get a new file. You know to expect a certain number of records and columns. In this exercise we will create a function that will validate the file loaded.
Este exercício faz parte do curso
Feature Engineering with PySpark
Instruções do exercício
- Create a data validation function
check_load()
with parametersdf
a dataframe,num_records
as the number of records andnum_columns
the number of columns. - Using
num_records
create a check to see if the input dataframedf
has the same amount withcount()
. - Compare input number of columns the input dataframe has with
num_columns
by usinglen()
oncolumns
. - If both of these return
True
, then printValidation Passed
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
def ____(____, ____, ____):
# Takes a dataframe and compares record and column counts to input
# Message to return if the critera below aren't met
message = 'Validation Failed'
# Check number of records
if num_records == df.____():
# Check number of columns
if num_columns == ____(df.____):
# Success message
message = ____
return message
# Print the data validation message
print(check_load(df, 5000, 74))