1. Logging
In this lesson we learn the basics of logging,
as well as how to avoid a stealth loss of cpu when logging Spark actions.
2. Logging primer
If you are creating Spark jobs that will be deployed in production to be run as batch jobs, then logging is a best practice to know. In PySpark this is especially important if you are using log statements for inspecting dataframes. First, a brief primer on logging is in order. Here is a basic logging setup. The level argument of the basicConfig() function sets the logging level to info level, and has each log message print the time, the log level of the statement, and the message itself.
Note that the first statement printed, whereas the second one did not. This is because the second log statement is at debug level.
3. Logging with DEBUG level
This time, we set the log level to debug.
This means that the second log statement will display.
4. Debugging lazy evaluation
Due to the combination of Spark's lazy evaluation and distributed computation, debugging a complex application can be challenging. This is because an erroneous action might not be triggered until well downstream of where the action is coded. One way to simplify the debugging of an application is to force actions to be triggered. However, a naive approach to doing this can cause stealth loss of CPU.
5. A simple timer
We'll use a simple timer to prove the stealth loss of CPU. It prints a sequence number and the elapsed time. It then resets its timer.
The timer provides a function `elapsed`, which prints the elapsed time. Here is an example of using this simple timer. Immediately after creating it, it shows an elapsed time of 0. After performing an operation that takes two seconds, it shows an elapsed time of two seconds. We can zero the timer using reset.
6. class timer
For concreteness, here is the code for the simple timer being used here.
7. Stealth CPU wastage
Here is a naive approach to printing a debug statement that prints the size of a dataframe only if DEBUG log level is enabled. Carefully inspect what is happening here. First we import the logging module. The next line configures the logger. The log level is set to INFO. Next, we create a dataframe. However at this time no action has been performed on the dataframe yet. Then we create a timer. Next we print a log statement and print the elapsed time. Then we print a debug log statement that contains a dataframe action. Immediately after that we print the elapsed time.
The info log statement prints, as desired. The debug log statement does not print, also as desired. However, note that the elapsed time of the debug statement is 2.0 seconds -- that is not desired. This is because the debug log statement contains a dataframe action: the df.count(). That action is being triggered even though the debug statement not. Though the debug statement is below the log level threshold, the expression is always evaluated. As a result, the dataframe action is being performed even when the log message is not displayed. For most applications, the cost of evaluating an expression passed into the logging statement is insignificant. However, a count operation on a dataframe can be extremely costly. Fortunately, it is easy to correct.
8. Disable actions
Here is a simple solution. Simply provide a means to disable log statements that trigger an action.
A better solution may exist, but this one serves to illustrate the key idea: allow for dataframe actions that you need during development or during debugging, yet want to disable in production, while doing so with confidence that they will not be silently executed.
9. Enabling actions
Setting ENABLED = True allows us to enable the log statements. The takeaway here is to remember that actions on large dataframes can be costly to calculate. We need to be careful about including spark operations that trigger an action in log statements.
10. Let's practice!
Let's practice what we just learned about logging.