1. Filtering cases
In the previous chapter, we saw several tools to get insights into our process. However, process data in practice does not lend itself extremely well for analysis right away.
2. Theory versus practice
There might be too many activities, there might be cases of different types and time periods, which we do not want to look at simultaneously. Or it might be the case that certain data attributes are missing or not stored in the right format to be used in our analysis.
3. Filter
In order to resolve these issues, we will look into three event data preprocessing techniques in this chapter. Firstly, we will look at different ways to filter event data, allowing to focus the analysis.
4. Aggregate
Secondly, we will look at aggregation of events: this is useful when different activity types are specified at a too detailed level, with unnecessary distinctions.
5. Enrich
Thirdly, we will look at ways to enrich event data. In the last chapter we saw how we can use data attributes in the analysis. In this chapter, we will find out how we can create these data attributes, starting from other attributes or based on specific process-related characteristics.
6. Filter dimensions
In this lesson, we'll start with filtering.
There are two dimensions along which you can take a subset of process data.
7. Case filter
The first one is to filter cases: select process instances based on some attribute, or based on a process characteristic.
8. Event filter time period
A second dimension to subset process data is to look at the events itself: which are the events we want to analyze. For instance by looking at a specific time period
9. Event filter activity types
But you could also select specific activity types.
10. Categories of Case Filters
We will first look at case filters.
From a high level perspective, there are three categories of case filters: cases with a specific performance, cases with a specific control-flow characteristic, and cases related to a specific time frame.
11. Performance filters
Let’s start with performance. We can consider four types of criteria: the throughput time, the processing time, the idle time, but also the trace length.
Performance filters are very useful to have a look at the long-lasting cases and check what went wrong, or to learn from the short, performant cases.
12. Performance filters
Filtering cases for each of these criteria happens in the same way. The first decision we need to make is whether we want to filter in an absolute or in a relative way.
13. Filter by absolute interval
Filtering in an absolute way means configuring a time duration or trace length interval, for instance, give me all the cases with a throughput time between 5 and 10 days.
14. Filter by absolute interval
This will lead to this selection of cases.
15. Filter by Relative Percentage
Filtering in a relative way means configuring a percentage threshold, for instance, give me all the shortest cases, until I have at least covered 50% of the event data.
16. Adjusting filter configurations
We can then further play with the configurations: adding reverse = TRUE will negate the filter condition. Changing one of the interval boundaries to NA will create a half-open interval.
17. Control-flow filters
Filtering on control-flow characteristics requires a different set of configurations.
Here, we can distinguish 4 filters: based on the presence/absence of a (set of) activit(y)(ies), based on the presence /absence of a set of succession, or precedence, constraints, or based on the end points: the first and/or last activity in the case. Furthermore, we can also filter on the frequency of the trace, again in both absolute and relative ways.
18. Time filters
Finally, filtering on time can be done by specifying a time interval and a filter method. This method determines whether to select cases if they started, completed or were contained in the interval, or if they interesected with the interval.
19. Let's practice!
let's try some examples!