1. Filtering events
Sometimes, we do not want to select cases as a whole, but rather parts of them. In these situations, we need event filters.
2. Categories of event filters
We will discuss 4 different types of event filters:
- Trim filters, which trim the heads and the tails of cases
- Frequency filters, which filter events based on the frequency of the activity type, for example.
- Label filters, which uses the activity or resource label to select events, and
- General Attribute filters, which use any general condition using data attributes
3. Trim to time period
A first possibility is the focus the analysis on a part of the process flow, by trimming the head and tails of cases. Trimming can be done either by defining a time period, or by defining the desired endpoints of cases.
Trimming by time period is done in the same way as the time period filter for cases. Except, in this situation, we will use the trim filter method. This will only retain the parts of the cases within the time period, if any, and discard everything outside of it.
4. Trim to start and end points
Instead of using time, we can also trim by activities. The trim filter does just this. You specify a (set of) desired start and/or end points of cases, and the filter will make sure that each case in the result adheres to this criteria.
For instance, we can configure a set of viable start activities. Let's say we want cases to start with a blue activity.
5. Trim to start and end points
The filter will take each case from the beginning, and discard any activity, until it find one of the start activities we listed.
Then, it will do the same for the tail of the case. Let's say we want the green activities as end point.
6. Trim to start and end points
If it cannot find a viable start or end point, the whole trace will get discarded.
Trimming cases like this is very useful to focus on specific stages in the process. For instance, in a hospital, we can see the journey patients go through between two specific treatments.
For both filters, we can again set the reverse argument, by which the returned data would be just include the heads and the tails of the cases.
7. Filter by frequencies
Secondly, we can also just decided to look at the most common parts of the process. Here, we can define “common” to be the frequency of activity labels, but also the frequency of actors.
In both cases, we can use the same configurations we saw before when looking at the performance filter: we can chose between an absolute frequency interval, or we can chose a percentage threshold.
For example, we can filter the event log by retaining all activities which occur between 50 and 100 times. Or we can filter the event log by selecting the most frequency activity types, until we have covered at least 80% of the event log.
Also here, we can adjust configurations using half-open intervals or using the reverse argument
8. Filter by labels
However, instead of defining a filter on frequencies, we can just filter on the exact labels of activities and resources. For example, suppose we only want the red, orange, and purple activities.
The reverse argument here allows us to very easily deselect a set of labels.
9. Filter by conditions
In an even more general fashion, the well-known dplyr filter can be used to filter the event data based on any of the attributes which are available in the data. In this case, any logical condition or combination of logical conditions can be used for subsetting.
10. Let's practice!
In the following exercises, we'll see how to use these filters in our HR process.