The data set of the PDC 2020 is generated from a single base model that allows for the following characteristics to be configured:
A: Dependent tasks
If Yes then all transitions that bypass the dependent tasks are disabled.
0=No, 1=Simple, 2=Complex.
If No, then all transitions that start a loop are disabled. If Simple, then all transitions that are a shortcut between the loop and the main flow are disabled.
C: OR constructs
If No, then all transitions that only take some inputs for an OR-join and all transitions that generate only some outputs for an OR-split are disabled.
D: Routing constructs
If Yes, then some transitions are made invisible.
E: Optional tasks
If Yes, then some invisible transitions are added to allow skipping of some (visible) transitions.
F: Duplicate tasks
If Yes, then some transitions are relabeled to existing labels.
If Yes, then noise is introduced in approx. 1 out of 5 traces. Noise is introduced by deleting one event (40%), moving one event in the trace (20%), or copying one event in the trace (40%).
Every training event log (pdc_2020_ABCDEFG.xes) is generated randomly from the (configured) base model, and contains 1000 traces. As an example, the event log pdc_2020_1211111.xes is the most complex training log for the model with dependent tasks, with complex loops, with OR constructs, with routing constructs, with optional tasks, with duplicate tasks, and with noise.
Next to these 192 training event logs, 192 matching test event logs are generated, and classified using the correct models into 192 score logs. These event logs also contain 1000 traces, of which approx. 1 out of 2 contains noise, and where additional noise may be introduced to check whether the discovered model correctly has discovered dependent tasks. In the score log, the additional boolean “pdc:isPos” attribute denotes whether the trace is positive (fitting, true) or negative (non-fitting, false).
No event logs are disclosed for the automated contest.
For the manual contest, only the most complex event log (pdc_2020_1211111.xes) is disclosed. After the entire contest, all event logs have been disclosed.
Disclosed data set
The data set of the PDC 2020 has now been disclosed and is available for download. It contains the following four folders (in ZIP archives):
- Ground Truth Logs: The test logs (.xes format) as classified by the corresponding models.
- Models: The original workflow nets (.pnml format) used to generate the logs.
- Test Logs: The logs (.xes format) to classify using the models as discovered by the submitted algorithm from the training logs.
- Training Logs: The logs (.xes format) to discover the models from using the submitted algorithm.