Call for Discovery Algorithms for PDC 2021

PDC 2021 is the 2021 edition of the Process Discovery Contest.

Do you believe you have implemented a good discovery algorithm? Then submit it to the PDC 2021 to put it to the test!

This test contains 480 event logs, for which a model needs to be discovered (training logs), and 96 event logs that need to be classified using the discovered models (test logs).

The 96 test logs are all generated from the same configurable model, which was inspired by a model discovered from a real-life event log and has the following configurable options:

  • Long-term dependencies: yes/no
  • Loops: no/simple/complex
    • A simple loop has a single point of entry and a single point of exit.
    • A complex loop has multiple points of entry and/or multiple points of exit.
  • OR constructs: yes/no
  • Routing constructs: yes/no
  • Optional tasks: yes/no
  • Duplicate tasks: yes/no

Each test log is matched by five training logs:

  1. A training log without noise.
  2. A training log where in every trace with probability 20% one random event is removed.
  3. A training log where in every trace with probability 20% one random event is moved to a random position.
  4. A training log where in every trace with probability 20% one random event is copied to a random position.
  5. A training log where in every trace with probability 20% either one random event is removed (40%), moved (20%), or copied (40%).

Each training log contains 1000 traces that result from random walks through the configured model. Each test log contains 250 unique traces, of which 125 are positive (fit the configured model) and 125 are negative (do not fit the configured model).

Unlike the PDC 2020, the PDC 2021 only contains an automated contest.

How, what and when to submit?

Please let Eric Verbeek know that you want to submit your implemented discovery algorithm. Eric will then provide you with a link where you can upload your submission.

You should submit a working discovery algorithm, which can be called using a “Discover.bat” Windows batch file which takes two parameters:

  1. The full path to the training log file, excluding the “.xes” extension.
  2. The full path to the model file where the discovered model should be stored, excluding any extension like “.pnml”, “.bpmn”, or “.lsk”.

As an example, assuming that the miner discovers a Petri net, “Discover.bat logs\discovery\discovery-log models\discovered-model” will discover a model from the training log file “logs\discovery\discovery-log.xes” and will export the discovered Petri net to the file “models\discovered-model.pnml”.

If the results of calling your Discovery.bat file as described above is a PNML file (Petri nets), a BPMN file (BPMN diagram), or a LSK file (log skeleton), then you’re done. If not, the discovery algorithm needs to come with its own working classifier algorithm, that is, a “Classify.bat” Windows batch file, which takes three parameters:

  1. The full path to the test log file, excluding the “.xes” extension.
  2. The full path to the model file which should be used to classify the test log, excluding any extension like “.pnml”, “.bpmn”, or “.lsk”.
  3. The full path to the log file where the classified test log should be stored, excluding the “.xes” extension.

As an example, assuming that the miner discovered a Petri net, “Classify.bat logs\test\test-log models\discovered-model logs\classified\test-log” will classify every trace from the test log “logs\test\test-log.xes” using the Petri net from “models\discovered-model.pnml” and it will export the classified traces (by adding the “pdc:isPos” attributes to every trace) as an event log in “logs\classified\test-log.xes”.

Classification of a trace is done by adding the boolean “pdc:isPos” attribute to the trace, which should be true if the trace is classified positive (fits your model) and false if the trace is classified negative (does not fit your model).

All implemented classification algorithms (BPMN, DCR, LSK, and PNML) are available for download.

You can submit your discovery algorithm until September 30, but you can already submit it today. You can also submit multiple discovery algorithms, or multiple configurations of the same discovery algorithm. The earlier you submit your configured discovery algorithm, the earlier you will have its results, and the earlier you can decide to perhaps submit another configuration.

Score and winners

For each training log, the “Discovery.bat” file is used to discover a model from the training log. Next, the “Classify.bat” is used to classify every trace in the test log using the discovered model. This results in a positive accuracy rate P and a negative accuracy rate N for this training log. From these, its F-score F is computed as 2*(P*N)/(P+N). The end score for the discovery algorithm is the average F-score over all 480 training logs.

The first winner is the submission with the best end score. The second winner is the imperative submission with the best end score, provided that it beats the Directly Follows miner on its end score (89.0%). Tests have shown that such submissions are possible.

Key Dates

Discovery Algorithm SubmissionSeptember 30, 2021
Disclosure of the Data SetOctober 1, 2021
Winner(s) NotificationOctober 15, 2021
Winner(s) AnnouncementDuring ICPM 2021

The winners will be formally announced during the ICPM 2021 conference. Assuming that the winners will get the possibility to present their discovery algorithm some time during the conference, they should have some time to prepare this presentation. Therefore, they will be informally notified a bit before.

Example discovery algorithms

To help you, we have implemented the following 13 existing discovery algorithms, which are available for download as well:

  • Base miners:
    • The Directly Follows miner.
    • The Flower miner.
    • The Trace (or Sequence) miner.
  • Submitted to the PDC 2020:
    • The Directly Follows Model miner, which is the winning submission of the PDC 2020.
    • The DisCoveR miner using the Closed-World assumption.
    • The DisCoveR Light miner using the Closed-World assumption.
    • The Inductive IMfa miner (Inductive miner using the IMfa configuration).
    • The Kokos 2 T5 miner (Kokos 2 miner using a 5 minute timeout).
  • Examples used for the PDC 2020:
    • The Alpha miner.
    • The Fodina miner.
    • The Hybrid ILP miner.
    • The Log Skeleton N3 miner (Log Skeleton miner using a 3% noise threshold).
    • The Split miner.

We did not include the DisCoveR miners using the Open-World assumption (which were also submitted to the PDC 2020) here, as we are using the Closed-World assumption. We included the Log Skeleton using a 3% threshold here as we know that for the PDC 2020 this is the configuration of the Log Skeleton miner that scored best (85%).

Results

The figure above shows that the three declarative discovery algorithms each outperforms the base Directly Follows miner, and that the 7 imperative miners are each outperformed by it. The figure below breaks the score down in an average positive accuracy rate and an average negative accuracy rate. This shows that the imperative miners have good negative accuracy rates, but really fall short on their positive accuracy rates.

Downloads