Selecting Representative Sample Traces from Large Event Logs

Gaël Bernard and Periklis Andritsos


When event logs are large, the time needed to analyze them using process mining techniques can become prohibitive. In this paper, using sampling, we aim to reduce the size of event logs to p-traces, while minimizing the Earth Movers’ Distance (EMD) from the unsampled original event log. We contribute by formalizing log sampling in a canonical form and show its link with the EMD, a metric increasingly used for process mining. Next, we propose three log-sampling algorithms that we evaluate using a collection of 18 event logs from industry. We show that our approach largely reduces the EMD compared to existing sampling strategies. Moreover, we highlight that sampled event logs with low EMDs tend to have better behavioural quality, highlighting the generality of our work.