BPI challenge: Making available data from industry and organizations to advance in the field of Process Analytics
Boudewijn van Dongen is a full Professor in Computer Science and chair of the Process Analytics group at Eindhoven University of Technology (TU/e). Process analytics is an area in Data Science that could be considered a combination of industrial engineering and data mining where basically researchers looks at business/organization processes through data.
In his field, a process is considered any activity performed by individuals to reach a certain goal. For example, insurance companies need administrative processes in order to handle insurance claims or at an airport they need logistics processes for handling the bag received at the check-in counter and ensure it reaches the right plane or the right flight connection. Data from all these processes is recorded as event data.
In the example of an airport, receiving and delivering a bag results in the recording of data such as the time when the bag was received (time stamp), who or which machine received it and for which flight it was checked in (context). All these data are collected in logs which can be used to investigate the underlying process, for example, to determine if this process is efficient enough to deliver the bags to the right plane in time.
Boudewijn van Dongen
Boudewijn is also one of the organizers of the Business Process Intelligence challenge (BPI challenge; 2019 event version) where datasets from companies/organizations are made available to researchers or young consultancy companies to compete in solving a process problem through data mining. The datasets used for this challenge are published every year on 4TU.ResearchData in the collection IEEE Task Force on Process Mining – Event Logs.
We talked to Boudewijn about the data which is part of this competition, the involvement of industry and the motivations for running this challenge.
Everything started when Boudewijn was still part of Wil van der Aalst’s group. At that time, companies often would provide them with data to be used in creating case studies. They would receive the data and create algorithms to solve a particular problem within a process, to optimize the process itself, or they would use the data to test an algorithm already created. After the publication of the case studies other researchers would contact them asking for the data of such a case study. They would share the data, whenever the company owner of the data would agree to that.
“When we were able to publish the data from our case studies, we often got back the response from some researchers saying ‘your data are wrong’. Our data was not wrong! Probably the techniques and tools they developed were not working yet on this real data. This went on for a couple of years and we decided that we should change this around.”
Boudewijn and his colleagues decided that once a year they would publish a data set coming from a company (or a public organization) and pose a challenge to solve a particular question coming sometimes from the company itself. This would give the opportunity for researchers and practitioners to develop new algorithms and test them on real data to prove its applicability and win a prize organized by the sponsors of the challenge.
“The challenge forces researchers to make sure that new techniques, new developments or new ideas are relevant on practical examples. I’m also very much interested in pure theoretical and foundational computation science for example, but when you work in an applied field then you should show applicability to real life cases.”
Boudewijn’s group decided to use 4TU.ResearchData to make the data publicly available providing the datasets with a DOI enabling a proper citation.
“For us it was very important to be able to assign a DOI, but also to reserve a DOI before publishing the data because we include it in the metadata (the description of the data) of the dataset. 4TU.ResearchData also offers the opportunity to preserve the datasets in a sustainable way and for a long-term, so the data are available for those researchers that would like to use them, even after the challenge.”
The collection is constantly growing and not only with the datasets collected for the challenge. The IEEE Task Force on Process Mining – Event Logs has become a place where researchers from around the world want to make their event logs available, weather they are real-life data sets or artificial datasets created to test their algorithms.
“The collection is often used by other colleagues as a benchmarking set or to proof that their algorithms work slightly better than others. There are also datasets that address specific issues or that researchers use to evaluate different algorithms and explain why they would not work on a specific dataset, etc. It is a growing collection, it is very widely known in our research community and it is also very widely used. Unfortunately is not always properly cited with the corresponding DOI.”
There is a lot of work behind organizing the BPI challenge, which involves intense communication with the companies and organizations that provide the datasets and also in preparing the datasets themselves. One advantage in the discipline of Process Analytics is that there is a well established standard file-format IEEE eXtensible Event Stream (XES) (containing standard metadata) that allows to understand the semantics of the data and make them interoperable to be analyzed (mined) with the different tools used in the field. But, there is much more than that before the datasets get published.
Boudewijn tells us that three main ingredients are needed:
- “You need somebody inside the company that sees the benefits they can get out by making the data available for this challenge. Companies get a solution for their questions regarding a specific process for free.”
- “You need a manager high up in the organization, at the CEO level, who also sees the value of participating in the challenge and who is willing to push it forward.”
- “And, you need very good anonymization of the data. The data that we publish is not the raw data that comes from the systems/machines in the company, that is all proprietary data and sometimes contains personal data. You cannot leave personal IDs in the file, for example. You need relabel these IDs to anonymize them in a way that you still can make a distinction between operators and administrators, or between administrator and automated resources so the data is still provide relevant information. There are guidelines from Statistics Netherlands (CBS) on how to do this. We ask the company to follow those guidelines to prepare the data. Sometimes they ask for our help, but most of the time they know how to do that themselves.”
Since the conversations with a company start until the dataset is published for the challenge, it takes approximately 9 months. What is the motivation for Boudewijn to invest his time on this?
“It is a community effort! If you keep all these data to yourself you might be able to write nice case studies, but nobody will be able to do the same. And the whole point of doing research is to work together in a research area and try to advance it. That means that some people need to invest time and effort organizing conferences, or as editors of journals, and some others will need to invest the time and effort to publish data and organize challenges to advance the field. It is part of the research process.”
Note at the end of the article
If you are looking for Events Logs to use in your Process Analytics project, visit the IEEE eXtensible Event Stream (XES). If you use one of the datasets, make sure you cite it following the instructions found in “How to cite this item” using the corresponding DOI.
van Dongen, B.F. (Boudewijn); Borchert, F. (Florian) (2018) BPI Challenge 2018. Eindhoven University of Technology. Dataset. https://doi.org/10.4121/uuid:3301445f-95e8-4ff0-98a4-901f1f204972
Author: Paula Martinez Lavanchy
Cover image by Pete Linforth via Pixabay