BPI challenge: Making available data from industry and organizations to advance in the field of Process Analytics


Boudewi­jn van
Don­gen is a full Pro­fes­sor in Com­put­er Sci­ence and chair of the Process Ana­lyt­ics group at Eind­hoven Uni­ver­si­ty of Tech­nol­o­gy (TU/e). Process ana­lyt­ics is an area in Data Sci­ence that could be con­sid­ered a com­bi­na­tion of indus­tri­al engi­neer­ing and data min­ing where basi­cal­ly researchers looks at business/organization process­es through data.

In his field, a process is con­sid­ered any activ­i­ty per­formed by indi­vid­u­als to reach a cer­tain goal. For exam­ple, insur­ance com­pa­nies need admin­is­tra­tive process­es in order to han­dle insur­ance claims or at an air­port they need logis­tics process­es for han­dling the bag received at the check-in counter and ensure it reach­es the right plane or the right flight con­nec­tion. Data from all these process­es is record­ed as event data.

In the exam­ple of an air­port, receiv­ing and deliv­er­ing a bag results in the record­ing of data such as the time when the bag was received (time stamp), who or which machine received it and for which flight it was checked in (con­text). All these data are col­lect­ed in logs which can be used to inves­ti­gate the under­ly­ing process, for exam­ple, to deter­mine if this process is effi­cient enough to deliv­er the bags to the right plane in time.

Boudewi­jn van Don­gen

Boudewi­jn is also one of the orga­niz­ers of the Busi­ness Process Intel­li­gence chal­lenge (BPI chal­lenge; 2019 event ver­sion) where datasets from companies/organizations are made avail­able to researchers or young con­sul­tan­cy com­pa­nies to com­pete in solv­ing a process prob­lem through data min­ing. The datasets used for this chal­lenge are pub­lished every year on 4TU.ResearchData in the col­lec­tion IEEE Task Force on Process Min­ing — Event Logs.

We talked to Boudewi­jn about the data which is part of this com­pe­ti­tion, the involve­ment of indus­try and the moti­va­tions for run­ning this chal­lenge.

Every­thing start­ed when Boudewi­jn was still part of Wil van der Aalst’s group. At that time, com­pa­nies often would pro­vide them with data to be used in cre­at­ing case stud­ies. They would receive the data and cre­ate algo­rithms to solve a par­tic­u­lar prob­lem with­in a process, to opti­mize the process itself, or they would use the data to test an algo­rithm already cre­at­ed. After the pub­li­ca­tion of the case stud­ies oth­er researchers would con­tact them ask­ing for the data of such a case study. They would share the data, when­ev­er the com­pa­ny own­er of the data would agree to that. 

“When we were able to pub­lish the data from our case stud­ies, we often got back the response from some researchers say­ing ‘your data are wrong’. Our data was not wrong! Prob­a­bly the tech­niques and tools they devel­oped were not work­ing yet on this real data. This went on for a cou­ple of years and we decid­ed that we should change this around.” 

Boudewi­jn and his col­leagues decid­ed that once a year they would pub­lish a data set com­ing from a com­pa­ny (or a pub­lic orga­ni­za­tion) and pose a chal­lenge to solve a par­tic­u­lar ques­tion com­ing some­times from the com­pa­ny itself. This would give the oppor­tu­ni­ty for researchers and prac­ti­tion­ers to devel­op new algo­rithms and test them on real data to prove its applic­a­bil­i­ty and win a prize orga­nized by the spon­sors of the chal­lenge.

“The chal­lenge forces researchers to make sure that new tech­niques, new devel­op­ments or new ideas are rel­e­vant on prac­ti­cal exam­ples. I’m also very much inter­est­ed in pure the­o­ret­i­cal and foun­da­tion­al com­pu­ta­tion sci­ence for exam­ple, but when you work in an applied field then you should show applic­a­bil­i­ty to real life cas­es.”

Boudewijn’s group decid­ed to use 4TU.ResearchData to make the data pub­licly avail­able pro­vid­ing the datasets with a DOI enabling a prop­er cita­tion. 

“For us it was very impor­tant to be able to assign a DOI, but also to reserve a DOI before pub­lish­ing the data because we include it in the meta­da­ta (the descrip­tion of the data) of the dataset. 4TU.ResearchData also offers the oppor­tu­ni­ty to pre­serve the datasets in a sus­tain­able way and for a long-term, so the data are avail­able for those researchers that would like to use them, even after the chal­lenge.”

The col­lec­tion is con­stant­ly grow­ing and not only with the datasets col­lect­ed for the chal­lenge. The IEEE Task Force on Process Min­ing — Event Logs has become a place where researchers from around the world want to make their event logs avail­able, weath­er they are real-life data sets or arti­fi­cial datasets cre­at­ed to test their algo­rithms.

“The col­lec­tion is often used by oth­er col­leagues as a bench­mark­ing set or to proof that their algo­rithms work slight­ly bet­ter than oth­ers. There are also datasets that address spe­cif­ic issues or that researchers use to eval­u­ate dif­fer­ent algo­rithms and explain why they would not work on a spe­cif­ic dataset, etc. It is a grow­ing col­lec­tion, it is very wide­ly known in our research com­mu­ni­ty and it is also very wide­ly used. Unfor­tu­nate­ly is not always prop­er­ly cit­ed with the cor­re­spond­ing DOI.”

There is a lot of work behind orga­niz­ing the BPI chal­lenge, which involves intense com­mu­ni­ca­tion with the com­pa­nies and orga­ni­za­tions that pro­vide the datasets and also in prepar­ing the datasets them­selves. One advan­tage in the dis­ci­pline of Process Ana­lyt­ics is that there is a well estab­lished stan­dard file-for­mat IEEE eXten­si­ble Event Stream (XES) (con­tain­ing stan­dard meta­da­ta) that allows to under­stand the seman­tics of the data and make them inter­op­er­a­ble to be ana­lyzed (mined) with the dif­fer­ent tools used in the field. But, there is much more than that before the datasets get pub­lished.

Boudewi­jn tells us that three main ingre­di­ents are need­ed:

  • “You need some­body inside the com­pa­ny that sees the ben­e­fits they can get out by mak­ing the data avail­able for this chal­lenge. Com­pa­nies get a solu­tion for their ques­tions regard­ing a spe­cif­ic process for free.”
  • “You need a man­ag­er high up in the orga­ni­za­tion, at the CEO lev­el, who also sees the val­ue of par­tic­i­pat­ing in the chal­lenge and who is will­ing to push it for­ward.”
  • “And, you need very good anonymiza­tion of the data. The data that we pub­lish is not the raw data that comes from the systems/machines in the com­pa­ny, that is all pro­pri­etary data and some­times con­tains per­son­al data. You can­not leave per­son­al IDs in the file, for exam­ple. You need rela­bel these IDs to anonymize them in a way that you still can make a dis­tinc­tion between oper­a­tors and admin­is­tra­tors, or between admin­is­tra­tor and auto­mat­ed resources so the data is still pro­vide rel­e­vant infor­ma­tion. There are guide­lines from Sta­tis­tics Nether­lands (CBS) on how to do this. We ask the com­pa­ny to fol­low those guide­lines to pre­pare the data. Some­times they ask for our help, but most of the time they know how to do that them­selves.”

Since the con­ver­sa­tions with a com­pa­ny start until the dataset is pub­lished for the chal­lenge, it takes approx­i­mate­ly 9 months. What is the moti­va­tion for Boudewi­jn to invest his time on this?

“It is a com­mu­ni­ty effort! If you keep all these data to your­self you might be able to write nice case stud­ies, but nobody will be able to do the same. And the whole point of doing research is to work togeth­er in a research area and try to advance it. That means that some peo­ple need to invest time and effort orga­niz­ing con­fer­ences, or as edi­tors of jour­nals, and some oth­ers will need to invest the time and effort to pub­lish data and orga­nize chal­lenges to advance the field. It is part of the research process.”

Note at the end of the arti­cle

If you are look­ing for Events Logs to use in your Process Ana­lyt­ics project, vis­it the IEEE eXten­si­ble Event Stream (XES). If you use one of the datasets, make sure you cite it fol­low­ing the instruc­tions found in “How to cite this item” using the cor­re­spond­ing DOI. 

For exam­ple: 

van Don­gen, B.F. (Boudewi­jn); Borchert, F. (Flo­ri­an) (2018) BPI Chal­lenge 2018. Eind­hoven Uni­ver­si­ty of Tech­nol­o­gy. Dataset. https://doi.org/10.4121/uuid:3301445f-95e8-4ff0-98a4-901f1f204972

Author: Paula Mar­tinez Lavanchy
Cov­er image by Pete Lin­forth via Pix­abay 

Related Articles

Discover more from 4TU.ResearchData

Subscribe now to keep reading and get access to the full archive.

Continue reading