Sharing large simulation datasets effectively

Gior­dano Lipari is a con­sul­tant and researcher in com­pu­ta­tion­al and envi­ron­men­tal hydro­dy­nam­ics at Water­mo­tion | Water­be­weg­ing. After hear­ing that his data col­lec­tion in 4TU.ResearchData has already been viewed more than 2,000 times, our com­mu­ni­ty man­ag­er, Con­nie Clare, sat down with Gior­dano to dis­cuss his expe­ri­ence.

Tell us something about your data collection. The datasets within have been downloaded 3,500 times since their deposit last April.

We pub­lished a 735GB large col­lec­tion, ‘High-res­o­lu­tion SPH sim­u­la­tions of a 2D dam-break flow against a ver­ti­cal wall’, which com­pris­es 1650 data files from high-res­o­lu­tion sim­u­la­tions of flu­id flows. This data has been shared as a tail of a post­doc­tor­al project at TU Delft where I worked with Kees Vuik at the Delft Insti­tute of Applied Math­e­mat­ics to audit com­put­er pro­grams to sim­u­late such flows. 

What are simulations of fluid flows and what are their real-world applications? 

A com­put­er sim­u­la­tion basi­cal­ly con­sists of four ingre­di­ents: going back­wards in the work­flow, you give a piece of hard­ware a piece of soft­ware that imple­ments the math­e­mat­ics that express­es, in our case, some physics. I was study­ing a method­ol­o­gy called smoothed par­ti­cle hydro­dy­nam­ics and was inter­est­ed in the impact of water on objects, or vice ver­sa. These sim­u­la­tions mod­el real-word prob­lems, such as waves hit­ting ships, dykes or break­wa­ters, where water could dam­age struc­tures.

This sin­gle image is made of 82 mil­lion points tracked indi­vid­u­al­ly. Its cor­re­spond­ing file is 3.3GB.
The ani­ma­tion of the full flow with a descrip­tion can be found here.

Why do these simulations produce massive data? 

Researchers study­ing hydro­dy­nam­ics ide­al­ly want to track the motion of water down to the very small­est par­ti­cles of flu­id the flow con­sists of. Whilst this lev­el of descrip­tion is unachiev­able for gen­er­al pur­pos­es, the quest for it is a moti­va­tor for researchers since fin­er details pro­vide more insight into flu­id flow behav­iour. The more refined the sim­u­la­tion, the longer the com­put­ing time and the larg­er the data files gen­er­at­ed. 

How did you run the simulations? 

We had fast Graph­ics Pro­cess­ing Units (GPUs) at our dis­pos­al. We could describe the motion of water impacts with up to 100 mil­lion par­ti­cles and pro­duce 1TB worth of data in just five days. Col­lect­ing this large amount of data means that all tasks in an ordi­nary research work­flow scale up with the prob­lem size accord­ing­ly, with new chal­lenges aris­ing. For exam­ple, data stor­age can fill up fair­ly quick­ly before you’ve had time to analyse the data. This is a typ­i­cal issue asso­ci­at­ed with high-per­for­mance com­put­ing. 

Why did you decide to publish your data and software code?

The datasets are not asso­ci­at­ed with a pub­lished paper but we had start­ed by shar­ing ani­ma­tions of high-res­o­lu­tion flow sim­u­la­tions via our YouTube chan­nel. Then, we thought we should share the under­ly­ing data and code as it offers oth­ers the oppor­tu­ni­ty to look into pre­com­put­ed high-res­o­lu­tion sim­u­la­tions. The sim­u­la­tion data we pro­duced is high­ly valu­able for many schol­ars and prac­ti­tion­ers since the hard­ware we had is hard to afford for  most. We also used an open-source flow solver and could include the code in the col­lec­tion, so that any­one can repro­duce the results if so wished. In this way, we demon­strate a prime exam­ple of open sci­ence prac­tice. 

Each of these GPUs could pro­duce 1TB of data with­in 5 days 
car­ry­ing float­ing-point oper­a­tions in dou­ble pre­ci­sion at a rate of 4 TFLOPS.

Since becom­ing a data cham­pi­on at TU Delft, I was also aware of the university’s research data frame­work pol­i­cy. This has been an inspi­ra­tion to pub­lish our sim­u­la­tions in a way that makes them FAIR (Find­able, Acces­si­ble, Inter­op­er­a­ble and Reusable). As researchers from a part­ner organ­i­sa­tion we could deposit up to 1TB free of charge in the 4TU.ResearchData repos­i­to­ry. My fac­ul­ty data stew­ard, San­tosh Ilam­paruthi, pro­vid­ed use­ful tips for how to get start­ed and the cura­tors from 4TU.ResearchData, Jan van der Heul and Egbert Grams­ber­gen, sup­port­ed our ini­tia­tive through­out the upload stage.

Who do you expect to be interested in your data collection?

The flow we select­ed for deposit has a bench­mark val­ue for our spe­cial­ist com­mu­ni­ty. I envis­age that researchers using the same method­ol­o­gy as ours are inter­est­ed in explor­ing our data col­lec­tion. Researchers using oth­er method­olo­gies may also be inter­est­ed in com­par­ing results. In addi­tion, look­ing into pre­com­put­ed data can have edu­ca­tion­al val­ue for stu­dents. 

What’s more, high-per­for­mance com­put­ing and large datasets bring chal­lenges beyond the dis­ci­pline that stim­u­lates a study. For exam­ple, due to the gran­u­lar­i­ty of the high-res­o­lu­tion images, the datasets may also appeal to researchers and devel­op­ers from com­put­er graph­ics to test and improve their tools for ren­der­ing com­plex visu­al­i­sa­tions.

You have published 1,650 data files. How have you organised these in the repository to ensure they invite exploration? 

The files are many and large. With­out organ­i­sa­tion, nav­i­gat­ing more than a thou­sand large descrip­tions of flu­id flows can eas­i­ly turn a user’s ini­tial curios­i­ty into dis­cour­age­ment and unin­ter­est. You can­not expect peo­ple to down­load such large amounts of data in bulk and sift through the numer­ous files. This would be too time and labour inten­sive. Regard­ing the FAIR data prin­ci­ples, ‘acces­si­bil­i­ty’ does not scale well with the deposit size. It was, there­fore, impor­tant to pre­pare the data and organ­ise it appro­pri­ate­ly before pub­li­ca­tion. 

The best way to do this was by cre­at­ing a data col­lec­tion com­pris­ing four datasets con­tain­ing the core infor­ma­tion, plus an entry dataset which serves as a one-stop cat­a­logue of the collection’s con­tents, includ­ing visu­al­i­sa­tions of the raw data. The entry dataset allows users to form their idea of the sim­u­la­tions, nav­i­gate the col­lec­tion and select data files of choice. The data files have not been bun­dled in archives and are in a high­ly com­pressed for­mat. So they can be down­loaded indi­vid­u­al­ly in a tol­er­a­ble trans­fer time using a mod­er­ate band­width. 

Final­ly, besides the stan­dard README files, we also wrote a com­men­tary that caters for a vari­ety of approach­es and lev­els of exper­tise. It has been impor­tant to inform view­ers about our ratio­nale and method­ol­o­gy with addi­tion­al descrip­tive doc­u­men­ta­tion. . The com­men­tary explains why and how we cre­at­ed so much data, where to find data files, and sug­gests for inspi­ra­tion how the data can be used. The com­men­tary should allow view­ers to decide which datasets are valu­able to them. Organ­is­ing the volu­mi­nous data col­lec­tion in this way pro­vides a well-marked path for view­ers to con­fi­dent­ly dive into the details. 

Giordano, thank you for taking the time to share your considerations for publishing large datasets. We’re pleased that your data collection has become a highly accessed resource. Do you have any final thoughts to share with us? 

Thank you, Con­nie. Acces­si­bil­i­ty is impor­tant for any dataset. How­ev­er, when deposit­ing large datasets in par­tic­u­lar, depos­i­tors need to com­pen­sate for the hur­dles that arise from their sheer size. Facil­i­tat­ing the access itself does not guar­an­tee many data down­loads, still is a nec­es­sary part of effec­tive research data man­age­ment. All the prepa­ra­tion requires care­ful design and some extra effort but is worth it! The items with­in our col­lec­tion have been down­loaded an aver­age 11 times a day in the last eleven months. 

Such usage met­rics are a great suc­cess, demon­strat­ing that view­ers have engaged with the col­lec­tion, large as it is, and decid­ed to explore the parts rel­e­vant for them. This alone makes the effort of shar­ing invalu­able.

“Acces­si­bil­i­ty is impor­tant for any dataset. How­ev­er, when deposit­ing large datasets in par­tic­u­lar, depos­i­tors need to com­pen­sate for the hur­dles that arise from their sheer size. Facil­i­tat­ing the access itself does not guar­an­tee many data down­loads, still is a nec­es­sary part of effec­tive research data man­age­ment.”

— Gior­dano Lipari

Co-authored and edit­ed by: Gior­dano Lipari and Con­nie Clare

Related Articles

Discover more from 4TU.ResearchData

Subscribe now to keep reading and get access to the full archive.

Continue reading