NHR Data Lakes Workshop



Freja Nordsiek (GWDG)

Data-driven science requires not only fast storage systems but also strategies to manage this data efficiently within and across data centers. Big data tools can satisfy the need for searching data based on user specific metadata, however, there is a zoo of tools available and no single tool can realize all the requirements a HPC system in a data center requires. Data lakes, for example, are a reasonable approach but there are alternative concepts and tools that also need to be considered. A uniform and consistent view to the millions of scientific data files on HPC systems and their efficient processing is required to maximize exploitability and prevent segmented data silos between users or projects.

Within this Workshop we will discuss different topics related to large scale data management on HPC systems, ranging from the use of dedicated data management systems, to secure and efficient data transfer strategies, and storage challenges caused by data-intensive computations.

Freja Nordsiek
    • 1:00 PM 1:15 PM
      Welcome (Julian Kunkel, GWDG) 15m
      Speaker: Julian Kunkel (GWDG)
    • 1:15 PM 1:35 PM
      XNAT 20m
      Speaker: James Bowden
    • 1:35 PM 1:50 PM
      Navigating muddy Waters: iRODS as a Data Lake 15m

      Together, we will discover the powers of the Integrated Rule-Oriented Data System (iRODS) and shed light on iRODS' strengths, weaknesses, and use cases, supported by insightful usage statistics. We will delve into the concept of federations and explore how iRODS facilitates seamful collaboration across HPC sites. A glimpse into the future is provided, too, as we discuss evolving prospects of iRODS within the NHR.

      Speakers: Dr Christian Meesters (Johannes Gutenberg-Universität Mainz), Dr Jörg Steinkamp
    • 1:50 PM 2:10 PM
      Streamlining Data Sharing in Machine Learning with Dataverse and Customized Metadata 20m
      Speaker: Sherpa Lincoln
    • 2:10 PM 2:30 PM
      Coscine 20m
      Speaker: Marcel Nellesen (RWTH)
    • 2:30 PM 2:50 PM
      WP 1: Conclusion about Data Management Systems and HPC (Marcel Nelleson, RWTH) 20m

      The amount of research data that is generated within research projects grows rapidly, however metadata that describes the research data or the environment in which it was created is often missing. Without this information it is often impossible to reproduce or reuse the results for further research projects. Data management describes the process of storing, organizing, and maintaining data. Effective handling of data is important to generate a long-term value from the data. Within this talk we will present the results of our work package and discuss open challenges.

      Speaker: Marcel Nellesen (RWTH)
    • 2:50 PM 3:05 PM
      Coffee Break 15m
    • 3:05 PM 3:20 PM
      trappedssh — Sandboxing rsync and SFTP on transfer nodes 15m

      Many HPC centers place their frontend nodes behind a VPN or a jumphost for extra protection, and quite reasonably so since the internet is a dangerous place. As these methods can make it harder to get data into the HPC center and because data transfers can take a long time, many HPC centers also have dedicated transfer nodes. For those on the bare internet or just behind a jumphost, it is common to lock them down by restricting incoming access to SFTP only, forbidding the more dangerous shell access. Unfortunately, this makes rsync and other transfer methods that need shell access impossible. Enter trappedssh, an attempt to provide sandboxed rsync, SFTP, etc. with better protections than the current SFTP-only methods (trappedssh can also do SFTP-only) using low level techniques like namespaces rather than trying to sanitize SSH_ORIGINAL_COMMAND. The presentation will discuss the idea behind it, the first proof of concept and its problems, progress on the second proof of concept, and future directions.

      Speaker: Dr Freja Nordsiek (GWDG)
    • 3:20 PM 3:40 PM
      S3 data transfer investigation 20m

      A detailed test on the data transfer through the S3 protocol from the RWTH cluster and different S3 storage solutions.

      Speaker: Fabian Dünzer (RWTH)
    • 3:40 PM 4:00 PM
      WP 2: Data Transfer 20m

      Users often need to transfer large amounts of data into and out of HPC centers. Most focus is on transfer from users' workstations into and out of the HPC center. But, users often must transfer data between centers, whether because they are moving HPC centers, need to make a local copy of a dataset hosted elsewhere, etc. We will present our investigations on transferring data between NHR centers via rsync, SFTP based tools, and S3 based tools including the difficulties doing so securely and avoiding the need for a 3rd host (e.g. the user's workstation). Such difficulties include the VPNs many centers use for improving access security, SSH credentials, etc.

      Speaker: Dr Freja Nordsiek (GWDG)
    • 4:00 PM 4:10 PM
      Coffee Break 10m
    • 4:15 PM 4:30 PM
      CEPH 15m
      Speaker: Johannes Veh (FAU)
    • 4:30 PM 4:50 PM
      Bespoke Data Management Concepts for the Storage Tiers @ GWDG 20m
      Speaker: Hendrik Nolte (GWDG)
    • 4:50 PM 5:10 PM
      Concluding Discussion 20m