NHR Data Lakes Workshop
Tuesday 23 January 2024 -
13:00
Monday 22 January 2024
Tuesday 23 January 2024
13:00
Welcome (Julian Kunkel, GWDG)
-
Julian Kunkel
(
GWDG
)
Welcome (Julian Kunkel, GWDG)
Julian Kunkel
(
GWDG
)
13:00 - 13:15
13:15
XNAT
-
James Bowden
XNAT
James Bowden
13:15 - 13:35
13:35
Navigating muddy Waters: iRODS as a Data Lake
-
Christian Meesters
(
Johannes Gutenberg-Universität Mainz
)
Jörg Steinkamp
Navigating muddy Waters: iRODS as a Data Lake
Christian Meesters
(
Johannes Gutenberg-Universität Mainz
)
Jörg Steinkamp
13:35 - 13:50
Together, we will discover the powers of the Integrated Rule-Oriented Data System (iRODS) and shed light on iRODS' strengths, weaknesses, and use cases, supported by insightful usage statistics. We will delve into the concept of federations and explore how iRODS facilitates seamful collaboration across HPC sites. A glimpse into the future is provided, too, as we discuss evolving prospects of iRODS within the NHR.
13:50
Streamlining Data Sharing in Machine Learning with Dataverse and Customized Metadata
-
Sherpa Lincoln
Streamlining Data Sharing in Machine Learning with Dataverse and Customized Metadata
Sherpa Lincoln
13:50 - 14:10
14:10
Coscine
-
Marcel Nellesen
(
RWTH
)
Coscine
Marcel Nellesen
(
RWTH
)
14:10 - 14:30
14:30
WP 1: Conclusion about Data Management Systems and HPC (Marcel Nelleson, RWTH)
-
Marcel Nellesen
(
RWTH
)
WP 1: Conclusion about Data Management Systems and HPC (Marcel Nelleson, RWTH)
Marcel Nellesen
(
RWTH
)
14:30 - 14:50
The amount of research data that is generated within research projects grows rapidly, however metadata that describes the research data or the environment in which it was created is often missing. Without this information it is often impossible to reproduce or reuse the results for further research projects. Data management describes the process of storing, organizing, and maintaining data. Effective handling of data is important to generate a long-term value from the data. Within this talk we will present the results of our work package and discuss open challenges.
14:50
Coffee Break
Coffee Break
14:50 - 15:05
15:05
trappedssh — Sandboxing rsync and SFTP on transfer nodes
-
Freja Nordsiek
(
GWDG
)
trappedssh — Sandboxing rsync and SFTP on transfer nodes
Freja Nordsiek
(
GWDG
)
15:05 - 15:20
Many HPC centers place their frontend nodes behind a VPN or a jumphost for extra protection, and quite reasonably so since the internet is a dangerous place. As these methods can make it harder to get data into the HPC center and because data transfers can take a long time, many HPC centers also have dedicated transfer nodes. For those on the bare internet or just behind a jumphost, it is common to lock them down by restricting incoming access to SFTP only, forbidding the more dangerous shell access. Unfortunately, this makes rsync and other transfer methods that need shell access impossible. Enter trappedssh, an attempt to provide sandboxed rsync, SFTP, etc. with better protections than the current SFTP-only methods (trappedssh can also do SFTP-only) using low level techniques like namespaces rather than trying to sanitize SSH_ORIGINAL_COMMAND. The presentation will discuss the idea behind it, the first proof of concept and its problems, progress on the second proof of concept, and future directions.
15:20
S3 data transfer investigation
-
Fabian Dünzer
(
RWTH
)
S3 data transfer investigation
Fabian Dünzer
(
RWTH
)
15:20 - 15:40
A detailed test on the data transfer through the S3 protocol from the RWTH cluster and different S3 storage solutions.
15:40
WP 2: Data Transfer
-
Freja Nordsiek
(
GWDG
)
WP 2: Data Transfer
Freja Nordsiek
(
GWDG
)
15:40 - 16:00
Users often need to transfer large amounts of data into and out of HPC centers. Most focus is on transfer from users' workstations into and out of the HPC center. But, users often must transfer data between centers, whether because they are moving HPC centers, need to make a local copy of a dataset hosted elsewhere, etc. We will present our investigations on transferring data between NHR centers via rsync, SFTP based tools, and S3 based tools including the difficulties doing so securely and avoiding the need for a 3rd host (e.g. the user's workstation). Such difficulties include the VPNs many centers use for improving access security, SSH credentials, etc.
16:00
Coffee Break
Coffee Break
16:00 - 16:10
16:15
CEPH
-
Johannes Veh
(
FAU
)
CEPH
Johannes Veh
(
FAU
)
16:15 - 16:30
16:30
Bespoke Data Management Concepts for the Storage Tiers @ GWDG
-
Hendrik Nolte
(
GWDG
)
Bespoke Data Management Concepts for the Storage Tiers @ GWDG
Hendrik Nolte
(
GWDG
)
16:30 - 16:50
16:50
Concluding Discussion
Concluding Discussion
16:50 - 17:10