Data Lake Admin Workshop

Name: Data Lake Admin Workshop
Start: 2022-09-29T13:00:00+02:00
End: 2022-09-29T17:00:00+02:00
Location: Online

Thursday 29 Sept 2022, 13:00 → 17:00 Europe/Berlin

BigBlueButton (Online)

BigBlueButton

Online

https://meet.gwdg.de/b/ale-o0p-mti-fmi

Julian Kunkel (GWDG), Hendrik Nolte (GWDG), Andreas Knuepfer (TU Dresden)

Description

Data Lake Admin Workshop

In recent years, classic HPC users have seen an ever-increasing interest in the public cloud that is used as part of traditional HPC workflows. There are many reasons for this, e.g. special hardware components such as TPUs or special GPUs are available in the cloud earlier than in a local data center. In addition, there is a need for users to store any data for analysis using AI methods in different data silos and to be able to access them flexibly from HPC and cloud systems. A central role for data analytics workflows is the flexible data migration and provision in the data lake. For this purpose, highly-scalable object storage has long been established in the cloud area, which is mostly used via an S3 interface. Another advantage from the user's point of view for a consistent data management strategy as offered by a data lake is the uniform and consistent view that it allows for the individual data silos.

This ongoing shift in the usage model of HPC systems requires admins to extend their consulting, software, and hardware offerings. This Admin-Workshop will be split in two distinct parts. In the first part we will have three Talks presenting three different Data Management Tools from an HPC perspective. This session will go with an honest and critical discussion about the short comings of each tool. The goal is not to promote one, or all of these tools, but to identify common challenges and unique solutions in order to take a first step to develop an NHR wide strategy for Data Management.

In the second part, related topics are being presented and discussed, ranging from object storage from an HPC perspective, using an HPC system with a ReST API, or securely processing sensible (medical) data on a shared HPC system. This session will be concluded with a joint discussion about high-performance data analytics (HPDA), big data analytics (BDA), and scientific data management in general. This will foster further collaboration in this matter across the different NHR centers.

Important Infos

Date and Time	Thursday, September 29th 2022, 13:00 - 17:00
Venue	Virtual / Room: Data Lake Admin Workshop
Organizers	Julian Kunkel (Uni Göttingen/GWDG), julian.kunkel@gwdg.de
	Hendrik Nolte (GWDG), hendrik.nolte@gwdg.de
	Andreas Knuepfer (TU Dresden), andreas.knuepfer@tu-dresden.de
	Alexander Goldmann (GWDG), alexander.goldmann@gwdg.de

Funding

This workshop is funded by the GWDG and supported by the NHR.

Contact

hendrik.nolte@gwdg.de

Registration

Participants

- 13:00 → 13:10
  
  Welcome and Introduction
  
  Prof. Dr. Julian Kunkel will welcome the attendees.
  
  Convener: Julian Kunkel (GWDG)
- 13:10 → 13:40
  
  StrongLink 30m
  
  TU Dresden aims to improve its services for Research Data Management (RDM) supporting scientists even better with managing their valuable data. Among the challenges are handling large data sets, keeping track of one/many data sets across various storage technologies and storage tiers including many versions of datasets over their lifetimes. As part of the challenge there is metadata to be managed in addition. Then again, metadata is also a part of the solution, therefore, an integrated metadata handling is needed.
  
  The talk presents the insights of TU Dresden's recent evaluation of the Stronglink software (https://www.stronglink.de/ and https://stronglink.com/). This includes its interesting solution for unifying many storage services starting from file systems to object storages all the way to archiving tapes and can connect to HPC files systems, too. It also covers data versioning and the integrated metadata functionalities.
  
  Speaker: Dr Andreas Knüpfer (Technische Universität Dresden)
- 13:40 → 14:10
  
  Data Lake/Management 30m
  
  Across various domains, data lakes are successfully utilized to centrally store all data of an organization in their raw format. This promises a high reusability of the stored data since a schema is implied on read, which prevents an information loss due to ETL (Extract, Transform, Load) processes. Despite this schema-on-read approach, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality. These data models are maintained within a central data catalog which can be queried. To further organize the data in the data lake, different architectures have been proposed, like the most widely known zone architecture where data is assigned to different zones according to the degree of processing. In this talk, a novel data lake architecture based on FAIR (Findable, Accessible, Interoperable, Reusable) Digital Objects (FDO) with (high-performance) processing capabilities is presented. These FDOs abstract away the handling of the underlying mass storage and databases, thereby enforcing a homogeneous state, while offering a flat yet easily comprehensible research data management. The FDOs are connected by a provenance-centered graph. Users can define generic workflows, which are reproducible by design, making this data lake implementation ideally suited for science.
  
  Speaker: Hendrik Nolte (GWDG)
- 14:10 → 14:40
  
  Coscine 30m
  
  For many researchers an involvement with the FAIR principles (finable, accessible, interoperable, reusable) does not begin until the publication of an article and the sometimes obligatory transfer of the research data to a repository. At this point, a significant amount of valuable information about the research project is often already lost. One solution to make research data FAIR from the very beginning of its life cycle is to use a storage environment that implicitly implements FAIR principles. To create such a storage environment, the research data management platform Coscine was developed as an open source software at RWTH Aachen University. Coscine provides an integrated concept for research (meta)data management in addition to storage, management and archiving of research data. In addition, Coscine features open interfaces for automating data flows and unlimited collaboration capabilities for cross-institutional projects. The talk will introduce the main features of Coscine and show how the platform can support researchers in implementing good scientific practice in day-to-day research data management.
  
  Speaker: Dr Ilona Lang (RWTH Aachen University)
- 14:40 → 15:00
  
  Break and Joint (Critical) Discussion 20m
- 15:00 → 15:30
  
  DAOS 30m
  
  The Distributed Asynchronous Object Storage (DAOS) is a new and pure user-space HPC storage software solution developed by Intel. It is built upon recent hardware technologies and while being a key-value store it breaks with traditional POSIX-like HPC file systems. Despite the discontinuation of Optane persistent memory, a key technology DAOS currently relies on, the development on the software stack continues in order to support more common hardware setups which still makes DAOS worth a look. Within the talk, we will outline basic DAOS concepts and demonstrate basic administrative operations to both setup a DAOS system from scratch and create the infrastructure to make the installation of the non-POSIX system utilizable for regular users. This includes a short discussion of security aspects. We will also briefly cover how users and applications can interact with DAOS without breaking established workflows and highlight which new opportunities are available when working with the system.
  
  Speaker: Dr Steffen Christgau (Zuse Institut Berlin)
- 15:30 → 16:00
  
  HPCSerA 30m
  
  The usual mode of accessing High Performance Computing (HPC) resources involves interactively connecting to the command-line interface and submitting job scripts to a job scheduler. However, some services which provide a user interface by themselves (e.g. when working with graphical data) or services which simply require HPC resources as a backend compute engine, can benefit from the integration of HPC resources via an API. In our prototype for such a solution, external services authenticate against this backend via a RESTful API which then submits jobs to the HPC system's job scheduler on behalf of the user. Moreover, the job status and outcome is tracked and can be queried by the service and reflected back to the user. We showcase our architecture, prototype implementation and recent advancements in our work on the security model as well as various usage scenarios.
  
  Speaker: Christian Köhler (GWDG)
- 16:00 → 16:30
  
  Secure HPC 30m
  
  Driven by the progress of data and compute-intensive methods in various scientific domains, there is an increasing demand from researchers working with highly sensitive data to have access to the necessary computational resources to be able to adapt those methods in their respective fields. To satisfy the computing needs of those researchers cost-effectively, it is an open quest to integrate reliable security measures on existing High Performance Computing (HPC) clusters. The fundamental problem with securely working with sensitive data is, that HPC systems are shared systems that are typically trimmed for the highest performance – not for high security. For instance, there are commonly no additional virtualization techniques employed, thus, users typically have access to the host operating system. Since new vulnerabilities are being continuously discovered, solely relying on the traditional Unix permissions is not secure enough. In this paper, we discuss Secure HPC, a workflow allowing users to transfer, store and analyze data with the highest privacy requirements.
  
  Speaker: Mr Trevor Khwam Tabougua (GWDG)
- 16:30 → 17:00
  
  Concluding Discussion

Choose timezone

Data Lake Admin Workshop

BigBlueButton

Online

Data Lake Admin Workshop

Important Infos

Funding