Name: Data Lake Admin Workshop
Start: 2022-09-29T13:00:00+02:00
End: 2022-09-29T17:00:00+02:00
Location: Online

Data Lake Admin Workshop

Thursday 29 September 2022 - 13:00

Monday 26 September 2022
Tuesday 27 September 2022
Wednesday 28 September 2022
Thursday 29 September 2022

13:00

13:00 - 13:10
Room: BigBlueButton
13:10 StrongLink - Andreas Knüpfer (Technische Universität Dresden)
StrongLink
- Andreas Knüpfer (Technische Universität Dresden)
13:10 - 13:40
Room: BigBlueButton TU Dresden aims to improve its services for Research Data Management (RDM) supporting scientists even better with managing their valuable data. Among the challenges are handling large data sets, keeping track of one/many data sets across various storage technologies and storage tiers including many versions of datasets over their lifetimes. As part of the challenge there is metadata to be managed in addition. Then again, metadata is also a part of the solution, therefore, an integrated metadata handling is needed. The talk presents the insights of TU Dresden's recent evaluation of the Stronglink software (https://www.stronglink.de/ and https://stronglink.com/). This includes its interesting solution for unifying many storage services starting from file systems to object storages all the way to archiving tapes and can connect to HPC files systems, too. It also covers data versioning and the integrated metadata functionalities.
13:40 Data Lake/Management - Hendrik Nolte (GWDG)
Data Lake/Management
- Hendrik Nolte (GWDG)
13:40 - 14:10
Room: BigBlueButton Across various domains, data lakes are successfully utilized to centrally store all data of an organization in their raw format. This promises a high reusability of the stored data since a schema is implied on read, which prevents an information loss due to ETL (Extract, Transform, Load) processes. Despite this schema-on-read approach, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality. These data models are maintained within a central data catalog which can be queried. To further organize the data in the data lake, different architectures have been proposed, like the most widely known zone architecture where data is assigned to different zones according to the degree of processing. In this talk, a novel data lake architecture based on FAIR (Findable, Accessible, Interoperable, Reusable) Digital Objects (FDO) with (high-performance) processing capabilities is presented. These FDOs abstract away the handling of the underlying mass storage and databases, thereby enforcing a homogeneous state, while offering a flat yet easily comprehensible research data management. The FDOs are connected by a provenance-centered graph. Users can define generic workflows, which are reproducible by design, making this data lake implementation ideally suited for science.
14:10 Coscine - Ilona Lang (RWTH Aachen University)
Coscine
- Ilona Lang (RWTH Aachen University)
14:10 - 14:40
Room: BigBlueButton For many researchers an involvement with the FAIR principles (finable, accessible, interoperable, reusable) does not begin until the publication of an article and the sometimes obligatory transfer of the research data to a repository. At this point, a significant amount of valuable information about the research project is often already lost. One solution to make research data FAIR from the very beginning of its life cycle is to use a storage environment that implicitly implements FAIR principles. To create such a storage environment, the research data management platform Coscine was developed as an open source software at RWTH Aachen University. Coscine provides an integrated concept for research (meta)data management in addition to storage, management and archiving of research data. In addition, Coscine features open interfaces for automating data flows and unlimited collaboration capabilities for cross-institutional projects. The talk will introduce the main features of Coscine and show how the platform can support researchers in implementing good scientific practice in day-to-day research data management.
14:40 Break and Joint (Critical) Discussion
Break and Joint (Critical) Discussion
14:40 - 15:00
Room: BigBlueButton
15:00 DAOS - Steffen Christgau (Zuse Institut Berlin)
DAOS
- Steffen Christgau (Zuse Institut Berlin)
15:00 - 15:30
Room: BigBlueButton The Distributed Asynchronous Object Storage (DAOS) is a new and pure user-space HPC storage software solution developed by Intel. It is built upon recent hardware technologies and while being a key-value store it breaks with traditional POSIX-like HPC file systems. Despite the discontinuation of Optane persistent memory, a key technology DAOS currently relies on, the development on the software stack continues in order to support more common hardware setups which still makes DAOS worth a look. Within the talk, we will outline basic DAOS concepts and demonstrate basic administrative operations to both setup a DAOS system from scratch and create the infrastructure to make the installation of the non-POSIX system utilizable for regular users. This includes a short discussion of security aspects. We will also briefly cover how users and applications can interact with DAOS without breaking established workflows and highlight which new opportunities are available when working with the system.
15:30 HPCSerA - Christian Köhler (GWDG)
HPCSerA
- Christian Köhler (GWDG)
15:30 - 16:00
Room: BigBlueButton The usual mode of accessing High Performance Computing (HPC) resources involves interactively connecting to the command-line interface and submitting job scripts to a job scheduler. However, some services which provide a user interface by themselves (e.g. when working with graphical data) or services which simply require HPC resources as a backend compute engine, can benefit from the integration of HPC resources via an API. In our prototype for such a solution, external services authenticate against this backend via a RESTful API which then submits jobs to the HPC system's job scheduler on behalf of the user. Moreover, the job status and outcome is tracked and can be queried by the service and reflected back to the user. We showcase our architecture, prototype implementation and recent advancements in our work on the security model as well as various usage scenarios.
16:00 Secure HPC - Trevor Khwam Tabougua (GWDG)
Secure HPC
- Trevor Khwam Tabougua (GWDG)
16:00 - 16:30
Room: BigBlueButton Driven by the progress of data and compute-intensive methods in various scientific domains, there is an increasing demand from researchers working with highly sensitive data to have access to the necessary computational resources to be able to adapt those methods in their respective fields. To satisfy the computing needs of those researchers cost-effectively, it is an open quest to integrate reliable security measures on existing High Performance Computing (HPC) clusters. The fundamental problem with securely working with sensitive data is, that HPC systems are shared systems that are typically trimmed for the highest performance – not for high security. For instance, there are commonly no additional virtualization techniques employed, thus, users typically have access to the host operating system. Since new vulnerabilities are being continuously discovered, solely relying on the traditional Unix permissions is not secure enough. In this paper, we discuss Secure HPC, a workflow allowing users to transfer, store and analyze data with the highest privacy requirements.
16:30 Concluding Discussion
Concluding Discussion
16:30 - 17:00
Room: BigBlueButton