Reduced Precision and Data Compression in Scientific Applications

Europe/Berlin
Hörsaal (MPCDF)

Hörsaal

MPCDF

Registration
Participants
    • 09:30 09:40
      Welcome 10m
      Speaker: Erwin Laure (MPCDF)
    • 09:40 10:00
      Finding the Sweet Spot: Balancing Efficiency and Accuracy in Fusion Simulations 20m
      Speaker: Frank Jenko (IPP)
    • 10:00 10:30
      Mixed Feelings about Mixed Precision: Can we adapt Numerical Algorithms to AI Hardware? 30m

      The rise of AI has driven the development of special-purpose hardware and floating point
      formats that are suitable for executing AI applications. Generally speaking, the trend
      is towards less complex (shorter) floating point formats, simple operations like matrix
      multiplication, and over-provisioning processors for floating point performance. Most
      scientific applications, on the other hand, rely on long (high precision) floating point
      formats and linear algebra operations whose performance is often bound rather by the
      communication and memory speed. This motivates the investigation of sophisticated
      techniques to avoid, reduce, and/or hide data transfers in-between processors and between
      processors and main memory. One promising strategy is to decouple the memory
      precision from the arithmetic precision and compress the data before invoking communication
      operations. While this generally comes with a loss of information, the strategy
      can be reasonable when operating with approximate objects like preconditioners used in
      iterative methods. We will present a memory accessor separating the arithmetic precision
      from the memory precision and mixed precision algorithms based on the strategy
      of employing lower precision formats for communication and memory access without
      impacting the final accuracy

      Speaker: Hartwig Anzt (TUM)
    • 10:30 11:00
      Reduced precision in GENE: Challenges, benefits, and a novel mixed-precision strategy 30m

      The recent influence of the AI community on hardware vendors offers scientific simulations a new opportunity: Modern GPUs natively support 16-bit half-precision floating point data types. As GENE is a bandwidth-bound stencil code, its performance can proportionally profit from the reduced half-precision memory consumption. We briefly discuss the technical challenges and workarounds of porting GENE and its library dependencies to half-precision, and present achieved single-node multi-GPU performance improvements.
      In extreme scale simulations, when running on hundreds or even thousands of GPU nodes, however, inter-node communication can become the dominant bottleneck of a GENE simulation [Germaschewski et al., Physics of Plasmas, 28, 062501 (2021)]. An essential part of this communication is a moments reduction to solve for the electromagnetic fields. While it is tempting to decrease the communicated data amount by a low precision Allreduce, this leads to catastrophic cancellation due to the plasma's quasi-neutrality. Instead, we employ a novel mixed-precision strategy avoiding the numerical cancellation, while exploiting the reduced data transfer requirements of low precision Allreduce.

      Speaker: Carl-Martin Pfeiler (IPP)
    • 11:00 11:30
      Coffee Break 30m
    • 11:30 12:00
      Single Precision in Climate Modelling with ICON 30m

      Earth System Models (ESMs) are already taking advantage of the performance gains of reduced precision floating-point arithmetic, with models utilizing single- or mixed-precision modes in operational weather forecasting. Current targets for climate modelling with the ICON model aim for 365 simulated days per day (SDPD) for a full ESM at 1.25km horizontal resolution, with recent flagship runs achieving 82.5 SPDP. Research into running the model in single-precision may bring us closer to achieving the 365 SPDP target. Starting with evaluating the atmospheric dynamical core in isolation allows debugging and the application of tools on a minimal subset of the code. Preliminary results show 2x speedup on CPUs and 1.5x speedup on GPUs. The output evaluation is progressing but largely requires manual review from domain experts. Precision-sensitive variables and arithmetic operations have been identified in the dynamical core using floating-point error analysis tools. We are investigating solutions to deal with precision-sensitive variables, and emphasize the importance of robust testing infrastructure for development and validation.

      Speaker: Dylan Kierans (DKRZ)
    • 12:00 12:30
      Discussion 30m
    • 12:30 13:30
      Lunch 1h
    • 13:30 14:00
      Lower FP precision in turbulence simulations 30m

      Modern computing clusters offer specialized hardware with reduced-precision arithmetic that can speed-up the time to solution significantly, mainly due to less data movement and increased arithmetic performance. However, for high-fidelity simulations of turbulence, separation, and transition the impact of lower floating-point precision on the computed solution and the uncertainty it introduces has not been explored in sufficient detail. This limits the optimal utilization of new and upcoming exascale machines. In this work, the effect of reduced precision for numerical solution of the Navier-Stokes equations is considered across different spatial and temporal discretization approaches. We compare four solvers, two compressible and two incompressible, across three test cases: K-type transition in a channel, turbulent channel flow up to Ret=2000 and flow over a cylinder at ReD=3900. Different terms of the Navier-Stokes equations are perturbed to lower floating-point precision, ranging from conventional 64 bit IEEE double precision down to recent 8 bit formats highlighting the opportunities and the drawbacks of low-precision arithmetic in high-fidelity computational fluid dynamics.

      Speaker: Philipp Schlatter (FAU)
    • 14:00 14:20
      Mitigating Data Movement Bottlenecks in Fusion Plasma Simulations with Lossy Compression: Less Data, Same Science? 20m

      The performance gap between computing power and data movement bandwidth across the HPC hardware stack is one of the primary obstacles to application scaling at exascale. Fusion Plasma simulations are particularly affected by this due to their high-dimensional phase-space representation and communication requirements. This talk will explore the use of agnostic lossy compressors in line with MPI communications as a way to reduce network contention. By striking a balance between compression throughput, ratio, and fidelity preservation, Fusion Plasma research could potentially benefit from this data reduction technique. This is highlighted by the research done in two Fusion Plasma Simulation applications: GENE and bsl6d.

      Speaker: Diego Jimenez (MPCDF)
    • 14:20 14:30
      MPI extensions for payload compression — a proposal and a prototype 10m

      In this presentation, we propose an API extension to the Message-Passing Interface (MPI) standard that allows applications to register compression algorithms in the MPI library tailored to their needs. Corresponding parameters can be passed to the compression algorithms via a likewise extended interface. The effectiveness of these interface extensions has been demonstrated in DaREXA-F using a prototype implementation in ParaStation MPI, where the procedures for MPI partitioned communication are complemented by the ability of utilizing the registered compression algorithms. This prototype implementation, preliminary results, and possible future improvements are discussed in the presentation.

      Speaker: Carsten Clauss (Par-Tec)
    • 14:30 15:00
      Discussion 30m
    • 15:00 15:30
      Coffee Break 30m
    • 15:30 16:00
      Performance Modeling in DAREXA-F 30m

      In this talk we will summarize the efforts on performance modeling and performance analysis in the DAREXA-F project. We will focus on GENE, compression libraries, and general performance modeling for lower precision data for state-of-the-art CPU and GPU models.

      Speaker: Jan Laukemann (FAU)
    • 16:00 16:30
      Performance Evaluation of Current and Future HPC Hardware for Physical Simulations 30m

      Especially with the use of specialized accelerators like GPGPUs, the compute performance of single nodes in HPC systems has been increasing faster than the network bandwidth between the nodes. Hence, network bandwidth is increasingly becoming a major bottleneck for the performance of large-scale HPC applications. We discuss in which cases data compression can help mitigate this issue, also considering accelerators.
      Implementing data compression in a real-world code involves significant development efforts.
      To address this, we present a prototypical implementation of transparent compression in MPI with a Field-Programmable Gate Array (FPGA)-backend using a ZFP variant as the compression algorithm.
      We also demonstrate the performance of FPGA devices for two isolated routines related to the GENE plasma physics simulation:
      1. An implementation of a Fast Fourier Transform (FFT) on an FPGA device, and
      2. an isolated stencil kernel from GENE ported to an FPGA device.

      Speaker: Felix Jung (TUM)
    • 16:30 17:00
      Discussion 30m