# Peer-Reviewed Full Papers

The following papers have been selected for publishing in the PASC17 Conference Proceedings.

### Towards the Virtual Rheometer: High Performance Computing for the Red Blood Cell Microstructure

#### Eva Athena Economides (ETH Zurich), Lucas Amoudruz (ETH Zurich), Sergey Litvinov (ETH Zurich), Dmitry Alexeev (ETH Zurich), Sara Nizzero (Houston Methodist Research Institute), Diego Rossinelli (Lucid Concepts AG), Panagiotis E. Hadjidoukas (ETH Zurich), Petros Koumoutsakos (ETH Zurich)

Recent advances in medical research and bio-engineering have led to the development of devices capable of handling fluids and biological matter at the microscale. The operating conditions of medical devices are constrained in order to ensure that characteristic properties of blood flow such as mechanical properties and local hemodynamics are not altered during operation. These properties are a consequence of the red blood cell (RBC) microstructure, which changes dynamically according to the device geometry. The understanding of the mechanics and dynamics that govern the interactions between the RBCs is crucial for the quantitative characterization of blood flow, a stepping stone towards the design of medical devices specialized to the patient, in the context of personalized medicine. This can be achieved by analyzing the microstuctural characteristics of the RBCs and study their dynamics. In this work we focus on the quantification of the microstructure of high and low hematocrit blood flows, in wall bounded geometries. We present distributions of the RBCs according to selected deformation criteria and dynamic charasteristics, and elaborate on mechanisms that control their collective behavior, focusing on the interplay between cells and wall induced confinement effects.

### Load Balancing and Patch-Based Parallel Adaptive Mesh Refinement for Tsunami Simulation on Heterogeneous Platforms Using Xeon Phi Coprocessors

#### Chaulio Resende Ferreira (Technical University of Munich), Michael Bader (Technical University of Munich)

We present a patch-based approach for tsunami simulation with parallel adaptive mesh refinement on the Salomon supercomputer. The special architecture of Salomon, with two Intel Xeon CPUs (Haswell architecture) and two Intel Xeon Phi coprocessors (Knights Corner) per compute node, suggests truly heterogeneous load balancing instead of offload approaches, because host and accelerator achieve comparable performance. We use a tree-structured mesh refinement strategy resulting from the newest vertex bisection refinement of triangular grid cells, but introduce small uniform grid patches into the leaves of the tree to allow vectorisation of the Finite Volume solver over grid cells. Hence, patches increase computational performance, but at the same time restrict mesh adaptivity and thus increase the overall number of grid cells. We studied the time-to-solution for a scenario to simulate the 2011 Tohoku tsunami and found the smallest solution time for comparably small patches of 64 cells. We use the Xeon Phis in symmetric mode and implemented heterogeneous load balancing between host and coprocessors, identifying the relative load distribution either from on-the-fly runtime measurements or from a-priori exhaustive testing. Both approaches performed better than homogeneous load balancing and better than using only the CPUs or only the Xeon Phi coprocessors in native mode. In all set-ups, however, the absolute speedups were impeded by slow MPI communication between the Xeon Phi coprocessors.

### A Histogram-Free Multicanonical Monte Carlo Algorithm for the Basis Expansion of Density of States

#### Ying Wai Li (Oak Ridge National Laboratory), Markus Eisenbach (Oak Ridge National Laboratory)

We report a new multicanonical Monte Carlo (MC) algorithm to obtain the density of states (DOS) for physical systems with continuous state variables in statistical mechanics. Our algorithm is able to obtain an analytical form for the DOS expressed in a chosen basis set, instead of a numerical array of finite resolution as in previous variants of this class of MC methods such as the multicanonical (MUCA) sampling and Wang-Landau (WL) sampling. This is enabled by storing the visited states directly in a data set and avoiding the explicit collection of a histogram. This practice also has the advantage of avoiding undesirable artificial errors caused by the discretization and binning of continuous state variables. Our results show that this scheme is capable of obtaining converged results with a much reduced number of Monte Carlo steps, leading to a significant speedup over existing algorithms.

### Fast and Scalable Low-order Implicit Unstructured Finite-element Solver for Earth's Crust Deformation Problem

#### Kohei Fujita (RIKEN), Tsuyoshi Ichimura (University of Tokyo), Kentaro Koyama (Fujitsu Ltd.), Hikaru Inoue (Fujitsu Ltd.), Muneo Hori (University of Tokyo), Lalith Maddegedara (University of Tokyo)

A low-order implicit unstructured finite-element solver for Earth's crust deformation problem is proposed. The solver scales up to 294,912 CPU cores of K computer with 94.8% size-up efficiency, enabling solving a 504 billion degrees-of-freedom crust deformation problem with 126 billion second-order tetrahedral elements in 146.5 s at 0.970 PFLOPS (20.6% of peak) on 294,912 CPU cores. This is 11.1 times faster than a highly-tuned preconditioned conjugate gradient solver and 2.18 times faster than the state-of-the-art SC14 Gordon Bell Prize Finalist solver. With the proposed solver, we solved a practical 381 billion degrees-of-freedom Eastern Japan crust deformation problem (576 km x 800 km x 400 km area; resolution of 62.5 m) in 252.2 s using 294,912 CPU cores. Such high-resolution analysis assures the convergence of stress and strain required to evaluate nonlinear material constitutive models at plate boundaries and is expected to become a key tool for physics-based earthquake forecasting.

### A Computational Framework to Assess the Influence of Changes in Vascular Geometry on Blood Flow

#### John Gounley (Duke University), Madhurima Vardhan (Duke University), Amanda Randles (Duke University)

Many vascular abnormalities, such as aneurysms or stenoses, develop gradually over time. In the early stages of their development, they require monitoring but do not pose sufficient risk to the patient for a clinician to recommend invasive treatment. With a better understanding of the interplay between hemodynamic factors and changes in blood vessel geometry, there is an opportunity to improve clinical care by earlier identification of aneurysms or stenoses with significant potential for further development. Computational fluid dynamics has shown great promise for investigating this interplay and identifying the associated underlying mechanisms, by using patient-derived geometries and modifying them to represent potential evolution of the vascular disease. However, a general, extensible framework for comparing simulation results from different vascular geometries in a direct, quantitative manner does not currently exist. As a first step toward the development of such a framework, we present a method for quantifying the relationship between changes in vascular geometry and hemodynamic factors, such as wall shear stress. We apply this framework to study the correlation between wall shear stress and geometric changes in two opposite settings: when flow properties are associated with consequent changes in the vascular geometry, as in a thoracic aortic aneurysm, and when geometric changes alter the flow, as in a worsening aortic stenosis.

### Evaluation of a Directive-Based GPU Programming Approach for High-Order Unstructured Mesh Computational Fluid Dynamics

#### Kunal Puri (Israel Institute of Technology), Vikram Singh (Israel Institute of Technology), Steven Frankel (Israel Institute of Technology)

In this work we evaluate the effectiveness of using OpenACC as a paradigm for the auto-prallelization of a high-order unstructured CFD code on Graphics Processing Units (GPUs). This is in lieu of hand-written CUDA or OpenCL code for the algorithms that have to be separately maintained and tuned to available hardware. Specifically, we compare the performance of using OpenACC-2.5 for Fortran with the commercial PGI compiler suite and OpenCL code running on the Nvidia Kepler series (K20, K40) GPU accelerators. Our results show that the (double precision) GPU accelerated code for both approaches is 2-3 times faster than the optimized counterpart on the CPU (running with an OpenMP model). We find that sparse matrix vector multiplication with OpenCL is faster than using OpenACC with CUBLAS. While it is in general possible to write an optimized code using OpenCL (or CUDA) that outperforms OpenACC, we find that the directive based approach offered by OpenACC results in a flexible, unified and hence smaller code-base that is easier to maintain, is readily portable and promotes algorithm development.

### Pareto Optimal Swimmers

#### Siddhartha Verma (ETH Zurich), Panagiotis Hadjidoukas (ETH Zurich), Philipp Wirth (ETH Zurich), Diego Rossinelli (ETH Zurich), Petros Koumoutsakos (ETH Zurich)

A fundamental understanding of how various biological traits and features provide organisms with a competitive advantage can help us improve the design of a number of mechanical systems. Numerical optimization can play an invaluable role for this purpose, by allowing us to scrutinize the evolution of specific biological adaptations in nature. Importantly, the use of numerical optimization can help us overcome limiting constraints that restrict the evolutionary capability of biological species. We capitalize on these advantages by coupling high-fidelity simulations of self-propelled swimmers with evolutionary optimization algorithms, to examine peculiar swimming patterns observed in a number of fish species. More specifically, we investigate the intermittent form of locomotion referred to as `burst-and-coast' swimming, which involves a few quick flicks of the fish's tail followed by a prolonged unpowered glide. This mode of swimming is believed to confer energetic benefits, in addition to several other advantages. We discover a range of intermittent-swimming patterns, the most efficient of which resembles the swimming-behaviour observed in live fish. We also discover patterns which lead to a marked increase in swimming-speed, albeit with a significant increase in energy expenditure. Notably, the use of multi-objective optimization reveals locomotion patterns that strike the perfect balance between speed and efficiency, which can be invaluable for use in robotic applications. As an additional goal of the paper, we highlight the ease with which disparate codes can be coupled via the software framework used, without encumbering the user with the details of efficient parallelization and machine-learning based task-scheduling.

### A Scalable Object Store for Meteorological and Climate Data

#### Tiago Quintino (ECMWF), Simon Smart (ECMWF), Baudouin Raoult (ECMWF)

Numerical Weather Prediction (NWP) and Climate simulations sit in the intersection between classically understood High Performance Computing (HPC) and the Big Data / High Performance Data Analytics (HPDA) communities. Driven by ever more ambitions scientific goals, both the size and number of output data elements generated as part of NWP operations has grown by several orders of magnitude, and will continue to grow into the future. The total amount of data is expected to grow exponentially with time, and over the last 30 years this increase has been approximately 40% per year. This poses significant scalability challenges for the data processing pipeline, and the movement of data through and between stages is one of the most significant factors in this. At ECMWF, meteorological data within the HPC facility is stored in an indexed data store for retrieval according to a well defined schema of meteorological metadata. This paper discusses the design and implementation of the next version (5th) of this indexed data store, which aims to increase the range of contexts within the operational workflow in which it can be used, and to increase its tolerance to failure. Further, it aims to pre-emptively head off some upcoming scalability bottlenecks present in the previous versions.

### ABCpy: A User-Friendly, Extensible, and Parallel Library for Approximate Bayesian Computation

#### Ritabrata Dutta (Università della Svizzera italiana), Marcel Schoengens (ETH Zurich / CSCS), Jukka-Pekka Onnela (Harvard University), Antonietta Mira (Università della Svizzera italiana)

ABCpy is a highly modular, scientific library for Approximate Bayesian Computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization, even without much additional effort to parallelize their code. Further, ABCpy enables ABC experts to easily develop new inference schemes and evaluate them in a standardized environment, and to extend the library with new algorithms. These benefits come mainly from the modularity of ABCpy. We give an overview of the design of ABCpy, and we provide a performance evaluation concentrating on parallelization.

### Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

#### Alfio Lazzaro (University of Zurich), Joost VandeVondele (ETH Zurich / CSCS), Juerg Hutter (University of Zurich), Ole Schuett (University of Zurich)

Matrix-matrix multiplication is a basic operation in linear algebra and an essential building block for a wide range of algorithms in various scientific fields. Theory and implementation for the dense, square matrix case are well-developed. If matrices are sparse, with application-specific sparsity patterns, the optimal implementation remains an open question. Here, we explore the performance of communication reducing 2.5D algorithms and one-sided MPI communication in the context of linear scaling electronic structure theory. In particular, we extend the DBCSR sparse matrix library, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. The library is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation. Here, we compare the performance of the original implementation based on Cannon's algorithm and MPI point-to-point communication, with an implementation based on MPI one-sided communications (RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and auxiliary operations for reduced communication, which can lead to a speedup if communication is dominant. The 2.5D algorithm is somewhat easier to implement with one-sided communications. A detailed description of the implementation is provided, also for non ideal processor topologies, since this is important for actual applications. Given the importance of the precise sparsity pattern, and even the actual matrix data, which decides the effective fill-in upon multiplication, the tests are performed within the CP2K package with application benchmarks. Results show a substantial boost in performance for the RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the number of involved processes in the parallelization.

### Parallelized Dimensional Decomposition for Large-Scale Dynamic Stochastic Economic Models

#### Aryan Eftekhari (Università della Svizzera italiana), Simon Scheidegger (University of Zurich), Olaf Schenk (Università della Svizzera italiana)

We introduce and deploy a generic, highly scalable computational method to solve high-dimensional dynamic stochastic economic models on high-performance computing platforms. Within an MPI-TBB parallel, nonlinear time iteration framework, we approximate economic policy functions using an adaptive sparse grid algorithm with d-linear basis functions that is combined with a dimensional decomposition scheme. Numerical experiments on ìPizDaintî (Cray XC30) at the Swiss National Supercomputing Center show that our framework scales nicely to at least 1,000 compute nodes. As an economic application, we compute global solutions to international real business cycle models up to 200 continuous dimensions with significant speedup values over state-of-the-art techniques.

### Scheduling Finite Difference Approximations for DAG-Modeled Large Scale Applications

#### Xavier Meyer (University of Lausanne), Bastien Chopard (University of Geneva), Nicolas Salamin (University of Lausanne)

An increasing number of scientific domains are confronted with the arduous task of managing large scale applications. For such applications, gradient estimations come at a large computational cost. Despite notable advances in automatic differentiation during the last years, its use in this context may reveal too costly in memory, inadequate for parallel architecture or require expert knowledge. For these reasons, we investigate an alternative approach that uses the finite difference method to evaluate the gradient of functions modeled as a directed acyclic graph. This approach enables the reuse of partial results from previous partial derivatives evaluations and thus reduces the computational cost. We identify a discrete optimization problem arising in the limited-memory context of large scale applications that aims to maximize the computational efficiency of the gradient approximation by scheduling the partial derivatives. This optimization problem is extended to consider the partitioning of the computations on multiple processors. We further derive some properties of these optimization problems, such as their upper bound on performance gains. Following a brief description of algorithms designed to obtain sensible solutions for both problems, we study the increase in performance resulting from sequential and parallel schedules obtained for synthetic DAGs. Finally, we employ this approach to accelerate the gradient evaluation of DAGs representing real evolutionary biology models. One one of these large scale applications, our approach is shown to be nearly 400 times faster than a state-of-the-art software in sequential, and more than 11,000 times faster when using 256 processors.

### Asynchronous Task-Based Parallelization of Algebraic Multigrid

#### Amani Alonazi (King Abdullah University of Science and Technology), George Markomanolis (King Abdullah University of Science and Technology), David Keyes (King Abdullah University of Science and Technology)

As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in hypre library for the 3D Laplacian operator and electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.