Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops

Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops

Alexander Supalov, Andrey Semin, Christopher Dahnken, Michael Klemm

Language: English

Pages: 300

ISBN: 1430264969

Format: PDF / Kindle (mobi) / ePub

Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops

Alexander Supalov, Andrey Semin, Christopher Dahnken, Michael Klemm

Language: English

Pages: 300

ISBN: 1430264969

Format: PDF / Kindle (mobi) / ePub


Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

The book focuses on optimization for clusters consisting of the Intel® Xeon processor, but the optimization methodologies also apply to the Intel® Xeon Phi™ coprocessor and heterogeneous clusters mixing both architectures. Besides the tutorial and reference content, the authors address and refute many myths and misconceptions surrounding the topic. The text is augmented and enriched by descriptions of real-life situations.

What you’ll learn

  • Practical, hands-on examples show how to make clusters and workstations based on Intel® Xeon processors and Intel® Xeon Phi™ coprocessors "sing" in Linux environments

  • How to master the synergy of Intel® Parallel Studio XE 2015 Cluster Edition, which includes Intel® Composer XE, Intel® MPI Library, Intel® Trace Analyzer and Collector, Intel® VTune™ Amplifier XE, and many other useful tools

  • How to achieve immediate and tangible optimization results while refining your understanding of software design principles

Who this book is for

Software professionals will use this book to design, develop, and optimize their parallel programs on Intel platforms. Students of computer science and engineering will value the book as a comprehensive reader, suitable to many optimization courses offered around the world. The novice reader will enjoy a thorough grounding in the exciting world of parallel computing.

Table of Contents

Foreword by Bronis de Supinski, CTO, Livermore Computing, LLNL

Introduction

Chapter 1: No Time to Read this Book?

Chapter 2: Overview of Platform Architectures

Chapter 3: Top-Down Software Optimization

Chapter 4: Addressing System Bottlenecks

Chapter 5: Addressing Application Bottlenecks: Distributed Memory

Chapter 6: Addressing Application Bottlenecks: Shared Memory

Chapter 7: Addressing Application Bottlenecks: Microarchitecture

Chapter 8: Application Design Considerations

Letters to Steve: Inside the E-mail Inbox of Apple's Steve Jobs

Content Nation: Surviving and Thriving as Social Media Changes Our Work, Our Lives, and Our Future

The Android Developer's Cookbook: Building Applications with the Android SDK (Developer's Library)

The Wikipedia Revolution: How a Bunch of Nobodies Created the World's Greatest Encyclopedia

 

 

 

 

 

 

 

 

 

 

 

analysis to a degree and provides an easy way to a correct, performing code. AN introduces an array section notation that allows the specification of particular elements, compact or regularly strided: [:[:]] The syntax resembles the Fortran syntax, but Fortran programmers beware: the semantic requires start:length and not start:end! Examples for the array section notation are: a[:]         // the whole array a[0:10]      // elements 0 through 9

ones. First Steps: Loading and StoringThe first thing we want to do is get data into a vector and back into main memory. There two ways of loading, aligned and unaligned: __m256d a = _mm256_load_pd(double* memptr): Loads the four packed double precision numbers contained in the 256 bit starting at memptr. And memptr must be 32 byte aligned. __m256d a = _mm256_loadu_pd(double* memptr): Loads the four packed double precision numbers contained in the 256 bit starting at memptr. And memptr does not

synchronization and locking definition hotspots profiles MiniMD benchmark MiniMD vs. OpenMP atomic construct VTune Amplifier XE Array of structures (AoS) C Compact policy ComputeSYMGS_ref function Control abstraction D, E Data abstraction datatype virtual memory Data layout AoS definition BQCD makefiles definition SIMD SoA definition standard vector Data organization F Floating point operations per second (FLOPS) G Graphics Double Data Rate, version 5(GDDR5) H

consumption stays within the specification and the cooling system can cool the processor package below its critical temperature. An Intel Xeon E5-2697 v2 processor can run at a up to 300 MHz higher clock speed in the Turbo Boost, thus reaching up to 3 GHz. When Turbo Boost is disabled in BIOS settings, though, the processor clock frequency cannot exceed the nominal 2.7 GHz, and consequently the performance reported by nodeperf is lower, while still above 90 percent from the peak performance, as

respectively. 5.The default pinning takes into account the platform affinity setting (cf. cpuset command) and the locality of the InfiniBand networking cards (called host channel adapter, or HCA). It also prescribes targeting the virtual cores (unit) and compact domain ordering (compact) in the absence of respective qualifiers in the values of the I_MPI_PIN_PROCESSOR_LIST and I_MPI_PIN_DOMAIN environment variables. There may be small deviations between the description given and the realities of

Download sample

Download