Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops
Alexander Supalov, Andrey Semin, Christopher Dahnken, Michael Klemm
Format: PDF / Kindle (mobi) / ePub
Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.
The book focuses on optimization for clusters consisting of the Intel® Xeon processor, but the optimization methodologies also apply to the Intel® Xeon Phi™ coprocessor and heterogeneous clusters mixing both architectures. Besides the tutorial and reference content, the authors address and refute many myths and misconceptions surrounding the topic. The text is augmented and enriched by descriptions of real-life situations.
What you’ll learn
- Practical, hands-on examples show how to make clusters and workstations based on Intel® Xeon processors and Intel® Xeon Phi™ coprocessors "sing" in Linux environments
- How to master the synergy of Intel® Parallel Studio XE 2015 Cluster Edition, which includes Intel® Composer XE, Intel® MPI Library, Intel® Trace Analyzer and Collector, Intel® VTune™ Amplifier XE, and many other useful tools
- How to achieve immediate and tangible optimization results while refining your understanding of software design principles
Who this book is for
Software professionals will use this book to design, develop, and optimize their parallel programs on Intel platforms. Students of computer science and engineering will value the book as a comprehensive reader, suitable to many optimization courses offered around the world. The novice reader will enjoy a thorough grounding in the exciting world of parallel computing.
Table of Contents
Foreword by Bronis de Supinski, CTO, Livermore Computing, LLNL
Chapter 1: No Time to Read this Book?
Chapter 2: Overview of Platform Architectures
Chapter 3: Top-Down Software Optimization
Chapter 4: Addressing System Bottlenecks
Chapter 5: Addressing Application Bottlenecks: Distributed Memory
Chapter 6: Addressing Application Bottlenecks: Shared Memory
Chapter 7: Addressing Application Bottlenecks: Microarchitecture
Chapter 8: Application Design Considerations
analysis to a degree and provides an easy way to a correct, performing code. AN introduces an array section notation that allows the specification of particular elements, compact or regularly strided:
ones. First Steps: Loading and StoringThe first thing we want to do is get data into a vector and back into main memory. There two ways of loading, aligned and unaligned: __m256d a = _mm256_load_pd(double* memptr): Loads the four packed double precision numbers contained in the 256 bit starting at memptr. And memptr must be 32 byte aligned. __m256d a = _mm256_loadu_pd(double* memptr): Loads the four packed double precision numbers contained in the 256 bit starting at memptr. And memptr does not
synchronization and locking definition hotspots profiles MiniMD benchmark MiniMD vs. OpenMP atomic construct VTune Amplifier XE Array of structures (AoS) C Compact policy ComputeSYMGS_ref function Control abstraction D, E Data abstraction datatype virtual memory Data layout AoS definition BQCD makefiles definition SIMD SoA definition standard vector Data organization F Floating point operations per second (FLOPS) G Graphics Double Data Rate, version 5(GDDR5) H
consumption stays within the specification and the cooling system can cool the processor package below its critical temperature. An Intel Xeon E5-2697 v2 processor can run at a up to 300 MHz higher clock speed in the Turbo Boost, thus reaching up to 3 GHz. When Turbo Boost is disabled in BIOS settings, though, the processor clock frequency cannot exceed the nominal 2.7 GHz, and consequently the performance reported by nodeperf is lower, while still above 90 percent from the peak performance, as
respectively. 5.The default pinning takes into account the platform affinity setting (cf. cpuset command) and the locality of the InfiniBand networking cards (called host channel adapter, or HCA). It also prescribes targeting the virtual cores (unit) and compact domain ordering (compact) in the absence of respective qualifiers in the values of the I_MPI_PIN_PROCESSOR_LIST and I_MPI_PIN_DOMAIN environment variables. There may be small deviations between the description given and the realities of