Load balancing is a critical component for achieving high performance in MPI applications that run on parallel and distributed systems. The goal of load balancing is to distribute work evenly across all processes so that no single process is overloaded with work while others are idle. This helps maximize resource utilization and minimizes overall runtime. There are a few main techniques that MPI implementations employ for load balancing:
Static load balancing occurs at compile/initialization time and does not change during runtime. The developer or application is responsible for analyzing the problem and dividing the work evenly among processes beforehand. This approach provides good performance but lacks flexibility, as load imbalances may occur during execution that cannot be addressed. Many MPI implementations support specifying custom data decompositions and mappings of processes to hardware to enable static load balancing.
Dynamic load balancing strategies allow work to be redistributed at runtime in response to load imbalances. Periodic reactive methods monitor process load over time and shuffle data/tasks between processes as needed. Examples include work-stealing algorithms where overloaded processes donate work to idle processes. Probabilistic techniques redistribute work randomly to balance probability of all processes finishing simultaneously. Threshold-based schemes trigger load balancing when the load difference between maximum and minimum processes exceeds a threshold. Dynamic strategies improve flexibility but add runtime overhead.
Many MPI implementations employ a hybrid of static partitioning with capabilities for limited dynamic adjustments. For example, static initialization followed by periodic checks and reactive load balancing transfers. The Open MPI project uses a two-level hierarchical mapping by default that maps processes to sockets, then cores within sockets, providing location-aware static layouts while allowing dynamic intra-node adjustments. MPICH supports customizable topologies that enable static partitioning for different problem geometries, plus interfaces for inserting dynamic balancing functions.
Decentralized and hierarchical load balancing algorithms avoid bottlenecks of centralized coordination. Distributed work-stealing techniques allow local overloaded-idle process pairs to directly trade tasks without involving a master. Hierarchical schemes partition work into clusters that balance independently, with load sharing occurring between clusters. These distributed techniques scale better for large process counts but require more sophisticated heuristics.
Data decomposition strategies like block-block and cyclic distributions also impact load balancing. Block distributions partition data into contiguous blocks assigned to each process, preserving data locality but risking imbalances from non-uniform workloads. Cyclic distributions spread data across processes randomly, improving statistical balance but harming locality. Many applications combine multiple techniques – for example using static partitioning for large grained tasks, with dynamic work-stealing within shared-memory nodes.
Runtime systems and thread-level speculation techniques allow even more dynamic load adjustments by migrating tasks between threads rather than processes. Thread schedulers can backfill idle threads with tasks from overloaded ones. Speculative parallelization identifies parallel sections at runtime and distributes redundant speculative work to idle threads. These fine-grained dynamic strategies complement MPI process-level load balancing.
Modern MPI implementations utilize sophisticated hybrid combinations of static partitioning, dynamic load balancing strategies, decentralized coordination, and runtime load monitoring/migration mechanisms to effectively distribute parallel work across computing resources. The right balance of static analysis and dynamic adaptation depends on application characteristics, problem sizes, and system architectures. Continued improvements to load balancing algorithms will help maximize scaling on future extreme-scale systems comprised of billions of distributed heterogeneous devices.