The Highly Efficient AllReduce in MPICH2

The MPI specification defines a parallel computing API known as AllReduce. In MPICH2, the most famous implementation of MPI, AllReduce is implemented using a very smart algorithm. Thanks to Wenyen who explained this algorithm clearly in few minutes for me. To the right is the figure drawn by Wenyen on a white board.

Suppose we have n computers, and each computer has a data block with n cells. Denote the j-th data cell on the i-th computer by Dij, and all data cells on computer i by Di={Di1,…,Din}. The goal of AllReduce is to make each of the j-th computer has R(D)=\sum_i Di.

The AllReduce algorithm implemented in MPICH2 consists of two stages: (1) Reduce-scatter and (2) All-gather. Reduce-scatter makes computer i has R(Di)=\sum_j Dij. In all-gather stage, computers exchange their partial result to make each computer have R(D)={R(D1), …, R(Dn)}.

References:

  1. Rajeev Thakur, Rolf Rabenseifner, and William Gropp, Optimization of Collective Communication Operations in MPICH, Int’l Journal of High Performance Computing Applications, (19)1:49-66, Spring 2005.