Turbo codes [1] and LDPC codes [2] are the two best known codes that are capable of achieving low bit error rates (BER) at code rates approaching Shannon's capacity limit. However, in order to achieve desired power and throughputs for current applications (e.g., > 1Mbps in 3G wireless sys- tems, > 1Gbps in magnetic recording systems), fully parallel and pipelined iterative decoder architectures are needed. Compared to turbo codes, LDPC codes enjoy a signi¯cant advantage in terms of computational complexity and are known to have a large amount of inherent parallelism [3]. However, the randomness of LDPC codes results in stringent memory requirements that amount to an order of magnitude increase in complexity compared to those for turbo codes. A direct approach to implementing a parallel decoder ar- chitecture would be to allocate, for each node or cluster of nodes in the graph de¯ning the LDPC code, a function unit for computing the reliability messages, and employ an in- terconnection network to route messages between function nodes (see Fig.1). A major problem with this approach is that the interconnection networks require complex wiring to perform global routing of messages and hence must be deeply pipelined (e.g., bidirectional multilayered networks in [4] and 4096-input multiplexers per function unit in [5]). Moreover, the randomness in the pattern of communicating messages leads to routing and congestion problems on the networks which require extensive bu®ering to resolve.