The PEM model2 is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of P {\displaystyle P} processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size N {\displaystyle N} and P {\displaystyle P} small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size M {\displaystyle M} which is partitioned in blocks of size B {\displaystyle B} . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size B {\displaystyle B} .
The complexity measure of the PEM model is the I/O complexity,3 which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if P {\displaystyle P} processors load parallelly a data block of size B {\displaystyle B} form the main memory into their caches, it is considered as an I/O complexity of O ( 1 ) {\displaystyle O(1)} not O ( P ) {\displaystyle O(P)} . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts4 occur. Like in the PRAM model, three different variations of this problem are considered:
The following two algorithms5 solve the CREW and EREW problem if P ≤ B {\displaystyle P\leq B} processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of P {\displaystyle P} parallel block transfers. A second approach needs O ( log ( P ) ) {\displaystyle O(\log(P))} parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round P {\displaystyle P} processors combine their blocks into P / 2 {\displaystyle P/2} blocks. Then P / 2 {\displaystyle P/2} processors combine the P / 2 {\displaystyle P/2} blocks into P / 4 {\displaystyle P/4} . This procedure is continued until all the data is combined in one block.
Let M = { m 1 , . . . , m d − 1 } {\displaystyle M=\{m_{1},...,m_{d-1}\}} be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition6 of A is a set Π = { A 1 , . . . , A d } {\displaystyle \Pi =\{A_{1},...,A_{d}\}} , where ∪ i = 1 d A i = A {\displaystyle \cup _{i=1}^{d}A_{i}=A} and A i ∩ A j = ∅ {\displaystyle A_{i}\cap A_{j}=\emptyset } for 1 ≤ i < j ≤ d {\displaystyle 1\leq i<j\leq d} . A i {\displaystyle A_{i}} is called the i-th bucket. The number of elements in A i {\displaystyle A_{i}} is greater than m i − 1 {\displaystyle m_{i-1}} and smaller than m i 2 {\displaystyle m_{i}^{2}} . In the following algorithm7 the input is partitioned into N/P-sized contiguous segments S 1 , . . . , S P {\displaystyle S_{1},...,S_{P}} in main memory. The processor i primarily works on the segment S i {\displaystyle S_{i}} . The multiway partitioning algorithm (PEM_DIST_SORT8) uses a PEM prefix sum algorithm9 to calculate the prefix sum with the optimal O ( N P B + log P ) {\displaystyle O\left({\frac {N}{PB}}+\log P\right)} I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.
If the vector of d = O ( M B ) {\displaystyle d=O\left({\frac {M}{B}}\right)} pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with O ( N P B + ⌈ d B ⌉ > log ( P ) + d log ( B ) ) {\displaystyle O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)} I/O complexity. The content of the final buckets have to be located in contiguous memory.
The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code10 makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in O ( log N ) {\displaystyle O(\log N)} , and SELECT, which is a cache optimal single-processor selection algorithm.
Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:
Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.
If P = 1 {\displaystyle P=1} the task is delegated to a cache-optimal single-processor sorting algorithm.
Otherwise the following algorithm11 is used:
The I/O complexity of PEMDISTSORT is:
where
If the number of processors is chosen that f ( N , P , d ) = O ( ⌈ N P B ⌉ ) {\displaystyle f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)} and M < B O ( 1 ) {\displaystyle M<B^{O(1)}} the I/O complexity is then:
O ( N P B log M / B N B ) {\displaystyle O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)}
Where sort P ( N ) {\displaystyle {\textrm {sort}}_{P}(N)} is the time it takes to sort N items with P processors in the PEM model.
Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. 9781595939739 ↩
Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572. 9781424464425 ↩