Parallel external memory

<h2 id="model">Model</h2>
<h3>Definition</h3>
<p>The PEM model<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of 
  
    
      
        P
      
    
    {\displaystyle P}
  
 processors and a two-level <a href="/facts/Memory_hierarchy/ttwjvRyd">memory hierarchy</a>. This memory hierarchy consists of a large <a href="/facts/External_memory_algorithm/sCcj8M8o"> external memory</a> (main memory) of size 
  
    
      
        N
      
    
    {\displaystyle N}
  
 and 
  
    
      
        P
      
    
    {\displaystyle P}
  
 small <a href="/facts/Cache_(computing)/gmJYN8wm"> internal memories (caches)</a>. The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size 
  
    
      
        M
      
    
    {\displaystyle M}
  
 which is partitioned in blocks of size 
  
    
      
        B
      
    
    {\displaystyle B}
  
. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size 
  
    
      
        B
      
    
    {\displaystyle B}
  
.
</p>
<h3>I/O complexity</h3>
<p>The <a href="/facts/Programming_complexity/HuVncxNS"> complexity measure</a> of the PEM model is the I/O complexity,<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a> which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if 
  
    
      
        P
      
    
    {\displaystyle P}
  
 processors load parallelly a data block of size 
  
    
      
        B
      
    
    {\displaystyle B}
  
 form the main memory into their caches, it is considered as an I/O complexity of 
  
    
      
        O
        (
        1
        )
      
    
    {\displaystyle O(1)}
  
 not 
  
    
      
        O
        (
        P
        )
      
    
    {\displaystyle O(P)}
  
. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
</p>
<h3>Read/write conflicts</h3>
<p>In the PEM model, there is no <a href="/facts/Computer_network/3w5RM99p"> direct communication network</a> between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a> occur. Like in the PRAM model, three different variations of this problem are considered:
</p>
<ul><li>Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.</li>
<li>Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.</li>
<li>Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.</li></ul>
<p>The following two algorithms<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a> solve the CREW and EREW problem if 
  
    
      
        P
        ≤
        B
      
    
    {\displaystyle P\leq B}
  
 processors write to the same block simultaneously.
A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of 
  
    
      
        P
      
    
    {\displaystyle P}
  
 parallel block transfers. A second approach needs 
  
    
      
        O
        (
        log
        ⁡
        (
        P
        )
        )
      
    
    {\displaystyle O(\log(P))}
  
 parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a <a href="/facts/Reduce_(parallel_pattern)/91nN9z5n"> binary tree fashion</a> and gradually combine the data into a single block. In the first round 
  
    
      
        P
      
    
    {\displaystyle P}
  
 processors combine their blocks into 
  
    
      
        P
        
          /
        
        2
      
    
    {\displaystyle P/2}
  
 blocks. Then 
  
    
      
        P
        
          /
        
        2
      
    
    {\displaystyle P/2}
  
 processors combine the 
  
    
      
        P
        
          /
        
        2
      
    
    {\displaystyle P/2}
  
 blocks into 
  
    
      
        P
        
          /
        
        4
      
    
    {\displaystyle P/4}
  
. This procedure is continued until all the data is combined in one block.
</p>
<h3>Comparison to other models</h3>
<table><tbody><tr><th>Model</th><th>Multi-core</th><th>Cache-aware</th></tr><tr><td><a href="/facts/Random-access_machine/9NrNtVSd">Random-access machine</a> (RAM)</td><td>No</td><td>No</td></tr><tr><td><a href="/facts/Parallel_random-access_machine/G1pCFgWP">Parallel random-access machine</a> (PRAM)</td><td>Yes</td><td>No</td></tr><tr><td><a href="/facts/External_memory_algorithm/sCcj8M8o">External memory</a> (EM)</td><td>No</td><td>Yes</td></tr><tr><td>Parallel external memory (PEM)</td><td>Yes</td><td>Yes</td></tr></tbody></table>
<h2 id="examples">Examples</h2>
<h3>Multiway partitioning</h3>
<p>Let 
  
    
      
        M
        =
        {
        
          m
          
            1
          
        
        ,
        .
        .
        .
        ,
        
          m
          
            d
            −
            1
          
        
        }
      
    
    {\displaystyle M=\{m_{1},...,m_{d-1}\}}
  
 be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a> of A is a set 
  
    
      
        Π
        =
        {
        
          A
          
            1
          
        
        ,
        .
        .
        .
        ,
        
          A
          
            d
          
        
        }
      
    
    {\displaystyle \Pi =\{A_{1},...,A_{d}\}}
  
 , where 
  
    
      
        
          ∪
          
            i
            =
            1
          
          
            d
          
        
        
          A
          
            i
          
        
        =
        A
      
    
    {\displaystyle \cup _{i=1}^{d}A_{i}=A}
  
 and 
  
    
      
        
          A
          
            i
          
        
        ∩
        
          A
          
            j
          
        
        =
        ∅
      
    
    {\displaystyle A_{i}\cap A_{j}=\emptyset }
  
 for 
  
    
      
        1
        ≤
        i
        <
        j
        ≤
        d
      
    
    {\displaystyle 1\leq i<j\leq d}
  
. 
  
    
      
        
          A
          
            i
          
        
      
    
    {\displaystyle A_{i}}
  
 is called the i-th bucket. The number of elements in 
  
    
      
        
          A
          
            i
          
        
      
    
    {\displaystyle A_{i}}
  
 is greater than 
  
    
      
        
          m
          
            i
            −
            1
          
        
      
    
    {\displaystyle m_{i-1}}
  
 and smaller than 
  
    
      
        
          m
          
            i
          
          
            2
          
        
      
    
    {\displaystyle m_{i}^{2}}
  
. In the following algorithm<a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a> the input is partitioned into N/P-sized contiguous segments 
  
    
      
        
          S
          
            1
          
        
        ,
        .
        .
        .
        ,
        
          S
          
            P
          
        
      
    
    {\displaystyle S_{1},...,S_{P}}
  
 in main memory. The processor i primarily works on the segment 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
. The multiway partitioning algorithm (PEM_DIST_SORT<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a>) uses a PEM <a href="/facts/Prefix_sum/HAacSHLW">prefix sum</a> algorithm<a class="footnote-ref" id="fnref:9" href="#fn:9"><sup>9</sup></a> to calculate the prefix sum with the optimal 
  
    
      
        O
        
          (
          
            
              
                N
                
                  P
                  B
                
              
            
            +
            log
            ⁡
            P
          
          )
        
      
    
    {\displaystyle O\left({\frac {N}{PB}}+\log P\right)}
  
 I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.
</p>
// Compute parallelly a d-way partition on the data segments 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}

for each processor i in parallel do
    Read the vector of pivots M into the cache.
    Partition 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 into d buckets and let vector 
  
    
      
        
          M
          
            i
          
        
        =
        {
        
          j
          
            1
          
          
            i
          
        
        ,
        .
        .
        .
        ,
        
          j
          
            d
          
          
            i
          
        
        }
      
    
    {\displaystyle M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}}
  
 be the number of items in each bucket.
end for

Run PEM prefix sum on the set of vectors 
  
    
      
        {
        
          M
          
            1
          
        
        ,
        .
        .
        .
        ,
        
          M
          
            P
          
        
        }
      
    
    {\displaystyle \{M_{1},...,M_{P}\}}
  
 simultaneously.

// Use the prefix sum vector to compute the final partition
for each processor i in parallel do
    Write elements 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 into memory locations offset appropriately by 
  
    
      
        
          M
          
            i
            −
            1
          
        
      
    
    {\displaystyle M_{i-1}}
  
 and 
  
    
      
        
          M
          
            i
          
        
      
    
    {\displaystyle M_{i}}
  
.
end for

Using the prefix sums stored in 
  
    
      
        
          M
          
            P
          
        
      
    
    {\displaystyle M_{P}}
  
 the last processor P calculates the vector B of bucket sizes and returns it.

<p>If the vector of 
  
    
      
        d
        =
        O
        
          (
          
            
              M
              B
            
          
          )
        
      
    
    {\displaystyle d=O\left({\frac {M}{B}}\right)}
  
 pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with 
  
    
      
        O
        
          (
          
            
              
                N
                
                  P
                  B
                
              
            
            +
            
              ⌈
              
                
                  d
                  B
                
              
              ⌉
            
            >
            log
            ⁡
            (
            P
            )
            +
            d
            log
            ⁡
            (
            B
            )
          
          )
        
      
    
    {\displaystyle O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)}
  
 I/O complexity. The content of the final buckets have to be located in contiguous memory.
</p>
<h3>Selection</h3>
<p>The <a href="/facts/Selection_problem/zyxLnxNf">selection problem</a> is about finding the k-th smallest item in an unordered list A of size N.
The following code<a class="footnote-ref" id="fnref:10" href="#fn:10"><sup>10</sup></a> makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in 
  
    
      
        O
        (
        log
        ⁡
        N
        )
      
    
    {\displaystyle O(\log N)}
  
, and SELECT, which is a cache optimal single-processor selection algorithm.
</p>
if 
  
    
      
        N
        ≤
        P
      
    
    {\displaystyle N\leq P}
  
 then 
    
  
    
      
        
          
            PRAMSORT
          
        
        (
        A
        ,
        P
        )
      
    
    {\displaystyle {\texttt {PRAMSORT}}(A,P)}

return 
  
    
      
        A
        [
        k
        ]
      
    
    {\displaystyle A[k]}

end if

//Find median of each 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}

for each processor i in parallel do 
    
  
    
      
        
          m
          
            i
          
        
        =
        
          
            SELECT
          
        
        (
        
          S
          
            i
          
        
        ,
        
          
            N
            
              2
              P
            
          
        
        )
      
    
    {\displaystyle m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})}

end for

// Sort medians

PRAMSORT
          
        
        (
        {
        
          m
          
            1
          
        
        ,
        …
        ,
        
          m
          
            2
          
        
        }
        ,
        P
        )
      
    
    {\displaystyle {\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)}

// Partition around median of medians

t
        =
        
          
            PEMPARTITION
          
        
        (
        A
        ,
        
          m
          
            P
            
              /
            
            2
          
        
        ,
        P
        )
      
    
    {\displaystyle t={\texttt {PEMPARTITION}}(A,m_{P/2},P)}

if 
  
    
      
        k
        ≤
        t
      
    
    {\displaystyle k\leq t}
  
 then 
    return 
  
    
      
        
          
            PEMSELECT
          
        
        (
        A
        [
        1
        :
        t
        ]
        ,
        P
        ,
        k
        )
      
    
    {\displaystyle {\texttt {PEMSELECT}}(A[1:t],P,k)}

else 
    return 
  
    
      
        
          
            PEMSELECT
          
        
        (
        A
        [
        t
        +
        1
        :
        N
        ]
        ,
        P
        ,
        k
        −
        t
        )
      
    
    {\displaystyle {\texttt {PEMSELECT}}(A[t+1:N],P,k-t)}

end if

<p>Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:
</p>

O
        
          (
          
            
              
                N
                
                  P
                  B
                
              
            
            +
            log
            ⁡
            (
            P
            B
            )
            ⋅
            log
            ⁡
            (
            
              
                N
                P
              
            
            )
          
          )
        
      
    
    {\displaystyle O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)}

<h3>Distribution sort</h3>
<p><a href="/facts/Distribution_sort/PwHpj1Gs">Distribution sort</a> partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.
</p><p>If 
  
    
      
        P
        =
        1
      
    
    {\displaystyle P=1}
  
 the task is delegated to a cache-optimal single-processor sorting algorithm.
</p><p>Otherwise the following algorithm<a class="footnote-ref" id="fnref:11" href="#fn:11"><sup>11</sup></a> is used:
</p>
// Sample 
  
    
      
        
          
            
              
                4
                N
              
              
                d
              
            
          
        
      
    
    {\displaystyle {\tfrac {4N}{\sqrt {d}}}}
  
 elements from A
for each processor i in parallel do
    if 
  
    
      
        M
        <
        
          |
        
        
          S
          
            i
          
        
        
          |
        
      
    
    {\displaystyle M<|S_{i}|}
  
 then
        
  
    
      
        d
        =
        M
        
          /
        
        B
      
    
    {\displaystyle d=M/B}

Load 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 in M-sized pages and sort pages individually
    else
        
  
    
      
        d
        =
        
          |
        
        
          S
          
            i
          
        
        
          |
        
      
    
    {\displaystyle d=|S_{i}|}

Load and sort 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 as single page
    end if
    Pick every 
  
    
      
        
          
            d
          
        
        
          /
        
        4
      
    
    {\displaystyle {\sqrt {d}}/4}
  
'th element from each sorted memory page into contiguous vector 
  
    
      
        
          R
          
            i
          
        
      
    
    {\displaystyle R^{i}}
  
 of samples
end for

in parallel do
    Combine vectors 
  
    
      
        
          R
          
            1
          
        
        …
        
          R
          
            P
          
        
      
    
    {\displaystyle R^{1}\dots R^{P}}
  
 into a single contiguous vector 
  
    
      
        
          
            R
          
        
      
    
    {\displaystyle {\mathcal {R}}}

Make 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 copies of 
  
    
      
        
          
            R
          
        
      
    
    {\displaystyle {\mathcal {R}}}
  
: 
  
    
      
        
          
            
              R
            
          
          
            1
          
        
        …
        
          
            
              R
            
          
          
            
              d
            
          
        
      
    
    {\displaystyle {\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}}

end do

// Find 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 pivots 
  
    
      
        
          
            M
          
        
        [
        j
        ]
      
    
    {\displaystyle {\mathcal {M}}[j]}

for 
  
    
      
        j
        =
        1
      
    
    {\displaystyle j=1}
  
 to 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 in parallel do
    
  
    
      
        
          
            M
          
        
        [
        j
        ]
        =
        
          
            PEMSELECT
          
        
        (
        
          
            
              R
            
          
          
            i
          
        
        ,
        
          
            
              P
              
                d
              
            
          
        
        ,
        
          
            
              
                j
                ⋅
                4
                N
              
              d
            
          
        
        )
      
    
    {\displaystyle {\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})}

end for

Pack pivots in contiguous array 
  
    
      
        
          
            M
          
        
      
    
    {\displaystyle {\mathcal {M}}}

// Partition Aaround pivots into buckets 
  
    
      
        
          
            B
          
        
      
    
    {\displaystyle {\mathcal {B}}}

B
          
        
        =
        
          
            PEMMULTIPARTITION
          
        
        (
        A
        [
        1
        :
        N
        ]
        ,
        
          
            M
          
        
        ,
        
          
            d
          
        
        ,
        P
        )
      
    
    {\displaystyle {\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)}

// Recursively sort buckets
for 
  
    
      
        j
        =
        1
      
    
    {\displaystyle j=1}
  
 to 
  
    
      
        
          
            d
          
        
        +
        1
      
    
    {\displaystyle {\sqrt {d}}+1}
  
 in parallel do
    recursively call 
  
    
      
        
          
            PEMDISTSORT
          
        
      
    
    {\displaystyle {\texttt {PEMDISTSORT}}}
  
 on bucket jof size 
  
    
      
        
          
            B
          
        
        [
        j
        ]
      
    
    {\displaystyle {\mathcal {B}}[j]}

using 
  
    
      
        O
        
          (
          
            ⌈
            
              
                
                  
                    
                      
                        B
                      
                    
                    [
                    j
                    ]
                  
                  
                    N
                    
                      /
                    
                    P
                  
                
              
            
            ⌉
          
          )
        
      
    
    {\displaystyle O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)}
  
 processors responsible for elements in bucket j
end for

<p>The I/O complexity of PEMDISTSORT is:
</p>

O
        
          (
          
            
              ⌈
              
                
                  N
                  
                    P
                    B
                  
                
              
              ⌉
            
            
              (
              
                
                  log
                  
                    d
                  
                
                ⁡
                P
                +
                
                  log
                  
                    M
                    
                      /
                    
                    B
                  
                
                ⁡
                
                  
                    N
                    
                      P
                      B
                    
                  
                
              
              )
            
            +
            f
            (
            N
            ,
            P
            ,
            d
            )
            ⋅
            
              log
              
                d
              
            
            ⁡
            P
          
          )
        
      
    
    {\displaystyle O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)}

<p>where
</p>

f
        (
        N
        ,
        P
        ,
        d
        )
        =
        O
        
          (
          
            log
            ⁡
            
              
                
                  P
                  B
                
                
                  d
                
              
            
            log
            ⁡
            
              
                N
                P
              
            
            +
            
              ⌈
              
                
                  
                    
                      d
                    
                    B
                  
                
                log
                ⁡
                P
                +
                
                  
                    d
                  
                
                log
                ⁡
                B
              
              ⌉
            
          
          )
        
      
    
    {\displaystyle f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)}

<p>If the number of processors is chosen that 
  
    
      
        f
        (
        N
        ,
        P
        ,
        d
        )
        =
        O
        
          (
          
            ⌈
            
              
                
                  N
                  
                    P
                    B
                  
                
              
            
            ⌉
          
          )
        
      
    
    {\displaystyle f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)}
  
and 
  
    
      
        M
        <
        
          B
          
            O
            (
            1
            )
          
        
      
    
    {\displaystyle M<B^{O(1)}}
  
 the I/O complexity is then:
</p><p>
  
    
      
        O
        
          (
          
            
              
                N
                
                  P
                  B
                
              
            
            
              log
              
                M
                
                  /
                
                B
              
            
            ⁡
            
              
                N
                B
              
            
          
          )
        
      
    
    {\displaystyle O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)}

</p>
<h3>Other PEM algorithms</h3>
<table><tbody><tr><th>PEM Algorithm</th><th>I/O complexity</th><th>Constraints</th></tr><tr><th><a href="/facts/Merge_sort/A5jLYfzS">Mergesort</a><a class="footnote-ref" id="fnref:12" href="#fn:12"><sup>12</sup></a></th><td>                    O                  (                                                    N                                  P                  B                                                                    log                                                M                  B                                                      ⁡                                          N                B                                              )                =                                            sort                                            P                          (        N        )              {\displaystyle O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)}  </td><td>                    P        ≤                              N                          B                              2                                                    ,        M        =                  B                      O            (            1            )                                {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}  </td></tr><tr><th><a href="/facts/List_ranking/nlFHGmPy">List ranking</a><a class="footnote-ref" id="fnref:13" href="#fn:13"><sup>13</sup></a></th><td>                    O                  (                                                                      sort                                                            P                                      (            N            )                    )                      {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}  </td><td>                    P        ≤                                            N                              /                                            B                                  2                                                                    log              ⁡              B              ⋅                              log                                  O                  (                  1                  )                                            ⁡              N                                      ,        M        =                  B                      O            (            1            )                                {\displaystyle P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}  </td></tr><tr><th><a href="/facts/Euler_tour/4l2G67c7">Euler tour</a><a class="footnote-ref" id="fnref:14" href="#fn:14"><sup>14</sup></a></th><td>                    O                  (                                                                      sort                                                            P                                      (            N            )                    )                      {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}  </td><td>                    P        ≤                              N                          B                              2                                                    ,        M        =                  B                      O            (            1            )                                {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}  </td></tr><tr><th><a href="/facts/Expression_tree/HqbKYU43">Expression tree</a> evaluation<a class="footnote-ref" id="fnref:15" href="#fn:15"><sup>15</sup></a></th><td>                    O                  (                                                                      sort                                                            P                                      (            N            )                    )                      {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}  </td><td>                    P        ≤                              N                                          B                                  2                                            log              ⁡              B              ⋅                              log                                  O                  (                  1                  )                                            ⁡              N                                      ,        M        =                  B                      O            (            1            )                                {\displaystyle P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}  </td></tr><tr><th>Finding a <a href="/facts/Minimum_spanning_tree/kgkGmY9s">MST</a><a class="footnote-ref" id="fnref:16" href="#fn:16"><sup>16</sup></a></th><td>                    O                  (                                                                      sort                                                            P                                      (                          |                        V                          |                        )            +                                                            sort                                                            P                                      (                          |                        E                          |                        )            log            ⁡                                                                                                      |                                        V                                          |                                                                            p                    B                                                                                )                      {\displaystyle O\left({\textrm {sort}}_{P}(|V|)+{\textrm {sort}}_{P}(|E|)\log {\tfrac {|V|}{pB}}\right)}  </td><td>                    p        ≤                                                            |                            V                              |                            +                              |                            E                              |                                                                    B                                  2                                            log              ⁡              B              ⋅                              log                                  O                  (                  1                  )                                            ⁡              N                                      ,        M        =                  B                      O            (            1            )                                {\displaystyle p\leq {\frac {|V|+|E|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}  </td></tr></tbody></table>
<p>Where 
  
    
      
        
          
            
              sort
            
          
          
            P
          
        
        (
        N
        )
      
    
    {\displaystyle {\textrm {sort}}_{P}(N)}
  
 is the time it takes to sort N items with P processors in the PEM model.
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Parallel_random-access_machine/G1pCFgWP">Parallel random-access machine</a> (PRAM)</li>
<li><a href="/facts/Random-access_machine/9NrNtVSd">Random-access machine</a> (RAM)</li>
<li><a href="/facts/External_memory_algorithm/sCcj8M8o">External memory</a> (EM)</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
<li id="fn:9"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></p></li>
<li id="fn:10"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></p></li>
<li id="fn:11"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:11" class="footnote-back-ref">↩</a></p></li>
<li id="fn:12"><p>Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041. <a href="9781595939739" target="_blank">9781595939739</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></p></li>
<li id="fn:13"><p>Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572. <a href="9781424464425" target="_blank">9781424464425</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></p></li>
<li id="fn:14"><p>Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572. <a href="9781424464425" target="_blank">9781424464425</a> <a href="#fnref:14" class="footnote-back-ref">↩</a></p></li>
<li id="fn:15"><p>Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572. <a href="9781424464425" target="_blank">9781424464425</a> <a href="#fnref:15" class="footnote-back-ref">↩</a></p></li>
<li id="fn:16"><p>Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572. <a href="9781424464425" target="_blank">9781424464425</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></p></li>
</ol>

Parallel external memory open-in-new

Parallel external memory