Sort-merge join

<h2 id="complexity">Complexity</h2>
<p>Let 
  
    
      
        R
      
    
    {\displaystyle R}
  
 and 
  
    
      
        S
      
    
    {\displaystyle S}
  
  be relations where 
  
    
      
        
          |
        
        R
        
          |
        
        <
        
          |
        
        S
        
          |
        
      
    
    {\displaystyle |R|<|S|}
  
. 
  
    
      
        R
      
    
    {\displaystyle R}
  
 fits in 
  
    
      
        
          P
          
            r
          
        
      
    
    {\displaystyle P_{r}}
  
 pages memory and 
  
    
      
        S
      
    
    {\displaystyle S}
  
 fits in 
  
    
      
        
          P
          
            s
          
        
      
    
    {\displaystyle P_{s}}
  
 pages memory. In the worst case, a sort-merge join will run in 
  
    
      
        O
        (
        
          P
          
            r
          
        
        +
        
          P
          
            s
          
        
        )
      
    
    {\displaystyle O(P_{r}+P_{s})}
  
 I/O operations. In the case that 
  
    
      
        R
      
    
    {\displaystyle R}
  
 and 
  
    
      
        S
      
    
    {\displaystyle S}
  
 are not ordered the worst case time cost will contain additional terms of sorting time: 
  
    
      
        O
        (
        
          P
          
            r
          
        
        +
        
          P
          
            s
          
        
        +
        
          P
          
            r
          
        
        log
        ⁡
        (
        
          P
          
            r
          
        
        )
        +
        
          P
          
            s
          
        
        log
        ⁡
        (
        
          P
          
            s
          
        
        )
        )
      
    
    {\displaystyle O(P_{r}+P_{s}+P_{r}\log(P_{r})+P_{s}\log(P_{s}))}
  
, which equals 
  
    
      
        O
        (
        
          P
          
            r
          
        
        log
        ⁡
        (
        
          P
          
            r
          
        
        )
        +
        
          P
          
            s
          
        
        log
        ⁡
        (
        
          P
          
            s
          
        
        )
        )
      
    
    {\displaystyle O(P_{r}\log(P_{r})+P_{s}\log(P_{s}))}
  
 (as <a href="/facts/Linearithmic_time/77T62gmf">linearithmic</a> terms outweigh the linear terms, see <a href="/facts/Big_O_notation/weFFjSWg">Big O notation – Orders of common functions</a>).
</p>
<h2 id="pseudocode">Pseudocode</h2>
<p>For simplicity, the algorithm is described in the case of an <a href="/facts/Join_(SQL)/Cxl5lxcR">inner join</a> of two relations <i>left</i> and <i>right</i>. Generalization to other join types is straightforward. The output of the algorithm will contain only rows contained in the <i>left</i> and <i>right</i> relation and duplicates form a <a href="/facts/Cartesian_product/BfqBWzdL">Cartesian product</a>.
</p>
function Sort-Merge Join(left: Relation, right: Relation, comparator: Comparator) {
    result = new Relation()
    
    // Ensure that at least one element is present
    if (!left.hasNext() || !right.hasNext()) {
        return result
    }
    
    // Sort left and right relation with comparator
    left.sort(comparator)
    right.sort(comparator)
    
    // Start Merge Join algorithm
    leftRow = left.next()
    rightRow = right.next()
    
    outerForeverLoop:
    while (true) {
        while (comparator.compare(leftRow, rightRow) != 0) {
            if (comparator.compare(leftRow, rightRow) < 0) {
                // Left row is less than right row
                if (left.hasNext()) {
                    // Advance to next left row
                    leftRow = left.next()
                } else {
                    break outerForeverLoop
                }
            } else {
                // Left row is greater than right row
                if (right.hasNext()) {
                    // Advance to next right row
                    rightRow  = right.next()
                } else {
                    break outerForeverLoop
                }
            }
        }
        
        // Mark position of left row and keep copy of current left row
        left.mark()
        markedLeftRow = leftRow
        
        while (true) {
            while (comparator.compare(leftRow, rightRow) == 0) {
                // Left row and right row are equal
                // Add rows to result
                result = add(leftRow, rightRow)
                
                // Advance to next left row
                leftRow = left.next()
                
                // Check if left row exists
                if (!leftRow) {
                    // Continue with inner forever loop
                    break
                }
            }
            
            if (right.hasNext()) {
                // Advance to next right row
                rightRow  = right.next()
            } else {
                break outerForeverLoop
            }
            
            if (comparator.compare(markedLeftRow, rightRow) == 0) {
                // Restore left to stored mark
                left.restoreMark()
                leftRow = markedLeftRow
            } else {
                // Check if left row exists
                if (!leftRow) {
                    break outerForeverLoop
                } else {
                    // Continue with outer forever loop
                    break
                }
            }
        }
    }
    
    return result
}

<p>Since the comparison logic is not the central aspect of this algorithm, it is hidden behind a generic comparator and can also consist of several comparison criteria (e.g. multiple columns). The compare function should return if a row is <i>less(-1)</i>, <i>equal(0)</i> or <i>bigger(1)</i> than another row:
</p>
function compare(leftRow: RelationRow, rightRow: RelationRow): number {
	// Return -1 if leftRow is less than rightRow
	// Return 0 if leftRow is equal to rightRow
	// Return 1 if leftRow is greater than rightRow
}

<p>Note that a relation in terms of this pseudocode supports some basic operations:
</p>
interface Relation {
    // Returns true if relation has a next row (otherwise false)
    hasNext(): boolean
    
    // Returns the next row of the relation (if any)
    next(): RelationRow
    
    // Sorts the relation with the given comparator
    sort(comparator: Comparator): void
    
    // Marks the current row index
    mark(): void
    
    // Restores the current row index to the marked row index
    restoreMark(): void
}

<h2 id="simple-c-implementation">Simple C# implementation</h2>
<p>Note that this implementation assumes the join attributes are unique, i.e., there is no need to output multiple tuples for a given value of the key.
</p>
public class MergeJoin
{
    // Assume that left and right are already sorted
    public static Relation Merge(Relation left, Relation right)
    {
        Relation output = new Relation();
        while (!left.IsPastEnd && !right.IsPastEnd)
        {
            if (left.Key == right.Key)
            {
                output.Add(left.Key);
                left.Advance();
                right.Advance();
            }
            else if (left.Key < right.Key)
                left.Advance();
            else // if (left.Key > right.Key)
                right.Advance();
        }
        return output;
    }
}
 
public class Relation
{
    private const int ENDPOS = -1;
    private List<int> list;
    private int position = 0;

public Relation()
    {
        this.list = new List<int>();
    }

public Relation(List<int> list)
    {
        this.list = list;
    }

public int Position => position;

public int Key => list[position];

public bool IsPastEnd => position == ENDPOS;

public bool Advance()
    {
        if (position == list.Count - 1 || position == ENDPOS)
        {
            position = ENDPOS;
            return false;
        }
        position++;
        return true;
    }

public void Add(int key)
    {
        list.Add(key);
    }

public void Print()
    {
        foreach (int key in list)
            Console.WriteLine(key);
    }
}

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Hash_join/WryEJk9d">Hash join</a></li>
<li><a href="/facts/Nested_loop_join/LV246CsC">Nested loop join</a></li></ul>

<h2 id="external-links">External links</h2>
<p><a href="http://www.necessaryandsufficient.net/2010/02/join-algorithms-illustrated/">C# Implementations of Various Join Algorithms</a>
</p>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>"Sort-Merge Joins". www.dcs.ed.ac.uk. Retrieved 2022-11-02. <a href="https://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm" target="_blank">https://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
</ol>

Sort-merge join open-in-new

Sort-merge join