Union find

The union-find, or disjoint-set, data structure is used to model a collection of non-overlapping sets of elements. The three supported operations are as follows.

  1. make_set(x)

    Adds a single element x that is a member of a new singleton set containing x.

  2. find(x)

    Finds the set that x is a part of. This function returns a representative element of that set.

  3. union(x, y)

    Merges the sets containing x and y into a single set.

Note that with our definition of find we can determine if x and y belong to the same set by comparing their representative elements, i.e. find(x) == find(y).

The union find data structure can be used to model the connected components of a graph. This is useful, for example, when implementing Kruskal’s algorithm for computing minimum spanning trees.

Linked list implementation

A straightforward implementation of the union-find data structure represents each set with a linked list of elements. The representative of a set is the head of the list. We could implement the operations as follows. Here we assume all elements are pointers to a node that have a parent, child, and head pointers pointing to the previous, next, and first elements of the list, if any.

MAKESET(x)
  create a new linked list containing x

FIND(x)
  if x has a head
    return x.head
  else
    return x

UNION(x, y)
  while x has child
    x = x.child

  rootY = FIND(Y)
  rootY.parent = x
  rootY.head = x.head

  while rootY has child
    rootY.child.head = x.head
    rootY = rootY.child

Here we use the head element of a list as its representative, and we append linked lists together to merge them.

Note that this implementation has the following runtime characteristics, where throughout \(n\) is the total number of elements being tracked.

  1. make_set(x) is \(O(1)\) time
  2. find(x) is \(O(1)\) time
  3. union(x, y) is \(O(n)\) time

If we always append the smaller list to the larger list, then performing \(m\) MAKESET, FIND, or UNION operations on \(n\) elements takes \(O(m + n \log n)\) time. This helps because we only need to update the head pointer on the smaller of the lists, and so we always have to update at most half of a set.

Disjoint forests structure

We can improve the runtime by using a more interesting data structure, along with some optimizations. We now represent each set as a rooted tree. The representative of a set is the root of the tree. MAKESET is implemented by creating a tree with a single root node. FIND returns the root of the tree, found by traversing the tree upwards until a node is found with no parent. UNION is implemented by attaching the root of one tree as the child of the root of another.

Union by rank

The first optimization is similar to the one we employed with linked lists, i.e. always appending to the larger list. Here we always merge the smaller tree into the larger tree. Note that this increases the depth of the tree being merged into only if the height of the two trees were equal beforehand. Both UNION and FIND can be implemented in \(O(\log n)\) time with this optimization.

Path compression

In path compression, whenever we traverse the tree for a FIND operation we attach all nodes traversed in getting to the root directly to the root; this maintains the validity of the structure, since the root is the representative for all these nodes. The intuition is that any subsequent calls to FIND on the same node or any recursive parent of the node will be faster.

With both union-by-rank and path compression, the amortized time per FIND or UNION operation is \(O(\alpha(n))\), where \(\alpha(n)\) is defined as the inverse of \(A(n, n)\), where

\[\begin{split}A(m, n) = \begin{cases} n + 1, & m = 0 \\ A(m - 1, 1), & n = 0 \\ A(m - 1, A(m, n - 1)), & \text{otherwise} \end{cases}\end{split}\]

is the Ackermann function. It can be safely assumed that \(\alpha(n) \le 5\) for any real-world inputs, since the Ackermann function is increasing in each of its arguments and \(A(4, 4) = 2^{2^{2^{65536}}} - 3\).