Sets and Associative Arrays

Data structures can act as containers, keeping track of many elements at once. Allowing the ability to perform actions on any element within the data structure requires an efficient implementation. Below, we discuss two prevalent examples of such data structures: sets and associative arrays.

Sets

Sets are data structures that implement operations from their mathematical definition: they contain unique elements in no particular order. The following are routines common to set implementations.

  • insert(a)

    Inserts element a into the set.

  • has(a)

    Returns True if the set contains a, and False otherwise.

  • remove(a)

    Removes element a from the set.

  • union(A,B)

    Returns a set that contains elements either in set A or set B.

  • intersection(A,B)

    Returns a set that contains elements in both set A and set B.

  • set_minus(A,B)

    Returns a set that contains elements in set A that are not in set B.

Binary search tree implementation

If there is an ordering among elements, then one of the best implementations for a set is a binary tree. This means that insert, has and remove would all take \(O(\log n)\) time, which is very efficient (typically, it is difficult for one of these operations to be performed in constant time without having another require linear time).

The last three functions are operations performed on sets themselves. Because binary trees can be merged in an efficient manner, these operations can be done very efficiently. For example, it is relatively simple to generate an ordered array of elements in the union by performing a merge operation on the in-order tree traversals of both trees. From there, initializing a new binary search tree from a sorted array takes linear time. All three operations can be done in \(O(m+n)\) time using a similar technique, where \(m\) and \(n\) indicate the sizes of the two sets.

Associative arrays

Sometimes, there are elements that are expressed as key-value pairs. A key is a unique identifier for the element. If the elements being added are complex objects, the key is used to search for the element inside a container data structure. This is the premise behind associative arrays. The following are the operations that an associative array would efficiently perform.

  • insert(key,val)

    Inserts an element using key as a key and val as the implicit value.

  • get(key)

    Returns the value associated with key. Throws an exception if there is no key-value pair in the associative array.

  • remove(key)

    Removes the associated key-value pair from the associative array. Throws an exception if no such pair exists.

Hash table implementation

One of the most common ways to implement this is by using a hash table. Some hash function so that keys are mapped to integers called table indices. Usually, the domain of possible keys is larger than the range of possible indices, so there would be entire sets of keys that would map to the same index. The index corresponds to some location in memory with a smaller dynamic array.

Suppose we want to store the key-value pair ('Steele','red'). We find the correct index by evaluating hash_function('Steele'), and then append the key-value pair to the end of the smaller array. The functions get('Steele') and remove('Steele') would work in a similar way.

Analyzing the complexity of this implementation requires some nuance. At the end of the day, we are performing search and remove operations on an unordered array, which takes a linear amount of time with respect to its size. However, the goal is to choose a hash function so that these hash collisions are minimal, making the smaller arrays sufficiently small. In the worst-case scenario, the added key-value pairs have keys that all map to the same index. Therefore, the worst case scenario has get(key) and remove(key) taking \(O(n)\) time.

But this is a pessimistic analysis. The hash functions are strategically chosen to minimize the likelihood of this occurrence. In the average-case scenario, the lengths of the smaller arrays are independent of the total number of key-value pairs in the hash table. Therefore, in an average-case setting, all of the desired operations take \(O(1)\) time.

Binary search tree implementation

If the keys can be ordered, another approach is to use a self-balancing binary search tree. There, each key-value pair would be a node in the tree. The worst-case scenario analysis indicates that all three desired operations would take \(O(\log n)\) time (this is true for the average-case scenario as well).