2012/10/07

A literate union-find data structure

I have a working closure conversion done using the purely functional CPS data structure presented in my earlier post, but it is somewhat hackish. Thus I am trying to improve it, following Andrew Kennedy's excellent paper "Compiling with continuations, continued".

Kennedy's CPS structure requires a union-find data structure, used to merge the occurences of a variable, and find the binding site of an occurence. I already had a union-find data structure, used to implement first-order unification in type inference, but as is usual a module becomes good on the second time you write it (when you have more experience about its implementation and usage).

There are many possible variations in the interface of a union-find module. The particularities of this one is explicit support for attaching description to sets, and a "partition" type separate from the "element" type. Also, the interface is functorial, and I provided two versions: a Safe one that checks that usage of the structure is correct, and a Fast one with no check. It is easy to shoot yourself in the foot by using this module incorrectly, so using the Safe one is probably a better bet.

So here they are.

Interface for module Union_find

1.  A union-find data structure maintains a partition of elements into disjoint sets.

It allows to add new elements in new partitions, perform the union of two partitions, and retrieve the partition in which is an element. Moreover it allows to attach a description to a partition, which is generally the point of using such a structure.

This module has side effects: adding an element to a union-find data structure changes that element, and the union operation merges the partitions destructively. This make it easy to use this module incorrectly. To that end, a number of protections (using types and dynamic checks) are set that detect such incorrect uses of the module.

Note on the name: there are other data structures that maintain disjoint sets with other operations, such as partition refinement, so "union-find" is a more accurate name for this data structure than "disjoint set".

module type S =
sig

   The t type represents the whole union-find data structure. A partition always belong to some t; elements belong to a t once there has been a "singleton" operation on them.

All the functions (except create) take a t argument; in their safe version this argument is used to checks that other element and partition arguments indeed belong to the t argument.

   type t
   type partition
   type element
   type description

   create() returns a new empty union find data structure.
   val create : unit → t

   singleton t e d adds a new element e to t, and create and returns a new partition p in t, such that e is the only element of p. It also attach the description d to p.

The safe version checks that e was not previously added to another union-find data structure (with the same link).

   val singleton : t → element → description → partition

   find t e returns the partition p of t that contains e.
   val find : t → element → partition

   union t p1 p2 d creates a new partition p3, with description d, that contains the union of all the elements in p1 and p2. The p1 and p2 arguments are consumed, i.e. must not be used after they were passed to union. p1 and p2 must be different partitions.
   val union : t → partition → partition → description → partition

   description t p returns the description associated to p.
   val description : t → partition → description

   description t p changes the description associated to p.
   val set_description : t → partition → description → unit
end
2.  We defined two types of "union-find makers": Fast and Safe. Both propose a link type, and each element of a union-find structure must be "associated" to one different link (generally the link is a mutable field in the element type). Initially, the link value is empty_link.

The Make functor, once told how access the link of an element, returns a module complying to S. Below we given an exemple of usage.

Note: It is possible for an element to be present in two different union-find data structures; it must just have different links.

If the link in an element must be re-used for another union-find data structure, then it must be set to empty_link, and one must stop using the union-find data structure that contained the element (even with other elements).

module type UNION_FIND = sig
   type (α, β) link
   val empty_link:(α,β) link

   module type LINK =
   sig
     type element
     type description
     val get : element → (description, element) link
     val set : element → (description, element) link → unit
   end

   module Make(Link : LINK):S with type description = Link.description and type element = Link.element
end

The difference between the fast and safe version is that safe performs additional checks. The performance difference is small, so the Safe version should be prefered.
module Fast:UNION_FIND

module Safe:UNION_FIND

3.  Exemple of usage:

type test = { x:intmutable z:(string, test) Union_find.Safe.link };;

module Test = struct  type description = string  type element = test  let get_link t = t.z  let set_link t z = t.z ← z  end

module A = Union_find.Safe.Make(Test)

let uf = A.create() in  let elt1 = {x=1; z=Safe.empty_link} in  let part1 = A.singleton t elt1 "1" in  assert(A.description t (A.find t elt1) = "1")

Module Union_find

1.  We represent each disjoint set by a tree : elements are in the same set than the element that they point to.

The root of the tree is the representative of the set, and corresponds to elements of type partition. It points to a "partition descriptor".

type (α,β) baselink = 
   ∣ Partition_descriptor of α partition_descriptor
   ∣ Parent of β

The partition descriptor contains the user-accessible description, and a rank, used to optimize the union operation.

Note that the partition descriptor is not accessible by the users of the module, and the interface make it so that there can be only one link to the partition descriptor (from the representative). This allows to update the partition descriptor destructively.

and α partition_descriptor = { mutable rank:rank; mutable desc:α }

The rank of a partition is is a majorant of the distance of its elements to the root (path compression makes so that the height of the tree can be lower than the rank). The union operation minimizes the rank, and thus the height of the tree.
and rank = int
2.  The implementation is parametrized by the safety checks that we perform (which differs between the Fast and Safe modules).

The safe module identifies all union-find data structures by a unique id, embed that in the links, and checks for all operation that they are equal. It also checks initialization of the link.

module type SAFETY = sig
   type t
   val create: unit → t
   type (α,β) link

   (∗ Create a safe link from a baselink. ∗)
   val securize: t → (α,β) baselink → (α,β) link

   (∗ Returns the base_link from the safe link. ∗)
   val get_base: (α,β) link → (α,β) baselink

   (∗ Check the safe link withat the element (and the safe link) belong to t. ∗)
   val check_membership: t → (α,β) link → unit

   (∗ Check that the element is not yet part of any union find. ∗)
   val check_unused: (α,β) link → unit

   (∗ Initial link. ∗)
   val empty_link: (α,β) link
end

module No_safety:SAFETY = struct
   type t = unit
   let create() = ()
   type (α,β) link = (α,β) baselink

   let securize () l = l

   let check_membership () l = ()

   let check_unused l = ()

   let get_base l = l

   (∗ Note: This cast can make the execution fail without notice. ∗)
   let empty_link = Obj.magic 0 
end

type unique = int

module Unique = Unique.Make(struct end)

module Safety:SAFETY = struct
   type t = Unique.t
   let create() = Unique.fresh()

   type (α,β) link = t option × (α,β) baselink

   let securize u l = (Some u,l)

   let check_membership t (u,_) = 
     (match u with
       ∣ Some(a) → assert (t ≡ a) (∗ The element is in another union-find structure. ∗)
       ∣ None → assert false); () (∗ The element is in no union-find structure. ∗)

   let check_unused (u,_) = 
     (match u with
       ∣ Some(_) → assert false (∗ The element is already in a union-find structure. ∗)
       ∣ None → ())

   let get_base (_,l) = l

   (∗ Note: The cast is not dangerous, because the left-hand part is checked first. ∗)
   let empty_link = (NoneObj.magic 0) 
end

3.  The goal of the below "double functor" is to produce a module with the following signature. In it, partition and element are actually the same underlying type; the difference is that elements returned with type partition are the root of the tree). Hiding this in the interface provides some guarantee that arguments of type partition are the representative of their partition.

Unfortunately, after calling union on two partitions p1 and p2, one of them will stop being the root; that is why the partition arguments of union must not be re-used. Thus, defining the partition type only guarantees that the argument has been a root in the past, and we ensure that by a dynamic test.

module type S = sig
       type t
       type partition
       type element
       type description
       val create: unit → t
       val singleton : t → element → description → partition
       val find : t → element → partition
       val union: t → partition → partition → description → partition
       val description: t → partition → description
       val set_description : t → partition → description → unit
end

This is a double functor with two arguments; Saf allows to differenciate the "Fast" and "Safe" modules, while Link is used to find and change the link.
module Make(Saf:SAFETY):UNION_FIND = 
struct

   type (α,β) link = (α,β) Saf.link

   let empty_link = Saf.empty_link

   module type LINK = sig
     type element
     type description
     val get: element → (description, element) link 
     val set: element → (description, element) link → unit
   end

   module Make(LinkLINK) =
   struct
     type t = Saf.t
     type element = Link.element
     type description = Link.description
     type partition = Link.element

     let create = Saf.create

4.  singleton is the only way to add new elements to the union-find structure, and is the place where we check that the element is not part of another structure.
     let singleton t elt desc = 
       let l = (Link.get elt) in
       Saf.check_unused l;
       Link.set elt (Saf.securize t (Partition_descriptor {rank=0;desc=desc}));
       elt 
5.  Basically, find just walks the tree until it finds the root.

But performance is increased if the length of the path is diminished: traversed nodes are linked to nodes that are closer to the roof. The possibility we have implemented is path compression: when the root is found, the elements are changed to link to the it, so that subsequent calls are faster. We implemented a tail-recursive version of this algorithm (which still requires two pass).

Note: there are alternatives to path compression, such that halving; but in Tarjan’s structure the root is linked to itself, which is not the case here, so halving would require more checks than in Tarjan’s version. Thus we stick with path compression.

Note: we could perform a lighter check in the safe version by checking only the argument, and not all recursive calls; this is probably not worth implementing it, and the heavy check has its uses.

     let find t x = 
       (∗ Tail-recursive function to find the root of the algorithm. ∗)
       let rec find x = 
         let l = (Link.get x) in
         Saf.check_membership t l;
         match Saf.get_base l with
           ∣ Partition_descriptor(s) → x
           ∣ Parent(y) → find y in
       (∗ This is also tail-recursive, but we do not perform the checks the second time. ∗)
       let rec compress x r = 
         let l = (Link.get x) in
         match Saf.get_base l with
           ∣ Partition_descriptor(s) → ()
           ∣ Parent(y) → Link.set x (Saf.securize t (Parent r)) in
       let root = find x in
       compress x root;
       root
6.  The following functions work only when the given element is the root of a partition, but check that.
     let get_partition_descriptor t p = 
       let l = (Link.get p) in
       Saf.check_membership t l;
       match Saf.get_base l with
         ∣ Partition_descriptor(s) → s
         ∣ _ → assert(false(∗ The element is not a partition. ∗)

     let description t x = (get_partition_descriptor t x).desc

     let set_description t x desc = 
       let pd = get_partition_descriptor t x in
       pd.desc ← desc

7.  This function performs the union of two partitions. We use rank to find which should be the root : we attach the smaller tree to the root of the larger tree, so as not to increase the maximum height (i.e. path length) of the resulting tree.

The last argument allows to update the set descriptor along with this operation.

Note that this function takes partitions as argument; one could have instead taken any element, and performed the find inside the function; in particular some efficient algorithms interleave the find and union operations. The reason why we take partition arguments is that it avoids a find when we know that the argument is a partition (for instance when merging with a just-created singleton), and the user needs to perform a find to retrieve and merge the description in the algorithms we use (such as unification).

     let union t p1 p2 newdesc =

       This function also checks that p1 and p2 are partitions.
       let d1 = get_partition_descriptor t p1 in
       let d2 = get_partition_descriptor t p2 in

       Alternatively, the check that p1 and p2 are different could have been done here.
       assert (p1 ≠ p2);
       if( d1.rank < d2.rank) then 
         begin 
           (∗ Keep d2_repr as root. Height of the merge is max(d1_height +1, d2_height) so does not change. ∗)
           Link.set p1 (Saf.securize t (Parent p2));
           d2.desc ← newdesc;
           p2
         end
       else if (d1.rank > d2.rank) then
         begin 
           (∗ Keep d1_repr as root. Height of the merge is max(d2_height +1, d1_height) so does not change. ∗)
           Link.set p2 (Saf.securize t (Parent p1));
           d1.desc ← newdesc;
           p1
         end
       else 
         begin
           (∗ We choose arbitrarily p1 to be the root. The height may have changed, as all elements in the subset with root p2 are 1 step further to the root. ∗)
           Link.set p2 (Saf.securize t (Parent p1));
           d1.rank ← d1.rank + 1; d1.desc ← newdesc;
           p1
         end
   end

end

8.  The double-functor is not shown in the exposed interface, and we only export the following, simpler modules.
module Fast=Make(No_safety)

module Safe=Make(Safety)

9.  For a survey of the implementations of union-find algorithms, one should read "Worst-Case Analysis of Set Union Algorithms", by Tarjan and Van Leeuwen.

Recent performance comparison of these algorithms (and modern enhancements) can be found in "Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure", by Md. Mostofa Ali Patwary, Jean Blair, Fredrik Manne.