Hilbert Curve in Liquid Clustering

A color-coded Hilbert curve of order 6 going from the top-left corner to the top-right corner

There are so many amazingly beautiful applications of mathematics in technology that I don’t even know how people even thought of it. For example, instead of using long antenna for cell phone, we instead use a small fractal inspired antenna. Another beautiful application is using Hilbert Curve in Databricks Liquid Clustering.

There is a wonderful video from 3Blue1Brown that explains Hilbert Curve pretty well:

Unfortunately, the video ties it to strange application, which downplays its incredible application in data.

The Hilbert Curve is actually a well-defined continuous function H from the unit interval to a unit cube in n-dimension space. We will just stick with 2-dimensional unit square in this post for simplicity. It turns out that H is an onto function (which is why it’s space-filling) that preserves locality. In other words, points close in the unit interval remains close in the unit square. This is due to the fact that the function H is uniformly continuous. The proof of these claims is not difficult, but they are definitely not trivial, though it is probably standard for beginning graduate students.

As we will see later in the video below, we start with the finite construction H_n, and we take the limit to obtain H:

H_n: [0,1] \rightarrow [0,1]^n
\displaystyle H(x) = \lim_{n \rightarrow \infty} H_n(x)

Each H_n is just a bijective map between two sets, yet after taking the limit, H is continuous and surjective, hence, it is a space-filling curve. H is uniformly continuous as [0,1] is a compact metric space. H is not injective as that would imply that [0,1] is homemorphic to [0,1]^n. We have to be careful here as we see that H does not inherit properties from H_n like injectivity. However, H being surjective follows from construction via nested cells and intervals.

In practice, such as in Liquid Clustering, we use H_n to map between cells and indices. We use the fact that H is uniformly continuous to assure that as n gets large, locality is preserved.

The idea of using Hilbert Curve to cluster data for optimized queries go back to at least 30 years ago from this paper:

I. Kamel, C. Faloutsos, Hilbert R-tree: An improved R-tree using fractals, in: Proceedings of the 20th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994, pp. 500–509.

Once you have a Delta table, it’s very easy to turn it on:

ALTER TABLE schema.database.table CLUSTER BY (col1,col2);
OPTIMIZE database.table;

Note that if you cluster by specific columns, you must manually OPTIMIZE regularly if the table changes. Fortunately, now you can CLUSTER BY AUTO and everything will be taken care of. And you can turn it off by using CLUSTER BY NONE.

In the following video, I will test a simple query on a table with and without Liquid Clustering. The CSV is a heavily skewed table that is 1 GB in size. The query time goes down tremendously. Afterwards, I will show how the Hilbert function works.

Leave a comment