Showing changes from revision #26 to #27:
Added | Removed | Changed
This is not yet suitable for final editing. Review comments on the forum are welcome.
Mathematicians and computer scientists often focus on different aspects of a task: in a mathematical view the primary thing is whether objects with particular properties exist while computer scientists are obsessed with methods for obtaining smaller concrete representations faster. As an example, a mathematician may focus on the fact that all symmetric matrices are guaranteed to have a full set of orthogonal eigenvectors, while a computer scientist may be focused on ways to find them given a particular matrix. This is particuarly true here, where we consider the problem of finding more compact “functions” for labelling incoming data faster than the obvious way. We’ll concentrate on the core of the real-world task, which is to incrementally maintain a mapping of randomly distributed vectors to compact labels in a way that the mapping produced is computationally efficient. By incrementally we mean that the vectors to be labelled are revealed to us in groups and at all times we have to have a mapping which will label all the values produced so far with the same labels they were given earlier.
As a concrete example of this, consider a book publisher. They produce books with arbitrarily long names, and for all sorts of purposes such as warehouse storage of newly printed copies, labelling pallets to go to bookshops, processing returned books, etc, it helps to have a compact, uniform length label to use. This label needs to be produced from the book title, but there’s a key aspect to the task here: new book titles are only created occasionally in batches (compared to the incredibly frequent operation of looking up the label corresponding to a given name) and on occasions when we know that new titles to be labelled are being produced. This sets it apart from other tasks which involve “producing labels” (such as within a programming language compiler), where we can encounter a new title at any time and which use very different techniques.
There’s an important but non-obvious point here: suppose each time we create some new book titles we send the labelling to, say, every bookshop that stocks our books on a sale-or-return basis. When they need to send some books back, they can compute the short label to be attach to each crate themselves without contacting the publisher with the original titles asking for the corresponding labels, and the shipment can be taken in and placed in the correct bins by the warehouse directly. (In actual computer applications this high degree of independence is makes it easy to use this scheme in a distributed system.)
Real-world applications for are not restricted to books but occur in many situations: consider for example the addition of new companies to a stock exchange listing, the addition of names of newly discovered species, the addition of new users to a social media site, etc. You could even look at it as the quaint task of deciding how to assign pigeon-holes as people appear. This is the core task in the area of perfect hashing, a very active field in computer science. However, for most of the remainder of this article we’ll concentrate on the core mathematical problem, with a very brief description of the larger computer science setting at the end for experts.
This also shows how mathematical discourse is changing: the development here was inspired by a blog article (((1))) that described a clever technique devised by Prasad Raghavendra for solving linear systems over prime fields differently to standard Gaussian elimination which has some attractive properties for some applications. To the best of my knowledge this has never been published in a journal, or even a preprint on the arxiv. The real core of this article is an interesting mathematical relation that holds in spaces over finite fields of characteristic-2 – and how this can be integrated into Raghavendra’s technique – which is interesting in its own right.
A couple of notes on notation: As we’ll be choosing short, snappy names for various quantities, we’ll use $a := b$ as an abbreviation for “let us denote by $a$ the value of expression $b$” (keeping $a=b$ for cases where $a$ pre-exists). Secondly given vectors $a := (a_1,\dots,a_n)$ and $b := (b_1,\dots,b_n)$ the dot product $a\cdot b$ is the usual $a_1 b_1 + \dots + a_n b_n$.
We want to progressively and incrementally associate input vectors to distinct label values in some set which (in anticipation of later) we’ll call $\mathbb{F}$. More precisely, the fundamental problem being tackled is:
We’re first given a set of random input vectors $S_a := \{a_1, \dots, a_{k_a} \}$ and must find a vector $x_a$ such that the $\alpha_i := a_i \cdot x_a$ “label values under $x_a$” for each $a_i \in S_a$ are all distinct.
We then get an set of additional random vectors $S_b := \{b_1, \dots, b_{k_b} \}$ and determine a vector $x_b$ such that we still have $a_i \cdot x_b = \alpha_i$ (again for each $a_i \in S_a$) and also that the $\beta_i := b_i \cdot x_b$ values (over all the $b_i \in S_b$) combined with the $\alpha_i$s are still all distinct.
Then we get yet another set $S_c := \{c_1, \dots, c_{k_c} \}$ and find $x_c$ maintaining $a_i \cdot x_c = \alpha_i$ and $b_i \cdot x_c = \beta_i$ and that the $\gamma_i := c_i \cdot x_c$ combined with the $\alpha_i$s and $\beta_i$s are all distinct.
Next we get a further set of vectors …
This whole process ends when we’ve received enough $S_i$s that we fail to find a new solution vector which keeps all the dot products distinct. One way to look at this is as “dotting with $x$” as specifying a mechanism that converts from the an input vector into a box location, as shown in the sketch here:
Note that a key aspect of the problem is that once we’ve associated an input vector $d$ with a label $\delta$ at some stage, at all future steps the new solution vector $x$ must still satisfy $d \cdot x = \delta$.
You should also note that the problem description refers to the task of “finding vectors” rather than saying “there exists a vector”: this highlights that we’re working in the computer science realm where we’re limited in the amount of work we’re prepared to do. Indeed, in terms of mathematical existence there’s a textbook solution if we’re prepared to wait until all the sets of vectors have been received before outputting a labelling: choose to associate each incoming vector with a consecutive label in $\mathbb{F}$ and stack all the vectors into a big matrix $A$ and concatenate the labels into a vector to get the matrix equation
(where $N$ is the number of vectors to label) which has solution
providing that matrix $A$ is of full rank, which amongst other things implies $x$ must have the same number of components as the number of random vectors we’re labelling. (We won’t go in to the solutions that still exist in some cases when $A$ isn’t of full rank here, concentrating on the alternative below.)
But as we want to keep the number of components in $x$ low, we’ll try a different tack.
Before we do that, we should observe that we’d expect this labelling task to be non-trivial; we can’t just grab a random mapping vector $x$ and expect it to work. This is for the same reason as the well known phenomenon that if you take a random group of 23 people, there’s a probability of over $50$ percent that two of them share a birthday. This is referred to as a paradox because of your intuitive guess that a subset would need to be around half the size of the number of possiblities in order to get a probability $\ge 0.5$ of a collision, not around its square root. This is illustrated in the following plot (taken from Wikipedia article (((2))), a good source of more information):
This is a particular example of “given a set $\mathbb{F}$, if you take $n$ uniformly random choices the likelihood that there are at least two identical elements is better than evens when $|\mathbb{F}| \times \dots \times (|\mathbb{F}|-n+1) \le {|\mathbb{F}|^n} / 2$”. The relevance here is that if we just construct a mapping at random when $|\mathbb{F}|=2^8$, on average we’ll only be able to label 18 elements before we have an already-used label is attached to a new vector! Since this just isn’t good enough, it looks like generating a mapping entirely at random is out.
Since our goal is to obtain labels $\alpha_i$s, $\beta_i$s, etc, all lying within a small set discrete values, we don’t really want to work over the real numbers $\mathbb{R}$ or the integers $\mathbb{Z}$, both of which are infinite. Instead we’d like to work in a finite field $\mathbb{F}$, an “alternative set of numbers” which has only a small (and finite!) number of elements. What sort of options do we have for the size of $\mathbb{F}$? It turns out the restriction of being a field means that the sizes follow a strict pattern: it was shown in the nineteenth century that a finite field must have size $p^n$ for some prime $p$ and some natural number $n \ge 1$. These fields are known as Galois fields, frustratingly denoted $GF(p^n)$ rather than the more uniform $\mathbb{F}_{p^n}$. They’re very convenient since if we pick size $2^n$ (and in practice $2^8$) it fits well with the way computer hardware is built around processing entities with $2^n$ possibilities (such as bytes, i.e., $2^8$ possibilities).
The mathematics of constructing concrete representations of finite fields is fascinating and well worth reading (a good introduction is (((3)))). However we won’t go into details here as not only would it be impossible to do it justice within the space limitations, but we won’t be using the detailed structure of $GF(2^n)$. The key bits are the features of being a field, that word I snuck into the previous paragraph, which basically means that various usual algebraic identities familiar from the real numbers still hold, namely,
associativity: $(a+b)+c=a+(b+c)$ and $(a b)c=a(b c)$.
commutativity: $a+b=b+a$ and $a b=b a$.
distributivity: $a(b+c)=a b+a c$.
identity elements: $a+0=a$ and $1 a=a$.
inverse elements: $-a+a=0$ and, providing $a \ne 0$, $a^{-1} a=1$.
and an additional rule that is a new feature of $GF(2^n)$: $GF(2^n)$ is what’s known as a field of characteristic 2. What this means is that $x+x$ is always $0$ no matter what $x$ is. (In general, a finite field $\mathbb{F}$ with $p^n$ elements has characteristic $p$, i.e., adding $p$ copies of the same thing together always gives $0$.) This is a new feature that crops up with finite fields ($\mathbb{R}$ doesn’t have it) and is a very convenient property we’ll make extensive use of later on.
However, it’s important to bear in mind that a finite field like $GF(2^8)$ has some differences to normal numbers: for example, the animation below shows various lines in a vector space over $\mathbb{R}$.
Here, we’re looking at the line $(a,b)^T \cdot (x,y)^T = c$. In $\mathbb{R}^2$ keeping $a$ and $b$ fixed while increasing $c$ generates parallel lines; if in an outer process we keep $b$ fixed while increasing $a$ then we’d be continuously changing the gradient of the sets of parallel lines as seen above. By contrast, the animation below shows various lines in a vector space over $GF(2^8)$:
For both animations the precise procedure being followed in the $\mathbb{R}^2$ or $GF(2^8)^2$ animations above is::
I picked a fixed value for $b$ and 3 values of $c$ that “are equally spaced apart” which are associated with colour red, green or blue respectively.
Each frame $a$ is moved onto its “successor”.
In that frame, I found all the $(x, y)$ pairs which satisfy the equation for the current $a$ for each $c$ and plotted them in the associated colour.
When $a$ has reaches its maximum value it wraps around and the animation repeats.
(Several terms are in “vague quotes” since the notions are a bit more awkward over $GF(p^n)$ than over $\mathbb{R}$.) What this shows is that the notion of a “straight line” – and thus also the dot-product operation – in a vector space over $GF(2^n)$ is a far more “dispersed”, disconnected structure than the straight, connected lines produced in $\mathbb{R}$.
Puzzle 1: The ring of integers $\mathbb{Z}/n\mathbb{Z}$ generally has some elements with inverses; e.g., $13 \times 197 = 1 \pmod{2^8}$ and hence $13$ has inverse $197$ working over $\mathbb{Z}/2^8\mathbb{Z}$. By contrast there’s simply no element in $\mathbb{Z}/2^8\mathbb{Z}$ which works as an inverse for $8$, but so what? Why can’t we work over $\mathbb{Z}/2^8\mathbb{Z}$ rather than the more complicated $GF(2^8)$? Why is it so important that every non-zero element has an inverse?
Now we can look at the key step in the solution: We’ve found two solution vectors $u$ and $v$ which both satisfy a set of equations $a_i \cdot x = \alpha_i$ equations, and we want to get a solution $w$ which satisfies a new equation $b \cdot x = \beta$ while still satisfying the old equations. It turns out there’s an especially simple way we can combine $u$ and $v$ to get $w$, namely
We can see the “strangeness” of $GF(2^n)$ already if we look at what happens when we suppose $\beta = b \cdot v$, i.e., when $v$ already satisfies the new equation:
where we’ve used just the field identities above and that $b \cdot v + b \cdot v$ is always 0. Likewise if you investigate the case $\beta = b \cdot u$ you’ll get $w = u$. So $w$ is always a point on the “line” that passes through $u$ and $v$. In theory what’s happening is that given two solution vectors (marked with blue plus-signs in the 2-D visualisation below) we “conceptually” construct the line between them (shown as red crosses).
(Of course as a “line” in $GF(2^n)$ it’s not the straight connected line we’re used to.) We then find the point where it intersects with the solution space of the new equation (which in this 2-D case is another line, shown as green plus-signs; in $N$ dimensions it’s an $N-1$ dimensional hyper-plane). You can see that the red and green markers fall on the same point about a fifth of the way from the bottom of the image just in the half nearer the right hand edge. However, obviously we don’t really do that; just as in the case in $\mathbb{R}$ we can find the point of intersection purely algebraically without constructing the full sets of points.
Now if we calculate what $a_i \cdot w$ is (for an arbitrary $i$) using the fact that the work we did previously has ensured that $a_i \cdot u = a_i \cdot v = \alpha_i$, we find the term involving $\beta$ “cancels” because we end up with a sum involving two copies of it and we end up with
which by the usual rules for cancelling in fractions is just $\alpha_i$. (Try writing it out: the patterns of simplification that occur are very pretty.) You can also straightforwardly compute $b \cdot w$ and this time as there are two copies of the non-$\beta$ terms (so again they vanish), ending up with
which obviously simplifies to just $\beta$.
(This is a slight generalization of Raghavendra’s original technique, where he cleverly observed that in a finite field $\mathbb{F}_p$ – which has characteristic $p$ – if you’ve got some solution vectors $v_1$, …, $v_{p+1}$ that each individually satisfy a set of equations $a_i \cdot v_j = \alpha_i$, then
where the final step is because $p$ $\alpha_i$s sum to $0$ in $\mathbb{F}_p$. As this holds for all $i$, this gives a “new” solution to the set of equations generated from the existing ones. One drawback to Raghavendra’s technique is that this doesn’t give you any control of what value the new solution gives for a new equation. His procedure relies on it satisfying the new equation “by chance” and is applied over small fields like $\mathbb{F}_2$ where this chance is good.)
This is the core technique that we’ll use, but we need to specify what to do in various unusual cases for a complete recipe. Without loss of generality, we assume that the first set of vectors to classify contains just a single vector; if it doesn’t we just split the first actual set into first a single vector, then the remainder as the next set. Doing this makes describing the initialisation easier.
So here are the gory details (where all arithmetic is done over $GF(2^n)$):
For initial vector $a$, set labelling $l := \{ 1 \}$, vectors $V := \{ a_i^{-1} e_i | 1 \le i \le 2^n \}$ and $N:=1$.
For each new set of vectors $S$ to extend the labelling $(V,l)$ by:
Increase $N$ by $|S|$ and compute “trial labels” $l_{x} := \{ a \cdot x | a \in S \}$ for each $x \in V$.
Pick the vector $z$ from $V$ such that $|l_{z} \cup l| = N$ and add the values in $l_{z}$ to $l$.
Initialise new solution set $V' := \{ z \}$. (If any other vectors from $V$ give the same labelling, also add them to $V'$.)
Choose a vector $y$ from $V \slash V'$.
For each other vector $x$ use the update equation to compute a new solution vector $w$ from $x$ and $y$ and add it to $V'$.
If any step 1–5 isn’t possible, signal that the mapping is “full up” and exit.
Otherwise rename $V'$ to $V$ and pick the sparsest vector $x$ from $V$ as the current labeller.
TODO: point out how we’re choosing the RHS elements, not being given them
An important point here is that we need to preserve as much variety in the vectors in $V$ as possible; that’s why we only reduce the size of $V$ by 1 each iteration rather than by $|S|$.
There’s no essential reason for choosing $p^n := 2^8$, it just happens to fit well with hardware and is low enough to be tractable in terms of memory and time usage, but another potential choice is $p^n := 2^{16}$. In complexity terms, when we’ve labelled $N$ vectors the above algorithm has done $O(N p^{2n})$ operations, the same order as solution using conventional Gaussian elimination. However, as with other alogrithms such as (((5))) the fact that we generating each new vector from two randomly chosen existing solutions only with no dependency on anything else means that a parallelised version runs in $O(N p^n)$ timesteps.
So let’s have a look at this in practice:
EXPLANATION OF HOW NUMBER OF COMPONENTS INCREASES IN SPURTS
Puzzle 2: There’s an interesting aspect to the algorithm above. It inherently uses the fact that $\mathbb{F}$ allows division, but we don’t anywhere use the fact that $\mathbb{F}$ also allows subtraction. Mathematicians, and category theorists in particular, are interested in trying to move mathematical results from using fields to using rings – which allow subtraction but not division by all elements – and if possible further so that they only use rigs – where neither subtraction nor division are defined for all elements. So there’s a sense that division is an easier to avoid operation than subtraction, but in the above there are divisions but no subtractions. So is the opposite is the case, or is there another explanation?
The problem above is one way to tackle the core issue of perfect hashing. In general however the restriction to very small sets is not practical: we need to be able to deal with arbitrarily large sets. As discussed in the introduction, we’ve also dealt with the task is to map from names to uniform labels in such a way that there are no collisions. Since the names are likely not to be uniform length, uniformly-randomly-distributed vectors it’s common to build a two stage process for doing the labelling to reduce the general problem to this instance by the following means:
First use a simple, fixed uniform hashing function to generate a uniform key $u$ for each original key. A small number of bits from this key are used to place the key in one of several buckets, and the remaining bits are used as a uniform random vector “intermediate key”.
Within each bucket use a more sophisticated procedure to map the intermediate keys to a set of more compact labels, and produce the final key by concatenating the bucket number and the compact label.
This is shown in this diagram:
If the number of keys to be processed is large, since the first step is essentially uniformly random assignment into buckets, it’s very highly likely that close to the same number of keys have landed in each one and this forms a reliable way to ensure the sophisticated procedure is only applied to a tractably small set of keys. One way to implement the second step is what we’ve looked at here.
This wider context is discussed in greater depth in, for example, (((4))).
The only paper I’ve found taking the ideas from (((1))) further is (((5))), and indeed if you consider the corresponding constructions between characteristc-2 $GF(2^n)$ and characteristic-0 $\mathbb{R}$, some of the formulae are related. Source code implementing the algorithm in this post can be found at (((6))).
R J Lipton, A New Way To Solve Linear Equations . ’Gödel’s Lost Letter and P=NP’ blog.
Birthday problem , Wikipedia.
Joerg Fliege, A Randomized Parallel Algorithm with Run Time $O(n^2)$ for Solving an $n \times n$ System of Linear Equations , arxiv.org.
Github link