The Azimuth Project
Blog - Linear maps that fill pigeon holes

This is not yet suitable for final editing. Review comments on the forum are welcome.

Introduction

Mathematicians and computer scientists often focus on different aspects of a task: in a mathematical view the primary thing is whether objects with particular properties exist while computer scientists are obsessed with methods for obtaining smaller concrete representations faster. As an example, a mathematician may focus on the fact that all symmetric matrices are guaranteed to have a full set of orthogonal eigenvectors, while a computer scientist may be focused on ways to find them given a particular matrix. This is particuarly true here, where we consider the problem of finding more compact “functions” for labelling incoming data faster than the obvious way. We’ll concentrate on the core of the real-world task, which is to incrementally maintain a mapping of randomly distributed vectors to compact labels in a way that the mapping produced is computationally efficient. By incrementally we mean that the vectors to be labelled are revealed to us in groups and at all times we have to have a mapping which will label all the values produced so far with the same labels they were given earlier.

As a concrete example of this, consider a book publisher. They produce books with arbitrarily long names, and for all sorts of purposes such as warehouse storage of newly printed copies, labelling pallets to go to bookshops, processing returned books, etc, it helps to have a compact, uniform length label to use. This label needs to be produced from the book title, but there’s a key aspect to the task here: new book titles are only created occasionally in batches (compared to the incredibly frequent operation of looking up the label corresponding to a given name) and on occasions when we know that new titles to be labelled are being produced. This sets it apart from other tasks which involve “producing labels” (such as within a programming language compiler), where we can encounter a new title at any time and which use very different techniques.

There’s an important but non-obvious point here: suppose each time we create some new book titles we send the labelling to, say, every bookshop that stocks our books on a sale-or-return basis. When they need to send some books back, they can compute the short label to be attach to each crate themselves without contacting the publisher with the original titles asking for the corresponding labels, and the shipment can be taken in and placed in the correct bins by the warehouse directly. (In actual computer applications this high degree of independence is makes it easy to use this scheme in a distributed system.)

Real-world applications for are not restricted to books but occur in many situations: consider for example the addition of new companies to a stock exchange listing, the addition of names of newly discovered species, the addition of new users to a social media site, etc. You could even look at it as the quaint task of deciding how to assign pigeon-holes as people appear. This is the core task in the area of perfect hashing, a very active field in computer science. However, for most of the remainder of this article we’ll concentrate on the core mathematical problem, with a very brief description of the larger computer science setting at the end for experts.

This also shows how mathematical discourse is changing: the development here was inspired by a blog article (((1))) that described a clever technique devised by Prasad Raghavendra for solving linear systems over prime fields differently to standard Gaussian elimination which has some attractive properties for some applications. To the best of my knowledge this has never been published in a journal, or even a preprint on the arxiv. The real core of this article is an interesting mathematical relation that holds in spaces over finite fields of characteristic-2 – and how this can be integrated into Raghavendra’s technique – which is interesting in its own right.

A place for everything, and everything in its place

A couple of notes on notation: As we’ll be choosing short, snappy names for various quantities, we’ll use a:=ba := b as an abbreviation for “let us denote by aa the value of expression bb” (keeping a=ba=b for cases where aa pre-exists). Secondly given vectors a:=(a 1,,a n)a := (a_1,\dots,a_n) and b:=(b 1,,b n)b := (b_1,\dots,b_n) the dot product aba\cdot b is the usual a 1b 1++a nb na_1 b_1 + \dots + a_n b_n.

We want to progressively and incrementally associate input vectors to distinct label values in some set which (in anticipation of later) we’ll call 𝔽\mathbb{F}. More precisely, the fundamental problem being tackled is:

  • We’re first given a set of random input vectors S a:={a 1,,a k a}S_a := \{a_1, \dots, a_{k_a} \} and must find a vector x ax_a such that the α i:=a ix a\alpha_i := a_i \cdot x_a “label values under x ax_a” for each a iS aa_i \in S_a are all distinct.

  • We then get an set of additional random vectors S b:={b 1,,b k b}S_b := \{b_1, \dots, b_{k_b} \} and determine a vector x bx_b such that we still have a ix b=α ia_i \cdot x_b = \alpha_i (again for each a iS aa_i \in S_a) and also that the β i:=b ix b\beta_i := b_i \cdot x_b values (over all the b iS bb_i \in S_b) combined with the α i\alpha_is are still all distinct.

  • Then we get yet another set S c:={c 1,,c k c}S_c := \{c_1, \dots, c_{k_c} \} and find x cx_c maintaining a ix c=α ia_i \cdot x_c = \alpha_i and b ix c=β ib_i \cdot x_c = \beta_i and that the γ i:=c ix c\gamma_i := c_i \cdot x_c combined with the α i\alpha_is and β i\beta_is are all distinct.

  • Next we get a further set of vectors …

This whole process ends when we’ve received enough S iS_is that we fail to find a new solution vector which keeps all the dot products distinct. One way to look at this is as “dotting with xx” as specifying a mechanism that converts from the an input vector into a box location, as shown in the sketch here:

picture

Note that a key aspect of the problem is that once we’ve associated an input vector dd with a label δ\delta at some stage, at all future steps the new solution vector xx must still satisfy dx=δd \cdot x = \delta.

You should also note that the problem description refers to the task of “finding vectors” rather than saying “there exists a vector”: this highlights that we’re working in the computer science realm where we’re limited in the amount of work we’re prepared to do. Indeed, in terms of mathematical existence there’s a textbook solution if we’re prepared to wait until all the sets of vectors have been received before outputting a labelling: choose to associate each incoming vector with a consecutive label in 𝔽\mathbb{F} and stack all the vectors into a big matrix AA and concatenate the labels into a vector to get the matrix equation

Ax=(1,2,,N) TA x = (1, 2, \dots, N)^T

(where NN is the number of vectors to label) which has solution

x=A 1(1,2,,N) Tx = A^{-1} (1, 2, \dots, N)^T

providing that matrix AA is of full rank, which amongst other things implies xx must have the same number of components as the number of random vectors we’re labelling. (We won’t go in to the solutions that still exist in some cases when AA isn’t of full rank here, concentrating on the alternative below.)

But as we want to keep the number of components in xx low, we’ll try a different tack.

Birthday paradoxes

Before we do that, we should observe that we’d expect this labelling task to be non-trivial; we can’t just grab a random mapping vector xx and expect it to work. This is for the same reason as the well known phenomenon that if you take a random group of 23 people, there’s a probability of over 5050 percent that two of them share a birthday. This is referred to as a paradox because of your intuitive guess that a subset would need to be around half the size of the number of possiblities in order to get a probability 0.5\ge 0.5 of a collision, not around its square root. This is illustrated in the following plot (taken from Wikipedia article (((2))), a good source of more information):

picture

This is a particular example of “given a set 𝔽\mathbb{F}, if you take nn uniformly random choices the likelihood that there are at least two identical elements is better than evens when |𝔽|××(|𝔽|n+1)|𝔽| n/2|\mathbb{F}| \times \dots \times (|\mathbb{F}|-n+1) \le {|\mathbb{F}|^n} / 2”. The relevance here is that if we just construct a mapping at random when |𝔽|=2 8|\mathbb{F}|=2^8, on average we’ll only be able to label 18 elements before we have an already-used label is attached to a new vector! Since this just isn’t good enough, it looks like generating a mapping entirely at random is out.

Finite fields of size 2 n2^n

Since our goal is to obtain labels α i\alpha_is, β i\beta_is, etc, all lying within a small set discrete values, we don’t really want to work over the real numbers \mathbb{R} or the integers \mathbb{Z}, both of which are infinite. Instead we’d like to work in a finite field 𝔽\mathbb{F}, an “alternative set of numbers” which has only a small (and finite!) number of elements. What sort of options do we have for the size of 𝔽\mathbb{F}? It turns out the restriction of being a field means that the sizes follow a strict pattern: it was shown in the nineteenth century that a finite field must have size p np^n for some prime pp and some natural number n1n \ge 1. These fields are known as Galois fields, frustratingly denoted GF(p n)GF(p^n) rather than the more uniform 𝔽 p n\mathbb{F}_{p^n}. They’re very convenient since if we pick size 2 n2^n (and in practice 2 82^8) it fits well with the way computer hardware is built around processing entities with 2 n2^n possibilities (such as bytes, i.e., 2 82^8 possibilities).

The mathematics of constructing concrete representations of finite fields is fascinating and well worth reading (a good introduction is (((3)))). However we won’t go into details here as not only would it be impossible to do it justice within the space limitations, but we won’t be using the detailed structure of GF(2 n)GF(2^n). The key bits are the features of being a field, that word I snuck into the previous paragraph, which basically means that various usual algebraic identities familiar from the real numbers still hold, namely,

  • associativity: (a+b)+c=a+(b+c)(a+b)+c=a+(b+c) and (ab)c=a(bc)(a b)c=a(b c).

  • commutativity: a+b=b+aa+b=b+a and ab=baa b=b a.

  • distributivity: a(b+c)=ab+aca(b+c)=a b+a c.

  • identity elements: a+0=aa+0=a and 1a=a1 a=a.

  • inverse elements: a+a=0-a+a=0 and, providing a0a \ne 0, a 1a=1a^{-1} a=1.

and an additional rule that is a new feature of GF(2 n)GF(2^n): GF(2 n)GF(2^n) is what’s known as a field of characteristic 2. What this means is that x+xx+x is always 00 no matter what xx is. (In general, a finite field 𝔽\mathbb{F} with p np^n elements has characteristic pp, i.e., adding pp copies of the same thing together always gives 00.) This is a new feature that crops up with finite fields (\mathbb{R} doesn’t have it) and is a very convenient property we’ll make extensive use of later on.

However, it’s important to bear in mind that a finite field like GF(2 8)GF(2^8) has some differences to normal numbers: for example, the animation below shows various lines in a vector space over \mathbb{R}.

Example of lines in normal space

Here, we’re looking at the line (a,b) T(x,y) T=c(a,b)^T \cdot (x,y)^T = c. In 2\mathbb{R}^2 keeping aa and bb fixed while increasing cc generates parallel lines; if in an outer process we keep bb fixed while increasing aa then we’d be continuously changing the gradient of the sets of parallel lines as seen above. By contrast, the animation below shows various lines in a vector space over GF(2 8)GF(2^8):

picture

For both animations the precise procedure being followed in the 2\mathbb{R}^2 or GF(2 8) 2GF(2^8)^2 animations above is::

  1. I picked a fixed value for bb and 3 values of cc that “are equally spaced apart” which are associated with colour red, green or blue respectively.

  2. Each frame aa is moved onto its “successor”.

  3. In that frame, I found all the (x,y)(x, y) pairs which satisfy the equation for the current aa for each cc and plotted them in the associated colour.

  4. When aa has reaches its maximum value it wraps around and the animation repeats.

(Several terms are in “vague quotes” since the notions are a bit more awkward over GF(p n)GF(p^n) than over \mathbb{R}.) What this shows is that the notion of a “straight line” – and thus also the dot-product operation – in a vector space over GF(2 n)GF(2^n) is a far more “dispersed”, disconnected structure than the straight, connected lines produced in \mathbb{R}.

Puzzle 1: The ring of integers /n\mathbb{Z}/n\mathbb{Z} generally has some elements with inverses; e.g., 13×197=1(mod2 8)13 \times 197 = 1 \pmod{2^8} and hence 1313 has inverse 197197 working over /2 8\mathbb{Z}/2^8\mathbb{Z}. By contrast there’s simply no element in /2 8\mathbb{Z}/2^8\mathbb{Z} which works as an inverse for 88, but so what? Why can’t we work over /2 8\mathbb{Z}/2^8\mathbb{Z} rather than the more complicated GF(2 8)GF(2^8)? Why is it so important that every non-zero element has an inverse?

Extending solutions to satisfy more equations

Now we can look at the key step in the solution: We’ve found two solution vectors uu and vv which both satisfy a set of equations a ix=α ia_i \cdot x = \alpha_i equations, and we want to get a solution ww which satisfies a new equation bx=βb \cdot x = \beta while still satisfying the old equations. It turns out there’s an especially simple way we can combine uu and vv to get ww, namely

w:=β+bvbv+buu+β+bubv+buv.w := \frac{\beta + b \cdot v}{b \cdot v + b \cdot u} u + \frac{\beta + b \cdot u}{b \cdot v + b \cdot u} v.

We can see the “strangeness” of GF(2 n)GF(2^n) already if we look at what happens when we suppose β=bv\beta = b \cdot v, i.e., when vv already satisfies the new equation:

w=bv+bvbv+buu+bv+bubv+buv=0bv+buu+1v=vw = \frac{b \cdot v + b \cdot v}{b \cdot v + b \cdot u} u + \frac{b \cdot v + b \cdot u}{b \cdot v + b \cdot u} v = \frac{0}{b \cdot v + b \cdot u} u + 1 v = v

where we’ve used just the field identities above and that bv+bvb \cdot v + b \cdot v is always 0. Likewise if you investigate the case β=bu\beta = b \cdot u you’ll get w=uw = u. So ww is always a point on the “line” that passes through uu and vv. In theory what’s happening is that given two solution vectors (marked with blue plus-signs in the 2-D visualisation below) we “conceptually” construct the line between them (shown as red crosses).

intersection in GF(256)

(Of course as a “line” in GF(2 n)GF(2^n) it’s not the straight connected line we’re used to.) We then find the point where it intersects with the solution space of the new equation (which in this 2-D case is another line, shown as green plus-signs; in NN dimensions it’s an N1N-1 dimensional hyper-plane). You can see that the red and green markers fall on the same point about a fifth of the way from the bottom of the image just in the half nearer the right hand edge. However, obviously we don’t really do that; just as in the case in \mathbb{R} we can find the point of intersection purely algebraically without constructing the full sets of points.

Now if we calculate what a iwa_i \cdot w is (for an arbitrary ii) using the fact that the work we did previously has ensured that a iu=a iv=α ia_i \cdot u = a_i \cdot v = \alpha_i, we find the term involving β\beta “cancels” because we end up with a sum involving two copies of it and we end up with

bv+bubv+buα i,\frac{b \cdot v + b \cdot u}{b \cdot v + b \cdot u} \alpha_i,

which by the usual rules for cancelling in fractions is just α i\alpha_i. (Try writing it out: the patterns of simplification that occur are very pretty.) You can also straightforwardly compute bwb \cdot w and this time as there are two copies of the non-β\beta terms (so again they vanish), ending up with

bv+bubv+buβ,\frac{b \cdot v + b \cdot u}{b \cdot v + b \cdot u} \beta,

which obviously simplifies to just β\beta.

(This is a slight generalization of Raghavendra’s original technique, where he cleverly observed that in a finite field 𝔽 p\mathbb{F}_p – which has characteristic pp – if you’ve got some solution vectors v 1v_1, …, v p+1v_{p+1} that each individually satisfy a set of equations a iv j=α ia_i \cdot v_j = \alpha_i, then

a i( j=1 p+1v j)= j=1 p+1a iv j= j=1 p+1α i=α ia_i \cdot \left( \sum_{j=1}^{p+1} v_j \right) = \sum_{j=1}^{p+1} a_i \cdot v_j = \sum_{j=1}^{p+1} \alpha_i = \alpha_i

where the final step is because pp α i\alpha_is sum to 00 in 𝔽 p\mathbb{F}_p. As this holds for all ii, this gives a “new” solution to the set of equations generated from the existing ones. One drawback to Raghavendra’s technique is that this doesn’t give you any control of what value the new solution gives for a new equation. His procedure relies on it satisfying the new equation “by chance” and is applied over small fields like 𝔽 2\mathbb{F}_2 where this chance is good.)

A concrete procedure

This is the core technique that we’ll use, but we need to specify what to do in various unusual cases for a complete recipe. Without loss of generality, we assume that the first set of vectors to classify contains just a single vector; if it doesn’t we just split the first actual set into first a single vector, then the remainder as the next set. Doing this makes describing the initialisation easier.

So here are the gory details (where all arithmetic is done over GF(2 n)GF(2^n)):

  • For initial vector aa, set labelling l:={1}l := \{ 1 \}, vectors V:={a i 1e i|1i2 n}V := \{ a_i^{-1} e_i | 1 \le i \le 2^n \} and N:=1N:=1.

  • For each new set of vectors SS to extend the labelling (V,l)(V,l) by:

    1. Increase NN by |S||S| and compute “trial labels” l x:={ax|aS}l_{x} := \{ a \cdot x | a \in S \} for each xVx \in V.

    2. Pick the vector zz from VV such that |l zl|=N|l_{z} \cup l| = N and add the values in l zl_{z} to ll.

    3. Initialise new solution set V:={z}V' := \{ z \}. (If any other vectors from VV give the same labelling, also add them to VV'.)

    4. Choose a vector yy from VVV \slash V'.

    5. For each other vector xx use the update equation to compute a new solution vector ww from xx and yy and add it to VV'.

    6. If any step 1–5 isn’t possible, signal that the mapping is “full up” and exit.

    7. Otherwise rename VV' to VV and pick the sparsest vector xx from VV as the current labeller.

TODO: point out how we’re choosing the RHS elements, not being given them

An important point here is that we need to preserve as much variety in the vectors in VV as possible; that’s why we only reduce the size of VV by 1 each iteration rather than by |S||S|.

There’s no essential reason for choosing p n:=2 8p^n := 2^8, it just happens to fit well with hardware and is low enough to be tractable in terms of memory and time usage, but another potential choice is p n:=2 16p^n := 2^{16}. In complexity terms, when we’ve labelled NN vectors the above algorithm has done O(Np 2n)O(N p^{2n}) operations, the same order as solution using conventional Gaussian elimination. However, as with other alogrithms such as (((5))) the fact that we generating each new vector from two randomly chosen existing solutions only with no dependency on anything else means that a parallelised version runs in O(Np n)O(N p^n) timesteps.

Practical results

So let’s have a look at this in practice:

placeholder for final probability graph

EXPLANATION OF HOW NUMBER OF COMPONENTS INCREASES IN SPURTS

Puzzle 2: There’s an interesting aspect to the algorithm above. It inherently uses the fact that 𝔽\mathbb{F} allows division, but we don’t anywhere use the fact that 𝔽\mathbb{F} also allows subtraction. Mathematicians, and category theorists in particular, are interested in trying to move mathematical results from using fields to using rings – which allow subtraction but not division by all elements – and if possible further so that they only use rigs – where neither subtraction nor division are defined for all elements. So there’s a sense that division is an easier to avoid operation than subtraction, but in the above there are divisions but no subtractions. So is the opposite is the case, or is there another explanation?

Remind me why we’re doing this?

The problem above is one way to tackle the core issue of perfect hashing. In general however the restriction to very small sets is not practical: we need to be able to deal with arbitrarily large sets. As discussed in the introduction, we’ve also dealt with the task is to map from names to uniform labels in such a way that there are no collisions. Since the names are likely not to be uniform length, uniformly-randomly-distributed vectors it’s common to build a two stage process for doing the labelling to reduce the general problem to this instance by the following means:

  • First use a simple, fixed uniform hashing function to generate a uniform key uu for each original key. A small number of bits from this key are used to place the key in one of several buckets, and the remaining bits are used as a uniform random vector “intermediate key”.

  • Within each bucket use a more sophisticated procedure to map the intermediate keys to a set of more compact labels, and produce the final key by concatenating the bucket number and the compact label.

This is shown in this diagram:

hashing.png?

If the number of keys to be processed is large, since the first step is essentially uniformly random assignment into buckets, it’s very highly likely that close to the same number of keys have landed in each one and this forms a reliable way to ensure the sophisticated procedure is only applied to a tractably small set of keys. One way to implement the second step is what we’ve looked at here.

This wider context is discussed in greater depth in, for example, (((4))).

For more

The only paper I’ve found taking the ideas from (((1))) further is (((5))), and indeed if you consider the corresponding constructions between characteristc-2 GF(2 n)GF(2^n) and characteristic-0 \mathbb{R}, some of the formulae are related. Source code implementing the algorithm in this post can be found at (((6))).

  1. R J Lipton, A New Way To Solve Linear Equations . ’Gödel’s Lost Letter and P=NP’ blog.

  2. Birthday problem , Wikipedia.

  3. Introduction to finite fields. .

  4. Universal and Perfect hashing .

  5. Joerg Fliege, A Randomized Parallel Algorithm with Run Time O(n 2)O(n^2) for Solving an n×nn \times n System of Linear Equations , arxiv.org.

  6. Github link