The Azimuth Project
Experiments in entropy as a functor


There are many theorems saying what’s special about entropy, both for Shannon entropy and its generalization called Rényi entropy (or, not quite the same but differing only by a change of variables, Tsallis entropy). A modern approach based on category theory can be seen here:

While the use of category theory is kept to a bare minimum here, the idea for this paper arose in a series of blog conversations which used a lot more:

A simple summary can be found here:

The paper was written up on the nLab here:

and there is also a draft of a paper that never got finished, here:

The following are notes by John Baez, Tobias Fritz and Tom Leinster. The goal was to develop a deeper understanding of entropy, free energy, the partition function and related ideas in probability theory and statistical mechanics using the tools of modern algebra: categories, monads, algebraic theories, operads and the like.


Entropy as a functor

In our paper A characterization of entropy in terms of information loss we gave a characterization of Shannon and more generally Tsallis entropy, not just for probability measures on finite sets, but for arbitrary (nonnegative) measures on these sets. This characterization uses a certain category:


Let FinMeas\Fin\Meas be the category whose objects are finite sets equipped with measures and whose morphisms are measure-preserving functions.

We noted that one can take arbitrary nonnegative linear combinations of objects and morphisms in FinMeas\Fin\Meas. These can be built up from direct sums and scalar multiplication by nonnegative real numbers, defined as follows.

Tsallis entropy can be seen as the unique invariant of morphisms in FinMeas\Fin\Meas that is additive under composition, additive under direct sums, homogeneous with respect to scalar multiplication and also continuous:


Let α(0,)\alpha \in (0, \infty). Suppose FF is any map sending morphisms in FinMeas\Fin\Meas to numbers in [0,)[0,\infty), and obeying these four properties:

  1. Functoriality:

    F(fg)=F(f)+F(g) F(f \circ g) = F(f) + F(g)

    whenever f,gf,g are composable morphisms.

  2. Additivity:

    F(fg)=F(f)+F(g) F(f \oplus g) = F(f) + F(g)

    for all morphisms f,gf,g.

  3. Homogeneity of degree α\alpha:

    F(λf)=λ αF(f) F(\lambda f) = \lambda^\alpha F(f)

    for all morphisms ff and all λ[0,)\lambda \in [0,\infty).

  4. Continuity: FF is continuous.

Then there exists a constant c0c \ge 0 such that for any morphism f:pqf : p \to q in FinMeas\Fin\Meas,

F(f)=c(H α(p)H α(q)) F(f) = c(H_\alpha(p) - H_\alpha(q))

where H αH_\alpha is the Tsallis entropy of order α\alpha. Conversely, for any constant c0c \ge 0, this formula determines a map FF obeying conditions 1-4.

In the simplest case, namely homogeneity of degree α=1\alpha = 1, the Tsallis entropy H αH_\alpha reduces to the Shannon entropy.

In fact, such an FF really defines a functor from FinMeas\Fin\Meas to a category with one object and with nonnegative real numbers as morphisms. So we now go ahead and formulate our results in the language of category theory. For this we need to pick a way to view the nonnegative real numbers as a category:


Let +\mathbb{R}_+ be the category with a single object *\ast and [0,)[0,\infty) as the set of morphisms from this object to itself, with addition as composition.

The categories +\mathbb{R}_+ and FinMeas\Fin\Meas have much in common, which is why it is reasonable to assign an entropy change, namely a morphism in +\mathbb{R}_+, to each morphism in FinMeas\Fin\Meas. Here we focus on three similarities between these categories.

First, both +\mathbb{R}_+ and FinMeasFin\Meas are topological categories. Recall that a topological category is a category where the set of objects and the set of morphisms form topological spaces, and all the category operations are continuous. A continuous functor is a functor between topological categories that maps objects to objects and morphisms to morphisms in a continuous way.

To make +\mathbb{R}_+ into a topological category, we use the only possible topology on the set of objects, and the standard topology on the set of morphisms [0,)[0,\infty). To make FinMeas\Fin\Meas into a topological category, we should say not only when a sequence of objects or morphisms converges, but also when any net converges. A net of objects p βFinMeasp_\beta \in \Fin\Meas converges to pFinMeasp \in \Fin\Meas if eventually all the p βp_\beta have the same underlying finite set XX, and then (p β) ip i(p_\beta)_i \to p_i in the usual topology on [0,)[0,\infty) for all iXi \in X. We say a net of morphisms f β:p βq βf_\beta: \p_\beta \to q_\beta converges if p βpp_\beta \to p and q βqq_\beta \to q for some p,qFinMeasp,q\in \Fin\Meas and for large enough β\beta the underlying function of f βf_\beta equals a fixed function f:XYf: X \to Y. The observant reader will note that FinMeas\Fin\Meas is a ‘large’ topological category, meaning that its set of objects lives in a larger Grothendieck universe than that containing the finite sets we use here. This will not present a problem.

Second, both +\mathbb{R}_+ and FinMeasFin\Meas come equipped with a notion of ‘addition’ making them into symmetric monoidal categories. For +\mathbb{R}_+ this addition is trivial on objects, and the usual addition of obvious nonnegative real numbers on morphisms. The resulting functor

+: +× + + + : \mathbb{R}_+ \times \mathbb{R}_+ \to \mathbb{R}_+

makes +\mathbb{R}_+ into a symmetric monoidal category. On the other hand, we have already seen how disjoint union of finite sets equipped with measures defines a ‘direct sum’ operation on FinMeas\Fin\Meas. This gives a functor

:FinMeas×FinMeasFinMeas \oplus : \Fin\Meas \times \Fin\Meas \to \Fin\Meas

which makes FinMeas\Fin\Meas into a symmetric monoidal category.

Third, both +\mathbb{R}_+ and FinMeasFin\Meas can be equipped with a notion of ‘scalar multiplication’ by any number λ[0,)\lambda \in [0,\infty). This gives an action of the multiplicative monoid [0,)[0,\infty) on these categories, by which we mean monoid homomorphisms

[0,)End( +),[0,)End(FinMeas) [0,\infty) \rightarrow \End(\mathbb{R}_+) \: , \qquad [0,\infty) \rightarrow \End(\Fin\Meas)

where for a category CC, the monoid End(C)\End(C) consists of all functors CCC\rightarrow C with functor composition as monoid multiplication.

Now let us think a bit about how the topological, additive and multiplicative structures on FinMeas\Fin\Meas and +\mathbb{R}_+ interact with each other. Note that while in general on FinMeas\Fin\Meas,

(λ+λ)(p)λpλp (\lambda + \lambda')(p) \ncong \lambda p \oplus \lambda' p


0p, 0 p \ncong \emptyset,

we do have canonical isomorphisms

λ(pq)λpλq \lambda (p \oplus q) \cong \lambda p \oplus \lambda q


λ. \lambda \emptyset \cong \emptyset .

So [0,)[0,\infty) acts not just as endofunctors on FinMeas\Fin\Meas, but as symmetric monoidal endofunctors. Writing End SymMonCatEnd_{\Sym\Mon\Cat} for the monoid of symmetric monoidal endofunctors of a symmetric monoidal category CC, we hence have a monoid homomorphism

[0,)End SymMonCat(FinMeas) [0,\infty) \rightarrow \End_{\Sym\Mon\Cat}(\Fin\Meas)

So, given a topological monoid MM, let us call a topological symmetric monoidal category (X,,0)(X, \oplus, 0) equipped with a monoidal functor

MEnd SymmMonCat(X)M \to End_{\Symm\Mon\Cat}(X)

providing a continuous action of MM on XX an MM-module category. We can summarize the previous two paragraphs by saying that FinMeas\Fin\Meas is a [0,)[0,\infty)-module category, and for each α>0\alpha \gt 0 there is an [0,)[0,\infty)-module category + α\mathbb{R}_+^\alpha.

By a morphism of MM-module categories, we mean a symmetric monoidal functor which is compatible with the action of MM. In this way, we obtain a category of MM-module categories; this is the coslice category over MM in the category of symmetric monoidal categories with strict symmetric monoidal functors.

Using these concepts, we see that Theorem 2 implies a nice characterization of Shannon entropy:


Suppose F:FinMeas +F : \Fin\Meas \to \mathbb{R}_+ is a morphism of [0,)[0,\infty)-module categories. Then for any morphism f:pqf : p \to q in FinMeas\Fin\Meas,

F(f)=c(H(p)H(q)) F(f) = c(H(p) - H(q))

for some c0c \ge 0, where HH is Shannon entropy. Conversely, for any constant c0c \ge 0 this equation defines a morphism of [0,)[0,\infty)-module categories.

What about the Tsallis entropy for α1\alpha \ne 1? In fact, for each number α>0\alpha \gt 0 there is a way to make +\mathbb{R}_+ into a [0,)[0,\infty)-module category, by letting λ[0,)\lambda \in [0,\infty) act trivially on objects and as multiplication by λ α\lambda^\alpha on morphisms. We write + α\mathbb{R}_+^\alpha for +\mathbb{R}_+ equipped with this action of [0,)[0,\infty).

The action of [0,)[0,\infty) on the category + α\mathbb{R}_+^\alpha has properties already mentioned in the case α=1\alpha = 1. Namely, there are equations

λ α(x+x)=λ αx+λ αx \lambda^\alpha(x+x') = \lambda^\alpha x + \lambda'^\alpha x


0 αx=0, 0^\alpha x = 0 \:,

and so the action actually also operates in terms of symmetric monoidal endofunctors,

A α:[0,)End SymMonCat( + α) A_\alpha : [0,\infty) \to End_{\Sym\Mon\Cat}(\mathbb{R}_+^\alpha)

This homomorphism A αA_\alpha is also continuous. So, for each α>0\alpha \gt 0 we see that + α\mathbb{R}_+^\alpha is a [0,)[0,\infty)-module. Theorem Theorem 2 implies:


Let α(0,)\alpha\in(0,\infty). Suppose F:FinMeas + αF : \Fin\Meas \to \mathbb{R}_+^\alpha is a morphism of [0,)[0,\infty)-module categories. Then for any morphism f:pqf : p \to q in FinMeas\Fin\Meas,

F(f)=c(H α(p)H α(q)) F(f) = c(H_\alpha(p) - H_\alpha(q))

for some c0c \ge 0, where H αH_\alpha is the Tsallis entropy of order α\alpha. Conversely, for any constant c0c \ge 0 this equation defines a morphism of [0,)[0,\infty)-module categories.

The partition function


Let FinMeas >0\Fin\Meas_{\gt 0} be the category of strictly positive finite measure spaces, where:

• an object is a finite set SS equipped with a strictly positive measure: that is, a measure μ\mu for which every set has nonnegative measure, and the only set of measure zero is the empty set.

• a morphism f:(S,μ)(S,μ)f: (S,\mu) \to (S',\mu') is a function such that the pushforward of μ\mu is μ\mu'.

We define a topological category is a category internal to TopTop, and a continuous functor is a functor internal to TopTop. There is a category Cat(Top)Cat(Top) with topological categories as objects, and continuous functors as morphisms.

We shall make FinMeas >0\Fin\Meas_{\gt 0} into a topological category as follows. The topology on objects and morphisms is the obvious one except that I’m tempted to do the following: when a sequence of measures μ j \mu_j on a set S S is trying to converge (in the obvious sense!) to a measure μ \mu for which only points in some subset TS T \subseteq S have nonzero measure, we say that

(S,μ j)(T,μ| T) (S,\mu_j) \to (T,\mu|_T)

In simple heuristic terms: call the points in a finite set ‘states’. Then when the probability of certain possible states goes to zero, these states effectively ‘go away’.

Note that one interesting feature of this definition is that the space of objects of FinMeas >0\Fin\Meas_{\gt 0} is connected.


Let \mathbb{R} be the topological category where objects are real numbers and there is a unique morphism f:xyf: x \to y whenever xyx \ge y (and none otherwise).

Let us classify all continuous functors

f:FinMeas >0 f: \Fin\Meas_{\gt 0} \to \mathbb{R}

with the property that

f(XY)=f(X)+f(Y) f(X \oplus Y) = f(X) + f(Y)

Here \oplus at left denotes the ‘direct sum’ of measure spaces, while at right it denotes the usual sum of real numbers.

Every functor

F:FinMeas >0 F: \Fin\Meas_{\gt 0} \to \mathbb{R}

restricts to a continuous functor

G:core(FinMeas >0) G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

Here core(FinMeas >0)core(\Fin\Meas_{\gt 0}) is the core of the category FinMeas >0\Fin\Meas_{\gt 0}, meaning that we keep the same objects but include only the isomorphisms. Note that this restriction loses no information. So, if we classify all continuous functors

G: G: \to \mathbb{R}

obeying the equation G(XY)=G(X)+G(Y)G(X \oplus Y) = G(X) + G(Y), the answer to our problem will be easy.


Continuous functors

G:core(FinMeas >0) G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

obeying the equation G(XY)=G(X)+G(Y)G(X \oplus Y) = G(X) + G(Y) correspond to continuous functions

g:[0,) g: [0,\infty) \to \mathbb{R}

with the property that g(0)=0g(0) = 0.

Proof Sketch - Since every object in FinMeas >0\Fin\Meas_{\gt 0} is a direc sum of one-element measure spaces, the functor gg is determined by its value on those. A one-element measure space is just a nonnegative real number, so GG is determined by a function

g:[0,) g: [0,\infty) \to \mathbb{R}

Continuity of the functor GG implies that gg is continuous. Note that GG applied to the empty set must give 0. Thanks to the funny topology on FinMeas\Fin\Meas and the continuity of GG, this forces g(0)=0g(0) = 0. Conversely, given any such function gg we can define GG on the finite measure space (S,μ)(S,\mu) by

G(S,μ)= iSg(μ i) G(S, \mu) = \sum_{i \in S} g(\mu_i)

where μ i\mu_i is μ\mu of the point iSi \in S.


Continuous functors

F:FinMeas >0 F: \Fin\Meas_{\gt 0} \to \mathbb{R}

obeying the equation F(XY)=F(X)+F(Y)F(X \oplus Y) = F(X) + F(Y) correspond to continuous functions

f:[0,) f: [0,\infty) \to \mathbb{R}

such that f(0)=0f(0) = 0 and and f(x+y)f(x)+f(y)f(x+y) \le f(x) + f(y).

Proof Sketch - Use the previous proposition and show that for G:FinMeas 0G : \Fin\Meas_0 \to \mathbb{R} to extend to some (necessarily unique) F:FinMeas 0F: \Fin\Meas_0 \to \mathbb{R}, it is necessary and sufficient that the corresponding function gg obeys g(x+y)g(x)+g(y)g(x+y) \ge g(x) + g(y). We need this condition so that any map from the 2-point measure space to a 1-point measure space gets sent to a morphism in \mathbb{R}. Every map in FinMeas\Fin\Meas is surjective and every surjection is a composite of coproducts of copies of the identity and these 2-points-to-1 maps, so this is all we need.


Continuous functors

G:core(FinMeas >0) G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

obeying the equations G(XY)=G(X)+G(Y)G(X \oplus Y) = G(X) + G(Y) and G(X×Y)=G(X)×G(Y)G(X \times Y) = G(X) \times G(Y) correspond to continuous functions

g:[0,) g: [0,\infty) \to \mathbb{R}

of the form g(x)=x αg(x) = x^\alpha with α0\alpha \ge 0.

Proof - This should follow from our earlier results.


Continuous functors

F:FinMeas >0 F: \Fin\Meas_{\gt 0} \to \mathbb{R}

obeying the equation F(XY)=F(X)+F(Y)F(X \oplus Y) = F(X) + F(Y) and F(X×Y)=F(X)×F(Y)F(X \times Y) = F(X) \times F(Y) correspond to continuous functions

f:[0,) f: [0,\infty) \to \mathbb{R}

of the form f(x)=x αf(x) = x^\alpha with α1\alpha \ge 1.

Proof - This should follow from our earlier results.

Suppose (S,μ)(S,\mu) is an object in FinMeas\Fin\Meas. The continuous functor

G:core(FinMeas >0) G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

corresponding to the function g(x)=x αg(x) = x^\alpha is then equal to

G(S,μ)= iSμ i α G(S,\mu) = \sum_{i \in S} \mu_i^\alpha

We may fix (S,μ)(S,\mu) and think of G(S,μ)G(S,\mu) as a function of α[0,)\alpha \in [0,\infty). This function plays an important role in physics, where it is called the partition function of X=(S,μ)X = (S,\mu). We denote this by

Z(X):[0,)[0,),Z(X) : [0,\infty) \to [0,\infty),

so that

Z(X)(α)= iSμ i α Z(X)(\alpha) = \sum_{i \in S} \mu_i^\alpha

The following proposition says the partition function is a ‘complete invariant’ of positive finite measure spaces:


Two objects of FinMeas >0\Fin\Meas_{\gt 0} have the same partition function if and only if they are isomorphic.

Proof - It is obvious that isomorphic objects of FinMeas >0\Fin\Meas_{\gt 0} have the same partition function. So, we need to show that we can recover X=(S,μ)X = (S,\mu) from its partition function. To see this, note that Z(X)(α)Z(X)(\alpha) is the Laplace transform of this measure on the real line:

ν= iSP lnμ i \nu = \sum_{i \in S} \mathbf{P}_{\ln \mu_i}


0 e αsdν(s) = iS 0 e αsP lnμ i(s)ds = iSe klnμ i = iSμ i α = Z(X)(α) \begin{array}{ccl} \int_0^\infty e^{\alpha s} d\nu(s) &=& \sum_{i \in S} \int_0^\infty e^{\alpha s} \mathbf{P}_{\ln \mu_i}(s) \, d s \\ &=& \sum_{i \in S} e^{k \ln \mu_i} \\ &=& \sum_{i \in S} \mu_i^\alpha \\ &=& Z(X)(\alpha) \end{array}

So, we can recover the measure ν\nu from the partition function Z(X)Z(X) by taking its inverse Laplace transform. Furthermore, we can recover the multiset of numbers μ i\mu_i from the measure ν\nu.

We can restate this proposition in an even more complicated way. For any (essentially small) category CC let C̲\underline{C} be the decategorification of CC: that is, the set of isomorphism classes of objects of CC. If CC is a symmetric rig category then C̲\underline{C} will naturally become a commutative rig.


Let \mathcal{R}, the rig of partition functions, be the set of functions f:[0,)[0,)f : [0,\infty) \to [0,\infty) of the form

Z(α)= iSp i α Z(\alpha) = \sum_{i \in S} p_i^\alpha

where SS is an arbitrary finite set and p ip_i are arbitrary positive real numbers. \mathcal{R} is a closed under pointwise addition and multiplication, making it into commutative rig.


The map Z:FinMeas >0̲Z: \underline{\Fin\Meas_{\gt 0}} \to \mathcal{R} sending each each isomorphism class of strictly positive finite measure space to its partition function is an isomorphism of commutative rigs.

Proof - When you penetrate the jargon, the only nonobvious part here is that ZZ is a bijection, which was shown in the previous proposition.

Convex algebras


The monad for convex sets is a monad on SetSet sending any set XX to the set of finitely-supported probability distributions on XX.

For example, this monad sends {1,,n}\{1, \ldots, n\} to a set P n\mathbf{P}_n, which can be identified with the (n1)(n - 1)-simplex. This monad is a finitary monad, so can be thought about in a few different ways.

A finitary monad is the same thing as a finitary algebraic theory. This one can be presented by a family (* t) t[0,1](*_t)_{t \in [0, 1]} of binary operations, subject to some equations. They’re probably exactly those equations that hold in a convex subset of n\mathbb{R}^n if we put x* ty=(1t)x+tyx *_t y = (1 - t)x + t y.

(I suspect there’s a theorem here: that for n1n \ge 1, n\mathbb{R}^n “satisfies no laws”. This means there are no equations between the various operations (x,y)(1t)x+ty(x, y) \mapsto (1 - t)x + t y beyond those forced by its being an algebra for this theory.)

A finitary algebraic theory can also be thought of as a kind of souped-up operad. In a symmetric operad PP, one has for each bijection σ:{1,,n}{1,,n}\sigma: \{1, \ldots, n\} \to \{1, \ldots, n\} an induced map σ *:P nP n\sigma_*: P_n \to P_n. In a finitary algebraic theory, one has the same thing for arbitrary functions between finite sets, not just bijections. In other words, a finitary algebraic theory amounts to a non-symmetric operad PP together with, for each function θ:{1,,m}{1,,n}\theta: \{1, \ldots, m\} \to \{1, \ldots, n\} between finite sets, an induced map θ *:P mP n\theta_*: P_m \to P_n, satisfying suitable axioms.


The underlying symmetric operad for the convex monad is called the operad for convex algebras, and denote P\mathbf{P}. An algebra of P\mathbf{P} is called a convex algebra.

The space of nn-ary operations for this operad is P n\mathbf{P}_n, the space of probability distributions on {1,,n}\{1, \ldots, n\}. The composition of operations is supposed to be obvious, but we should probably write down a formula for it. The maps “θ *:P nP m\theta_*: \mathbf{P}_n \to \mathbf{P}_m” can be defined by pushforward of measures. An algebra for the algebraic theory of convex algebras is an algebra XX for the operad with the further property that the square

P m×X n 1×θ * P m×X m θ *×1 P n×X n X \begin{matrix} \mathbf{P}_m \times X^n &\stackrel{1 \times \theta^*}{\to} &\mathbf{P}_m \times X^m \\ \theta_*\times 1 \downarrow & &\downarrow \\ \mathbf{P}_n \times X^n &\to &X \end{matrix}

commutes for all θ:{1,,m}{1,,n}\theta: \{1, \ldots, m\} \to \{1, \ldots, n\}.

Note that P\mathbf{P} is naturally a topological operad, where the topology on P n\mathbf{P}_n is the usual topology on the (n1)(n-1)-simplex. In our applications, we focus on algebras of P\mathbf{P} in topological categories with finite products. We call these convex algebras in \mathcal{E}. Here are some examples:

Shannon entropy

Fadeev’s characterization

Rényi presents a nice characterization of Shannon entropy due to Fadeev here:

It is the first result mentioned in the paper. To translate into our framework, it helps to write a probability measure on the finite set {1,,n}\{1,\dots,n\} as an nn-tuple of numbers p i[0,1]p_i \in [0,1] with i=1 np i=1 \sum_{i = 1}^n p_i = 1.


(Fadeev) Suppose F:FinProb 0[0,) F : \Fin\Prob_0 \to [0,\infty) is a continuous functor obeying the magical property:

F(tp 1,(1t)p 1,p 2,,p n)=F(p 1,,p n)+p 1F(t,1t)F(t p_1, (1-t)p_1, p_2, \dots, p_n) = F(p_1, \dots, p_n) + p_1 F(t, 1-t)

for all t[0,1]t \in [0,1]. Then FF is a nonnegative multiple of the Shannon entropy H 1H_1.

An operadic formulation

One may reformulate Fadeev’s theorem in terms of operads. The details are written out here:

What follows is a sketch.

Let PP be an operad in a finite product category \mathcal{E}. We may speak of (strict) PP-algebras in Cat()Cat(\mathcal{E}), the category of internal categories in \mathcal{E}. There is a notion of lax map between PP-algebras in Cat()Cat(\mathcal{E}).

Briefly put: if XX and YY are PP-algebras, a lax map from XX to YY is a functor XYX \to Y together with, for each nn \in \mathbb{N}, a natural transformation

P n×X n P n×Y n X Y \begin{matrix} P_n \times X^n &\to &P_n \times Y^n \\ \downarrow &\Leftarrow &\downarrow \\ X &\to &Y \end{matrix}

satisfying one axiom involving composition in PP, one involving the unit of PP, and, if PP is symmetric, one involving the symmetric group action on PP.

Now take X=1X = 1. Let us call a lax map 1Y1 \to Y a lax point of YY. Batanin has called them “internal PP-algebras” in YY, and maybe we’ll end up preferring that name.


In what follows we apply this idea to the case where =Top\mathcal{E} = Top and P\mathbf{P} is the symmetric, topological operad in which P n\mathbf{P}_n is the space of probability distributions on {1,,n}\{1, \ldots, n\} (otherwise known as the (n1)(n - 1)-simplex).


The lax points of the P\mathbf{P}-algebra (,+,0)(\mathbb{R}, +, 0) are precisely the scalar multiples of Shannon entropy.

Sketch of proof - Write out the axioms in the definition of lax map and unwind them in this particular case. One concerns composition, and this becomes the most substantial of the Fadeev axioms on Shannon entropy: the so-called ‘magical property’ mentioned above. One concerns units, and is redundant. One concerns symmetry. Then because we’re entirely in TopTop, you’ve got one concerning continuity. So we’ve got all the Fadeev axioms except for normalization, and the result follows from (Rényi’s statement of) Fadeev’s theorem.

There’s a very similar characterization of Shannon extropy, or rather the scalar (real) powers of Shannon extropy. There we simply use the P\mathbf{P}-algebra ((0,),,1)((0, \infty), \cdot, 1), which is isomorphic to (,+,0)(\mathbb{R}, +, 0) by taking exponentials.

We might also want to get rid of the “scalar multiples/powers” part and hit it on the nose. This seems more feasible with extropy rather than entropy, since there’s no choice of base. Still, I don’t see a nice way to do it.

Another operadic formulation

This section is in a more informal style…

A math puzzle

We begin with a sadly familiar problem:

Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution $P$. If you go to the $i$th restaurant, you then choose a dish from the menu according to some probability distribution $Q_i$. How surprising will your choice be, on average?

Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.

This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.

Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability pp is defined to be ln(p)- \ln(p). Then, if there are a bunch of possible outcomes distributed according to some probability distribution PP, the ‘expected’ surprise or Shannon entropy is:

(1)S(P)= ip iln(p i) S(P) = - \sum_i p_i \ln(p_i)

where p ip_i is the probability of the iith outcome. This is a weighted average where each surprise is weighted by its probability of happening.

So, we really do have a well-defined math puzzle, and it’s not even very hard.

Glomming together probabilities

But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution PP on the set of restaurants. Then you choose a meal according to some probability distribution Q iQ_i. If there are nn restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call

P(Q 1,,Q n) P \circ (Q_1, \dots, Q_n )

A bit more formally:

Suppose PP is a probability distribution on the set {1,,n}\{1,\dots, n\} and Q iQ_i are probability distributions on finite sets X iX_i, where i=1,,ni = 1, \dots, n. Suppose the probability distribution PP assigns a probability p ip_i to each element i{1,,n}i \in \{1,\dots, n\}, and suppose the distribution Q iQ_i assigns a probability q ijq_{i j} to each element jX ij \in X_i.

Then we can glom together the probability distributions Q iQ_i using PP and get a new probability distribution P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n) on the disjoint union of all the sets X iX_i. The resulting probability for each element (i,j)(i,j) in the disjoint union i=1 nX i\sqcup_{i = 1}^n X_i is just p iq ijp_i q_{i j}.

These probabilities sum to 1 as they should:

i,jp iq ij= i(p i jq ij)= ip i=1 \sum_{i, j} p_i q_{i j} = \sum_i \left( p_i \sum_j q_{i j} \right) = \sum_i p_i = 1

Note: what jj ranges over depends on ii! But it’s all kosher.

Glomming together numbers

There’s something even simpler than glomming together probability measures: we can glom together numbers!

Suppose we have a probability measure PP on the set {1,,n}\{1, \dots, n\} and for each element iSi \in S we have a number x ix_i. Then we can define a new number P(x 1,,x n)P(x_1, \dots, x_n) by

P(x 1,,x n)= i=1 np ix i P(x_1, \dots, x_n) = \sum_{i = 1}^n p_i x_i

This is just the ‘weighted average’ of the numbers x ix_i. We have already seen a weighted average in the definition of Shannon entropy, equation (2).

Glomming together entropies

Now we’re ready to answer the math puzzle:

(2)S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

On the left we’re glomming together a bunch of probability distributions using PP and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using PP. But there’s also an extra term: the entropy of PP!

In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely P(S(Q 1),,S(Q n))P(S(Q_1), \ldots, S(Q_n)). It’ll be that plus the expected surprise of your choice of restaurant, namely S(P)S(P).

You might not have expected such a simple formula. But it’s easy to prove:

S(P(Q 1,,Q n)) = i,jp iq ijln(p iq ij) = i,jp iq ijln(p i) i,jp iq ijln(q ij) = ip iln(p i) ip iS(Q i) = S(P)+P(S(Q 1),,S(Q n)) \begin{aligned} S(P\circ(Q_1, \ldots, Q_n)) &=& -\sum_{i, j} p_i q_{i j} \ln(p_i q_{i j}) \\ &=& - \sum_{i, j} p_i q_{i j} \ln(p_i) - \sum_{i, j} p_i q_{i j} \ln(q_{i j}) \\ &=& - \sum_{i} p_i \ln(p_i) - \sum_{i} p_i S(Q_i) \\ &=& S(P) + P(S(Q_1), \dots, S(Q_n)) \end{aligned}

Fadeev’s theorem

The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! We can restate Fadeev’s theorem (mentioned above) as follows:


(Fadeev) Suppose FF is a function sending probability measures on finite sets to real numbers. Suppose that:

  1. FF is invariant under bijections.

  2. FF is continuous.

  3. F(P(Q 1,,Q n))=F(P)+P(F(Q 1),,F(Q n))F(P\circ(Q_1, \ldots, Q_n)) = F(P) + P(F(Q_1), \ldots, F(Q_n)).

Then FF is a real multiple of Shannon entropy.

In item 1 we’re using the fact that a bijection f:XXf: X \to X' between finite sets together with a probability distribution on XX gives a probability distribution on XX'; we want these to have the same entropy. In item 2 we’re using the standard topology on n\mathbb{R}^n to put a topology on the set of probability distributions on any nn-element set. For a proof of this theorem, see the beginning of Rényi’s 1961 paper.

What’s going on?

While this equation is cute:

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution PP can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map SS from probability distributions to numbers. So, if you’re algebraically inclined, you would want SS to be a ‘homomorphism’, obeying a law like this:

S(P(Q 1,,Q n))=P(S(Q 1),,S(Q n))(sorry,nottrue) S(P\circ(Q_1, \ldots, Q_n)) = P(S(Q_1), \ldots, S(Q_n)) \qquad (sorry, \; not \; true)

We see laws of this sort all over math. But the true law has an extra term. What’s going on?

The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.

Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.

For starters, consider another law that has an ‘extra term’:

ddx(fg)=(ddxf)g+f(ddxg) \frac{d} {d x} (f g) = (\frac{d}{d x} f )g + f (\frac{d}{d x} g)

In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.

Let’s see how.

Glomming together partition functions

There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.

Suppose PP is a probability measure on a finite set XX. Let p ip_i be the probability of the element iXi \in X. We can always write

p i=e H i p_i = e^{-H_i}


H:X(,] H : X \to (-\infty, \infty]

is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy ++\infty to account for zero probabilities.

But physicists always go further and introduce a parameter β(0,)\beta \in (0,\infty) which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make p ip_i into a function of β\beta:

p i(β)=e βH i p_i(\beta) = e^{-\beta H_i}

or if you prefer, simply

p i(β)=p i β p_i(\beta) = p_i^\beta

When β=1\beta = 1, these numbers are just the probabilities p ip_i. But when β1\beta \ne 1, there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:

Z(P)= iXe βH i Z(P) = \sum_{i \in X} e^{-\beta H_i}

But I won’t do that just now: I’ll let P(β)P(\beta) be the measure on the set XX that assigns the measure p i(β)p_i(\beta) to the point iXi \in X.

Now everything is a function of β\beta! But everything reduces to something familar at β=1\beta = 1, so we can stretch our old notation without breaking it. We now have functions p ip_i sending numbers β(0,)\beta \in (0,\infty) to numbers, and a function PP sending numbers β\beta to measures on XX. These reduce to our original p ip_i and PP at β=1\beta = 1. The partition function is also a function of β\beta, and this equals 1 at β=1\beta = 1.

But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose PP is a probability distribution on the set {1,,n}\{1,\dots, n\} and Q iQ_i are probability distributions on finite sets X iX_i, where i=1,,ni = 1, \dots, n. Then

(3)Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

So, in fancy jargon, ZZ is an operad algebra homomorphism!

But I need to explain what this equation means. First of all, everything is a function of β\beta. Second, while PP and Q iQ_i are only probability measures at β=1\beta = 1, they’re perfectly fine measures at other values of β\beta, so all our ‘glomming’ formulas still make sense.

Let’s check to make sure we know what’s going on. An expression like P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n) could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers β(0,)\beta \in (0,\infty) to measures on that set. So, we can extend PP and Q iQ_i this way and then ‘glom’ them to get a measure-valued function of β\beta. On the other hand, we can glom them and then extend them. Luckily, the results agree.

Let’s see why. Suppose PP assigns the point iXi \in X the probability p ip_i. When we extend this to a function of β\beta we get p i βp_i^\beta. Similarly, suppose Q iQ_i assigns the point jX ij \in X_i the probability q ijq_{ i j}. When we extend this to a function of β\beta we get q ij βq_{i j}^\beta. When we glom these, the measure of the point (i,j)X i(i,j) \in \sqcup X_i will be this function of β\beta:

p i βq ij βp_i^\beta q_{i j}^\beta

But this equals

(p iq ij) β (p_i q_{i j})^\beta

which is the result of glomming and then extending.

The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,

Z(P(Q 1,,Q n))= i,jp i βq ij β= ip i βZ(Q i)=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = \sum_{i, j} p_i^\beta q_{i j}^\beta = \sum_i p_i^\beta Z(Q_i) = P (Z(Q_1), \dots, Z(Q_n))

From partition function to entropy

How is the Shannon entropy of a probability distribution related to its partition function? Simple:

(4)S(P)=ddβZ(P)| β=1 S(P) = - \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1}

Here I must emphasize that I’m only talking about the Shannon entropy S(P)S(P) of the original probability distribution, ‘at β=1\beta = 1’. Physicists extend S(P)S(P) to a function of β\beta, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating SS and ZZ a bit more complicated, so let’s not.

Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.

We start with the rule for glomming partition functions:

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

We differentiate with respect to β\beta and use the product rule, taking advantage of the fact that P(Z(Q 1),,Z(Q n))P (Z(Q_1), \dots, Z(Q_n)) is linear in PP but also linear in each of the functions Z(Q i)Z(Q_i):

ddβZ(P(Q 1,,Q n))=dPdβ(Z(Q 1),,Z(Q n))+P(ddβZ(Q 1),,ddβZ(Q n)) \frac{d}{d \beta} Z(P \circ (Q_1, \dots, Q_n)) = \frac{d P}{d \beta} (Z(Q_1), \dots, Z(Q_n)) + P (\frac{d}{d \beta} Z(Q_1), \dots, \frac{d}{d \beta} Z(Q_n))

Now set β=1\beta = 1. We can use equation (4) and the fact that all our partition functions equal 1 at β=1\beta = 1:

S(P(Q 1,,Q n))=dPdβ(1,,1)| β=1P(S(Q 1),,S(Q n)) - S(P \circ (Q_1, \dots, Q_n)) = \left. \frac{d P}{d \beta} (1,\dots, 1) \right|_{\beta = 1} - P(S(Q_1), \dots, S(Q_n))

Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that SS is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:

dPdβ(1,,1)| β=1= idp i(β)dβ| β=1 \left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i \left. \frac{d p_i(\beta)}{d \beta} \right|_{\beta = 1}

and we have

dp i(β)dβ=ddβe βH i=H ie βH i \frac{d p_i(\beta)} {d\beta} = \frac{d}{d\beta} e^{-\beta H_i} = -H_i e^{-\beta H_i}


dp i(β)dβ| β=1=H ie H i=ln(p i)p i \left. \frac{d p_i(\beta)} {d\beta} \right|_{\beta = 1} = - H_i e^{-H_i} = \ln(p_i) p_i


dPdβ(1,,1)| β=1= ip iln(p i)=S(P) \left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

So, we’ve got it!

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P \circ (Q_1, \dots, Q_n)) = S(P) + P(S(Q_1), \dots, S(Q_n))

The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!

But before we go further, let’s check rule (4). For starters,

ddβZ(P)| β=1=ddβ ie βH i| β=1 \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \left. \frac{d}{d \beta} \sum_i e^{-\beta H_i} \right|_{\beta = 1}

But we’ve just seen that

ddβe βH i| β=1=p iln(p i) \left. \frac{d}{d\beta} e^{-\beta H_i} \right|_{\beta = 1} = p_i \ln(p_i)


ddβZ(P)| β=1= ip iln(p i)=S(P) \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

The upshot

As we’ve seen already, there is deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!

Let FinProb 0\Fin\Prob_0 be the core of FinProb\Fin\Prob, as defined above. Then

Z:FinProb 0R Z : FinProb_0 \to R

can be thought of as a functor from FinProb 0Fin\Prob_0 to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, ZZ assigns to each object PFinProb 0P \in Fin\Prob_0 its partition function Z(P)Z(P)… but we can recover PP up to isomorphism from Z(P)Z(P).

Now, FinProb 0\Fin\Prob_0 is an algebra of a certain operad P\mathbf{P} whose nn-ary operations are the probability measures on the set {1,,n}\{1, \dots, n\}. This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a P\mathbf{P}-algebra. Furthermore, ZZ becomes a homomorphism of P\mathbf{P}-algebras. This last statement mostly just says that

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

But then we can take ZZ, differentiate it with respect to β\beta, and set β=1\beta = 1. By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the ConvConv-algebra FinProb 0\Fin\Prob_0. Concretely, we’ve seen this mostly says that

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

But there should also be an abstract-nonsense theory of derivations of operad algebras. (This was discussed this a bit back in week299, but only a little). So, an interesting question presents itself:

How does the ‘derivation’ way of thinking about the law S(P(Q 1,,Q n))=S(P\circ(Q_1, \ldots, Q_n)) = S(P)+P(S(Q 1),,S(Q n))S(P) + P(S(Q_1), \ldots, S(Q_n)) relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?

If we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of q q -derivative of minus the logarithm of the partition function. The qq-derivative is a kind of ‘twisted derivation’, but a bunch of the same calculations may work in some modified form.

Power means

There is a categorical characterization of the power means (of order 1\ge 1), similar in spirit to some of the other results on this page. It follows from the characterization in

The real half-line as a convex algebra

Write +=[0,)\mathbb{R}_+ = [0, \infty). For each t0t \ge 0, there is a convex algebra structure on the set +\mathbb{R}_+, given by taking the power mean of order tt. The action

M t:P n× + n + M_t: \mathbf{P}_n \times \mathbb{R}_+^n \to \mathbb{R}_+

is given by M t(p,x)=( ip ix i t) 1/tM_t(p, x) = (\sum_i p_i x_i^t)^{1/t} if 0<t<0 \lt t \lt \infty, or x i p i\prod x_i^{p_i} if t=0t = 0, or max i:p i>0x i\max_{i: p_i \gt 0} x_i if t=t = \infty.

Now put the order \ge on +\mathbb{R}_+ to make it into a category. Since M t(p,)M_t(p, -) is increasing (for each tt and pp), power mean of any order t0t \geq 0 makes +\mathbb{R}_+ into a convex algebra in CatCat.

Now put the additive structure (+,0)(+, 0) on +\mathbb{R}_+, to make it into a monoidal category. For t1t \geq 1 we have

M t(p,x)+M t(p,y)M t(p,x+y) M_t(p, x) + M_t(p, y) \geq M_t(p, x + y)

and M t(p,0)=0M_t(p, 0) = 0, so M t(p,)M_t(p, -) is a lax monoidal functor. So power mean of any order t1t \geq 1 makes +\mathbb{R}_+ into a convex algebra in MonCat \Mon\Cat_\ell, the category of monoidal categories and lax monoidal functors.

This structure is, moreover, homogeneous:

M t(p,cx)=cM t(p,x) M_t(p, c x) = c M_t(p, x)

whenever c +c \in \mathbb{R}_+ and (p,x)P n× + n(p, x) \in \mathbf{P}_n \times \mathbb{R}_+^n.

The characterization


The convex algebra structures on +MonCat \mathbb{R}_+ \in \Mon\Cat_\ell satisfying the homogeneity axiom are precisely the power means M tM_t (1t1 \leq t \leq \infty).

Sketch proof - We show that any homogeneous convex algebra structure on +\mathbb{R}_+ satisfies the hypotheses of the characterization theorem in arXiv:1103.2574, and is therefore one of the power means. This is elementary manipulation. The axiom called “functoriality” in that paper is exactly the commutativity of the square drawn above. “Consistency” is the unit axiom for an operad action. “Multiplicativity” follows from the composition axiom for an operad action, together with homogeneity.

I wouldn’t call this result a rephrasing of the result in my paper; it’s probably rather weaker (i.e. easier to prove), because the all-important multiplicativity axiom involves only a very restricted kind of operadic composition.

Remarks on homogeneity

The proposition would be nicer if the homogeneity axiom could be either dropped or turned into something more categorical. I don’t know what to do about this.

One possibility is to observe that the monoid ( +,,1)(\mathbb{R}_+, \cdot, 1) acts on the monoidal category ( +,,+,0)(\mathbb{R}_+, \geq, +, 0), and that homogeneity amounts to preservation of that action. So we could think about convex algebras in +MonCat \mathbb{R}_{+}- \Mon\Cat_\ell, the category of monoidal categories equipped with an action of the multiplicative monoid +\mathbb{R}_+. That should work, but seems clumsy.

Another possibility is to use the following strange kind of monoidal functor. Let’s say that a lax monoidal functor F:ABF: A \to B is homogeneous if for all aAa \in A and nn \in \mathbb{N}, the coherence map

F(a) nF(a n) F(a)^{\otimes n} \to F(a^{\otimes n})

is invertible. The composite of homogeneous functors is again homogeneous. Let MonCat h\Mon\Cat_h be the category of monoidal categories and homogeneous lax functors. Then the convex algebra structures on +MonCat h\mathbb{R}_+ \in \Mon\Cat_h are exactly the power means of order t[1,]t \in [1, \infty]. (Homogeneity of this kind only tells us that M(p,cx)=cM(p,x)M(p, c x) = cM(p, x) when cc is a natural number, but we can deduce it for all positive reals.) That’s a reasonably clean statement, but the concept of homogeneous lax functor seems strange, and was invented only for this purpose. I’ve never seen it before.

I don’t know whether the homogeneity axiom is simply redundant. I’ve tried to prove that it is, and I’ve tried to prove that it isn’t, without success.

Pairing measures with functions

This is Tom, still.

To study quadratic forms, you really need to study bilinear forms. Similarly, I wonder whether here, it would help to look not just at expressions such as ip i α\sum_i p_i^\alpha, but also at the associated expressions ip ix i α1\sum_i p_i x_i^{\alpha - 1}. (You see such expressions in the definition of power mean; I have in the back of my mind the kind of thing in arXiv:1103.2574, and closely related things in my work on diversity.)

In any case, here’s a theorem in the same spirit as some of those above.

Let FinMeasFun\Fin\Meas\Fun be the topological category in which:

Since the base space is a mere finite set, μ\mu and ϕ\phi are actually of the same type. The fancy terminology hides this, and maybe we’re embarrassed to admit it – we’re probably dreaming of higher things. Nevertheless, we’ll take advantage of it later.

FinMeasFun\Fin\Meas\Fun is a topological rig category, in an obvious way. So too is 0\mathbb{R}_0, the topological rig of all real numbers regarded as a rig category with no maps other than identities. (The 00 subscript is supposed to mean “the only maps are isos”, as in what John wrote.)

A rig functor

FinMeasFun 0 \Fin\Meas\Fun \to \mathbb{R}_0

is determined by its effect on objects, since 0\mathbb{R}_0 is categorically discrete. It’s the same thing as a homomorphism from Π 0(FinMeasFun)\Pi_0(\Fin\Meas\Fun), the rig of connected-components of FinMeasFun\Fin\Meas\Fun, to \mathbb{R}. For each t0t \geq 0, there’s a continuous rig functor J t:FinMeasFun 0J_t: \Fin\Meas\Fun \to \mathbb{R}_0 given by

J t(S,μ,ϕ)= sSμ sϕ(s) t. J_t(S, \mu, \phi) = \sum_{s \in S} \mu_s \phi(s)^t.

(Or more fancily: J t(S,μ,ϕ)= Sϕ tdμJ_t(S, \mu, \phi) = \int_S \phi^t d\mu. In the case t=0t = 0 and ϕ(s)=0\phi(s) = 0, continuity forces us to put 0 0=10^0 = 1.)


The continuous rig functors FinMeasFun 0\Fin\Meas\Fun \to \mathbb{R}_0 are precisely the functors J tJ_t (t0t \geq 0).

Sketch proof - Let JJ be a continuous rig functor. As usual, JJ is determined by what it does to singletons. Multiplicativity and continuity imply that J({},μ,ϕ)=μ uϕ() tJ(\{\star\}, \mu, \phi) = \mu_\star^u \phi(\star)^t for some constants uu and tt. Then functoriality (i.e. constancy of JJ on connected-components) implies that u=1u = 1.

Remark: I know 2×2=42 \times 2 = 4 versions of this proposition, including the one just stated. The first choice is whether to “allow zero” or not. By “allowing zero” I mean that in the definition of FinMeasFun\Fin\Meas\Fun, μ s0\mu_s \geq 0 and ϕ(s)0\phi(s) \geq 0 (as above); by “not allowing zero” I mean that we demand μ s>0\mu_s \gt 0 and ϕ(s)>0\phi(s) \gt 0.

The second choice is whether to work internally to TopTop or PosetPoset. (The point here is that in order to exclude the non-obvious solutions to the Cauchy functional equation ψ(a+b)=ψ(a)+ψ(b)\psi(a + b) = \psi(a) + \psi(b), you need some extra hypothesis such as continuity or order-preservation.) The order on the objects of FinMeasFun\Fin\Meas\Fun is: (S,μ,ϕ)(S,μ,ϕ)(S, \mu, \phi) \leq (S', \mu', \phi') iff S=SS = S', μ sμ s\mu_s \leq \mu'_s for all sSs \in S, and ϕ(s)ϕ(s)\phi(s) \leq \phi'(s) for all sSs \in S. The order on the maps of FinMeasFun\Fin\Meas\Fun is: fgf \leq g iff dom(f)dom(g)dom(f) \leq dom(g), cod(f)cod(g)cod(f) \leq cod(g), and f=gf = g as functions.

Here are the four versions:

  1. Allowing 0 and working in TopTop. This is the proposition above, where the continuous rig functors are J tJ_t for t0t \geq 0.

  2. Not allowing 0 and working in TopTop. Then we get J tJ_t for all tt \in \mathbb{R}.

  3. Allowing 0 and working in PosetPoset. Then we get J tJ_t for t0t \geq 0, together with another rig functor J εJ_\varepsilon. This new functor J εJ_\varepsilon is almost the same as J 0J_0, but whereas in the definition of J 0J_0 we take 0 0=10^0 = 1, in the definition of J εJ_\varepsilon we take 0 0=00^0 = 0. Explicitly, J ε(S,μ,ϕ)=lim t0+J t(S,μ,ϕ)=μ(suppϕ)J_\varepsilon(S, \mu, \phi) = \lim_{t \to 0+} J_t(S, \mu, \phi) = \mu(supp \phi) (whereas J 0(S,μ,ϕ)=μ(S)J_0(S, \mu, \phi) = \mu(S)).

  4. Not allowing 0 and working in PosetPoset. Then we get J tJ_t for t0t \geq 0 (but not J εJ_\varepsilon; or rather, we don’t see the difference between J εJ_\varepsilon and J 0J_0).

: I’ll continue to use the same setup as in the Proposition (allowing 00 and working in TopTop), for no other reason than that it’s the one I first thought of.

Now I’ll take advantage of the fact that a measure on a finite set is the same thing as a function: if (S,μ)FinMeas(S, \mu) \in \Fin\Meas then (S,μ,μ)FinMeasFun(S, \mu, \mu) \in \Fin\Meas\Fun. (I’m going to use FinMeas\Fin\Meas in a slightly different sense to John, dropping his assumption that when (S,μ)(S, \mu) is an object of FinMeas\Fin\Meas, the only subset of SS to have measure zero is the empty set. So an object of my FinMeas\Fin\Meas is just a finite set equipped with a measure.)

Digression: Actually, what I’m really taking advantage of is the fact that any measure on a finite set gives rise to a function. This is, if you like, its Radon-Nikodym derivative with respect to counting measure. So counting measure is playing a silent role as the canonical reference measure on a finite set.

I’ll invent some terminology. A map f:(S,μ)(S,μ)f: (S, \mu) \to (S', \mu') in FinMeas\Fin\Meas is an equivalence if there exist subsets ASA \subseteq S and ASA' \subseteq S' such that f 1A=Af^{-1} A' = A, the induced map AAA \to A' is a bijection, and μ(SA)=0=μ(SA)\mu(S \setminus A) = 0 = \mu'(S' \setminus A'). I think this is a standard concept in measure theory, but I forget the standard name; maybe just “measure-isomorphism”?


The maps (S,μ,μ)(S,μ,μ)(S, \mu, \mu) \to (S', \mu', \mu') in FinMeasFun\Fin\Meas\Fun are the equivalences (S,μ)(S,μ)(S, \mu) \to (S', \mu').

Proof - straightforward manipulation of the definitions.

Let FinMeas 1\Fin\Meas_1 be the category whose objects are those of FinMeas\Fin\Meas and whose maps are the equivalences. Thus, FinMeas 1\Fin\Meas_1 embeds as a full subcategory of FinMeasFun\Fin\Meas\Fun:

FinMeas 1FinMeasFun \Fin\Meas_1 \hookrightarrow \Fin\Meas\Fun

via (S,μ)(S,μ,μ)(S, \mu) \mapsto (S, \mu, \mu). Also, FinMeas 1\Fin\Meas_1 is closed under the rig category operations on FinMeasFun\Fin\Meas\Fun, so it’s a sub-rig-category.

John’s remarks on convergence in FinMeas\Fin\Meas suggest that perhaps he really wants to be using FinMeas 1\Fin\Meas_1 instead of FinMeas 0\Fin\Meas_0…? In any case, FinMeas 0\Fin\Meas_0 is a sub-rig-category of FinMeas 1\Fin\Meas_1, and it’s easy to see that the continuous rig functors FinMeas 1 0\Fin\Meas_1 \to \mathbb{R}_0 are exactly those of the form G αG_\alpha (α\alpha \in \mathbb{R}), where

G α(S,μ)= sSμ s α. G_\alpha (S, \mu) = \sum_{s \in S} \mu_s^\alpha.

This is basically Corollary 1 above.

We now have a commutative triangle, or we would if I knew how to draw one. The composite

FinMeas 1FinMeasFunJ α1 0 \Fin\Meas_1 \hookrightarrow \Fin\Meas\Fun \stackrel{J_{\alpha - 1}}{\to} \mathbb{R}_0

is equal to

FinMeas 1G α 0, \Fin\Meas_1 \stackrel{G_\alpha}{\to} \mathbb{R}_0,

for every α\alpha \in \mathbb{R}. Algebraically, this just says that

sμ sμ s α1= sμ s α. \sum_s \mu_s \mu_s^{\alpha - 1} = \sum_s \mu_s^\alpha.

But it seems to me to be noteworthy, because we naturally see the appearance of α1\alpha - 1 (as opposed to α\alpha), which is a recurrent player in this drama.

The basic inequalities of information theory

For random variables XX, YY, let us write (X,Y)(X,Y) for the ‘joint random variable’ whose distribution is the joint distribution of XX and YY. Its entropy is the joint entropy H((X,Y))H((X,Y)), which we also write as H(X,Y)H(X,Y).

The basic inequalities of information theory are the following inequalities for Shannon entropy:

Some of these inequalities can be generalized to relative Shannon entropy, Rényi entropy, or (combining both generalizations) relative Rényi entropy, also known as ‘Rényi divergence’. For a fairly thorough introduction to inequalities obeyed by Rényi divergence, see:

• T. A. L. van Even, When Data Compression and Statistics Disagree: Two Frequentist Challenges For the Minimum Description Length Principle, Chap. 6: Rényi Divergence, Ph.D. thesis, Leiden University, 2010.

Here however we will start by studying these inequalities for the Shannon entropy of probability measures on finite sets.

The first three types of inequalities can be seen as degenerate cases of the fourth, when taking one of the variables being deterministic.

We have already seen how to formulate the first two properties categorically: for the Shannon entropy functor

H:FinProb( +,) op, H : \Fin\Prob \rightarrow \left(\mathbb{R}_+,\leq\right)^{\mathrm{op}} \:,

the non-negativity of entropy is clear by the definition of the target category, while the non-negativity of conditional entropy follows by considering HH on the projection map (X,Y)Y(X,Y)\rightarrow Y.

But what about the other two types of basic inequalities?

Coslice categories and mutual information

Here’s how to do it for the non-negativity of mutual information. First of all, we need to make sense of the concept of joint distribution. Regarding random variables XX and YY as objects of FinProb\Fin\Prob does not yet specify a joint distribution; in general, there will be many joint ddistributions (X,Y)(X,Y) compatible with the given distributions of XX and YY. What we need is to assume the existence of a sample space on which XX and YY live.

So consider an object ΩFinProb\Omega\in \Fin\Prob; we think of it as the sample space conditional to which all other random variables will be considered: XX is a random variable on Ω\Omega if it comes equipped with a morphism ΩX\Omega\rightarrow X. In other words, a random variable on Ω\Omega is an object in the coslice category Ω/FinProb\Omega/\Fin\Prob.


If XX and YY are random variables over Ω\Omega, then their joint distribution (X,Y)(X,Y) is equal to the categorical product X× ΩYX\times_{\Omega}Y in Ω/FinProb\Omega/\Fin\Prob.

Proof - Straightforward.(?)

As a side remark, this also shows the following:


The category FinProb\Fin\Prob does not have products.

Proof - If FinProb\Fin\Prob would have products, then any coslice category Ω/FinProb\Omega/\Fin\Prob would inherit these products. However since the joint distribution of two random variables with given distribution does in general depend on the sample space on which these random variables live, the product in Ω/FinProb\Omega/\Fin\Prob does depend on Ω\Omega.

We resume the main line of thought. By composing the functor HH with the natural functor Ω/FinProbFinProb\Omega/\Fin\Prob\rightarrow\Fin\Prob, we can also consider Shannon entropy as a functor on Ω/FinProb\Omega/\Fin\Prob, which is, by abuse of notation, also denoted by HH.


Regarding the categorical product as the monoidal structure on Ω/FinProb\Omega/\Fin\Prob, Shannon entropy becomes a lax monoidal functor

H:Ω/FinProb( +,) op H:\Omega/\Fin\Prob\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

Proof - The monoidal unit of FinProb\Fin\Prob is given by the terminal object, which is any single-outcome random variable *\ast, while in 0\mathbb{R}_{\geq 0} the monoidal unit is 00. Thus the compatibility of HH with the monoidal unit states that

0H(*) 0\geq H(\ast)

which is equivalent to the condition that the one-outcome random variable *\ast has vanishing entropy, H(*)=0H(\ast)=0.

The compatibility of HH with the monoidal product requires

H(X)+H(Y)H(X× ΩY) H(X) + H(Y) \geq H(X\times_{\Omega}Y)

which is precisely the non-negativity of mutual information since H(X× ΩY)=H(X,Y)H(X\times_{\Omega}Y)=H(X,Y).

Slice categories and conditional entropy

We have just seen that coslice categories under objects in FinProb\Fin\Prob are relevant for defining joint distributions of random variables. Surprisingly, also slice categories over objects in FinProb\Fin\Prob play an important role. Let us fix an object ZFinProbZ\in \Fin\Prob and consider the slice category FinProb/Z\Fin\Prob/Z. An object (X,f X)FinProb/Z(X,f_X)\in \Fin\Prob/Z is a finite probability space XX together with a measure-preserving function f X:XZf_X:X\rightarrow Z. We may think of f Xf_X as defining a partitioning of the domain of XX, or as a fibration with base ZZ.

Given an object (X,f X)FinProb/Z(X,f_X)\in\Fin\Prob/Z, we define the conditional entropy by

H(X/Z)H(X)H(Z) H(X/Z)\equiv H(X)-H(Z)

(Since this is not exactly the usual definition of conditional entropy, we use a slightly different notation.) This quantity measures the information content of XX conditional to ZZ: given that we know the value of ZZ, how much more information do we expect to get when learning the value of XX?

This reduces to the usual concept of conditional Shannon entropy as follows: given random variables AA and BB, the conditional entropy is given by

H(A|B)=H(A,B)H(B) H(A|B)=H(A,B)-H(B)

which is of the above form if we take Z=BZ=B, X=(A,B)X=(A,B) and f:XZf:X\rightarrow Z to be the projection onto the second component.

On the category FinProb/Z\Fin\Prob/Z, there is a monoidal product defined as follows: for objects (X,f X)(X,f_X) and (Y,f Y)(Y,f_Y) the product ‘over ZZ’ is given by the pullback

X× ZY={(x,y)|f X(x)=f Y(y)} X\times^Z Y = \left\{(x,y)\:|\:f_X(x)=f_Y(y)\right\}

together with the measure defined as

P X× ZY(x,y)=P X(x)P Y(y)P Z(f X(x)) P_{X\times^Z Y}(x,y) = \frac{P_X(x)\cdot P_Y(y)}{P_Z(f_X(x))}

which just states that XX and YY are taken to be ‘fiberwise independent’ random variables. (When the denominator of this expression vanishes, its numerator also vanishes automatically, and we take the whole expression to be 00.)

It should be clear how this monoidal product operates on morphisms. Its unit is given by the object (Z,id Z)(Z,\mathrm{id}_Z).


With these definitions, the conditional Shannon entropy H(/Z)H(\cdot/Z) is a monoidal functor

H(/Z):FinProb/Z( +,) op H(\cdot/Z):\Fin\Prob/Z\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

Proof - Functoriality is clear by the previous results. That H(/Z)H(\cdot/Z) preserves the monoidal unit means that

H(Z/Z)=H(Z)H(Z)=0. H(Z/Z) = H(Z) - H(Z) = 0\:.

So it remains to check that H(/Z)H(\cdot/Z) preserves the monoidal product. This follows from a straightforward calculation:

H(X× ZY/Z)= zZ (x,y)X× ZYP X(x)P Y(y)P Z(z)logP X(x)P Y(y)P Z(z)H(Z)= xXP X(x)logP X(x) yYP Y(y)logP Y(y)2H(Z)=H(X/Z)+H(Y/Z) H(X\times^Z Y/Z)=-\sum_{z\in Z}\sum_{(x,y)\in X\times^Z Y}\frac{P_X(x)P_Y(y)}{P_Z(z)}\log\frac{P_X(x)P_Y(y)}{P_Z(z)}-H(Z)=-\sum_{x\in X}P_X(x)\log P_X(x)-\sum_{y\in Y}P_Y(y)\log P_Y(y) - 2H(Z)=H(X/Z)+H(Y/Z)

Bislice categories and conditional mutual information

Now it is time to combine these observations in order to find a categorical formulation of the non-negativity of conditional mutual information, which subsumes all the other basic information inequalities as degenerate cases.

So let us fix a sample space ΩFinProb\Omega\in\Fin\Prob and a ‘cosample space’ ZFinProbZ\in\Fin\Prob. We also need to fix a measure-preserving function g:ΩZg:\Omega\rightarrow Z which anchors ZZ as a random variable on Ω\Omega. We now consider the category of probability spaces ‘between’ Ω\Omega and ZZ. More precisely, we take Ω/FinProb/Z\Omega/\Fin\Prob/Z to be the category where an object is a triple (X,i X,f X)(X,i_X,f_X) consisting of a probability space XFinProbX\in\Fin\Prob and measure-preserving functions i X:ΩXi_X:\Omega\rightarrow X and f X:XZf_X:X\rightarrow Z such that f Xi X=gf_X\circ i_X=g. As morphisms of Ω/FinProb/Z\Omega/\Fin\Prob/Z, we take those measure-preserving functions which are compatible with the given data i i_{\cdot} and f f_{\cdot}.

This is not a comma category, is it? Should it still be called ‘bislice category’?


If (X,i X,f X),(Y,i Y,f Y)Ω/FinProb/Z(X,i_X,f_X), (Y,i_Y,f_Y)\in\Omega/\Fin\Prob/Z, then the categorical product, denoted by X× Ω ZYX\times^Z_\Omega Y, is the joint distribution of XX and YY on the sample space Ω\Omega.

Proof - Straightforward.(?)

As above, this gives the same consequence:


The category Ω/FinProb\Omega/\Fin\Prob does not have products.

The object (Z,g,id Z)(Z,g,\mathrm{id}_Z) is the unit of this monoidal structure.


With respect to this monoidal structure, the conditional entropy H(/Z)H(\cdot/Z) is a lax monoidal functor

H(/Z):Ω/FinProb/Z( +,) op H(\cdot/Z):\Omega/\Fin\Prob/Z\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

The compatibility with the monoidal product is equivalent to the non-negativity of conditional mutual information.

Proof - Again functoriality is clear. For the unit, being lax monoidal means that

H(Z/Z)0 H(Z/Z)\leq 0

which holds true by H(Z/Z)=H(Z)H(Z)=0H(Z/Z)=H(Z)-H(Z)=0. The compatibility of the monoidal products states that

(5)H(X× Ω ZY/Z)H(X/Z)+H(Y/Z) H(X\times^Z_\Omega Y/Z)\leq H(X/Z)+H(Y/Z)

That this inequality holds for Shannon entropy follows from second the assertion, stating that it is equivalent to the non-negativity of conditional mutual information. This equivalence still remains to be proven. To see this, note that XX is isomorphic to the joint variable (X,Z)(X,Z), and similarly YY is isomorphic to (Y,Z)(Y,Z). With this, (5) reads

H(X,Y,Z|Z)H(X,Y|Z)+H(Y,Z|Z) H(X,Y,Z|Z)\leq H(X,Y|Z)+H(Y,Z|Z)

Upon expanding the conditional entropies in terms of joint entropies, this becomes

(6)H(X,Y,Z)+H(Z)H(X,Z)+H(Y,Z),i.e.I(X:Y|Z)0. H(X,Y,Z)+H(Z)\leq H(X,Z)+H(Y,Z) ,\quad i.e. \quad I(X:Y|Z) \ge 0\;.

Therefore (5) follows from the non-negativity of conditional mutual information.

The converse direction works similarly: given a triple of random variables XX, YY, ZZ on Ω\Omega, we can consider (X,Z)(X,Z) and (Y,Z)(Y,Z), which are also random variables on Ω\Omega and come equipped with the projection to ZZ, as objects in Ω/FinProb/Z\Omega/\Fin\Prob/Z. Then again, (5) is equivalent to (6).

Rényi entropy

The logarithm of ZZ has the physical meaning of free energy. The logarithm of ZZ divided by 1α1 - \alpha is called the Rényi entropy:

H α(S,μ)=11αln iSμ i α H_\alpha(S,\mu) = \frac{1}{1-\alpha} \ln \sum_{i \in S} {\mu_i}^\alpha

This can be seen as roughly a qq-derivative of the free energy; see

for details.

For α>0\alpha \gt 0 with α1\alpha \ne 1, H αH_\alpha defines a continuous functor

H α:FinMeas 0 H_\alpha : \Fin\Meas_0 \to \mathbb{R}

The case α=1\alpha = 1 can probably be taken care of via a limit; for now let’s assume so.

Thanks to

Z(X×Y)=Z(X)×Z(Y) Z(X \times Y) = Z(X) \times Z(Y)

the Rényi entropy obeys

H α(X×Y)=H α(X)+H α(Y) H_\alpha(X \times Y) = H_\alpha(X) + H_\alpha(Y)

For probability measures the Rényi entropy also obeys

H α(X)0 H_\alpha(X) \ge 0

Thus, if we treat [0,)[0,\infty) as a topological subcategory of \mathbb{R}, the Rényi entropy restricts to a continuous functor

H α:FinProb 0[0,) H_\alpha : \Fin\Prob_0 \to [0,\infty)

where we treat probability measure spaces where the only set of measure zero is the empty set as forming a topological subcategory FinProb\Fin\Prob of FinMeas\Fin\Meas.


H α(p 1,,p k)H α(p 1,,p k1+p k) H_\alpha(p_1, \dots, p_k) \ge H_\alpha(p_1, \dots, p_{k-1} + p_k)

where we write a probability measure on a set of the form {1,,}\{1,\dots, \ell \} as a list of nonnegative numbers that sum to 1. Thus we have:


For α1\alpha \ge 1, the Rényi entropy H α:FinProb 0[0,) H_\alpha : \Fin\Prob_0 \to [0,\infty) extends to a continuous functor

H α:FinProb[0,) H_\alpha : \Fin\Prob \to [0,\infty)

Relative Rényi entropy

This is Tobias. I just want to write things down in little detail to at least have a record. So this section needs much more care. In particular, if we decide to only consider cases where are probalities are strictly positive, then one can apply continuity arguments.

Let p=(p 1,,p n)p=(p_1,\ldots,p_n) and r=(r 1,,r n)r=(r_1,\ldots,r_n) be two probability distributions on an nn-element set. Then the relative Rényi entropy of order qq is given by

D q(p||r)=1q1log ip i qr i q1 D_q(p||r) = \frac{1}{q-1} \log \sum_i \frac{p_i^q}{r_i^{q-1}}

Here, we set D q(p||r)D_q(p||r) to be \infty whenever there is an index ii with r i=0r_i=0, but p i>0p_i\gt 0. In terms of power means,

D q(p||r)=logM q1(p,pr) D_q(p||r) = \log M_{q-1} \left( p,\frac{p}{r} \right)

where pr(i)=p(i)r(i)\frac{p}{r}(i)=\frac{p(i)}{r(i)} is the Radon-Nikodym derivative (we set 0/0=00/0=0 here). D q(q||r)D_q(q||r) is non-negative for q[0,]q\in[0,\infty].

ToDo: see whether the notation is good and add some more introductory bla-bla

What are good properties of Rényi entropy? It is clearly invariant under permutations of the of the indices when these are applied to pp and qq at the same time. Furthermore, it is also invariant under adding additional randomness: we can add additional randomness to pp and qq by choosing some λ[0,1]\lambda\in[0,1] and considering the (n+1)(n+1)-outcome distributions p=(λp 1,(1λ)p 1,p 2,,p n)p'=(\lambda p_1,(1-\lambda)p_1,p_2,\ldots,p_n) and r=(λr 1,(1λ)r 1,r 2,,r n)r'=(\lambda r_1,(1-\lambda)r_1,r_2,\ldots,r_n). Then,

D q(p||r)=1q1log(λp 1 qr 1 q1+(1λ)p 1 qr 1 q1+ i=2 np i qr i q1)=D q(p||r) D_q(p'||r') = \frac{1}{q-1} \log \left( \lambda\frac{p_1^q}{r_1^{q-1}}+(1-\lambda)\frac{p_1^q}{r_1^{q-1}} + \sum_{i=2}^n \frac{p_i^q}{r_i^{q-1}} \right) = D_q(p||r)

So under which kind of operation does D qD_q change at all? One thing we have not considered so far is coarse-graining, i.e. the identification of some outcomes to a single one. Again applying this to both distributions at the same time, we obtain e.g. the distributions p^=(p 1+p 2,p 3,,p n)\hat{p}=(p_1+p_2,p_3,\ldots,p_n) and r^=(r 1+r 2,r 3,,r n)\hat{r}=(r_1+r_2,r_3,\ldots,r_n).


In this notation,

D q(p^||r^)D q(p||r) D_q(\hat{p}||\hat{r})\leq D_q(p||r)

for all q[0,]q\in[0,\infty].

Proof - For q1q\geq 1, this inequality states that

(p 1+p 2)(p 1+p 2r 1+r 2) q1p 1(p 1r 1 q1)+p 2(p 2r 2 q1) (p_1 + p_2) \left( \frac{p_1+p_2}{r_1+r_2} \right)^{q-1} \leq p_1 \left( \frac{p_1}{r_1}^{q-1} \right) + p_2 \left( \frac{p_2}{r_2}^{q-1} \right)

which is equivalent to

M 1(w,x) q1=M q1(w,x) q1 M_{-1}(w,x)^{q-1} = M_{q-1}(w,x)^{q-1}

with w=(p 1p 1+p 2,p 2p 1+p 2)w=\left(\frac{p_1}{p_1+p_2},\frac{p_2}{p_1+p_2}\right) and x i=p i/r ix_i=p_i/r_i. This in turn always holds since the power means M tM_t are increasing functions of the parameter tt. Allowing the second argument of the power means to also take on the value \infty (which means that, due to the particular form of the arguments, that the corresponding weight will not be 00), this also still holds in the degenerate case where some r ir_i vanishes.

(Under construction…)