Chapter 10: Entropy and Information | Lectures on Thermodynamics and Statistical Mechanics

10. Entropy and Information

The concept of entropy is one of the more difficult concepts in physics. Historically, it emerged as a consequence of the second law of thermodynamics, as in (3.16). Later, Boltzmann gave a general definition for it in terms of the number of ways of distributing a given number of particles, as in (7.11). But a clearer understanding of entropy is related to its interpretation in terms of information. We will briefly discuss this point of view here.

10.1 Information

We want to give a quantification of the idea of information. This is originally due to . Shannon.

Consider a random variable 𝒳 with probability distribution with . For simplicity, initially, we take 𝒳 to be a discrete random variable, with possible values with being the probability for . We may think of an experiment for which the outcomes are the , and the probability for being in a trial run of the experiment. We want to define a concept of information associated with . The key idea is to note that if an outcome has probability 1, the occurrence of that outcome carries no information, since it was clear that it would definitely happen. If an outcome has a probability less than 1, then its occurrence can carry information. If the probability is very small, and the outcome occurs, it is unlikely to be a random event and so it makes sense to consider it as carrying information. Based on this intuitive idea, we expect information to be a function of the probability. By convention, we choose to be positive. Further from what we said, . Now consider two completely independent events, with probabilities and . The probability for both to occur is , and will carry information . Since the occurrence of each event separately carries information and , we expect

(10.1)

Finally, if the probability of some event is changed by a small amount, we expect the information for the event to be changed by a small amount as well. This means that we would like to be a continuous and differentiable function of 𝒫. Thus we need a continuous and differentiable function obeying the requirements , and . The only function which obeys these conditions is given by

(10.2)

This is basically Shannon’s definition of information. The base used for this logarithm is not specified by what has been said so far; it is a matter of choosing a unit for information. Conventionally, for systems using binary codes, we use , while for most statistical systems we use the natural logarithms.

Consider now the outcome which has a probability . The amount of information for is . Suppose now that we do N trials of the experiment, where is very large. Then the number of times will be realized is . Thus it makes sense to define an average or expectation value for information as

(10.3)

This expected value for information is Shannon’s definition of entropy.

This definition of entropy requires some clarification. It stands for the amount of information which can be coded using the available outcomes. This can be made clearer by considering an example, say, of tosses of a coin, or equivalently a stream of s and s, units long. Each outcome is then a string of s and s; we will refer to this as a word, since we may think of it as the binary coding of a word. We take these to be ordered so that permutations of s and s in a given word will be counted as distinct. The total number of possibilities for this is , and each occurs with equal probability. Thus the amount of information in realizing a particular outcome is , or bits if we use logarithm to base . The entropy of the distribution is

(10.4)

Now consider a situation where we specify or fix some of the words. For example, let us say that all words start with , Then the probability of any word among this restricted set is now , and entropy becomes . Thus entropy has decreased because we have made a choice; we have used some information. Thus entropy is the amount of information which can be potentially coded using a probability distribution.

This definition of entropy is essentially the same as Boltzmann’s definition or what we have used in arriving at various distribution functions for particles. For this consider the formula for entropy which we used in chapter 7, equation (7.13),

(10.5)

Here the is the occupation number for the state . In limit of large N, ni/N may be interpreted as the probability for the state i. Using the symbol pi for this, we can rewrite (10.5) as

(10.6)

showing that the entropy as defined by Boltzmann in statistical physics is the same as Shannon’s information-theoretic definition. (In thermodynamics, we measure in ; we can regard Boltzmann’s constant as a unit conversion factor. Thus from thermodynamics is the quantity to be compared to the Shannon definition.) The states in thermodynamics are specified by the values of positions and momenta for the particles, so the outcomes are continuous. A continuum generalization of (10.6) is then

(10.7)

where is an appropriate measure, like the phase space measure in (7.55).

Normally, we maximize entropy subject to certain averages such as the average energy and average number of particles being specified. This means that the observer has, by observations, determined these values and hence the number of available states is restricted. Only those states which are compatible with the given average energy and number of particles are allowed. This constrains the probability distribution which maximizes the entropy. If we specify more averages, then the maximal entropy is lower. The argument is similar to what was given after (10.4), but we can see this more directly as well. Let be a set of observables. The maximization of entropy subject to specifying the average values of these is given by maximizing

(10.8)

Here are the average values which have been specified and are Lagrange multipliers. Variation with respect to the give the required constraint

(10.9)

The distribution 𝒫 which extremizes (10.8) is given by

(10.10)

The corresponding entropy is given by

(10.11)

Now let us consider specifying averages. In this case, we have

(10.12)

This distribution reverts to , and likewise , if we set to zero.

If we calculate using the distribution and the answer comes out to be the specified value, then there is no information in going to . Thus it is only if the distribution which realizes the specified value differs from that there is additional information in the choice of . This happens if. It is therefore useful to consider how changes with . We find, directly from (10.11),

(10.13)

The change of the maximal entropy with the is given by a set of correlation functions designated as . We can easily see that this matrix is positive semi-definite. For this we use the Schwarz inequality

(10.14)

For any set of complex numbers , we take and . We then see from (10.14) that . (The integrals in (10.14) should be finite for the inequality to make sense. We will assume that at least one of the , say corresponding to the Hamiltonian. is always included so that the averages are finite.) Equation (10.13) then tells us that decreases as more and more pick up nonzero values. Thus we must interpret entropy as a measure of the information in the states which are still freely available for coding after the constraints imposed by the averages of the observables already measured. This also means that the increase of entropy in a system left to itself means that the system tends towards the probability distribution which is completely random except for specified values of the conserved quantities. The averages of all other observables tend towards the values given by such a random distribution. In such a state, the observer has minimum knowledge about observables other than the conserved quantities.

10.2 Maxwell’s demon

There is a very interesting thought experiment due to Maxwell which is perhaps best phrased as a potential violation of the second law of thermodynamics. The resolution of this problem highlights the role of entropy as information.

We consider a gas of particles in equilibrium in a box at some temperature . The velocities of the particles follow the Maxwell distribution (7.44),

(10.15)

The mean square speed given by

(10.16)

may be used as a measure of the temperature. Now we consider a partition which divides the box into two parts. Further, we consider a small gate in the partition which can be opened and closed and requires a very small amount of energy, which can be taken to be zero in an idealized limit. Further, we assume there is a creature (“Maxwell’s demon") with a large brain capacity to store a lot of data sitting next to this box. Now the demon is supposed to do the following. Every time he sees a molecule of high speed coming towards the gate from the left side, he opens the gate and lets it through to the right side. If he sees a slowly moving molecule coming towards the gate from the right side, he opens it and lets the molecule through to the left side. If he sees a slow moving molecule on the left side, or a fast moving one on the right side, he does nothing. Now after a while, the mean square speed on the left will be smaller than what it was originally, showing that the temperature on the left side is lower than . Correspondingly, the mean square speed on the right side is higher and so the temperature there is larger than . Effectively, heat is being transferred from a cold body (left side of the box) to a hot body (the right side of the box). Since the demon imparts essentially zero energy to the system via opening and closing the gate, this transfer is done with no other change, thus seemingly providing a violation of the second law. This is the problem.

We can rephrase this in terms of entropy change. To illustrate the point, it is sufficient to consider the simple case of particles in the volume forming an ideal gas with the demon separating them into two groups of particles in volume each. If the initial temperature is and the final temperatures are and , then the conservation of energy gives . Further, we can use the the Sackur-Tetrode formula (7.32) for the entropies,

(10.17)

The change in entropy when the demon separates the molecules is then obtained as

(10.18)

since

(10.19)

we see that . Thus the process ends up decreasing the entropy in contradiction to the second law.

The resolution of this problem is in the fact that the demon must have information about the speeds of molecules to be able to let the fast ones to the right side and the slow ones to the left side. This means that using the Sackur-Tetrode formula for the entropy of the gas in the initial state is not right. We are starting off with a state of entropy (of gas and demon combined) which is less than what is given by (10.17), once we include the information carried by (or obtained via the observation of velocities by) the demon, since the specification of more observables decreases the entropy as we have seen in the last section. While it is difficult to estimate quantitatively this entropy, the expectation is that with this smaller value of to begin with, will come out to be positive and that there will be no contradiction with the second law. Of course, this means that we are considering a generalization of the second law, namely that the entropy of an isolated system does not decrease over time, provided all sources of entropy in the information-theoretic sense are taken into account.

10.3 Entropy and gravity

There is something deep about the concept of entropy which is related to gravity. This is far from being well understood, and is atopic of ongoing research, but there are good reasons to think that the Einstein field equations for gravity may actually emerge as some some sort of entropy maximization condition. A point of contact between gravity and entropy is for spacetimes with a horizon, an example being a black hole. In an ultimate theory of quantum gravity, spacetimes with a horizon may turn out to be nothing special, but for now, they may be the only window to the connection between entropy and gravity. To see something of the connection, we look at a spherical solution to the Einstein equations, corresponding to the metric around a point (or spherical distribution of) mass. This is the Schwarzschild metric given as

(10.20)

We are writing this in the usual spherical coordinates for the spatial dimensions. is Newton’s gravitational constant and c is the speed of light in vacuum. We can immediately see that there are two singularities in this expression. The first is obviously at , similar to what occurs in Newton’s theory for the gravitational potential, and the second is at . This second singularity is a two-sphere since it occurs at finite radius. Now, one can show that is a genuine singularity of the theory, in the sense that it cannot be removed by a coordinate transformation. The singularity at is a coordinate singularity. It is like the singularity at , π when we use spherical coordinates and can be eliminated by choosing a different set of coordinates. Nevertheless, the radius does have an important role. The propagation of light, in the ray optics approximation, is described by . As a result, one can see that nothing can escape from to larger values of the radius, to be detected by observers far away. An observer far away who is watching an object falling to the center will see the light coming from it being redshifted due to the factor, eventually being redshifted to zero frequency as it crosses ; the object fades out. For this reason, and because it is not a real singularity, we say that the sphere at is a horizon. Because nothing can escape from inside the horizon, the region inside is a black hole. The value is called the Schwarzschild radius.

Are there examples of black holes in nature? The metric (10.20) can be used to describe the spacetime outside of a nearly spherical matter distribution such as a star or the Sun. For the Sun, with a mass of about kg, the Schwarzschild radius is about km. The form of the metric in (10.20) ceases to be valid once we pass inside the surface of the Sun, and so there is no horizon physically realized for the Sun (and for most stars). (Outside of the gravitating mass, one can use (10.20) which is how observable predictions of Einstein’s theory such as the precession of the perihelion of Mercury are obtained.) But consider a star which is more massive than the Chandrasekhar and Tolman-Oppenheimer-Volkov limits. If it is massive enough to contract gravitationally overcoming even the quark degeneracy pressure, its radius can shrink below its Schwarzschild radius and we can get an black hole. The belief is that there is such a black hole at the center of our galaxy, and most other galaxies as well.

Returning to the physical properties of black holes, although classical theory tells us that nothing can escape a black hole, a most interesting effect is that black holes radiate. This is a quantum process. A full calculation of this process cannot be done without a quantum theory of gravity (which we do not yet have). So, while the fact that black holes must radiate can be argued in generality, the nature of the radiation can only be calculated in a semiclassical way. The result of such a semiclassical calculation is that irrespective of the nature of the matter which went into the formation of the black hole, the radiation which comes out is thermal, following the Planck spectrum, corresponding to a certain temperature

(10.21)

Although related processes were understood by many scientists, the general argument for radiation from black holes was due to Hawking and hence the radiation from any spacetime horizon and the corresponding temperature are referred to as the Hawking radiation and Hawking temperature, respectively.

Because there is a temperature associated with a black hole, we can think of it as a thermodynamic system obeying

(10.22)

The internal energy can be taken as following the Einstein mass-energy equivalence. We can then use (10.22) to calculate the entropy of a black hole as

(10.23)

(This formula for the entropy is known as the Bekenstein-Hawking formula.) Here is the area of the horizon, being the Schwarzschild radius.

These results immediately bring up a number of puzzles.

A priori, there is nothing thermodynamic about the Schwarzschild metric or the radiation process. The radiation can obtained from the quantized version of the Maxwell equations in the background spacetime (10.20). So how do thermodynamic concepts arise in this case?
One could envisage forming a black hole from a very ordered state of very low entropy. Yet once the black hole forms, the entropy is given by (10.23). There is nothing wrong with generating more entropy, but how did we lose the information coded into the low entropy state? Further, the radiation coming out is thermal and hence carries no information. So is there any way to understand what happened to it?
These questions can be sharpened further. First of all, we can see that the Schwarzschild black hole can evaporate away by Hawking radiation in a finite time. This is because the radiation follows the Planck spectrum and so we can use the Stefan-Boltzmann law (8.38) to calculate the rate of energy loss. Then from

(10.24)

we can obtain the evaporation time. Now, there is a problem with the radiation being thermal. Time-evolution in the quantum theory is by unitary transformations and these do not generate any entropy. So if we make a black hole from a very low entropy state and then it evaporates into thermal radiation which is a high entropy state, how is this compatible with unitary time-evolution? Do we need to modify quantum theory, or do we need to modify the theory of gravity?
Usually, when we have nonzero entropy, we can understand that in terms of microscopic counting of states. Are the number of states of a black hole proportional to ? Is there a quantitative way to show this?
The entropy is proportional to the area of the horizon. Usually, entropy is extensive and the number of states is proportional to the volume (via things like . How can all the states needed for a system be realized in terms of a lower dimensional surface?

There are some tentative answers to some of these questions. Although seemingly there is a problem with unitary time-evolution, this may be because we cannot do a full calculation. The semiclassical approximation breaks down for very small black holes. So we cannot reliably calculate the late stages of black hole evaporation. Example calculations with black holes in lower dimensions can be done using string theory and this suggests that time-evolution is indeed unitary and that information is recovered in the correlations in the radiation which develop in later stages.

For most black hole solutions, there is no reliable counting of microstates which lead to the formula (10.23). But there are some supersymmetric black holes in string theory for which such a counting can be done using techniques special to string theory. For those cases, one does indeed get the formula (10.23). This suggests that string theory could provide a consistent quantum theory of black holes and, more generally, of spacetimes with horizons. It could also be that the formula (10.23) has such universality (as many things in thermodynamics do) that the microscopic theory may not matter and that if we learn to do the counting of states correctly, any theory which has quantum gravity will lead to (10.23), with perhaps, calculable additional corrections (which are subleading, i.e., less extensive than area).

The idea that a lower dimensional surface can encode enough information to reconstruct dynamics in a higher dimensional space is similar to what happens in a hologram. So perhaps to understand the entropy formula (10.23), one needs a holographic formulation of physical laws. Such a formulation is realized, at least for a restricted class of theories, in the so-called AdS/CFT correspondence (or holographic correspondence) and its later developments. The original conjecture for this is due to J. Maldacena and states that string theory on an anti-de Sitter(AdS) spacetime background in five dimensions (with an additional 5-sphere) is dual to the maximally supersymmetric Yang-Mills gauge theory (which is a conformal field theory (CFT)) on the boundary of the AdS space. One can, in principle, go back and forth, calculating quantities in one using the other. Although still a conjecture, this does seem to hold for all cases where calculations have been possible.

It is clear that this is far from a finished story. But from what has been said so far, there is good reason to believe that research over the next few years will discover some deep connection between gravity and entropy.

Bibliography

Thermo-Textbook

Show the following:

Adjust appearance:

Notes

10. Entropy and Information

10.1 Information

10.2 Maxwell’s demon

10.3 Entropy and gravity

Annotate