Information theory Past Paper

Information and Entropy

When we resolve disorder, we gain information.

Shannon Information: measure of surprise / uncertainty for an event xx with probability P(x)P(x), (continuous, additive and symmetric),

h(x)=log2P(x)[bits]h(x) = - \log_2 P(x) \quad [\text{bits}]

Entropy: measure of average level of disorder / information for a random variable. Given X={x1,x2,...,xn}X=\{x_1, x_2, ..., x_n\}, with probability distribution P(X)P(X),

H(X)=i=1nP(xi)log21P(xi)=i=1nP(xi)log2P(xi).H(X) = \sum_{i=1}^{n} P(x_i) \log_2\frac{1}{P(x_i)} = - \sum_{i=1}^{n} P(x_i) \log_2 P(x_i).

Discussions,

H2(p)H(X)=plog2p(1p)log2(1p).H_2(p) \equiv H(X) = -p \log_2 p - (1-p) \log_2 (1-p).

Past papers

Noiseless channels

The Source Coding Theorem

NN i.i.d. random variables each with Entropy H(X)H(X) can be compressed into more than NH(X)N H(X) bits with a negligible risk of loss as NN tends to infinity. Conversely, if you compress to fewer than NH(X)N H(X) bits, you are almost guaranteed to lose information.

Symbol codes CC

Binary symbol code CC for an ensemble XX is AX={x1,x2,...,xn}\mathcal{A}_{X}= \{x_1, x_2, ..., x_n\} \rightarrow {0,1}+\{ 0, 1 \}^+. The extended code C+C^+ is AX+\mathcal{A}^+_{X} \rightarrow {0,1}+\{ 0, 1 \}^+.

c+(x1,x2,,xN)=c(x1)c(x2)c(xN).c^+(x_1, x_2, \dots, x_N) = c(x_1) c(x_2) \dots c(x_N).

The symbol code CC expected (encoded character) length for an ensemble XX is,

L(C,X)=xAXP(x)  l(x)=i=1nP(xi)  li.L(C,X) = \sum_{x \in \mathcal{A}_X} P(x) \; l(x) = \sum_{i=1}^{n} P(x_i) \; l_i.

Symbol code source coding theorem for an ensemble XX,

there exists an encoding CC such that the expected encoded character length L(C,X)L(C,X) satisfies,

H(X)L(C,X)<H(X)+1.H(X) \leq L(C,X) < H(X) + 1.

The minimal expected length only if the the code lengths are equal to the Shannon information contents li=log2P(xi)l_i = - \log_2 P(x_i).

Unique decodability (Prefix codes)

x,yAX,xy    c+(x)c+(y).\forall x, y \in \mathcal{A}_{X}, x \neq y \implies c^+(x) \neq c^+(y).

The uniquely decodable codeword, with length l1,l2,...,lnl_1, l_2, ..., l_n, over the binary alphabet {0,1}\{0, 1\} must satisfy the Kraft inequality,

i=1n2li1.\sum_{i=1}^{n} 2^{-l_i} \leq 1.

 0 : 0  
/
\    0 : 10
 1 /
   \    0 : 110
    1 /  
      \ ...  

Huffman coding

Build a binary tree from the leaves to the root,

  1. Take the two least probable symbols in the alphabet. They will be given the longest codewords, which will have equal length, and differ only in the last digit.
  2. Combine these two symbols into a single symbol, and repeat.

Limitations: assume a const data distribution, thus fixed coding; the extra bit is problematic when H(X)1H(X) \approx 1.

Stream codes

Live data stream, adaptive coding.

Arithmetic coding

Output a single floating point number with a high precision in the range [0,1)[0, 1), which represents the entire message.

Lempel-Ziv coding

Given a string of symbols strstr, Lempel-Ziv complexity c(str)c(str) is the number of longest consecutive substrings that are not repeated from the beginning, e.g. cstr=A | T | G | T Gcstr = \text{A | T | G | T G}     \implies c(str)=4c(str) = 4.

Normalized compression distance is a measure of the similarity between two strings xx and yy based on their Lempel-Ziv complexity.

C(xy)min(C(x),C(y))max(C(x),C(y))\frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}

Noisy discrete channels

Error Correcting Codes (ECCs)

Repetition code RzR_{z}, where zz is the number of bits to repeat.

Block Codes (N,k)(N, k), where N>kN > k.

Hamming Codes (7,4)(7, 4), detecting and correcting 1-bit errors efficiently.

The 7-bit code-word cc for 4-bit data word dd is defined by the generator matrix GG,

c7×n=G7×4d4×nmod2=[1101101110000111010000100001][d1d2d3d4]mod2.c^{7 \times n} = G^{7 \times 4} d^{4 \times n} \mod 2 = \begin{bmatrix} 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} d_1 \\ d_2 \\ d_3 \\ d_4 \\ \end{bmatrix} \mod 2.

The relationship is,

c1=d1d2d4c2=d1d3d4c4=d2d3d4\begin{aligned} c_1 &= d_1 \oplus d_2 \oplus d_4 \\ c_2 &= d_1 \oplus d_3 \oplus d_4 \\ c_4 &= d_2 \oplus d_3 \oplus d_4 \\ \end{aligned}

The parity check matrix HH is defined by the generator matrix GG, and cc is valid if all bits in pp are 0.

p3×n=H3×7c7×nmod2=[101010101100110001110][c1c2c3c4c5c6c7]mod2.p^{3 \times n} = H^{3 \times 7} c^{7 \times n} \mod 2 = \begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 \\ \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ c_3 \\ c_4 \\ c_5 \\ c_6 \\ c_7 \\ \end{bmatrix} \mod 2.

The relationship is,

p1=c1c3c5c7=2(d1d2d4)p2=c2c3c6c7=2(d1d3d4)p3=c4c5c6c7=2(d2d3d4)\begin{aligned} p_1 &= c_1 \oplus c_3 \oplus c_5 \oplus c_7 = 2 (d_1 \oplus d_2 \oplus d_4) \\ p_2 &= c_2 \oplus c_3 \oplus c_6 \oplus c_7 = 2 (d_1 \oplus d_3 \oplus d_4) \\ p_3 &= c_4 \oplus c_5 \oplus c_6 \oplus c_7 = 2 (d_2 \oplus d_3 \oplus d_4) \\ \end{aligned}

Bayes' rule

Bayes' rule for conditional probability,

P(XY)=P(X,Y)P(Y)=P(YX)P(X)xP(YX)P(X)P(X|Y) = \frac{P(X, Y)}{P(Y)} = \frac{P(Y|X) P(X)}{\sum_{x} P(Y|X) P(X)}

for conditional entropy states,

H(XY)=H(YX)+H(X)H(Y).H(X|Y) = H(Y|X) + H(X) - H(Y).

Conditional entropy

measure of uncertainty in random variable YY given event X=xX=x,

h(YX=x)=yP(yx)log2P(yx)h(Y|X=x) = - \sum_{y} P(y|x) \log_2 P(y|x)

conditional entropy: YY given XX, i.e, weighted averaging H(YX=x)H(Y|X=x) over all values xx of XX.

H(YX)=xP(x)h(YX=x)=xyP(x,y)log2P(yx)=xyP(x,y)log2P(x,y)P(x)\begin{aligned} H(Y|X)= \sum_{x} P(x) \cdot h(Y|X=x) & = - \sum_{x} \sum_{y} P(x, y) \log_2 P(y|x) \\ & = - \sum_{x} \sum_{y} P(x, y) \log_2 \frac{P(x, y)}{P(x)} \end{aligned}

Discussions,

Joint entropy

measure of uncertainty in two random variables XX and YY.

H(X,Y)=xyP(x,y)log2P(x,y)=xyP(x,y)log2P(yx)P(x)by chain rule.=xyP(x,y)log2P(yx)x(yP(x,y))log2P(x)=xyP(x,y)log2P(yx)xP(x)log2P(x)by marginalization.=H(YX)+H(X)=H(XY)+H(Y)\begin{aligned} H(X, Y) &= - \sum_{x} \sum_{y} P(x, y) \log_2 P(x, y) \\ &= - \sum_{x} \sum_{y} P(x, y) \log_2 P(y|x) \cdot P(x) \quad \text{by chain rule.} \\ &= - \sum_{x} \sum_{y} P(x, y) \log_2 P(y|x) - \sum_{x} (\sum_{y} P(x, y)) \log_2 P(x) \\ &= - \sum_{x} \sum_{y} P(x, y) \log_2 P(y|x) - \sum_{x} P(x) \log_2 P(x) \quad \text{by marginalization.} \\ &= H(Y|X) + H(X) = H(X|Y) + H(Y)\\ \end{aligned}

Symmetric property of joint entropy: H(X,Y)=H(Y,X)H(X, Y) = H(Y, X).

Chain rule for multiple RVs probability distribution,

P(X1,X2,...,Xn)=P(X1)P(X2X1)...P(XnX1,X2,...,Xn1)=i=1nP(XiX1,X2,...,Xi1)\begin{aligned} P(X_1, X_2, ..., X_n) &= P(X_1) P(X_2|X_1) ... P(X_n|X_1, X_2, ..., X_{n-1}) \\ &= \prod_{i=1}^{n} P(X_i|X_1, X_2, ..., X_{i-1}) \\ \end{aligned}

Joint entropy extended for multiple random variables X1,X2,...,XnX_1, X_2, ..., X_n,

H(X1,X2,...,Xn)=H(X1)+H(X2X1)+...+H(XnX1,X2,...,Xn1)=i=1nH(XiX1,X2,...,Xi1)\begin{aligned} H(X_1, X_2, ..., X_n) & = H(X_1) + H(X_2|X_1) + ... + H(X_n|X_1, X_2, ..., X_{n-1}) \\ & = \sum_{i=1}^{n} H(X_i|X_1, X_2, ..., X_{i-1}) \\ \end{aligned}

Mutual information

measure the common information between two RVs, i.e., how much information one RV conveys about (propagates to) another. The information gained about YY when we know XX.

I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(XY)=H(Y)H(YX)\begin{aligned} I(X;Y) & = H(X) + H(Y) - H(X, Y) \\ & = H(X) - H(X|Y) = H(Y) - H(Y|X) \end{aligned}

Channel capacity: maximum mutual information achievable between input and output random variables of a channel.

C=maxp(x)I(X;Y)[bits/symbol]C = \max_{p(x)} I(X;Y) \quad [\text{bits/symbol}]

(Symmetric) Channel capacity CC with additive noise independent of the input XX, i.e. Y=X+ηY = X + \eta,

C=maxp(x)I(X;Y)=maxp(x){H(Y)H(YX)}=maxp(x){H(Y)}H(η)since H(YX)=H(η).\begin{aligned} C &= \max_{p(x)} I(X;Y) \\ &= \max_{p(x)} \{ H(Y) - H(Y|X) \} \quad \\ &= \max_{p(x)} \{ H(Y) \} - H(\eta) \quad \text{since } H(Y|X) = H(\eta). \\ \end{aligned}

Past papers

Correlation coefficient: measure of the linear relationship strength between two random variables.

Corr(X,Y)=Cov(X,Y)Var(X)Var(Y)=E[(XE[X])(YE[Y])]E[(XE[X])2]E[(YE[Y])2]E[X]=xxP(x)E[X2]=xx2P(x)Var(X)=E[(XE[X])2]Cov(X,Y)=E[(XE[X])(YE[Y])]\begin{aligned} \text{Corr}(X, Y) & = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}} = \frac{\mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]}{\sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2] \cdot \mathbb{E}[(Y - \mathbb{E}[Y])^2]}} \\ \mathbb{E}[X] & = \sum_{x} x P(x) \\ \mathbb{E}[X^2] & = \sum_{x} x^2 P(x) \\ \text{Var}(X) & = \mathbb{E}[(X - \mathbb{E}[X])^2] \\ \text{Cov}(X, Y) & = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ \end{aligned}

Continuous entropy for signals

Differential entropy h(X)h(X) of a continuous random variable XX with probability density function p(x)p(x) is defined as,

h(X)=E[log2p(X)]=Xp(x)log2p(x)dx.h(X) = \mathbb{E}[-\log_2 p(X)] = - \int_{\mathcal{X}} p(x) \log_2 p(x) dx.

The differential entropy itself has no fundamental physical meaning, but it occurs often enough to have a name. The differential entropy is not the limiting case of the entropy; the entropy of a continuous distribution is infinite. It is not invariant under coordinate transformations.

H(XΔ)=Pilog2Pi=P(xi)Δxlog2(P(xi)Δx)=P(xi)Δxlog2P(xi)P(xi)Δxlog2Δx=P(xi)Δxlog2P(xi)+log21Δx=Δx0h(X)+\begin{aligned} H(X^{\Delta}) & = - \sum P_i \log_2 P_i \\ & = - \sum P(x_i) \Delta x \log_2 (P(x_i) \Delta x) \\ & = - \sum P(x_i) \Delta x \log_2 P(x_i) - \sum P(x_i) \Delta x \log_2 \Delta x \\ & = - \sum P(x_i) \Delta x \log_2 P(x_i) + \sum \log_2 \frac{1}{\Delta x} \\ &=_{\Delta x \rightarrow 0} h(X) + \infty \\ \end{aligned}

Continuous mutual information I(X;Y)I(X;Y) is defined as,

I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(XY)=H(Y)H(YX)=h(Y)+h(XY)=h(Y)h(XY)\begin{aligned} I(X;Y) &= H(X) + H(Y) - H(X, Y) = H(X) - H(X|Y) \\ &= H(Y) - H(Y|X) \\ &= h(Y) + \infty - h(X|Y) - \infty \\ &= h(Y) - h(X|Y) \\ \end{aligned}

Maximum entropy via Lagrange multipliers and normal distribution,

L=P(x)log21P(x)dx+λ(P(x)dx1)μ(σ2(xxˉ)2dx),L = \int P(x) \log_2 \frac{1}{P(x)} dx + \lambda \left( \int P(x) dx - 1 \right) - \mu (\sigma^2 - \int (x - \bar{x})^2 dx),

where variance σ2=i1N(xixˉ)2\sigma^2 = \sum_i \frac{1}{N} (x_i - \bar{x})^2 is a constraint. For a communication channel, the power of the signal Psignal2σ2P \propto \text{signal}^2 \propto \sigma^2. We assume that the transmitter is usually power-limited, i.e., the average power of the signal is limited to PP.

[The standard electrical power P=V2R=I2RP = \frac{V^2}{R} = I^2 R.]

Discussions,

The Gaussian channel X+η=YX + \eta = Y, modelling the relationship between transmitted signal XX and received signal YY with additive white noise η\eta, which has equal intensity at all frequencies. Usage: satellite, deep-space communication links and radio transmission.

P(X)N(μx,σx2),P(η)N(μη,ση2),P(Y)N(μy,σy2).P(X) \sim \mathcal{N}(\mu_x, \sigma_x^2), \quad P(\eta) \sim \mathcal{N}(\mu_{\eta}, \sigma_{\eta}^2), \quad P(Y) \sim \mathcal{N}(\mu_y, \sigma_y^2).

The differential entropy of a Gaussian distribution P(z)N(μ,σ2)=12πσ2exp((zμ)22σ2)P(z) \sim \mathcal{N}(\mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(z - \mu)^2}{2 \sigma^2}) is given by h(Z)=12log2(2πeσ2)\boxed{h(Z) = \frac{1}{2} \log_2 (2 \pi e \sigma^2)}. Proof,

h(ZN(μ,σ2))=P(z)log2P(z)dz=12πσ2e(zμ)22σ2log2(12πσ2e(zμ)22σ2)dz=12πσ2e(zμ)22σ2[log2(2πσ2)12+log2e(zμ)22σ2]dz=12πσ2e(zμ)22σ2[12log2(2πσ2)(zμ)22σ2log2e]dz=12log2(2πσ2)12πσ2e(zμ)22σ2dz+log2e2σ212πσ2e(zμ)22σ2(zμ)2dz=12log2(2πσ2)×1+log2e2σ2σ2=12log2(2πeσ2).\begin{aligned} h(Z \sim \mathcal{N}(\mu, \sigma^2)) &= - \int P(z) \log_2 P(z) dz \\ &= - \int \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} \log_2 (\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} ) dz \\ &= - \int \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} [\log_2 (2 \pi \sigma^2)^{- \frac{1}{2}} + \log_2 e^{-\frac{(z - \mu)^2}{2 \sigma^2}} ] dz \\ &= - \int \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} [- \frac{1}{2} \log_2 (2 \pi \sigma^2) - \frac{(z - \mu)^2}{2 \sigma^2} \log_2 e ] dz \\ &= \frac{1}{2} \log_2 (2 \pi \sigma^2) \int \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} dz + \frac{\log_2 e}{2 \sigma^2} \int \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(z - \mu)^2}{2 \sigma^2}} (z - \mu)^2 dz \\ &= \frac{1}{2} \log_2 (2 \pi \sigma^2) \times 1 + \frac{\log_2 e}{2 \sigma^2} \cdot \sigma^2 \\ &= \frac{1}{2} \log_2 (2 \pi e \sigma^2). \end{aligned}

The per-symbol capacity of the Gaussian channel is given by,

C=maxp(x){h(Y)h(YX)}=maxp(x){h(Y)}h(η)=12log2(2πeσy2)12log2(2πeση2)=12log2(σy2ση2)=12log2(1+SNR)[bits/symbol]\begin{aligned} C &= \max_{p(x)} \{ h(Y) - h(Y|X) \} \\ &= \max_{p(x)} \{ h(Y) \} - h(\eta) \\ &= \frac{1}{2} \log_2 (2 \pi e \sigma_y^2) - \frac{1}{2} \log_2 (2 \pi e \sigma_{\eta}^2) \\ &= \frac{1}{2} \log_2 \left( \frac{\sigma_y^2}{\sigma_{\eta}^2} \right) = \frac{1}{2} \log_2 (1 + \text{SNR}) \quad [\text{bits/symbol}] \\ \end{aligned}

where N0N_0 is the noise power spectral density, and SNR=SN=PN0B\text{SNR} = \frac{S}{N} =\frac{P}{N_0B} is the signal-to-noise ratio (dB).

Bandwidth BB (Hz, Hertz) of the channel is the difference between the upper and lower frequencies in a continuous band of frequencies, i.e. the frequency range available is [flow,flow+B][f_{low}, f_{low} + B] Hz.

The Nyquist theorem states the Nyquist rate fs2Bf_s \geq 2B is the minimum sampling rate required to reconstruct signals, which is twice the maximum frequency of the signal.

The Shannon-Hartley theorem, for bandwidth-limited channels, states that the theoretical tightest upper bound on the information rate of data (per second) that can be communicated at an arbitrarily low error rate using an average received signal power SS through an analog communication channel subject to additive white Gaussian noise (AWGN) of power NN,

C=Blog2(1+SNR)[bits/s]C = B \log_2 (1 + \text{SNR}) \quad [\text{bits/s}]

Limitations:

Ultra-WideBand (UWB) communications system (bandwidths ~ GHz) can avoid interference with other non-UWB users of the same radio spectrum part, by using a different range of frequencies / transmitting below the ambient noise floor.

Past papers

Prob. distributions comparison and ML

Entropy of distributions p(x)p(x), i.e., the average number of bits needed to encode data with distribution p(x)p(x) using a code optimised for p(x)p(x).

Cross entropy of p(x)p(x) and q(x)q(x) measures the average number of bits needed if a code optimised for distribution q(x)q(x) is used to encode data with distribution p(x)p(x). It is defined as,

H(p,q)=xp(x)log21q(x).H(p, q) = \sum_{x} p(x) \log_2 \frac{1}{q(x)}.

Kullback-Leibler divergence / relative entropy of p(x)p(x) from q(x)q(x) tells how many average additional number of bits needed if a code optimised for distribution q(x)q(x) is used to encode instead for data with distribution p(x)p(x). It is defined as,

DKL(pq)=xp(x)log2p(x)q(x)=xp(x)log21q(x)xp(x)log21p(x)=H(p,q)H(p)\begin{aligned} D_{KL}(p||q) &= \sum_{x} p(x) \log_2 \frac{p(x)}{q(x)} \\ &= \sum_{x} p(x) \log_2 \frac{1}{q(x)} - \sum_{x} p(x) \log_2 \frac{1}{p(x)} \\ &= H(p, q) - H(p) \\ \end{aligned}

Its minimal is achieved when p(x)=q(x)p(x) = q(x), i.e., DKL(pq)=0D_{KL}(p||q) = 0.

They both measure the divergence / inefficiency of using a predicted/approximated distribution q(x)q(x) instead of the true distribution p(x)p(x). In machine learning, they are used as loss functions to measure the difference between the predicted and true distributions or in the variational inference.

They are both asymmetric and thus not a distance. Instead, the entropy distance is defined as,

DH(X,Y)H(X,Y)I(X;Y)=H(XY)+H(YX).D_{H}(X, Y) \equiv H(X, Y) - I(X; Y) = H(X|Y) + H(Y|X).

Relationship between cross entropy and KL divergence:

H(p,q)=DKL(pq)+H(p)\boxed{ H(p, q) = D_{KL}(p||q) + H(p) }

LHS=xp(x)log21q(x)=xp(x)log2p(x)q(x)1p(x)by chain rule.=xp(x)log2p(x)q(x)+xp(x)log21p(x)=DKL(pq)+H(p)=RHS.\begin{aligned} \text{LHS}=\sum_{x} p(x) \log_2 \frac{1}{q(x)} & = \sum_{x} p(x) \log_2 \frac{p(x)}{q(x)} \cdot \frac{1}{p(x)} \quad \text{by chain rule.} \\ & = \sum_{x} p(x) \log_2 \frac{p(x)}{q(x)} + \sum_{x} p(x) \log_2 \frac{1}{p(x)} \\ & = D_{KL}(p||q) + H(p) = \text{RHS}.\\ \end{aligned}

Past papers

Applications

(a) Classical thermodynamics (“entropy”, developed by Clausius, etc.)

express the direction or outcome of spontaneous changes in the system, with an increase representing energy that becomes unavailable for work.

TT is the uniform temperature of the system, and dQdQ is the heat energy transferred to the system. The entropy SS change of the system is given by dS=dQTdS = \frac{dQ}{T}.

(b) Statistical mechanics (“statistical entropy”, developed by Boltzmann, etc.)

The macroscopic state of a system is described by the Gibbs probability distribution over its NN microscopic states. pip_i is the probability of the system being in state ii, and SS is the entropy of the system,

S=kBipilog2piS = - k_B \sum_{i} p_i \log_2 p_i

When states are equiprobable, pi=1Np_i = \frac{1}{N}, the entropy is maximized at S=kBlog2NS=k_B \log_2 N, and the system is in a state of maximum disorder. It expresses entropy as the logarithm of the number of accessible microstates.

(c) Information theory (“information entropy”, developed by Shannon)

measure of disorder in random variable XX, with probability distribution pX(x)p_X(x). When we resolve disorder, we gain information.

H(X)=xpX(x)log2pX(x)H(X) = - \sum_{x} p_X(x) \log_2 p_X(x)