For a Bernoulli RV. X with P(X)={p1−pwhen X=0,when X=1, the binary entropy is defined as,
H2(p)≡H(X)=−plog2p−(1−p)log2(1−p).
The entropy is 0 if the distribution is deterministic, i.e. taking a single value with probability 1.
The entropy is higher for equiprobable distributions since they are more unpredictable.
Maximal Entropy achieved when p=0.5, i.e., H2(0.5)=1.
For a general case, differentiate the Lagrange function L from H(X) and set ∂pi∂L(H(p1,p2,...,pn),λ)=0, with constraint ∑i=1npi=1, to find the maximum entropy.
N i.i.d. random variables each with Entropy H(X) can be compressed into more than NH(X) bits with a negligible risk of loss as N tends to infinity. Conversely, if you compress to fewer than NH(X) bits, you are almost guaranteed to lose information.
Symbol codes C
Binary symbol code C for an ensemble X is AX={x1,x2,...,xn}→{0,1}+. The extended code C+ is AX+→{0,1}+.
c+(x1,x2,…,xN)=c(x1)c(x2)…c(xN).
The symbol code Cexpected (encoded character) length for an ensemble X is,
L(C,X)=x∈AX∑P(x)l(x)=i=1∑nP(xi)li.
Symbol code source coding theorem for an ensemble X,
there exists an encoding C such that the expected encoded character length L(C,X) satisfies,
H(X)≤L(C,X)<H(X)+1.
The minimal expected length only if the the code lengths are equal to the Shannon information contents li=−log2P(xi).
Unique decodability (Prefix codes)
∀x,y∈AX,x=y⟹c+(x)=c+(y).
The uniquely decodable codeword, with length l1,l2,...,ln, over the binary alphabet {0,1} must satisfy the Kraft inequality,
i=1∑n2−li≤1.
0 : 0
/
\ 0 : 10
1 /
\ 0 : 110
1 /
\ ...
Huffman coding
Build a binary tree from the leaves to the root,
Take the two least probable symbols in the alphabet. They will be given the longest codewords, which will have equal length, and differ only in the last digit.
Combine these two symbols into a single symbol, and repeat.
Limitations: assume a const data distribution, thus fixed coding; the extra bit is problematic when H(X)≈1.
Stream codes
Live data stream, adaptive coding.
Arithmetic coding
Output a single floating point number with a high precision in the range [0,1), which represents the entire message.
Lempel-Ziv coding
Given a string of symbols str, Lempel-Ziv complexityc(str) is the number of longest consecutive substrings that are not repeated from the beginning, e.g. cstr=A | T | G | T G⟹c(str)=4.
Normalized compression distance is a measure of the similarity between two strings x and y based on their Lempel-Ziv complexity.
measure the common information between two RVs, i.e., how much information one RV conveys about (propagates to) another. The information gained about Y when we know X.
I(X;Y)=H(X)+H(Y)−H(X,Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X)
Channel capacity: maximum mutual information achievable between input and output random variables of a channel.
C=p(x)maxI(X;Y)[bits/symbol]
(Symmetric) Channel capacity C with additive noise independent of the input X, i.e. Y=X+η,
Differential entropy h(X) of a continuous random variable X with probability density function p(x) is defined as,
h(X)=E[−log2p(X)]=−∫Xp(x)log2p(x)dx.
The differential entropy itself has no fundamental physical meaning, but it occurs often enough to have a name. The differential entropy is not the limiting case of the entropy; the entropy of a continuous distribution is infinite. It is not invariant under coordinate transformations.
where variance σ2=∑iN1(xi−xˉ)2 is a constraint. For a communication channel, the power of the signal P∝signal2∝σ2. We assume that the transmitter is usually power-limited, i.e., the average power of the signal is limited to P.
[The standard electrical power P=RV2=I2R.]
Discussions,
What P(X) gives the P(Y) we want?
What P(Y) makes h(Y) maximum?
Answer: Gaussian distributions.
The Gaussian channel X+η=Y, modelling the relationship between transmitted signal X and received signal Y with additive white noise η, which has equal intensity at all frequencies. Usage: satellite, deep-space communication links and radio transmission.
where N0 is the noise power spectral density, and SNR=NS=N0BP is the signal-to-noise ratio (dB).
Bandwidth B (Hz, Hertz) of the channel is the difference between the upper and lower frequencies in a continuous band of frequencies, i.e. the frequency range available is [flow,flow+B] Hz.
The Nyquist theorem states the Nyquist rate fs≥2B is the minimum sampling rate required to reconstruct signals, which is twice the maximum frequency of the signal.
The Shannon-Hartley theorem, for bandwidth-limited channels, states that the theoretical tightest upper bound on the information rate of data (per second) that can be communicated at an arbitrarily low error rate using an average received signal power S through an analog communication channel subject to additive white Gaussian noise (AWGN) of power N,
C=Blog2(1+SNR)[bits/s]
Limitations:
Larger bandwidths brings more noise N=BN0. The spectrum available is limited, and higher bandwidths require larger physical infrastructure.
Diminishing returns as SNR increases.
Ultra-WideBand (UWB) communications system (bandwidths ~ GHz) can avoid interference with
other non-UWB users of the same radio spectrum part, by using a different range of frequencies / transmitting below the ambient noise floor.
Entropy of distributions p(x), i.e., the average number of bits needed to encode data with distribution p(x) using a code optimised for p(x).
Cross entropy of p(x) and q(x) measures the average number of bits needed if a code optimised for distribution q(x) is used to encode data with distribution p(x). It is defined as,
H(p,q)=x∑p(x)log2q(x)1.
Kullback-Leibler divergence / relative entropy of p(x) from q(x) tells how many average additional number of bits needed if a code optimised for distribution q(x) is used to encode instead for data with distribution p(x). It is defined as,
Its minimal is achieved when p(x)=q(x), i.e., DKL(p∣∣q)=0.
They both measure the divergence / inefficiency of using a predicted/approximated distribution q(x) instead of the true distribution p(x). In machine learning, they are used as loss functions to measure the difference between the predicted and true distributions or in the variational inference.
They are both asymmetric and thus not a distance. Instead, the entropy distance is defined as,
DH(X,Y)≡H(X,Y)−I(X;Y)=H(X∣Y)+H(Y∣X).
Relationship between cross entropy and KL divergence:
(a) Classical thermodynamics (“entropy”, developed by Clausius, etc.)
express the direction or outcome of spontaneous changes in the system, with an increase representing energy that becomes unavailable for work.
T is the uniform temperature of the system, and dQ is the heat energy transferred to the system. The entropy S change of the system is given by dS=TdQ.
(b) Statistical mechanics (“statistical entropy”, developed by Boltzmann, etc.)
The macroscopic state of a system is described by the Gibbs probability distribution over its N microscopic states. pi is the probability of the system being in state i, and S is the entropy of the system,
S=−kBi∑pilog2pi
When states are equiprobable, pi=N1, the entropy is maximized at S=kBlog2N, and the system is in a state of maximum disorder. It expresses entropy as the logarithm of the number of accessible microstates.
(c) Information theory (“information entropy”, developed by Shannon)
measure of disorder in random variable X, with probability distribution pX(x). When we resolve disorder, we gain information.