For a Bernoulli RV. X with P(X)={p,1−p,when X=0when X=1, the binary entropy is defined as,
H2(p)≡H(X)=−plog2p−(1−p)log2(1−p).
The entropy is 0 if the distribution is deterministic, i.e. taking a single value with probability 1.
The entropy is higher for equiprobable distributions since they are more unpredictable.
Maximal Entropy achieved when p=0.5, i.e., H2(0.5)=1.
For a general case, differentiate the Lagrange function L from H(X) and set ∂pi∂L(H(p1,p2,...,pn),λ)=0, with constraint ∑i=1npi=1, to find the maximum entropy.
N i.i.d. random variables each with Entropy H(X) can be compressed into more than NH(X) bits with a negligible risk of loss as N tends to infinity. Conversely, if you compress to fewer than NH(X) bits, you are almost guaranteed to lose information.
for an ensemble X, there exists an encoding C such that the expected encoded character length L(C,X) satisfies,
H(X)≤L(C,X)<H(X)+1.
The minimal expected length only if the the code lengths are equal to the Shannon information contents li=−log2P(xi).
Unique decodability (Prefix codes)
∀x,y∈AX,x=y⟹c+(x)=c+(y).
The uniquely decodable codeword, with length l1,l2,...,ln, over the binary alphabet {0,1} must satisfy the Kraft-McMillan inequality,
i=1∑n2−li≤1.
0 : 0
/
\ 0 : 10
1 /
\ 0 : 110
1 /
\ ...
Huffman coding
Build a binary tree from the leaves to the root,
Take the two least probable symbols in the alphabet. They will be given the longest codewords, which will have equal length, and differ only in the last digit.
Combine these two symbols into a single symbol, and repeat.
Limitations: assume a const data distribution, thus fixed coding; the extra bit is problematic when H(X)≪1.
Output a single floating point number with a high precision in the range [0,1), which represents the entire message.
Lempel-Ziv coding
Given a string of symbols str, Lempel-Ziv complexityc(str) is the number of longest consecutive substrings that are not repeated from the beginning, e.g. cstr=A | T | G | T G⟹c(str)=4.
Normalized compression distance is a measure of the similarity between two strings x and y based on their Lempel-Ziv complexity.
Differential entropy h(X) of a continuous random variable X with probability density function p(x) is defined as,
h(X)=E[−log2p(X)]=−∫Xp(x)log2p(x)dx.
The differential entropy itself has no fundamental physical meaning, but it occurs often enough to have a name. The differential entropy is not the limiting case of the entropy; the entropy of a continuous distribution is infinite. It is not invariant under coordinate transformations.
where variance σ2=∑iN1(xi−xˉ)2 is a constraint. For a communication channel, the power of the signal P∝signal2∝σ2. We assume that the transmitter is usually power-limited, i.e., the average power of the signal is limited to P.
[The standard electrical power P=RV2=I2R.]
Discussions,
What P(X) gives the P(Y) we want?
What P(Y) makes h(Y) maximum?
Answer: Gaussian distributions.
The Gaussian channel X+η=Y, modelling the relationship between transmitted signal X and received signal Y with additive white noise η, which has equal intensity at all frequencies. Usage: satellite, deep-space communication links and radio transmission.
where N0 is the noise power spectral density, and SNR=NS=N0BP is the signal-to-noise ratio (dB).
Bandwidth B (Hz, Hertz) of the channel is the difference between the upper and lower frequencies in a continuous band of frequencies, i.e. the frequency range available is [flow,flow+B] Hz.
The Nyquist theorem states the Nyquist rate fs≥2B is the minimum sampling rate required to reconstruct signals, which is twice the maximum frequency of the signal.
To prevent aliasing when it not possible to increase the sampling rate fs,
the signal should first be low-pass filtered before it is sampled,
reducing its frequency composition to be within ±B Hz,
such that the condition fs≥2B is satisfied.
The Shannon-Hartley theorem, for bandwidth-limited channels, states that the theoretical tightest upper bound on the information rate of data (per second) that can be communicated at an arbitrarily low error rate using an average received signal power S through an analog communication channel subject to additive white Gaussian noise (AWGN) of power N,
C=Blog2(1+SNR)[bits/s]
Limitations:
Larger bandwidths brings more noise N=BN0.
The spectrum available is limited, and higher bandwidths require larger physical infrastructure.
Increasing the quantity SNR inside the logarithm without bounds enables the channel capacity to increase monotonically and without limit.
Diminishing returns as SNR increases.
Increasing the bandwidth W alone causes a monotonic increase in capacity, but only up to an asymptotic limit.
As α=SNR=BN0S→0, ln(1+α)→α.
Thus C=Blog2(1+α)=Bln(2)α→=N0Slog2e,
there are vanishing returns from endless increase in bandwidth.
Ultra-WideBand (UWB) communications system (bandwidths ~ GHz) can avoid interference with
other non-UWB users of the same radio spectrum part, by using a different range of frequencies / transmitting below the ambient noise floor.
Entropy of distributions p(x), i.e., the average number of bits needed to encode data with distribution p(x) using a code optimised for p(x).
Cross entropy of p(x) and q(x) measures the average number of bits needed if a code optimised for distribution q(x) is used to encode data with distribution p(x). It is defined as,
H(p,q)=x∑p(x)log2q(x)1.
Kullback-Leibler divergence / relative entropy of p(x) from q(x) tells how many average additional number of bits needed if a code optimised for distribution q(x) is used to encode instead for data with distribution p(x). It is defined as,
Its minimal is achieved when p(x)=q(x), i.e., DKL(p∣∣q)=0.
They both measure the divergence / inefficiency of using a predicted/approximated distribution q(x) instead of the true distribution p(x). In machine learning, they are used as loss functions to measure the difference between the predicted and true distributions or in the variational inference.
They are both asymmetric and thus not a distance. Instead, the entropy distance is defined as,
DH(X,Y)≡H(X,Y)−I(X;Y)=H(X∣Y)+H(Y∣X).
Relationship between cross entropy and KL divergence:
(a) Classical thermodynamics (“entropy”, developed by Clausius, etc.)
express the direction or outcome of spontaneous changes in the system, with an increase representing energy that becomes unavailable for work.
T is the uniform temperature of the system, and dQ is the heat energy transferred to the system. The entropy S change of the system is given by dS=TdQ.
(b) Statistical mechanics (“statistical entropy”, developed by Boltzmann, etc.)
The macroscopic state of a system is described by the Gibbs probability distribution over its N microscopic states. pi is the probability of the system being in state i, and S is the entropy of the system,
S=−kBi∑pilog2pi
When states are equiprobable, pi=N1, the entropy is maximized at S=kBlog2N, and the system is in a state of maximum disorder. It expresses entropy as the logarithm of the number of accessible microstates.
(c) Information theory (“information entropy”, developed by Shannon)
measure of disorder in random variable X, with probability distribution pX(x). When we resolve disorder, we gain information.