Taken from the paper by G.J. Stephens and W. Bialek.
There are possible four letter words, the total entropy for such a sample is
bits. Out of this large number of possibilities only a small fraction of
is in use in English language. By looking at large database of American literature,
words and their frequency of occurrence is found (the database had
words all the four letter words which they have appeared more than
times are considered.). The entropy of this set is approximately
. The paper is trying to regenerate legal words using statistical properties of their building blocks (the letters).
Independent Letters Approximation:
Not all the letters appear with the same frequencies. Considering this point the probability of having a four letter words is . The entropy of a such an ensemble is
bits.
Pairwise Interaction Approximation:
We assume that can be written as a Boltzmann factor of pairwise interacting between different letters
are the pairwise potential and have
free parameters. They can be determined by solving
coupled equations (pairwise marginal distribution).
The ensemble built in this construction has an entropy of bits. The pairwise potentials
define an energy landscape. There are 136 local minima in this landscape (changing any letter increases the energy) out of them 118 are real English words, that capture
of the probability distribution.
The induced probability for each real word using pairwise interactions reproduce the observed probability of that word
especially for frequent words (The plot of
vs
is almost diagonal).
The independent letter approximation reduced the number of possibilites by a factor of , a further reduction factor of
, the higher order interaction only contribute for a factor of
.
Zipf’s Law:
Plotting the probability vs the rank of different words shows an approximate power law.The four letter words have a cut off tail. The pairwise model removes some weight from the bulk of the distribution and reassigns it to the tail.