Statistical Mechanics of Four Letter Words

By Peyman Khorsand

Taken from the paper by G.J. Stephens and W. Bialek.

There are 26^4=456976 possible four letter words, the total entropy for such a sample is S_{0}= 4 \log_2 (26)=18.802 bits. Out of this large number of possibilities only a small fraction of 1/3700 is in use in English language. By looking at large database of American literature, \sim 800 words and their frequency of occurrence is found (the database had \sim 2\times 10^7 words all the four letter words which they have appeared more than 100 times are considered.). The entropy of this set is approximately S_{full} = 6.92\pm 0.003. The paper is trying to regenerate legal words using statistical properties of their building blocks (the letters).

Independent Letters Approximation:

Not all the letters appear with the same frequencies. Considering this point the probability of having a four letter words is P(l_1, l_2, l_3, l_4)= \prod_i l_i. The entropy of a such an ensemble is S_{ind} = 14.083 \pm 0.001 bits.

Pairwise Interaction Approximation:

We assume that P(l_1,l_2,l_3,l_4) can be written as a Boltzmann factor of pairwise interacting between different letters

P^{(2)}(l_1,l_2,l_3,l_4)= {1 \over Z} \exp \left[ - \sum_{i,j} V_{i,j}(l_i,l_j) \right]

V_{i,j}(l_i,l_j) are the pairwise potential and have 6 \times (26^2 -1)=4050 free parameters. They can be determined by solving 6 \times 26^2 coupled equations (pairwise marginal distribution).

\rho_{i_1,i_2}(l_1,l_2) = \sum^{\prime} P^{(2)}(k_1,k_2,k_3,k_4)

The ensemble built in this construction has an entropy of 7.48 bits. The pairwise potentials V_{i,j} define an energy landscape. There are 136 local minima in this landscape (changing any letter increases the energy) out of them 118 are real English words, that capture 63.5 of the probability distribution.

The induced probability for each real word using pairwise interactions P_2 reproduce the observed probability of that word P_{full} especially for frequent words (The plot of P_2 vs P_{full} is almost diagonal).

The independent letter approximation reduced the number of possibilites by a factor of \sim 1/26, a further reduction factor of \sim 1/100, the higher order interaction only contribute for a factor of 1/1.5.

Zipf’s Law:

Plotting the probability vs the rank of different words shows an approximate power law.The four letter words have a cut off tail. The pairwise model removes some weight from the bulk of the distribution and reassigns it to the tail.

Leave a Reply