Archive for the ‘Information Theory’ Category

Information Maximization

October 21, 2008

Taken from: “An Information Maximization Approach to Blind Separation and Blind Deconvolution” by A.J. Bell and T.J. Sejnowski

The idea is to maximize the mutual information between the output Y of a neural network and its input X. Mutual information is defined as I(Y,X)= H(Y) - H(Y | X).

There are divergences due to the continuous variables. As a result it is better to work with gradient of mutual information with respect to some parameter, w. They considered the systems that have additive noise, N,  where H(Y | X)= H(N). Therefore to maximize the mutual information it is sufficient to maximize entropy of the output. Assuming that the input probability distribution is f_{x}(x) then probability distribution of the output is

f_{y}(y)= {f_{x}(x) \over |\partial y /\partial x |}

the entropy in the output can be written as

H(y) = E \left[ \ln \left| {\partial y \over \partial x} \right| \right] -E \left[ \ln f_{x} (x) \right]

A stochastic gradient descent learning rule can be implemented as

\Delta w\propto {\partial H(y)\over \partial w}= \left({\partial y\over \partial x} \right)^{-1}

In the case of logistic transfer function y= 1/ (1+ e^{-(wx+w_0)} or tanh function y= \tanh(wx + w_0). The learning rule has two component an Anti-Hebbian, \Delta w \propto -2 x y, and an Anti-Decay component, \Delta w \propto 1/w.

Multi-component network:

This result can be generalized to multi-component network, where {\mathbf y}= g({\mathbf W x +w_{0}}). The result will be

\Delta w_{ij} \propto {\mathrm {cof} \, w_{ij} \over \det \bf{W} } + x_i (1 - 2 y_j)

Casual Filters:

Furthermore it can be generalized to casual filters where y(t)=g[u(t)]=g[w(t) * x(t) ], or equivalently {\mathbf Y} =g({\mathbf W X} ). Now {\mathbf W} is lower triangular matrix.

W=\left(\begin{array}{lccccc} w_L & 0 &\cdots & 0 & & 0\\ w_{L-1} & w_L &0& \cdots& &0\\ \vdots & & & & &\vdots\\ w_1&\cdots & w_L & &\cdots &0\\ 0 & w_1 & \cdots & w_L & \cdots &0\\ \vdots & & & & & \vdots\\ 0 & \cdots & & w_1& \cdots& w_L\\ \end{array}\right)

The probability distribution function of X and Y are f_{Y}(Y)= f_{X}(X)/|J|

J=\det \left[ \partial y(t_i) \over \partial x(t_{j}) \right]=(\det W) \prod_{t=1}^{M} {\partial y(t) \over \partial u(t)}

The gradient descend learning rule is

\Delta w_L \propto \sum_{t=1}^M \left( {1 \over w_L} - 2 x_t y_{t} \right)
\Delta w_{L-j} \propto \sum_{t=1}^M \left( - 2 x_{t-j} y_{t}\right)

The delay weights w_{t-j} keep shrinking and try to de-correlate the past signal from future. It is very interesting to see what kind of w will be created depending on the input statistics.

Time Delay:

y(t)=g[w\, x(t-d)]

following the same set of steps we find

\Delta d \propto 2 w {\partial x \over \partial t} y

Generalizing the transfer function:

{dy\over d u}= y^p (1 - y)^r

\Delta w\propto {1\over w} +x[p(1-y) - ry]

\Delta w_0 \propto p(1-y )-ry

Blind Separation: A set of source signals, s_1(t), s_2 (t), \ldots, s_n(t) is mixed together linearly, with a matrix {\mathbf A}. We ought to find a square matrix {\mathbf W} that is inverse of {\mathbf A} up to a permutation. The problem reduces to minimize the mutual information between the inputs(Independent Component Analysis). Obviously, if the matrix {\mathbf A} is singular one can not solve the problem.

Blind De-convolution: A signal s(t) is corrupted with a linear filter, a(t)x(t)= [a(t)* s(t)]. We have to find a filter w_1, w_2 , \ldots , w_L to recover s(t) from x(t). The problem reduces to remove statistical dependency across time (Whitening of x(t)).

The fundamental question is how can we reduce mutual information between two outputs by information maximization. The joint entropy of two output signals is

H(y_1,y_2)=H(y_1) + H(y_2) -I(y_1, y_2)

The algorithm discussed above maximize H(y_1,y_2) and this is mostly done by minimizing I(y_1,y_2). When I(y_1,y_2) is zero the probability distribution of y’s is separable. Can we introduce a constraint and keep the total \sum H(y_i) conserved?

It was briefed up to section 5.

Accuracy vs. Correlated Noise

October 16, 2008

Taken from: “The Effect of Correlated Variability on the Accuracy of a Population Code” by L.F. Abbott and P. Dayan

They answer to the following questions for three different class of models: “(1) Does correlation increase or decrease the accuracy with which the value of an encoded quantity can be extracted from a population of N neurons? (2) Does this accuracy approach a fixed limit as N increases?”

The average firing rate is noted as f_i(s)=\langle r_i(s) \rangle. The correlation matrix Q_{ij}(s)= \langle (r_i (s)- \langle r_i (s)\rangle)(r_j(s) - \langle r_j (s)\rangle ) \rangle is used to calculate Fisher information I_{\mathrm F}.  The Fisher information can be calculated assuming Gaussian character of correlation

P[\mathbf {r}|s] = {\cal N} \exp \left[-{1\over 2}\sum_{i,j}(r_i - f_i)Q_{ij}^{-1} (r_j - f_j) \right]

I_{\mathrm F}(s)=\sum_{i,j} {d f_{i}\over ds} Q^{-1}_{ij} {d f_{j}\over ds} + {1\over 2}\sum_{i,j,k,l} {d Q_{ij} \over ds} Q_{jk}^{-1} {dQ_{kl} \over ds } Q_{li}^{-1}
I_{\mathrm F}(s)= {d \mathbf{f}^{\mathrm T} \over ds} \mathbf{Q}^{-1} {d\mathbf{f} \over ds}+ {1\over 2} \mathbf {Tr} \left[{d\mathbf{Q} \over ds} \mathbf{Q}^{-1}{d \mathbf{Q}\over ds}\mathbf{Q}^{-1} \right]

The models being studied are:

Additive Noise Model:

Q_{ij}= \sigma^2 \left[ \delta_{ij} + c(1 - \delta_{ij})\right]

A nice example of collective quantities R= {1\over N} \sum r_i and \tilde R={1\over N} \sum (-1)^i r_i and the N dependence of their respective variance, \sigma^2_R and \sigma^2_{\tilde R}, is presented.

\lim _{N\rightarrow \infty} I_{\mathrm F}= {N [F_1(s)- F_{2}(s)]\over \sigma^2 (1-c)}

where F_{1}(s)= {1\over N} \sum_i \left(d f_i(s) \over ds\right)^2 and F_{2}(s)=\left( {1\over N} \sum_i {d f_i(s) \over ds} \right)^2. It worths mentioning that when F_1=F_2 the Fisher information fails to grow linearly. This will put a constraint on the individual neurons tuning curves among the population

F_1=F_2 \Rightarrow f_i(s)=p(s)+q_i

Multiplicative Noise Model:

Q_{ij}(s)= \sigma^2 \left[ \delta_{ij} + c(1-\delta_{ij})\right] f_i (s) f_{j}(s)

\lim_{N \rightarrow \infty}I_{\mathrm F}= {N [G_1(s)- G_{2}(s)] \over \sigma^2 (1-c)} +{N [(2-c)G_1(s)-c G_{2}(s)]\over (1-c)}

Limited-Range Correlation Model:

Q_{ij}=\sigma^2 \rho^{|i-j|}

I_{\mathrm F}={N(1-\rho)F_{1}(s)\over\sigma^2 (1+\rho)} +O(N^{1-2/D})

D is the number of encoded variables. “The main conclusion of this paper is that neurons should have different selectivity to the quantities they are encoding. In particular their tuning curves should not be additively or multiplicatively separable”.

Optimal Sampling of Natural Images

October 10, 2008

This post is taken from the following paper,

“Optimal sampling of Natural Images: A design Principle for the Visual Systems?” by W. Bialek, D.L. Ruderman and A. Zee

If the natural scene images can be characterized by \phi(\mathbf {x}) and the cells are indexed by integers n and positioned respectively at \mathbf{x}_n. The output of each cell, Y_n, is a linear filter (receptive field) with an additional noisy component

Y_n=\int d^2 \mathbf{x}F( \mathbf{x}-\mathbf{x}_n)\phi(\mathbf{x})+\eta_n

The goal is to find a kernel F that maximze the information content of output, Y_n, about the input, \phi. The information content of Y_n assuming that \phi(\mathbf{x}) is Gaussian is

I={1\over 2\ln 2}Tr\ln\left[ \delta_{nm} + {1 \over (2 \pi)^2 \sigma^2} \int d^2\mathbf{k} e^{i \mathbf{k}(\mathbf{x}_n -\mathbf{x}_m)} |\tilde{F}(\mathbf{k})|^2 S(\mathbf{k})\right]

here S(\mathbf{k}) is the power spectrum of the signal. We can approximate I in large noise regime as

I \approx {N \over 2 (2\pi)^2\sigma^2 \ln 2} \int d^2\mathbf{k} |\tilde{F}(\mathbf{k})|^2 S(\mathbf{k})

We need to put extra constraints to get solutions that are physically realistic

1) We should fix the filters gain,

\int d^2 \mathbf{k} |\tilde{F} (\mathbf{k}) |^2 =1

2) There should be a cost for “long-range interactions in spatial space” or “sharp fluctuations in momentum space”,

C = \alpha \int d^2 \mathbf{k} \mathbf{k}^2 |\tilde{F}(\mathbf{k}) |^2

Using variational methods they found that filter should satisfy the schrodinger like equation in \mathbf{k}-space

-{\alpha\over 2}{\nabla_k}^2\tilde{F}(k)-{1\over 2 \sigma^2\ln 2} S(k)\tilde{F}(k)=\Lambda \tilde{F}(k)

the {h}^2/\alpha playes the role of the mass M and - S(k)/ ( 2 \sigma^2 \ln 2 ) of potential V in the schrodinger equation.

In the case that they are interested the power spectrum of natural images has a power law S(k)\sim 1/k^2 and as a result an accidental symmetry. This accidental symmetry allows them to recombine various angular momentum eigenstates and create orientation selective eigenstates (filters).

What interests me the most is that for any linear filter this formalism holds and, the power spectrum of the input signal appears as the potential of the schrodinger equation.

Statistical Mechanics of Four Letter Words

October 6, 2008

Taken from the paper by G.J. Stephens and W. Bialek.

There are 26^4=456976 possible four letter words, the total entropy for such a sample is S_{0}= 4 \log_2 (26)=18.802 bits. Out of this large number of possibilities only a small fraction of 1/3700 is in use in English language. By looking at large database of American literature, \sim 800 words and their frequency of occurrence is found (the database had \sim 2\times 10^7 words all the four letter words which they have appeared more than 100 times are considered.). The entropy of this set is approximately S_{full} = 6.92\pm 0.003. The paper is trying to regenerate legal words using statistical properties of their building blocks (the letters).

Independent Letters Approximation:

Not all the letters appear with the same frequencies. Considering this point the probability of having a four letter words is P(l_1, l_2, l_3, l_4)= \prod_i l_i. The entropy of a such an ensemble is S_{ind} = 14.083 \pm 0.001 bits.

Pairwise Interaction Approximation:

We assume that P(l_1,l_2,l_3,l_4) can be written as a Boltzmann factor of pairwise interacting between different letters

P^{(2)}(l_1,l_2,l_3,l_4)= {1 \over Z} \exp \left[ - \sum_{i,j} V_{i,j}(l_i,l_j) \right]

V_{i,j}(l_i,l_j) are the pairwise potential and have 6 \times (26^2 -1)=4050 free parameters. They can be determined by solving 6 \times 26^2 coupled equations (pairwise marginal distribution).

\rho_{i_1,i_2}(l_1,l_2) = \sum^{\prime} P^{(2)}(k_1,k_2,k_3,k_4)

The ensemble built in this construction has an entropy of 7.48 bits. The pairwise potentials V_{i,j} define an energy landscape. There are 136 local minima in this landscape (changing any letter increases the energy) out of them 118 are real English words, that capture 63.5 of the probability distribution.

The induced probability for each real word using pairwise interactions P_2 reproduce the observed probability of that word P_{full} especially for frequent words (The plot of P_2 vs P_{full} is almost diagonal).

The independent letter approximation reduced the number of possibilites by a factor of \sim 1/26, a further reduction factor of \sim 1/100, the higher order interaction only contribute for a factor of 1/1.5.

Zipf’s Law:

Plotting the probability vs the rank of different words shows an approximate power law.The four letter words have a cut off tail. The pairwise model removes some weight from the bulk of the distribution and reassigns it to the tail.

Thermodynamics of Natural Images

September 25, 2008

This is a brief extraction from Mora et.al. paper.

The idea originates from an observation by Field that the spatial spectrum of images of natural scenes follow a 1/f^2 behavior. Now if this behavior was observed in a statistical system, the first conclusion was that the system is in its critical point.

Here the authors to show the same result assume that the probability of an image \sigma is related to an energy function E(\sigma) through Boltzmann relation

P_T(\sigma)= {1 \over  Z} e^{-E(\sigma)/T} where Z = \sum_{\sigma} P(\sigma)

An L\times L pixel black and white image has 2^{L^2} possible configurations. In order to make this number they studied the system for L=2 , 3 and 4 using coarse graining ideas and block averaged the natural images.

Using their set of images they

1) assumed T=1 in natural images and found P(\sigma)

2) the probability at any other temperature can be found by

P_T(\sigma) = {P(\sigma)^{1/T}     \over  \sum_{\sigma} P(\sigma)^{1/T} }

3) the entropy at any other temperature can be derived by

S(T) = - \sum_{\sigma} P_T(\sigma)  \log  P_T(\sigma)

4) calculated the specific heat

C(T)=T {\partial S(T) \over \partial T}

5) repeated all these steps for L=2,3 and 4.

They observed that there is a peak in normalized specific heat C(T,L)/L^2 and as L increase the peak becomes sharper and approaches T=1. This is a clear indication for critical behavior in natural scenes (divergence of the specific heat at T=1).

In the macrocanonical ensemble, we study the statistical system under the fixed energy constraint. Under this conditions we can define the partition function as an integral over the density of states with constant energy \rho (E),

Z(T) = \int dE \rho(E) e^{-E/T}

Entropy will be defined as logarithm of density of states i.e. \rho(E)=e^{S(E)} ,

since both entropy and energy are extensive it is more suitable to work with \epsilon = E/N and s(\epsilon)=S(\epsilon)/N. In the large N limit,

C={N \over T} \left[  - {d^2 s(\epsilon)  \over d\epsilon^2}  \right].

When the S/N vs E/N curve for a system is become linear we should expect critical behavior. For natural images this can be done by looking at the probability of occurrence of each configuration, P(\sigma), then calculating P=\exp (-  E /T) of each configuration up to a constant. The density of states \rho (E) is the histogram of the configurations’ energy. Finally, the entropy is logarithm of density of states.

This calculation has been done for 8 to 50 pixel sample and they all shown a linear S , E behavior at low energy limit, where the sampling methods are more reliable (more occurrence).

They also prove that in a system which follows generalized Zipf’s law entropy has a linear dependence on energy.

Note: Generalized Zipf’s law i.e. p_r \propto 1/r^{\alpha}, where it  relates the ranking of a state r to its probability.

Energy Landscape:

Finally they studied the energy landscape of the natural images by looking at the local minima of the energy. They examined all the 4 \times 4 pixel images. The local minima are defined as configurations that flipping any pixel will increase the energy of the configuration. They found approximately 100 of these minima. The local minima are important in answering the question that if the natural images are at the critical point of T=1 what will happen to system at T=0?, and what order parameter can be defined for this phase transition?

Most of the local minima are representing an edge between black and white regions or a strip. This is consistent with the visual system that the response triggered averages of neurons in response to natural scenes has a prominent edge detection component.