Archive for October, 2008

Information Maximization

October 21, 2008

Taken from: “An Information Maximization Approach to Blind Separation and Blind Deconvolution” by A.J. Bell and T.J. Sejnowski

The idea is to maximize the mutual information between the output Y of a neural network and its input X. Mutual information is defined as I(Y,X)= H(Y) - H(Y | X).

There are divergences due to the continuous variables. As a result it is better to work with gradient of mutual information with respect to some parameter, w. They considered the systems that have additive noise, N,  where H(Y | X)= H(N). Therefore to maximize the mutual information it is sufficient to maximize entropy of the output. Assuming that the input probability distribution is f_{x}(x) then probability distribution of the output is

f_{y}(y)= {f_{x}(x) \over |\partial y /\partial x |}

the entropy in the output can be written as

H(y) = E \left[ \ln \left| {\partial y \over \partial x} \right| \right] -E \left[ \ln f_{x} (x) \right]

A stochastic gradient descent learning rule can be implemented as

\Delta w\propto {\partial H(y)\over \partial w}= \left({\partial y\over \partial x} \right)^{-1}

In the case of logistic transfer function y= 1/ (1+ e^{-(wx+w_0)} or tanh function y= \tanh(wx + w_0). The learning rule has two component an Anti-Hebbian, \Delta w \propto -2 x y, and an Anti-Decay component, \Delta w \propto 1/w.

Multi-component network:

This result can be generalized to multi-component network, where {\mathbf y}= g({\mathbf W x +w_{0}}). The result will be

\Delta w_{ij} \propto {\mathrm {cof} \, w_{ij} \over \det \bf{W} } + x_i (1 - 2 y_j)

Casual Filters:

Furthermore it can be generalized to casual filters where y(t)=g[u(t)]=g[w(t) * x(t) ], or equivalently {\mathbf Y} =g({\mathbf W X} ). Now {\mathbf W} is lower triangular matrix.

W=\left(\begin{array}{lccccc} w_L & 0 &\cdots & 0 & & 0\\ w_{L-1} & w_L &0& \cdots& &0\\ \vdots & & & & &\vdots\\ w_1&\cdots & w_L & &\cdots &0\\ 0 & w_1 & \cdots & w_L & \cdots &0\\ \vdots & & & & & \vdots\\ 0 & \cdots & & w_1& \cdots& w_L\\ \end{array}\right)

The probability distribution function of X and Y are f_{Y}(Y)= f_{X}(X)/|J|

J=\det \left[ \partial y(t_i) \over \partial x(t_{j}) \right]=(\det W) \prod_{t=1}^{M} {\partial y(t) \over \partial u(t)}

The gradient descend learning rule is

\Delta w_L \propto \sum_{t=1}^M \left( {1 \over w_L} - 2 x_t y_{t} \right)
\Delta w_{L-j} \propto \sum_{t=1}^M \left( - 2 x_{t-j} y_{t}\right)

The delay weights w_{t-j} keep shrinking and try to de-correlate the past signal from future. It is very interesting to see what kind of w will be created depending on the input statistics.

Time Delay:

y(t)=g[w\, x(t-d)]

following the same set of steps we find

\Delta d \propto 2 w {\partial x \over \partial t} y

Generalizing the transfer function:

{dy\over d u}= y^p (1 - y)^r

\Delta w\propto {1\over w} +x[p(1-y) - ry]

\Delta w_0 \propto p(1-y )-ry

Blind Separation: A set of source signals, s_1(t), s_2 (t), \ldots, s_n(t) is mixed together linearly, with a matrix {\mathbf A}. We ought to find a square matrix {\mathbf W} that is inverse of {\mathbf A} up to a permutation. The problem reduces to minimize the mutual information between the inputs(Independent Component Analysis). Obviously, if the matrix {\mathbf A} is singular one can not solve the problem.

Blind De-convolution: A signal s(t) is corrupted with a linear filter, a(t)x(t)= [a(t)* s(t)]. We have to find a filter w_1, w_2 , \ldots , w_L to recover s(t) from x(t). The problem reduces to remove statistical dependency across time (Whitening of x(t)).

The fundamental question is how can we reduce mutual information between two outputs by information maximization. The joint entropy of two output signals is

H(y_1,y_2)=H(y_1) + H(y_2) -I(y_1, y_2)

The algorithm discussed above maximize H(y_1,y_2) and this is mostly done by minimizing I(y_1,y_2). When I(y_1,y_2) is zero the probability distribution of y’s is separable. Can we introduce a constraint and keep the total \sum H(y_i) conserved?

It was briefed up to section 5.

Accuracy vs. Correlated Noise

October 16, 2008

Taken from: “The Effect of Correlated Variability on the Accuracy of a Population Code” by L.F. Abbott and P. Dayan

They answer to the following questions for three different class of models: “(1) Does correlation increase or decrease the accuracy with which the value of an encoded quantity can be extracted from a population of N neurons? (2) Does this accuracy approach a fixed limit as N increases?”

The average firing rate is noted as f_i(s)=\langle r_i(s) \rangle. The correlation matrix Q_{ij}(s)= \langle (r_i (s)- \langle r_i (s)\rangle)(r_j(s) - \langle r_j (s)\rangle ) \rangle is used to calculate Fisher information I_{\mathrm F}.  The Fisher information can be calculated assuming Gaussian character of correlation

P[\mathbf {r}|s] = {\cal N} \exp \left[-{1\over 2}\sum_{i,j}(r_i - f_i)Q_{ij}^{-1} (r_j - f_j) \right]

I_{\mathrm F}(s)=\sum_{i,j} {d f_{i}\over ds} Q^{-1}_{ij} {d f_{j}\over ds} + {1\over 2}\sum_{i,j,k,l} {d Q_{ij} \over ds} Q_{jk}^{-1} {dQ_{kl} \over ds } Q_{li}^{-1}
I_{\mathrm F}(s)= {d \mathbf{f}^{\mathrm T} \over ds} \mathbf{Q}^{-1} {d\mathbf{f} \over ds}+ {1\over 2} \mathbf {Tr} \left[{d\mathbf{Q} \over ds} \mathbf{Q}^{-1}{d \mathbf{Q}\over ds}\mathbf{Q}^{-1} \right]

The models being studied are:

Additive Noise Model:

Q_{ij}= \sigma^2 \left[ \delta_{ij} + c(1 - \delta_{ij})\right]

A nice example of collective quantities R= {1\over N} \sum r_i and \tilde R={1\over N} \sum (-1)^i r_i and the N dependence of their respective variance, \sigma^2_R and \sigma^2_{\tilde R}, is presented.

\lim _{N\rightarrow \infty} I_{\mathrm F}= {N [F_1(s)- F_{2}(s)]\over \sigma^2 (1-c)}

where F_{1}(s)= {1\over N} \sum_i \left(d f_i(s) \over ds\right)^2 and F_{2}(s)=\left( {1\over N} \sum_i {d f_i(s) \over ds} \right)^2. It worths mentioning that when F_1=F_2 the Fisher information fails to grow linearly. This will put a constraint on the individual neurons tuning curves among the population

F_1=F_2 \Rightarrow f_i(s)=p(s)+q_i

Multiplicative Noise Model:

Q_{ij}(s)= \sigma^2 \left[ \delta_{ij} + c(1-\delta_{ij})\right] f_i (s) f_{j}(s)

\lim_{N \rightarrow \infty}I_{\mathrm F}= {N [G_1(s)- G_{2}(s)] \over \sigma^2 (1-c)} +{N [(2-c)G_1(s)-c G_{2}(s)]\over (1-c)}

Limited-Range Correlation Model:

Q_{ij}=\sigma^2 \rho^{|i-j|}

I_{\mathrm F}={N(1-\rho)F_{1}(s)\over\sigma^2 (1+\rho)} +O(N^{1-2/D})

D is the number of encoded variables. “The main conclusion of this paper is that neurons should have different selectivity to the quantities they are encoding. In particular their tuning curves should not be additively or multiplicatively separable”.

Potential Energy Landscapes

October 14, 2008

The configuration space of any system with conserved energy (time-independent Hamiltonian) can be broken into subspaces surrounding potential energy minima, which the minima will be reached by the steepest descent path of the potential. This hyper-volume around a potential minimum is called basin. The structure based on these neighboring basins is called as the “Inherent Structure”. The statistical behavior of a complex system can be studied by analyzing the characteristics of this structure. The partition function of a system of N particles can be written as

Z(\Omega,N,T) = {1\over N! \lambda^{3N}} \int_{\Omega} d^{3N}x e^{-\beta V(x)}

We characterize each basin by its minimum potential E_i. The total volume \Omega is partitioned by these basins and so does the above integral.

Z(\Omega,N,T) ={1\over N! \lambda^{3N}} \sum_i \int_{\omega_i} d^{3N}x e^{-\beta V(x)}

Defining an average free energy for basins f_{b}(E_i,T,\Omega)

-\beta f_{b}(E_i,T,\Omega)=\ln \langle {\int_{\omega_i} d^{3N}x e^{-\beta V(x)} \over \lambda^{3N}} \rangle

Now if we know the probability distribution of energy minima \rho(E) the partition function can be written easily as,

Z(\Omega,N,T)=\int dE \rho(E) e^{-\beta f_{b}(E,T,\Omega)}

General Properties of PEL’s:

The number of total minima is exponentially depends on the number of underlying degrees of freedom (e.g. number of atoms),

N_{min} \sim e^{N_{deg}}

The energy distribution of minima follows Guassian distribution. This can be understood through central limit theorem for a large system with independent subsystems.

P(E) = {\cal{N}} e^{-(E-E_{0})^2/2 \sigma_e^2}

Also on average an exponential relation between the minimas’ energy and their corresponding volume was observed

A \sim e^{-\alpha E}

The probability distributions, P(A), of the volume, A, of these basins for various dynamical systems (Binary Lennard-Jones, Dzugutov Liquids and Amorphous Silicon) show power law behavior

P(A) \sim A^{-2}

There is a similarity between this phenomenon and “Apollonian Packing” of an arbitrary volume, where in the limit of large dimensionality of the target space obeys the same power law. Moreover, the network consist of connecting neighboring minima is a scale free network, which means

There is a strong correlation (a power law) between the volume of the basin and its number of neighbors, k_i,

k_i \sim (A_i)^\beta

This relation subsequently dictates a relation between the minimum energy and number of neighbors.

As a reference look through the papers by J. P. K. Doye and C. P. Massen.

Fluctuations in Network Dynamics

October 11, 2008

Taken from: “Fluctuations in Network Dynamics”, by M. Argollo de Menezes and A.-L. Barabasi

The flow through a node in a network is time dependent. This time dependence can be partly be described by mean flow, f_{\mathrm i}(t) and the fluctuation around this mean \sigma_{\mathrm i}(t).
In most natural networks, random, scale-free or small-world there is a functional dependence between mean and fluctuation of the flow through the nodes. In this paper it is claimed that the relation between f and \sigma in different networks (only scale-free and random network) fall into two different categories and characterized by their \alpha-exponent

\sigma \sim \langle f \rangle^{\alpha}

In all the networks they studied \alpha is equal to 1/2 or 1. In the networks with \alpha=1/2 the internal noise is responsible for flow fluctuations, while it was claimed that in networks with \alpha=1 the external noise is responsible for the fluctuations. Two different models are proposed.

Model 1: At any time step, W number of walkers are placed randomly on the network nodes, the preform M step walks.

Model 2: At any time step W random pairs of nodes are selected and they are connected through the shortest path between them (degeneracy problem is not discussed).

In both cases we observe \alpha=1/2 if W is fixed and \alpha=1 if W has large fluctuations. In general the fluctuations on a given nodes can be decompose into internal and external components
\sigma_{\mathrm i}^2=(\sigma_{\mathrm i}^{\mathrm {int}})^2 +(\sigma_{\mathrm i}^{\mathrm {ext}})^2
\sigma_{\mathrm i}^2 = a_{\mathrm i}^2 \langle f_{\mathrm i} \rangle +\left[ {\sigma_{\mathrm {dr}}\over \langle W(t)\rangle } \langle f_{\mathrm i} \rangle \right]^2

where \sigma_{\mathrm{dr}}=\sigma_{\mathrm{dr}}(\Delta W), represent the external driving force in the noise magnitude. Moreover, by increasing \Delta W a transition form \alpha =1 /2 to \alpha=1 behavior can be seen (some issues regarding the fitting process should be considered.).

In conclusion: “The \alpha=1/2 captures an endogenous behavior determined by the system’s internal fluctuations” while “The \alpha=1 exponent describes driven systems, in which the fluctuations of individual nodes are dominated by the time dependent changes in the external driving forces.”

Optimal Sampling of Natural Images

October 10, 2008

This post is taken from the following paper,

“Optimal sampling of Natural Images: A design Principle for the Visual Systems?” by W. Bialek, D.L. Ruderman and A. Zee

If the natural scene images can be characterized by \phi(\mathbf {x}) and the cells are indexed by integers n and positioned respectively at \mathbf{x}_n. The output of each cell, Y_n, is a linear filter (receptive field) with an additional noisy component

Y_n=\int d^2 \mathbf{x}F( \mathbf{x}-\mathbf{x}_n)\phi(\mathbf{x})+\eta_n

The goal is to find a kernel F that maximze the information content of output, Y_n, about the input, \phi. The information content of Y_n assuming that \phi(\mathbf{x}) is Gaussian is

I={1\over 2\ln 2}Tr\ln\left[ \delta_{nm} + {1 \over (2 \pi)^2 \sigma^2} \int d^2\mathbf{k} e^{i \mathbf{k}(\mathbf{x}_n -\mathbf{x}_m)} |\tilde{F}(\mathbf{k})|^2 S(\mathbf{k})\right]

here S(\mathbf{k}) is the power spectrum of the signal. We can approximate I in large noise regime as

I \approx {N \over 2 (2\pi)^2\sigma^2 \ln 2} \int d^2\mathbf{k} |\tilde{F}(\mathbf{k})|^2 S(\mathbf{k})

We need to put extra constraints to get solutions that are physically realistic

1) We should fix the filters gain,

\int d^2 \mathbf{k} |\tilde{F} (\mathbf{k}) |^2 =1

2) There should be a cost for “long-range interactions in spatial space” or “sharp fluctuations in momentum space”,

C = \alpha \int d^2 \mathbf{k} \mathbf{k}^2 |\tilde{F}(\mathbf{k}) |^2

Using variational methods they found that filter should satisfy the schrodinger like equation in \mathbf{k}-space

-{\alpha\over 2}{\nabla_k}^2\tilde{F}(k)-{1\over 2 \sigma^2\ln 2} S(k)\tilde{F}(k)=\Lambda \tilde{F}(k)

the {h}^2/\alpha playes the role of the mass M and - S(k)/ ( 2 \sigma^2 \ln 2 ) of potential V in the schrodinger equation.

In the case that they are interested the power spectrum of natural images has a power law S(k)\sim 1/k^2 and as a result an accidental symmetry. This accidental symmetry allows them to recombine various angular momentum eigenstates and create orientation selective eigenstates (filters).

What interests me the most is that for any linear filter this formalism holds and, the power spectrum of the input signal appears as the potential of the schrodinger equation.

Statistical Mechanics of Four Letter Words

October 6, 2008

Taken from the paper by G.J. Stephens and W. Bialek.

There are 26^4=456976 possible four letter words, the total entropy for such a sample is S_{0}= 4 \log_2 (26)=18.802 bits. Out of this large number of possibilities only a small fraction of 1/3700 is in use in English language. By looking at large database of American literature, \sim 800 words and their frequency of occurrence is found (the database had \sim 2\times 10^7 words all the four letter words which they have appeared more than 100 times are considered.). The entropy of this set is approximately S_{full} = 6.92\pm 0.003. The paper is trying to regenerate legal words using statistical properties of their building blocks (the letters).

Independent Letters Approximation:

Not all the letters appear with the same frequencies. Considering this point the probability of having a four letter words is P(l_1, l_2, l_3, l_4)= \prod_i l_i. The entropy of a such an ensemble is S_{ind} = 14.083 \pm 0.001 bits.

Pairwise Interaction Approximation:

We assume that P(l_1,l_2,l_3,l_4) can be written as a Boltzmann factor of pairwise interacting between different letters

P^{(2)}(l_1,l_2,l_3,l_4)= {1 \over Z} \exp \left[ - \sum_{i,j} V_{i,j}(l_i,l_j) \right]

V_{i,j}(l_i,l_j) are the pairwise potential and have 6 \times (26^2 -1)=4050 free parameters. They can be determined by solving 6 \times 26^2 coupled equations (pairwise marginal distribution).

\rho_{i_1,i_2}(l_1,l_2) = \sum^{\prime} P^{(2)}(k_1,k_2,k_3,k_4)

The ensemble built in this construction has an entropy of 7.48 bits. The pairwise potentials V_{i,j} define an energy landscape. There are 136 local minima in this landscape (changing any letter increases the energy) out of them 118 are real English words, that capture 63.5 of the probability distribution.

The induced probability for each real word using pairwise interactions P_2 reproduce the observed probability of that word P_{full} especially for frequent words (The plot of P_2 vs P_{full} is almost diagonal).

The independent letter approximation reduced the number of possibilites by a factor of \sim 1/26, a further reduction factor of \sim 1/100, the higher order interaction only contribute for a factor of 1/1.5.

Zipf’s Law:

Plotting the probability vs the rank of different words shows an approximate power law.The four letter words have a cut off tail. The pairwise model removes some weight from the bulk of the distribution and reassigns it to the tail.