Information Maximization

By Peyman Khorsand

Taken from: “An Information Maximization Approach to Blind Separation and Blind Deconvolution” by A.J. Bell and T.J. Sejnowski

The idea is to maximize the mutual information between the output Y of a neural network and its input X. Mutual information is defined as I(Y,X)= H(Y) - H(Y | X).

There are divergences due to the continuous variables. As a result it is better to work with gradient of mutual information with respect to some parameter, w. They considered the systems that have additive noise, N,  where H(Y | X)= H(N). Therefore to maximize the mutual information it is sufficient to maximize entropy of the output. Assuming that the input probability distribution is f_{x}(x) then probability distribution of the output is

f_{y}(y)= {f_{x}(x) \over |\partial y /\partial x |}

the entropy in the output can be written as

H(y) = E \left[ \ln \left| {\partial y \over \partial x} \right| \right] -E \left[ \ln f_{x} (x) \right]

A stochastic gradient descent learning rule can be implemented as

\Delta w\propto {\partial H(y)\over \partial w}= \left({\partial y\over \partial x} \right)^{-1}

In the case of logistic transfer function y= 1/ (1+ e^{-(wx+w_0)} or tanh function y= \tanh(wx + w_0). The learning rule has two component an Anti-Hebbian, \Delta w \propto -2 x y, and an Anti-Decay component, \Delta w \propto 1/w.

Multi-component network:

This result can be generalized to multi-component network, where {\mathbf y}= g({\mathbf W x +w_{0}}). The result will be

\Delta w_{ij} \propto {\mathrm {cof} \, w_{ij} \over \det \bf{W} } + x_i (1 - 2 y_j)

Casual Filters:

Furthermore it can be generalized to casual filters where y(t)=g[u(t)]=g[w(t) * x(t) ], or equivalently {\mathbf Y} =g({\mathbf W X} ). Now {\mathbf W} is lower triangular matrix.

W=\left(\begin{array}{lccccc} w_L & 0 &\cdots & 0 & & 0\\ w_{L-1} & w_L &0& \cdots& &0\\ \vdots & & & & &\vdots\\ w_1&\cdots & w_L & &\cdots &0\\ 0 & w_1 & \cdots & w_L & \cdots &0\\ \vdots & & & & & \vdots\\ 0 & \cdots & & w_1& \cdots& w_L\\ \end{array}\right)

The probability distribution function of X and Y are f_{Y}(Y)= f_{X}(X)/|J|

J=\det \left[ \partial y(t_i) \over \partial x(t_{j}) \right]=(\det W) \prod_{t=1}^{M} {\partial y(t) \over \partial u(t)}

The gradient descend learning rule is

\Delta w_L \propto \sum_{t=1}^M \left( {1 \over w_L} - 2 x_t y_{t} \right)
\Delta w_{L-j} \propto \sum_{t=1}^M \left( - 2 x_{t-j} y_{t}\right)

The delay weights w_{t-j} keep shrinking and try to de-correlate the past signal from future. It is very interesting to see what kind of w will be created depending on the input statistics.

Time Delay:

y(t)=g[w\, x(t-d)]

following the same set of steps we find

\Delta d \propto 2 w {\partial x \over \partial t} y

Generalizing the transfer function:

{dy\over d u}= y^p (1 - y)^r

\Delta w\propto {1\over w} +x[p(1-y) - ry]

\Delta w_0 \propto p(1-y )-ry

Blind Separation: A set of source signals, s_1(t), s_2 (t), \ldots, s_n(t) is mixed together linearly, with a matrix {\mathbf A}. We ought to find a square matrix {\mathbf W} that is inverse of {\mathbf A} up to a permutation. The problem reduces to minimize the mutual information between the inputs(Independent Component Analysis). Obviously, if the matrix {\mathbf A} is singular one can not solve the problem.

Blind De-convolution: A signal s(t) is corrupted with a linear filter, a(t)x(t)= [a(t)* s(t)]. We have to find a filter w_1, w_2 , \ldots , w_L to recover s(t) from x(t). The problem reduces to remove statistical dependency across time (Whitening of x(t)).

The fundamental question is how can we reduce mutual information between two outputs by information maximization. The joint entropy of two output signals is

H(y_1,y_2)=H(y_1) + H(y_2) -I(y_1, y_2)

The algorithm discussed above maximize H(y_1,y_2) and this is mostly done by minimizing I(y_1,y_2). When I(y_1,y_2) is zero the probability distribution of y’s is separable. Can we introduce a constraint and keep the total \sum H(y_i) conserved?

It was briefed up to section 5.

Leave a Reply