Taken from: “An Information Maximization Approach to Blind Separation and Blind Deconvolution” by A.J. Bell and T.J. Sejnowski
The idea is to maximize the mutual information between the output of a neural network and its input
. Mutual information is defined as
.
There are divergences due to the continuous variables. As a result it is better to work with gradient of mutual information with respect to some parameter, . They considered the systems that have additive noise,
, where
. Therefore to maximize the mutual information it is sufficient to maximize entropy of the output. Assuming that the input probability distribution is
then probability distribution of the output is
the entropy in the output can be written as
A stochastic gradient descent learning rule can be implemented as
In the case of logistic transfer function or tanh function
. The learning rule has two component an Anti-Hebbian,
, and an Anti-Decay component,
.
Multi-component network:
This result can be generalized to multi-component network, where . The result will be
Casual Filters:
Furthermore it can be generalized to casual filters where , or equivalently
. Now
is lower triangular matrix.
The probability distribution function of and
are
The gradient descend learning rule is
The delay weights keep shrinking and try to de-correlate the past signal from future. It is very interesting to see what kind of
will be created depending on the input statistics.
Time Delay:
following the same set of steps we find
Generalizing the transfer function:
Blind Separation: A set of source signals, is mixed together linearly, with a matrix
. We ought to find a square matrix
that is inverse of
up to a permutation. The problem reduces to minimize the mutual information between the inputs(Independent Component Analysis). Obviously, if the matrix
is singular one can not solve the problem.
Blind De-convolution: A signal is corrupted with a linear filter,
,
. We have to find a filter
to recover
from
. The problem reduces to remove statistical dependency across time (Whitening of
).
The fundamental question is how can we reduce mutual information between two outputs by information maximization. The joint entropy of two output signals is
The algorithm discussed above maximize and this is mostly done by minimizing
. When
is zero the probability distribution of
’s is separable. Can we introduce a constraint and keep the total
conserved?
It was briefed up to section 5.