Not only mixture models, but also a wide variety of other classical statistical models for density estimation are representable as simple networks with one or more layers of adaptive weights.
Following steps convert the standard Bayes rule into a logistic function:
To achieve good generalization it is important to have more data points than adaptive parameters in the model.
It has been demonstrated that MLP models of this form (with one hidden layer) can approximate to arbitrary accuracy any continuous function, defined on a compact domain, provided the number M of hidden units is sufficiently large.
The linear, logistic, and softmax functions are (inverse) canonical links for the Gaussian, Bernoulli, and multinomial distributions, respectively.
A variety of such pruning algorithms are available [cf. Bishop, 1995].
Some theoretical insight into the problem of overfitting can be obtained by decomposing the error into the sum of bias and variance terms. A model which is too inflexible is unable to represent the true structure in the underlying density function and this gives rise to a high bias. Conversely, a model which is too flexible becomes tuned to the specific details of the particular data set and gives a high variance. The best generalization is obtained from the optimum trade-off of bias against variance.
Regularization methods can be justified within a general theoretical framework known as structural risk minimization. Structural risk minimization provides a quantitative measure of complexity known as the VC dimension. The theory shows that the VC dimension predicts the different between performance on a training set and performance on a test set; thus, the sum log likelihood and (some function of) VC dimension provides a measure of generalization performance. This motivates regularization methods and provides some insight into possible forms for the regularizer.
Pre-processing is important for NN:
1) A simple normalization of the input to give it zero mean and unit variance;
2) Reducing the dimensionality of the input space;
A standard technique for dimensionality reduction is principle component analysis. Such methods, however, make use only of the input data and ignore the target values, and can sometimes be significantly sub-optimal.
One way of achieving such translation invariance for NN is to make use of the technique of shared weights. This involves a network architecture having many hidden layers in which each unit takes inputs only from a small patch, called a receptive field, of units in the previous layer. By a process of constraining neighboring units to have common weights, it can be arranged that the output of the network is insensitive to translation of the input image. A further benefit of weight sharing is that the number of independent parameters is much smaller than the number of weights, which assists with the problem of model complexity.
The relationship between probabilistic graphical models and neural networks is rather strong; indeed it is often possible to reduce one kind of model to the other.
Following figure shows that it is possible to treat an HMM as a special case of a Boltzmann machine.