Feedforward neural networks trained by error backpropagation are examples of nonparametric regression estimators.
The fundamental challenges in neural modeling are about representation rather than learning per se.
Learning, as it is represented in some current neural networks, can be formulated as a (nonlinear) regression problem.
The essence of the bias/variance dilemma lies in the fact that estimation error can be decomposed into two components, know as bias and variance; whereas incorrect models lead to high bias, truly model-free inference suffers from high variance. Thus, mode-free (tabula rasa) approaches to complex inference tasks are slot to "converge", in the sense that large training samples are required to achieve acceptable performance. This is the effect of high variance, and is a consequence of the large number of parameters, indeed infinite number in truly mode-free inference, that need to be estimated. Prohibitively large training sets are then required to reduce the variance contribution to estimation error. Parallel architectures and fast hardware do not help here: this "convergence problem" has to do with training set size rather than implementation. The only way to control the variance in complex inference problems is to use model-based inference is bias-prone: proper models are hard to identify for these more complex (and interesting) inference problems, and any model-based scheme is likely to be incorrect for the task at hand, that is highly biased.
In other words, among all functions of x, the regression is the best predictor of y given x, in the mean-squared-error sense.
This is indeed a reassuring property, but it comes with a high price: depending on the particular algorithm and the particular regression, non-parametric methods can be extremely slow to converge. That is, they may require very large numbers of examples to make relatively crude approximations of the target regression function. Indeed, with small smaples the estimator may be too dependent on the particular samples observed, that is, on the particular realizations of (x,y) (we say that the variance of the estimator is high). Thus, for a fixed and finite training set, a parametric estimator may actually outperform a nonparametric estimator, even when the true regression is outside of the parameterized class.
The regression problem is to construct a function f(x) based on a "training set" for the purpose of approximating y at future observations of x. This is sometimes called "generalization" a term borrowed from psychology.
An unbiased estimator may still have a large mean-squared error if the variance is large. Thus either bias or variance can contribute to poor performance.
There is often a tradeoff between the bias and variance contributions to the estimation error, which makes for a kind of "uncertainty principle". Typically, variance is reduced through "smoothing", via a combining, for example, of the influences of samples that are nearby in the input space. This, however, will introduce bias, as details of the regression function will be lost; for example, sharp peaks and valleys will be blurred.
The general recipe of for obtaining consistency in nonparametric estimation: slowly remove bias.
Nonparametric estimators are generally indexed by one or more parameters which control bias and variance; these parameters must be properly adjusted, as functions of sample size, to ensure consistency, that is, convergence of mean-squared error to zero in the large-sample-size limit. The number of neighbors k, the kernel bandwidth sigma, and the number of hidden units play these roles, respectively, in nearest-neighbor, Parzen-window, and feedforward-neural-network estimators. These "smoothing parameters" typically enforce a degree of regularity (hence bias), thereby "controlling" the variance. As we shall see in Section 4.2, consistency theorems specify asymptotic rates of growth or decay of these parameters to guarantee convergence to the unknown regression, or more generally, to the object of estimation. Thus, for example, a rate of growth of the number of neighbors of of the number of hidden units, or a rate of decay of the bandwidth, is specified as a function of the sample size N.
The most widely studied approach to automatic smoothing is "cross-validation". The idea of this technique, usually attributed to Stone(1974).
Nonparametric methods may be characterized by the property of consistency: in the appropriate asymptotic limit they achieve the best possible performance for any learning task given to them, however difficult this task may be.
We also saw that mean-squared error can be decomposed into a bias term and a variance term. Both have to be made small if we want to achieve good performance. The practical issue is then the following: Can we hope to make both bias and variance "small", with "reasonably" sized training sets, in "interesting" problems, using nonparametric inference algorithms such as nearest-neighbor, CART, feedforward networks, etc.?
Layered networks, Boltzmann Machines, and older methods like nearest neighbor or window estimators, can indeed form the basis of a trainable, "from scratch", speech recognition system, or a device for invariant object recognition. With enough examples and enough computing power, the performance of these machines will necessarily approximate the best possible for the task at hand. There would be no need for prepocessing or devising special representations: the "raw data" would do.
For many problems, the constrained structures can do better than general purpose structures.
The use of the graph-matching distance yields significantly better results on this task.
Adopting an appropriate data representation is an efficient means for designing the bias required for solving many hard problems in machine perception. This view is of course shared by many authors. As Anderson and Rosenfeld (1988) put it: "A good representation does most of the work".