The class notes could be found http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
Cast machine learning to numerical optimization problems.
Quantify solution to problem based on scalar objective function, evaluated on sets of inputs/outputs.
The goal is to adjust model parameters to minimize the objective function given inputs and outputs.
The key is to design machine learning systems is to select representation of inputs and outputs, and mathematical formulation of task as objective function.
The mechanics: optimize objective function given observed data to find best parameters.
Probabilistic models allow us to:
1) account for noisy sensors, actuators
2) make decisions based on partial information about the world
3) describe inherently stochastic aspects of natural world
4) consider data not well described by model
Supervised learning is conditional density estimation P(Y|X);
Unsupervised learning is density estimation P(X);
If we learn the joint distribution, P(X,Y) we could marginalize to get P(X) by summing over all possible Y; also we could use Bayes theory to compute P(Y|X). Thus the joint distribution is the primary object of interest. The central problems concern representing it compactly, and robustly learning the parameters.
Key point about Directed Acyclic Graphical Models is missing edges imply conditional independence.
Sum-Product algorithm: Message definitions
Learning via the likelihood:
Simple Monte Carlo:
Sums and integrals, often expectations, occur frequently in statistics.
Monte Carlo approximate expectations with a sample average.
Rejection sampling draws samples from complex distributions.
Importance sampling applies Monte Carlo to any sum/integral.
Markov chain Monte Carlo constructs a random walk that explores target distributions.