For sequential problems, the sequences exhibit significant sequential correlation. That is, nearby x and y values are likely to be related to each other.
To model the sequential correlations, we usually adopt the joint probability for the whole sequence as the objective function for learning.
Methods that analyze the entire sequence of x_t values before predicting the y_t labels typically can give better performance on the sequential supervised learning problem.
Non-uniform loss functions usually represented by a cost matrix C(i,j), which gives the cost of assigning label i to an example whose true label is j. In such cases, the goal is to find a classifier with minimum expected cost.
How can these kind of loss function be incorporated into sequential supervised learning? One approach is to view the learning problem as the task of predicting the (conditional) joint probability of all the labels in the output sequence: P(y|x). If this joint distribution can be accurately predicted, then all of the various loss functions can be evaluated.
We would like to find methods to extract features handling long distance interactions.
Approaches reviewed in this paper:
1) The sliding window method;
2) Recurrent sliding window
3) Hidden Markov Models
4) Maximum Entropy Markov Models
5) Input-Output Markov Models
6) Conditional Random Fields
7) Graph Transformer Networks