Thomas G. Dietterich. Machine Learning for Sequential Data: A Review. In Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, Vol. 2396, T. Caelli (Ed.), pp. 15–30, Springer-Verlag, 2002.
Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer networks. The paper also discusses some open research issues.
Simon Lacoste-Julien. Combining SVM with graphical models for supervised classification: an introduction to Max-Margin Markov Networks. CS281A Project Report, UC Berkeley, 2003.
The goal of this paper is to present a survey of the concepts needed to understand the novel Max-Margin Markov Networks (M3-net) framework, a new formalism invented by Taskar, Guestrin and Koller which combines both the advantages of the graphical models and the Support Vector Machines (SVMs) to solve the problem of multi-label multi-class supervised classification. We will compare generative models, discriminative graphical models and SVMs for this task, introducing the basic concepts at the same time, leading at the end to a presentation of the M3-net paper.
Ben Taskar, Carlos Guestrin and Daphne Koller. Max-Margin Markov Networks. In Advances in Neural Information Processing Systems 16 (NIPS 2003), 2004.
In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches.
Fuchun Peng and Andrew McCallum (2004). Accurate Information Extraction from Research Papers using Conditional Random Fields. In Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT/NAACL-04), 2004.
With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This paper employs Conditional Random Fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. The basic theory of CRFs is becoming well-understood, but best-practices for applying them to real-world data requires additional exploration. This paper makes an empirical exploration of several factors, including variations on Gaussian, exponential and hyperbolic priors for improved regularization, and several classes of features and Markov order. On a standard benchmark data set, we achieve new state-of-the-art performance, reducing error in average F1 by 36%, and word error rate by 78% in comparison with the previous best SVM results. Accuracy compares even more favorably against HMMs.
Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. In Advances in Neural Information Processing Systems 17 (NIPS 2004), 2005.
We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a "segmentation" of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs.