The fundamental frequency and energy trajectories are adopted to capture long-term speaker information for speaker recognition.
Prosodic information can be used to effectively improve performance of and add robustness to speaker recognition systems.
1) Global statistics of some prosodic-based feature are estimated and compared between two utterances, e.g. comparing the mean and standard deviation of the fundamental frequency between enrollment and test utterances;
2) Appending the prosodic features to standard spectral-based features and using the traditional distribution modeling systems, may not well capture the temporal dynamic information.
In this paper, the relation between dynamics of the fundamental frequency and energy trajectories is used to characterize the speaker's identity.
Another approach proposed is to allow explicit template matching of the f0 contours of a predefined set of words and phrases.
(How to explain EER:)
The performance measure used to evaluate the described systems is the equal error rate (EER). It represents the system performance when the false acceptance rate (accepting an impostor) is equal to the missed detection rate (rejecting a true speaker).
With prosody, as with other aspects of spoken language, speaker information may be found in both static and dynamic forms and may originate from anatomical, physiological, or behavioral differences among individuals.
(Explain why system fusion is needed:)
Since the baseline system is modeling the absolute f0 and energy values while the slope system is modeling the relative f0 and energy contour dynamics, it is expected that a fusion of these systems should produce better performance than the individual systems.