Tuesday, November 30, 2010

[Feature] SIFT

tutSIFT04.pdf (3040 KB)

In spite of significant progress in automatic speech recognition over the years, robustness still appears to be a stumbling block. Current commercial products are quite sensitive to changes in recording device, to acoustic clutter in the form of additional speech signals, and so on. The goal of replicating human performance in a machine remains far from sight.

Scale Invariant Feature Transform (SIFT) is an approach for detecting and extracting local feature descriptors that are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in viewpoint.

Detection stages for SIFT features:

1) Scale-space extrema detection

Interest points for SIFT features correspond to local extrema of difference-of-Gaussian filters at different scales.

Interest points (called keypoints in the SIFT framework) are identified as local maxima or minma of the DoG (difference of Gaussian) images across scales. Each pixel in the DoG images is compared to its 8 neighbors at the same scale, plus the 9 corresponding neighbors at neighboring scales. If the pixel is a local maximum or minimum, it is selected as a candidate keypoint.

For each candidate keypoint:

- Interpolation of nearby data is used to accurately determine its position;

- Keypoints with low contrast are removed;

- Responses along edges are eliminated;

- The keypoint is assigned an orientation.

To determine the keypoint orientation, a gradient orientation histogram is computed in the neighborhood of the keypoint (using the Gaussian image at the closest scale to the keypoint's scale). The contribution of each neighboring pixel is weighted by the gradient magnitude and a Gaussian window with a theta that is 1.5 times the scale of the keypoint.

Peaks in the histogram correspond to dominant orientations. A separate keypoint is created for the direction corresponding to the histogram maximum, and any other direction within 80% of the maximum value.

All the properties of the keypoint are measured relative to the keypoint orientation, this provides invariance to rotation.

2) Key point localization

3) Orientation assignment

4) Generation of keypoint descriptors

Posted via email from Troy's posterous

[Feature] Speech Recognition with localized time-frequency pattern detectors

Characteristics of the localized time-frequency features:

1) Local in frequency domain, not like MFCC, each feature is affected by all the frequencies;

2) Temporal dynamics, modeling long and variable time durations, while in MFCC, the features are all short time and fixed duration.

In this paper, the set of filters adopted is very simple, and are essentially basic edge detectors taking only values +1 and -1. The selection includes vertical edges (of varying frequency span and temporal duration) for onsets and offsets; wide horizontal edges for frication cutoffs; and horizontal edges tilted at various slopes to model formant transitions. The choices for the ranges of the various parameters were made based on acoustic phonetic knowledge, such as typical formant bandwidths, average phone durations, typical rates of formant movement, etc.[Book: Acoustic Phonetics]

With these filters, the features are computed as follows:

For each filter,

1) centering it at a particular frequency, and convolving with the whole spectrogram along the time axis for that specific frequency;

2) for each frequency value, we could get a time series of the convolution sums;

3) to reduce the dimension of features, the convolution sums are down sampled over a 16*32 point grid.

4) The 32 frequency points are taken linearly between 0 and 4kHz.

5) The 16 time points (which is specific to their task) is the centers of the 16 states (for the HMM word model) in the state alignment.

Thus, a feature refers to both the filter shape, and the time-frequency point (16*32 point grid).

In this paper, the task is to classify isolated digits. Each spectrogram is one single digit. The feature computed is per digit, thus is global to the target class.

###################

The problem with current frame based features is its non-localization, make it difficult to modeling speaker variabilities.

As shown in following figure:

Download now or preview on posterous

Schutte, Glass_2008_Speech recognition with localized time-frequency pattern detectors.pdf (374 KB)

Posted via email from Troy's posterous

Monday, November 29, 2010

[Speech] Phonetic cues

Download now or preview on posterous

Glass, Schutte_2009_Parts-based models and local features for automatic speech recognition.pdf (2115 KB)

While automatic speech recognition systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions.

There has been much progress and ASR technology is now in widespread use; however, there is still a considerable gap between human and machine performance, particularly in adverse conditions.

How human evolution makes humans different from machines in perceiving speech signals?

What's the major differences between humans and machine when processing speech signals? And which are the crucial ones?

The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles.

Perhaps most importantly, ASR systems have benefited greatly from general improvements in computer technology. The availability of very large datasets and the ability to utilize them for training models has been very beneficial. Also, with ever increasing computing power, more powerful search techniques can be utilized during recognition.

Reaching the ultimate goal of ASR - human-level (or beyond) performance in all conditions and on all tasks - will require investigating other regions of this landscape, even if doing so results in back-tracking in progress in the short-term.

Utilize the knowledge of acoustic phonetics and human speech perception in speech recognition.

We will argue that well known phonetic cues crucial to human speech perception are not modeled effectively in standard ASR systems, and that there is a benefit to model such cues explicitly, rather than implicitly.

The "glimpsing" model of speech perception suggests that humans can robustly decode noise-corrupted speech by taking advantage of local T-F regions having high SNR, and is supported by empirical evidence and computational models.

Auditory Neuroscience, Tonotopic maps. Also, recent research seeking to characterize the behavior of individual neurons in the mammalian auditory cortex has resulted in models in which cortical neurons act as localized spectro-temporal pattern detectors. ( represented by their so-called spectro-temporal receptive filed, or STRF).

+ Localized T-F pattern detectors

+ Explicit modeling of phonetic cues

Corpora:

ISOLET http://www.icsi.berkeley.edu/speech/papers/eurospeech05-onset/isolet/

VCV http://www.odettes.dds.nl/challenge_IS08/downloads.html

acoustic ----------> phonetic cues ------------------> phonemes

Posted via email from Troy's posterous

[DBN] Learning rate for RBM training

Download now or preview on posterous

learnrate.pdf (81 KB)

Posted via email from Troy's posterous

Thursday, November 25, 2010

[Speech] Spectrogram

edu_spectrogram_uk.html

To analyze sounds it is required to have an acoustic receiver (a microphone, an hydrophone or a vibration transducer) and an analyzer suitable for the frequencies of the signals we want to measure. Eventually, a recorder may allow to permanently store the sounds to allow later analyses or playbacks.

A spectrograph transforms sounds into images to make "visible", and thus measurable and comparable, sound features the human hear can't perceive. Spectrograms (also called sonograms or sonagrams) may show infrasounds, like those emitted by some large whales or by elephants, as well as ultrasounds, like those emitted by echolocating dolphins and by echolocating bats, but also emitted by insects and small rodents.

Spectrograms may reveal features, like fast frequency or amplitude modulations we can't hear even if they lie within our hearing frequency limits (30 Hz - 16 kHz). Spectrograms are widely used to show the features of animal voices, of the human voice and also of machinery noise.

A real-time spectrograph displays continuously the results of the analyses on the incoming sounds with a very small - often not perceivable - delay. This kind of instrumentation is very useful in field research because it allows to continuously monitor the sounds received by the sensors, to immediately evaluate their features, and to classify the received signals. A spectrograph can be dedicated instrument or a normal computer equipped with suitable hardware for receiving and digitizing sounds and a software to analyze sounds and convert them into a graphical representation.

Normally, a spectrogram represents the time on the x axis, frequency on the y axis and the amplitude of the signals by using a scale of grays or a scale of colours. In some applications, in particular those related with military uses, the x and y axes are swapped.

The quality and features of a spectrogram are controlled by a set of parameters. A default set can be used for generic display, but some parameters can be changed to optimize the display of specific features of the signals.
Also, by modifying the colour scale it is possible to optimize the display of the amplitude range of interest.

Posted via email from Troy's posterous

Tuesday, November 23, 2010

[News] ACMTech Nov.23

AT&T Ups the Ante in Speech Recognition
CNet (11/18/10) Marguerite Reardon

AT&T says it has devised technologies to boost the accuracy of speech and language recognition technology as well as broaden voice activation to other modes of communication. AT&T's Watson technology platform is a cloud-based system of services that identifies words as well as interprets meaning and contexts to make results more accurate. AT&T recently demonstrated various technologies such as the iRemote, an application that transforms smartphones into voice-activated TV remotes that let users speak natural sentences asking to search for specific programs, actors, or genres. Most voice-activated remotes respond to prerecorded commands, but the iRemote not only recognizes words, but also employs other language precepts such as syntax and semantics to interpret and comprehend the request's meaning. AT&T also is working on voice technology that mimics natural voices through its AT&T Natural Voices technology, which builds on text-to-speech technology to enable any message to be spoken in various languages, including English, French, Italian, German, or Spanish when text is processed via the AT&T cloud-based service. The technology accesses a database of recorded sounds that, when combined by algorithms, generate spoken phrases.
http://news.cnet.com/8301-30686_3-20023189-266.html

What If We Used Poetry to Teach Computers to Speak Better?
McGill University (11/17/10)

McGill University linguistics researcher Michael Wagner is studying how English and French speakers use acoustic cues to stress new information over old information. Finding evidence of a systematic difference in how the two languages use these cues could aid computer programmers in their effort to produce more realistic-sounding speech. Wagner is working with Harvard University's Katherine McCurdy to gain a better understanding of how people decide where to put emphasis. They recently published research that examined the use of identical rhymes in poetry in each language. The study found that even when repeated words differ in meaning and sound the same, the repeated information should be acoustically reduced as otherwise it will sound odd. "Voice synthesis has become quite impressive in terms of the pronunciation of individual words," Wagner says. "But when a computer 'speaks,' whole sentences still sound artificial because of the complicated way we put emphasis on parts of them, depending on context and what we want to get across." Wagner is now working on a model that better predicts where emphasis should fall in a sentence given the context of discourse.
http://www.eurekalert.org/pub_releases/2010-11/mu-wiw111710.php

Posted via email from Troy's posterous

Monday, November 22, 2010

Enabling Terminal's directory and file color highlighting in Mac

From: http://www.geekology.co.za/blog/2009/04/enabling-bash-terminal-directory-file-color-highlighting-mac-os-x/

By default Mac OS X’s Terminal application uses the Bash shell (Bourne Again SHell) but doesn’t havedirectory and file color highlighting enabled to indicate resource types and permissions settings.

mac-terminal-directory-file-highlighting

Enabling directory and file color highlighting requires that you open (or create) ~/.bash_profile in your favourite text editor, add these contents:

export CLICOLOR=1 export LSCOLORS=ExFxCxDxBxegedabagacad

… save the file and open a new Terminal window (shell session). Any variant of the “ls” command:

ls ls -l ls -la ls -lah

… will then display its output in color.

More details on the LSCOLORS variable can be found by looking at the man page for “ls“:

man ls

LSCOLORS needs 11 sets of letters indicating foreground and background colors:

directory
symbolic link
socket
pipe
executable
block special
character special
executable with setuid bit set
executable with setgid bit set
directory writable to others, with sticky bit
directory writable to others, without sticky bit

The possible letters to use are:

a black b red c green d brown e blue f magenta c cyan h light grey A block black, usually shows up as dark grey B bold red C bold green D bold brown, usually shows up as yellow E bold blue F bold magenta G bold cyan H bold light grey; looks like bright white x default foreground or background

By referencing these values, the strongstrongstrongstrongstrong

Posted via email from Troy's posterous