Speech Processing -Anu
Speech Processing -Anu
Words
Semantics
• Despite all these problems, adults recognize spoken words correctly.
• In most of the current models includes mapping as follows:
• Recognition of the unit occurs when its activation exceeds either a
threshold or some activation state relative to all other units at its level.
• The simplest way to study spoken word recognition is to study and measure
‘recognisability’ i.e. identification of words in a noisy condition or that of
truncated or filtered speech stimulus.
• However, these tasks fail to provide a measure of reaction time due to
variability.
Speech periods (time cycles that define pitch or rhythm) are harder to
detect when masked by noise.
Results in challenges for tone analysis or pitch tracking.
The following methods have been used to deal with additive noises:
•Mimic how humans process sounds using our ears and brain.
•Identify speech features (pitch, formants, etc.) while ignoring irrelevant noise.
Noise Reduction and Suppression:
•Algorithms that estimate and subtract noise from the speech signal.
•Example: Noise gates or spectral subtraction.
Noise Masking:
•Add specific types of noise (like white noise) to “cover” the unpleasant or interfering noise.
•Used when noise cannot be removed entirely .
Adaptive Models:
Participants progressively recognized cap, cat, cab more accurately as the vowel
duration increased.
At about halfway through the vowel, participants could correctly identify cat, cap,
and cab more often than by random chance.
For cash, participants needed the final consonant (/ʃ/) to correctly identify it.
Study 2: Grosjean (1980)
Spoken Word Recognition Processes and Gating Paradigm
Method:
•Words of varying lengths and frequencies were presented to participants in three contexts:
• In isolation: No extra context, just the word.
• In short context: Minimal surrounding linguistic context.
• In long context: A sentence or phrase providing substantial contextual
information.
•The presentation of each word was incremental (word duration increased gradually).
•After each increment:
• Participants wrote down their guess of the word.
• Indicated their confidence level in the guess
Findings:
Context Helps:
•A lexical decision task measures how participants process and recognize words in real-time.
•Task: Participants decide whether a given stimulus is a real word (e.g., "umbrella") or a non-word (e.g.,
"umbrellir")
•Purpose: To study lexical access—how quickly and efficiently the brain retrieves information about words
from the mental lexicon (our internal "dictionary").
1.Online Processing:
•Word Length:
• Longer words generally take more time to process
•Word Frequency:
• High-frequency words (e.g., "car") are recognized faster than low-frequency words (e.g., "vial").
•Non-Words: Non-words that closely resemble real words (e.g., "umbrellir") take longer
to reject than completely nonsensical ones (e.g., "flobber").
4.Lexical Access:
Latency (response time) indicates how quickly the brain retrieves a word from the mental lexicon.
Faster RTs suggest easier or more automatic access.
4. Word spotting
•Word spotting is a task in speech processing where participants or systems identify specific
target words embedded within a continuous stream of speech.
•Unlike full sentence recognition, word spotting focuses only on detecting whether a word exists
in the input.
Why Use Word Spotting?
To study how listeners or machines can identify key words in noisy or complex speech
environments
Helps in understanding:
Example: A sentence like “I ate an apple pie” contains the target word “apple.”
Task
Measurement
Acoustic Salience:
•Words with distinct acoustic features (e.g., stress, intonation) are easier to spot.
•Example: “APPLE” in a loud, clear tone is easier to detect than “apple” in a monotone speech.
•Frequency: High-frequency words (e.g., “dog”) are detected more easily than low-frequency words (e.g.,
“lichen”)
•Length: Shorter words (e.g., “cat”) are harder to spot than longer words due to potential overlaps with
parts of other words
Speech Rate:
•Faster speech reduces word spotting accuracy.
•Slower speech gives more time for processing and increases accuracy.
McQueen and Cutler (1998): Word Spotting in Contexts
Study Design:
1. Participants were given nonsense speech stimuli containing real words randomly embedded.
2. Words were presented in different contexts:
1. Syllabic context: e.g., "vuffapple" (contains vowels and likely word boundaries).
2. Consonantal context: e.g., "fapple" (contains only consonants and no clear word
boundaries).
Findings:
1. Syllabic Context is Better: Words were easier to spot in longer syllabic contexts (e.g.,
"vuffapple") than shorter consonantal ones (e.g., "fapple").
2. Phonotactic Probability:
1. Detection improves when the structure of the nonsense speech (e.g., its
phonotactics) predicts where a word boundary should occur.
2. Example: "venlip" makes "lip" easier to spot than "veglip," where phonotactic rules
do not suggest a clear boundary.
Phonotactic Probabilities
Definition: Phonotactic rules determine the likelihood of certain sound sequences in a language.
• E.g., In English, "lip" in "venlip" is segmented easily because English phonotactics favors a
syllable break before "lip."
• Conversely, "lip" in "veglip" is harder to segment due to unnatural syllable boundaries.
Impact: When nonsense stimuli align with natural language phonotactics, the embedded word is
recognized more easily.
Similarity Neighbourhoods
Dense Neighbourhoods:
• Recognized faster.
• Recognized with higher accuracy
A phoneme-triggered lexical decision task is a variant of the lexical decision task designed to investigate
how phoneme-level and word-level information interact during sentence processing
The task focuses on lexical access—how quickly and efficiently participants recognize real words based on a
specified phoneme.
Setup:
• Participants are presented with a set of sentences.
• Before hearing each sentence, they are given a target phoneme to listen for (e.g., /k/).
Task:
• Participants must:
• Identify real words beginning with the specified phoneme as they occur in the sentence.
• Ignore nonsense words (even if they contain the target phoneme).
• Example:
• Target phoneme: /k/.
• Sentence: "Bobby drove the car into the lake."
• Participant's Response: Press the button on hearing "car."
Manipulation
• The speed of lexical access is varied by altering the semantic predictability of the
target word.
• Semantically related context: The verb or preceding words strongly suggest the
target word (e.g., "drove the car").
• Semantically unrelated context: No strong association with the target word (e.g.,
"saw the car").
Findings (Blank, 1980)
Objective: The study aimed to explore how people perceive, process, and repeat words and non-words
when specific fricatives (/s/ or /sh/) are involved. It focused on the factors influencing naming times and
error rates during an auditory task
Stimuli Used
Fricative: Whether the sound was /s/ (e.g., mess) or /sh/ (e.g., mesh).
Lexicality:
Location:
Initial fricative: At the start of the word (e.g., sack).
Final fricative: At the end of the word (e.g., mess).
Changeability:
This describes whether changing the fricative identity changes the item’s lexical status:
Example: Mess → Mesh (both real words).
Example: Ness → Nesh (both non-words).
Experiment Design
•Task:
• Participants listened to all versions of the stimuli via headphones.
• They repeated what they heard into a microphone.
Results
On presenting the mismatched version of stimuli ,subjects perceived the correct form of word
,indicating fricatives are important for the word recognition
7. Continuous Speech (Shadowing)
by Marslen-Wilson, 1985
What is Shadowing?
Definition:
Shadowing is a task where a subject listens to spoken language and repeats it back
immediately, word-for-word, with minimal delay.
Purpose:
The experiment is designed to study speech perception and how the brain
processes and repeats language in real time.
Chistovich's 1960 Findings
Two Types of Shadowers
1.Close Shadowers:
1. Delays: Very short, about 150–200 milliseconds (msec).
2. Characteristics: Speech is slurred and difficult to analyze for accuracy.
Demonstrates rapid and efficient speech perception, where the listener almost immediately
anticipates what they hear.
2.Distant Shadowers:
1. Delays: Longer, between 500–1500 msec.
2. Characteristics: Speech is clear and easy to understand.
Conclusion:
Participants:
•65 participants, including men and women
Key Observations:
1.Syntactic Prose:
1. Contains normal syntax (grammatical sentence structure), but is semantically
meaningless.
Example: The blue ideas sleep furiously.
3.Jabberwocky:
1. Maintains basic syntax, but the words are replaced with nonsense words by modifying
sounds.
2. Inspired by Lewis Carroll’s Jabberwocky poem.
Example: Twas brillig and the slithy toves.
Second Series of Experiments
Purpose: To determine if close and distant shadowers process syntactic and semantic information during
shadowing.
Findings:
•Both close and distant shadowers actively analyze the syntax and semantics of the material while
shadowing.
•Evidence:
• Spontaneous Errors: Their errors were constrained by the syntactic and semantic structure of
the prose, meaning their mistakes weren’t random but followed logical rules.
• Sensitivity to Disruptions: Both groups showed reduced performance when the syntactic or
semantic structure of the material was disrupted.
Third Series of Experiments
Findings:
Close Shadowers:
• Use on-line (real-time) speech analysis to drive their articulation.
• Begin repeating speech before they are fully aware of the material's meaning, relying on rapid
processing.
• Advantage: Close shadowing provides a more direct reflection of language comprehension,
with minimal interference from later (post-perceptual) processes.
Distant Shadowers:
• Use slower, more deliberate output strategies, requiring greater conscious awareness of the
material.
Vitevitch and Luce’s Shadowing Task
Studies (1998, 1999, 2005)
Research Focus:
They examined how phonotactic probabilities and neighborhood density influence the speed
of shadowing.
1.Phonotactic Probabilities:
1. Likelihood of a sound sequence occurring in a language (e.g., str in "street" is high
probability, whereas fsr is low probability).
2.Neighborhood Density:
1. Refers to how many words sound similar to a given word.
2. Example: Cat has a dense neighborhood (bat, mat, sat), whereas quirk has a sparse
neighborhood.
Interpretation
Non-words:
• Non-words with high phonotactic probabilities (common sound
sequences) and dense neighborhoods were repeated faster than those
with low probabilities and sparse neighborhoods.
Words:
• The opposite was true—words with low phonotactic probabilities and
sparse neighborhoods were repeated faster than those with high
probabilities and dense neighborhoods.
8. Tokens embedded in Words and
Non-Words
Study by Zhang & Samuel (2015): Tokens Embedded in Words and Non-Words
Objective:
•Investigated how listeners process English words containing shorter words embedded within them (e.g.,
ham in hamster).
•Used auditory priming experiments to assess when embedded words become activated under varying
listening conditions.
Experiment 1: Optimal Listening Conditions
Findings:
• Isolated embedded words primed targets (ham → pig) in all conditions.
• In carrier words: Priming occurred only if the embedded word was at the beginning or
comprised a large proportion of the carrier word.
Experiment 2: Duration Change
Findings:
• Significant priming for isolated embedded words, even under duration changes.
• No priming when carrier words were compressed or expanded.
Experiment 3: Segment Loss
•Method: Replaced a segment of carrier words with noise (e.g., h_noise_mster).
•Findings: Priming was eliminated, indicating embedded word activation relies on intact speech signals.
Findings
Priming for embedded words persisted in isolation (ham), but not when embedded in carrier words (hamster).
Overall Findings:
1.Activation Factors:
1. Embedded words are activated if they are at the beginning of the carrier word.
2. Activation is stronger when embedded words constitute a large proportion of the carrier
word.
2.Listening Conditions:
1. Embedded word activation occurs only under optimal conditions (e.g., clear speech,
minimal distortion).
2. Under suboptimal conditions (e.g., noise, duration changes, cognitive load), activation is
impaired, especially in carrier words.
Study by Vroomen & de Gelder (1997): Embedded
Monosyllables
Objective:
•Explored cross-modal priming in Dutch to study when monosyllables embedded within other words or
non-words are activated.
Key Findings:
Context
1.Two-Syllable Words:
1. Example: framboos (strawberry) contains the embedded word boos (angry).
2. Finding: Embedded words (boos) produced significant priming for related words, showing
activation in two-syllable contexts.
2.Monosyllabic Words:
1. Example: swine contains wine.
2. Finding: No priming was found for embedded words (wine) in monosyllabic carriers.
Position Effects:
1. Initial Position:
1. Example: vel (skin) in velk (non-word) or velg (word).
2. Finding: Priming occurred when the carrier was a non-word (velk) but not when it was a
word (velg).
3. Longer words inhibit activation of shorter, embedded words due to lexical
competition.
2. Final Position:
1. Example: wine in swine or twine.
2. Finding: No evidence of embedded word activation.
Conclusion:
1.Lexical Competition:
Embedded word activation is influenced by lexical competition, where longer words inhibit
shorter embedded words in word contexts.
2.Syllable Onset:
Activation is stronger when the embedded word has a matching syllable onset with the
lexical representation.
9. Rhyme Monitoring (Marslen-
Wilson, 1980)
The subjects are presented auditorily with sentences. A cue
word rhyming with the target word is presented.
•Instead of a rhyming cue, the cue word is semantically related to the target word.
• Example:
• Target word: lead
• Cue word: metal
•Subjects listen to a sentence and press a switch when they hear the target word.
•Nature of Sentences:
•Sentences can be:
• Meaningful: Contextually coherent.
• Nonsense: Lacking semantic coherence
Word Monitoring Paradigm:
2. Independent Variables:
1. Nature of the target word (e.g., semantically related or unrelated).
2. Position of the target word in the sentence.
3. Context of the sentence (e.g., meaningful or nonsense).
3. Dependent Variables:
1. Response Latency: Time taken to press the switch.
2. Error Rate: Missed or incorrect responses.
3. Brain Imaging Data: Neural activity during the task
Findings
1.Participants' Task:
1. Participants listen to a sentence played to them.
2. Before the sentence finishes, they are shown a visual stimulus
(either a word or a picture) on a screen.
3. The visual stimulus can either be:
1. Related (or identical) to a word they heard in the sentence
earlier (e.g., the word "dog" after hearing "animal").
2. Unrelated to the sentence they heard.
2.Response:
1. As soon as they see the word/picture, they are instructed to press a button as quickly as
possible.
1. For words: They decide whether the word is a valid word or a non-word (lexical
decision task).
2. For pictures: They classify the picture, such as determining if it depicts an animate
or inanimate object (e.g., animacy task).
3.Priming Effect:
2. When the visual stimulus is related (or identical) to a word heard earlier in the sentence,
reaction times (RTs) are faster than when the visual stimulus is unrelated.
Summary and Implications
Spoken word recognition involves the activation of multiple word candidates on the basis
of the initial speech input—the “cohort”—and selection among these competitors.
They examined the potential interaction of bottom-up and top-down processes in an fMRI
study by presenting participants with words and pseudowords for lexical decision.
•In words with high competition cohorts, high imageability words generated
stronger activity than low imageability words, indicating a facilitatory role of
imageability in a highly competitive cohort context.
•These results support the behavioral data in showing that selection processes
do not rely solely on bottom–up acoustic– phonetic cues but rather that the
semantic properties of candidate words facilitate discrimination between
competitors.
• They found greater activity in the left inferior frontal
gyrus (BA 45, 47) and the right inferior frontal
gyrus (BA 47) with increased cohort competition
• An imageability effect in the left posterior middle
temporal gyrus/angular gyrus (BA 39)
• A significant interaction between imageability and
cohort competition in the left posterior superior
temporal gyrus/ middle temporal gyrus (BA 21, 22).
PREVIOUS YEAR QUESTIONS
Briefly describe the pros and cons of any one method used in SWR
research (5) 2021
1. Zhuang, J., Randall, B., Stamatakis, E. A., Marslen-Wilson, W. D., & Tyler, L. K. (2011). The interaction of
lexical semantics and cohort competition in spoken word recognition: an fMRI study. Journal of Cognitive
Neuroscience, 23(12), 3778-3790.
2. Kerry Kilborn Helen Moss (1996) Word Monitoring, Language and Cognitive Processes, 11:6, 689-694, DOI:
10.1080/016909696387105
Thank you