Unit 4 NLP Kcs072
Unit 4 NLP Kcs072
AKTU - UNIT 4
SPEED SOUNDS
In Natural Language Processing (NLP), understanding how speech sounds are produced and classified is
essential for tasks like speech recognition, synthesis, and phonetic analysis.
Speech sounds are produced when air from the lungs passes through the vocal cords, which can either
vibrate or remain open. This airflow then interacts with the mouth, teeth, tongue, and lips to form different
sounds.
The basic steps involved in the production of speech sounds are:
Initiation: Air is pushed from the lungs through the trachea.
Phonation: The vocal cords either vibrate or remain open to control airflow, creating sound.
Articulation: The sound is shaped by the movement of the tongue, lips, and other parts of the vocal
tract.
Application in NLP
Speech Recognition: Identifying and converting speech sounds (phonemes) into text.
Speech Synthesis: Generating speech from text by accurately pronouncing phonemes.
Phonetic Transcription: Writing down the speech sounds of a word using symbols like the International Phonetic Alphabet (IPA).
APPLICATION OF SPEECH SOUNDS IN NLP
Speech Recognition:
Identifying and converting speech sounds (phonemes) into text.
Speech Synthesis:
Generating speech from text by accurately pronouncing phonemes.
Phonetic Transcription:
Writing down the speech sounds of a word using symbols like the International
Phonetic Alphabet (IPA).
CLASSIFICATION OF SPEECH SOUNDS
A. Consonants-
Consonants are speech sounds where airflow is partially or fully obstructed in the vocal tract. They are classified based on the following
features:
Place of articulation: The location in the vocal tract where the sound is produced (e.g., lips, teeth, alveolar ridge).
Example: /p/ (bilabial), /t/ (alveolar).
Manner of articulation: The way in which the airflow is restricted or modified.
Example: /p/ (plosive), /s/ (fricative).
Voicing: Whether the vocal cords vibrate (voiced) or not (voiceless).
Example: /b/ (voiced), /p/ (voiceless).
B. Vowels
Vowels are produced without significant obstruction in the vocal tract. They are classified based on:
Height: How high the tongue is in the mouth (high, mid, low).
Example: /i/ (high), /a/ (low).
Backness: How far back the tongue is in the mouth (front, central, back).
Example: /i/ (front), /u/ (back).
Roundness: Whether the lips are rounded or unrounded.
Example: /u/ (rounded), /i/ (unrounded).
C. Semi-Vowels (Glides)
These are sounds that are produced similarly to vowels but function as consonants in certain contexts. They include
sounds like /j/ (as in "yes") and /w/ (as in "wet").
Articulatory Phonetics Acoustic Phonetics
Focuses on how speech sounds are produced by the Studies the physical properties of sound waves, such as frequency
movement of speech organs. and amplitude.
Air is pushed from the lungs to create sound. Examines how sound travels through the air.
Vocal cords vibrate to produce voiced sounds or stay open Frequency affects pitch—higher frequency means higher pitch.
for voiceless sounds.
Active articulators include the tongue, lips, and teeth. Amplitude affects loudness—greater amplitude means louder
sound.
Passive articulators are fixed parts like the teeth or palate. Formants define vowel sounds and their resonance.
Place of articulation refers to where the airflow is blocked Spectrogram visually shows frequency distribution over time.
(e.g., lips for /p/).
Manner of articulation involves how airflow is manipulated Waveform represents sound pressure variation over time.
(e.g., plosives or fricatives).
Voicing indicates whether vocal cords vibrate (voiced) or Acoustic features are used in speech recognition systems.
not (voiceless).
Short-Time Fourier Transform (STFT)
The Short-Time Fourier Transform (STFT) is a mathematical technique used to analyze
non-stationary signals, such as speech, by dividing the signal into small, overlapping
segments (called windows). The Fourier Transform is then applied to each segment,
allowing the representation of both time and frequency components simultaneously.
This results in a spectrogram—a visual representation of how the signal's frequency
content changes over time.
STFT is important in speech processing because it allows for time-based analysis of speech, creates a
spectrogram for recognition, and helps improve accuracy and noise reduction.
Significance of (STFT)
Time-Frequency View: STFT lets us see both time and frequency of speech, which helps in analyzing
how speech sounds change over time.
Captures Speech Changes: It helps to capture dynamic speech changes, such as when different
sounds (vowels or consonants) are produced in speech.
Visual Representation: The STFT creates a spectrogram, a visual map that shows which frequencies
are present at each moment of speech, useful for recognizing speech patterns.
Improves Accuracy: By breaking down speech into smaller segments, it helps make accurate analysis
of each sound, improving tasks like speech recognition.
Noise Reduction: STFT is useful for isolating speech from noise in a recording, enhancing audio clarity
for better processing.
Digital Signal Process (DSP)
Digital Signal Processing (DSP) involves manipulating signals that have been converted into a digital format.
Signals, such as audio, video, or sensor data, are first sampled and quantized to produce discrete values. DSP
techniques are then applied to process these digital signals for various purposes like filtering, analysis,
enhancement, compression, or transformation.
Sampling: The continuous signal (analog) is converted into discrete values by sampling it at regular
intervals, using a sampling rate.
Quantization: After sampling, the continuous values are approximated to finite levels (quantized) to
convert the signal into a digital format.
Processing: Various operations are performed on the digital signal, such as filtering (removing noise),
transforming (Fourier Transform, for example), and extracting features.
Reconstruction: After processing, the digital signal can be converted back to an analog signal (using a
Digital-to-Analog Converter, DAC) for playback or further use.
Filter-Bank Method in Digital Signal Processing (DSP)
The Filter-Bank Method in Digital Signal Processing (DSP) involves breaking a signal into multiple frequency
components by passing it through a series of filters. Each filter is designed to capture a specific frequency range of the
signal. This method allows for detailed analysis of the signal by focusing on different frequency bands. It mimics the
way the human auditory system processes sound, breaking down complex signals into simpler frequency components.
The Filter-Bank method is particularly useful in speech recognition, where it helps extract features from speech signals,
and in audio compression, where it helps focus on the most important frequency components, reducing redundancy
and making the signal more efficient for storage or transmission.