A method for lexical tone classification in audio-visual speech

dc.creatorJoão Vítor Possamai de Menezes
dc.creatorMaria Mendes Cantoni
dc.creatorDenis Burnham
dc.creatorAdriano Vilela Barbosa
dc.date.accessioned2023-02-01T14:11:36Z
dc.date.accessioned2025-09-08T23:30:08Z
dc.date.available2023-02-01T14:11:36Z
dc.date.issued2020
dc.format.mimetypepdf
dc.identifier.doihttps://doi.org/10.20396/joss.v9i00.14960
dc.identifier.issn2236-9740
dc.identifier.urihttps://hdl.handle.net/1843/49361
dc.languageeng
dc.publisherUniversidade Federal de Minas Gerais
dc.relation.ispartofJournal of Speech Sciences
dc.rightsAcesso Aberto
dc.subjectFala
dc.subject.otherMultimodal speech
dc.subject.otherLexical tone
dc.subject.otherCantonese language
dc.subject.otherStatistical learning
dc.subject.otherLinear discriminant analysis
dc.titleA method for lexical tone classification in audio-visual speech
dc.typeArtigo de periódico
local.citation.epage104
local.citation.spage93
local.citation.volume9
local.description.resumoThis work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.
local.identifier.orcidhttp://orcid.org/0000-0002-7612-9754
local.identifier.orcidhttps://orcid.org/0000-0001-9515-1802
local.identifier.orcidhttp://orcid.org/0000-0002-1980-3458
local.identifier.orcidhttp://orcid.org/0000-0003-1083-8256
local.publisher.countryBrasil
local.publisher.departmentFALE - FACULDADE DE LETRAS
local.publisher.initialsUFMG
local.url.externahttps://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/14960

Arquivos

Pacote original

Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
A method for lexical tone classification in audio-visual speech.pdf
Tamanho:
394.82 KB
Formato:
Adobe Portable Document Format

Licença do pacote

Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
License.txt
Tamanho:
1.99 KB
Formato:
Plain Text
Descrição: