Uma análise audiovisual da produção de tons lexicais

João Vítor Possamai de Menezes

Please use this identifier to cite or link to this item: http://hdl.handle.net/1843/34183

Type:	Dissertação
Title:	Uma análise audiovisual da produção de tons lexicais
Other Titles:	An audiovisual analysis of lexical tone production
Authors:	João Vítor Possamai de Menezes
First Advisor:	Adriano Vilela Barbosa
First Co-advisor:	Maria Mendes Cantoni
First Referee:	Hani Camille Yehia
Second Referee:	Frederico Gualberto Ferreira Coelho
Third Referee:	Adriano Chaves Lisboa
Abstract:	Sabe-se que a fala se manifesta não só de forma acústica, mas também visual, por meio de movimentos faciais e gestos corporais, além de possuir correlatos fisiológicos como o movimento do trato vocal e a atividade neural. Este trabalho apresenta uma análise audiovisual da produção de tons lexicais, que são variações de graves e agudos que mudam o significado das palavras em línguas tonais. Tons lexicais são tradicionalmente estudados em termos de parâmetros acústicos, como a frequência fundamental (F0) do sinal de fala. Este trabalho, no entanto, adota uma abordagem integrada, investigando a contribuição, de forma isolada e conjunta, das componentes acústica e visual da fala para a diferenciação dos tons lexicais em três línguas tonais (cantonês, mandarim e tailandês). A abordagem adotada é tentar classificar os tons de cada língua a partir de cada componente tomada isoladamente e comparar seus desempenhos. Foram coletados dados em experimentos audiovisuais de produção de fala com sete falantes das três línguas. A componente visual da fala foi obtida por meio do rastreamento 3D de marcadores fixados à face e à cabeça das participantes, e a componente acústica foi obtida, de forma simultânea, por um microfone. Após o experimento, as posições dos marcadores foram submetidas a um procedimento de compensação do movimento da cabeça com o intuito de decompô-las em suas duas componentes: uma devida ao movimento da face e outra devida ao movimento de corpo rígido da cabeça. O sinal acústico teve sua F0 estimada por meio do método de autocorrelação. Neste ponto, a componente visual é representada por três tipos de sinais: Movimento Total (posições dos marcadores), Face e Cabeça (resultantes da decomposição); e a componente acústica é representada pelas curvas de F0. Todos os tipos de sinais foram parametrizados por meio de regressão polinomial, sendo representados por coeficientes que aproximam sua trajetória original. Os sinais parametrizados foram então utilizados para treinar classificadores lineares e não-lineares, com os tons de cada língua usados como rótulos das classes. A capacidade de cada tipo de sinal de classificar os diferentes tons lexicais foi medida por meio da acurácia de cada classificador, obtida com validação cruzada em K partes (K = 5). Os sinais visuais foram capazes de classificar tons lexicais, nas três línguas, com acurácia acima da aleatória. As maiores acurácias foram obtidas pelos sinais de F0. Entre os sinais visuais, as maiores acurácias foram obtidas, em ordem decrescente, pelos sinais Movimento Total e Face. Além disso, alguns tons lexicais de uma mesma língua foram classificados com acurácias acima da média, sugerindo que alguns tons são mais fáceis de serem classificados do que outros. Os resultados obtidos estão de acordo com a literatura e sugerem que tons lexicais podem ser preditos não só por F0, mas também, em menor grau, pelos movimentos da face e da cabeça.
Abstract:	It is known that speech manifests itself not only acoustically, but also visually, through facial movements and body gestures, in addition to having physiological correlates such as movement of the vocal tract and neural activity. This work presents an audiovisual analysis of the production of lexical tones, which are pitch variations that change the meaning of words in tone languages. Lexical tones are traditionally studied in terms of acoustic parameters, such as the fundamental frequency (F0) of the speech signal. This work, however, adopts an integrated approach, investigating the contribution, in isolation and jointly, of the acoustic and visual components of speech to the differentiation of lexical tones in three tone languages (Cantonese, Mandarin and Thai). The approach adopted consists in classifying the tones of each language from each component taken in isolation and to compare their performances. Data was collected in audiovisual speech production experiments with seven speakers of the three languages. The visual component of speech was obtained through 3D tracking of markers fixed to the participants' faces and heads, and the acoustic component was obtained simultaneously by a microphone. After the experiment, the positions of the markers were subjected to a head movement compensation procedure in order to separate them into their two components: one due to the movement of the face and the other due to the movement of the rigid body of the head. The acoustic signal had its F0 estimated through the autocorrelation method. At this point, the visual component is represented by three types of signals: Total movement (marker positions), Face and Head (resulting from the decomposition); and the acoustic component is represented by the F0 curves. All types of signals were parameterized using polynomial regression, being represented by coefficients that approximate their original trajectory. The parameterized signals were then used to train linear and non-linear classifiers, with the tones of each language used as class labels. The ability of each type of signal to classify the different lexical tones was measured using the accuracy of each classifier, obtained with cross-validation in K parts (K = 5). Visual signals were able to classify lexical tones in the three languages, with accuracy above chance. The highest accuracy was obtained by the F0 signals. Among the visual signals, the highest accuracy was obtained, in decreasing order, by the signals Total Movement and Face. In addition, some lexical tones of the same language were classified with above-average accuracy, suggesting that some tones are easier to classify than others. The results obtained are in accordance with the literature and suggest that lexical tones can be predicted not only by F0, but also, to a lesser extent, by the movements of the face and head.
Subject:	Engenharia elétrica Fala Lexicologia
language:	por
metadata.dc.publisher.country:	Brasil
Publisher:	Universidade Federal de Minas Gerais
Publisher Initials:	UFMG
metadata.dc.publisher.department:	ENG - DEPARTAMENTO DE ENGENHARIA ELÉTRICA
metadata.dc.publisher.program:	Programa de Pós-Graduação em Engenharia Elétrica
Rights:	Acesso Aberto
metadata.dc.rights.uri:	http://creativecommons.org/licenses/by-nc-nd/3.0/pt/
URI:	http://hdl.handle.net/1843/34183
Issue Date:	31-Jul-2020
Appears in Collections:	Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
Dissertacao_MENEZES_JVP_final-pdfa.pdf		5.03 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License