Prosody analysis for speech synthesis 
Speech is a complex waveform containing verbal (e.g. phoneme, syllable, and word) and nonverbal (e.g. speaker identity, emotional state, and tone) information. Both the verbal and nonverbal aspects of speech are extremely important in interpersonal communication and human-machine interaction. However, work in machine perception of speech has focused primarily on the verbal, or content-oriented, goals of speech recognition, speech compression, and speech labeling. Usage of nonverbal information has been limited to speaker identification applications. While the success of research in these areas is well documented, this success is fundamentally limited by the effect of nonverbal information on the speech waveform. The extra-linguistic aspect of speech is considered a source of variability that theoretically can be minimized with an appropriate preprocessing technique; determination of such robust techniques is however, far from trivial.
It is widely believed in the speech processing community that the nonverbal component of speech contains higher-level information that provides cues for auditory scene analysis, speech understanding, and the determination of a speaker's psychological state or conversational tone. We believe that the identification of such nonverbal cues can improve the performance of classic speech processing tasks and will be necessary for the realization of natural, robust human-computer speech interfaces.
Intonation evaluation 
A system for evaluating the intonation of English utterances made by Japanese native speakers using synthesized speech for the rapid development of a computer-assisted language learning (CALL) system. To evaluate the intonation of learners' utterances, reference utterances are needed, for which native English speakers' utterances should be used. However, it is costly to gather native speakerís utterances for all sentences in the system. Therefore an intonation evaluation method using synthesized speech generated by text-to-speech systems instead of real speech is examined. The intonation evaluation system calculates scores between a learner's utterance and corresponding utterances by the teachers. We first compare the reliability of intonation evaluation using native and synthesized utterances, and find that the reliability of evaluation using synthesized utterances could be improved by using the weighted Mahalanobis distance for calculating the evaluation score. Next, we investigated a method of combining multiple scores of diﬀerent teachers. In addition, we incorporated a feature for evaluating rhythm into intonation evaluation. As a result, the correlation between scores by human evaluators and the system was improved. Furthermore, we analyze the tendency of intonation evaluations made by the system through limiting the evaluation utterances to find out for degradation of the system's performance.