This thesis presents the development and evaluation of a multimodal dialogue system for social robots in office environments. The primary objective is to create a natural, responsive conversational interface that maintains the expected human response time of approximately 758 milliseconds while providing task-oriented assistance. The research addresses the challenge of balancing low-latency interactions with high-quality conversational capabilities through the implementation of various speech recognition, language processing, and speech synthesis components. The system architecture integrates multiple output modalities including speech, visual expressions, and LED lighting, creating a comprehensive human-robot interaction experience. Through a modular design approach utilizing the factory method pattern, the implementation allows for flexible component swapping between cloud-based and local processing solutions. Evaluation results demonstrate that local speech recognition using Vosk achieves processing times averaging 0.801 seconds compared to Google’s cloud service at 1.578 seconds. Similarly, Windows TTS significantly outperforms Google TTS with processing times of 0.074 seconds versus 0.419 seconds, representing a 5.67x speedup. However, user studies revealed a marked preference for Google TTS’s more natural voice quality despite its higher latency, indicating that users prioritize speech naturalness over the 345-millisecond difference in processing time. The recommended hybrid configuration, combining local speech processing with cloudbased language models and Google TTS for quality-critical responses, achieves an endto-end latency of 3.043 seconds per interaction. While this is slightly slower than the fastest possible configuration, it provides a superior user experience by balancing speed with voice quality. User studies confirm that perceived naturalness of conversation depends more on voice quality and consistent performance than raw speed, particularly for interactions requiring emotional nuance or extended dialogue. This research demonstrates that strategic optimization must consider both quantitative metrics and qualitative user preferences to create effective dialogue systems for social robots.
Optimizing Multimodal Dialogue Systems for Social Robots in Office Environments
Dalkilic, M. I. (Author). 2025
Student thesis: Master's Thesis