Canonical Unveils Myna Speech Recognition System

Jean-Baptiste Lallement, Director of Engineering at Canonical, presented the Myna project, developing a speech recognition application that they intend to use for organizing voice input and recognizing commands on natural language in Ubuntu Desktop. The project is distributed under the GPLv3 license, but so far the repository contains only outlines describing the modular architecture of the project and its integration with Ubuntu.

For the release of Ubuntu 26.10 They plan to make the application suitable for voice text input. A session with the application is reduced to activating through a keyboard shortcut, dictating out loud and inserting recognized text into the current application through simulating keyboard input as it is spoken. When the microphone is turned on, a special indicator will be shown in the panel.
GNOME based on Wayland is stated as the base testing environment, but the application is initially designed with the possibility of adaptation for various desktop environments.

For recognition in Myna, an AI model will be used, running locally. Requirements for the application include: the ability to work without an Internet connection; turning on the microphone only after explicitly activating the dictation mode with a hotkey; audio processing in memory, cleared after each use; a ban on transferring audio recordings to external services.

Components for speech recognition, user interaction, dictation control and text substitution are being developed in the form of modules.
The environment for executing AI models will be designed as a snap package. Possible models for recognition include Wisper, Parakeet, NemoTron and Qwen3-ASR.
The dictation control service monitors the pressing of a hotkey, activates the microphone, accesses the AI model in the snap package via the API, redirects the audio stream from the audio service to it and coordinates data flows.

The audio service accesses the audio device, both directly and through the PulseAudio or PipeWire audio servers, suppresses noise and equalizes the volume. The text generated by the model is sent to the post-processing module for cleaning, normalization, formatting and punctuation. The final text is inserted into the application through input substitution, for example, through the Wayland protocol input-method or IBus.

/Reports, release notes, official announcements.