In the realm of text-to-speech (TTS) technology, a team of researchers from Microsoft, including Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, and Xie Chen, has developed a novel framework called ReStyle-TTS. This advancement aims to enhance the control over speech style in zero-shot TTS models, which can clone a speaker’s voice from a short audio sample but often inherit unwanted speaking styles from the reference audio. The research was published in the Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision.
Zero-shot TTS models have shown promise in cloning a speaker’s voice, but they often struggle with controlling the speaking style, such as pitch, energy, and emotion, without carefully selecting reference audio. This limitation can be impractical when only limited or mismatched references are available. To address this issue, the researchers propose ReStyle-TTS, a framework that enables continuous and reference-relative style control.
The key insight behind ReStyle-TTS is the need to reduce the model’s implicit dependence on reference style before introducing explicit control mechanisms. To achieve this, the team introduces Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance. This approach reduces reliance on reference style while preserving the fidelity of the text. Building on this, the researchers apply style-specific Low-Rank Adaptation (LoRAs) together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control. Additionally, they introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance.
Experiments conducted by the researchers demonstrate that ReStyle-TTS enables user-friendly, continuous, and relative control over various speech attributes such as pitch, energy, and multiple emotions. The framework maintains intelligibility and speaker timbre, and performs robustly even in challenging scenarios where the reference and target styles are mismatched.
The practical applications of ReStyle-TTS in the energy sector could be significant. For instance, in the realm of energy management systems, automated voice interfaces could benefit from more natural and controllable speech synthesis. This technology could enhance the user experience in smart grids, energy monitoring systems, and other energy-related applications that rely on voice interaction. By enabling more precise control over speech style, ReStyle-TTS could make these systems more intuitive and user-friendly, ultimately improving efficiency and user satisfaction in the energy sector.
This article is based on research available at arXiv.

