Maria Arnal Dimas

An artistic research on modeling voices

The human larynx can produce an extraordinary range of sounds, much more than those needed to speak. From an evolutionary perspective, this versatility in vocalization, particularly evident in our singing abilities, likely served practical purposes in ancient times. The human voice is an extremely sophisticated tool, the oldest musical instrument and a cross-over of many social constructs, traditionally connected to the expression of feelings and identity. However, what does the concept of a voice involve in the 21st century? How do technological challenges and societal shifts shape its role? And what does it mean for a voice to exist without a physical body or to be associated with a synthetic form?

By training a physically based model of a vocal track, combined with machine learning, we can identify and modify parameters attached to specific areas, technical and expressive singing features, but also hyper-dimension and expand the possibilities of a human voice, creating impossible vocal sounds. Through the exploration of the latent space of a voice, both symbolic and acoustically, we will be able to add a layer of understanding to AI voice generation models. Stretching the physical limits of a body through different trained models, we aim to develop a modular neural tool capable of real time performance, paving the way for innovative approaches to vocal music production.

The integration of multiple voice processing models is an approach that allows a deeper understanding of different voice processing features and their potential synergies. By adding, also, a physically based vocal tract model, we believe that it will provide a more realistic representation of vocal production and a possibility to modulate specific areas that produce sound, adding controllability to this technology. By understanding better the inner structures of these models, we can start tailoring one that suits our needs.

Throughout the work in progress, an evaluation protocol has been created to compare the outcomes of voice conversions across different models and their respective variations. This protocol draws upon the rich and complex tradition of flamenco singing technique, providing a robust framework for assessment. Additionally, a curated selection of a cappella songs from diverse traditional styles has been chosen as inputs, reflecting technical complexity in performance.

Furthermore, special interest has been put in building a diverse yet personal dataset, recognizing its crucial role as the foundation of our trained voice model. Experimenting with various sounds and expressive features remains a constant practice in our ongoing work. Additionally, our research is significantly shaped by understanding of the relationship between body and voice. This understanding influences how we approach voice processing systems and their connection to vocal training and performance techniques.

Developing a neural tool grounded in physical principles and integrating features from various voice processing models, highlights the physical nature of voice production alongside the abstract concept of synthetic voices. Furthermore, voice processing technologies have the potential to enhance accessibility and inclusivity for people with speech impairments or disabilities that might need custom-made solutions.

As a singer and performer, I see the exploration of synthetic voice architectures as an artistic medium and, by exploring these technologies, the project bridges the gap between artistic expression and technological innovation, which I believe creates exciting opportunities for new forms of musical expression and creativity. By exploring these technologies, we can open up new sonic possibilities for listeners and singers, encouraging them to expand their auditory perception beyond their creative practices.

Lastly, our plan involves the creation of a 3D visualization of a vocal tract that dynamically responds to the singing voice in real-time and adds a visual component to the listening experience, creating a multimodal exploration of sound production. This integration of visual and auditory features might enhance listeners’ understanding of the relationship between vocal anatomy and sound, enriching their musical experience, engaging with their vibrant singing bodies and igniting their listening imagination.