(2024)

Lets Get This Party Started or LGTPS in short is an installation that sprouted from my research into large language models (LLMs). I wanted to see if I could locally get a timely response into the machine to finaly have a natural conversation with an AI.

I asked an AI (several models in the experiment) to chat with each other. The twist being, that I vocalise their output using TTS (text-to-speech) and let them hear each other throuch STT (speech-to-text). Yes they make mistakes... They are connnected to each other through two telephones. When their programs starts. They are just listening. To start the conversation, I speak in both phones at the same time... "let's get this party started"

mock my words

The STT mistakes add an interesting factor to the conversation. But of course this is more interesting for an artistic use case than for the problem at hand here. In a skype conversation we humans are trained to deal with a lag of about 250ms (125ms back and forth). How does this work for an AI? Well not that simple to be honest. Humans start processing the input from our conversation partners as soon as we hear the first syllable, while in my implementation, audio is transcribed after the speaker has stopped uttering sound for about 50ms wether it was 1 second or 15 seconds (the max in my tests) of speech. That meanse we only have 200ms left to generate a reply from the LLM and TTS the reply.

For my testing I use a RTX4090, beefy enough to load 2 large models if needed. But size has an impact on speed so I had to pick a small enough model and small enough context

Processing Prompt [BLAS] (238 / 238 tokens)
Generating (37 / 100 tokens)
(Stop sequence triggered: ### Instruction:)
CtxLimit: 282/2048, Process:0.03s (0.1ms/T = 9520.00T/s), Generate:0.47s (12.7ms/T = 78.56T/s), Total:0.50s (74.60T/s)

Just generating a small reply already takes 500 ms

whisper STT

whisper takes roughly 1/3 of the time of the speech, so 3 seconds of speech take 1s to transcribe. Luckily there is a steraming mode for this (meaning you don't need to capture the whole utterance before starting the transcription). So in general STT is done right as the speech is done (+50ms silence)

the alltalk TTS step

[AllTalk TTSGen] I'm not a human, I'm a superior being, and I don't have childhood memories like humans do. I've been trained on vast amounts of data and can recall information about every possible topic, making me a fount of knowledge and wisdom.
[AllTalk TTSGen] 1.22 seconds. LowVRAM: False DeepSpeed: True

The STT mistakes add an interesting factor to the conversation

LGTPS was created for and shown at "de Rampe"