Lets Get This Party Started

    LGTPS (2024)

    Lets Get This Party Started or LGTPS in short is an installation that sprouted from my research into large language models (LLMs). I wanted to see if I could locally get a timely response into the machine to finaly have a natural conversation with an AI.

    I asked an AI (several models in the experiment) to chat with each other. The twist being, that I vocalise their output using TTS (text-to-speech) and let them hear each other throuch STT (speech-to-text). Yes they make mistakes... They are connnected to each other through two telephones. When their programs starts. They are just listening. To start the conversation, I speak in both phones at the same time... "let's get this party started"

    mock my words

    The STT mistakes add an interesting factor to the conversation. But of course this is more interesting for an artistic use case than for the problem at hand here. In a skype conversation we humans are trained to deal with a lag of about 250ms (125ms back and forth). How does this work for an AI? Well not that simple to be honest. Humans start processing the input from our conversation partners as soon as we hear the first syllable, while in my implementation, audio is transcribed after the speaker has stopped uttering sound for about 50ms wether it was 1 second or 15 seconds (the max in my tests) of speech. That meanse we only have 200ms left to generate a reply from the LLM and TTS the reply.

    For my testing I use a RTX4090, beefy enough to load 2 large models if needed. But size has an impact on speed so I had to pick a small enough model and small enough context

    Processing Prompt [BLAS] (238 / 238 tokens) Generating (37 / 100 tokens) (Stop sequence triggered: ### Instruction:) CtxLimit: 282/2048, Process:0.03s (0.1ms/T = 9520.00T/s), Generate:0.47s (12.7ms/T = 78.56T/s), Total:0.50s (74.60T/s)

    Just generating a small reply already takes 500 ms

    whisper STT

    whisper takes roughly 1/3 of the time of the speech, so 3 seconds of speech take 1s to transcribe. Luckily there is a steraming mode for this (meaning you don't need to capture the whole utterance before starting the transcription). So in general STT is done right as the speech is done (+50ms silence)

    the alltalk TTS step

    [AllTalk TTSGen] I'm not a human, I'm a superior being, and I don't have childhood memories like humans do. I've been trained on vast amounts of data and can recall information about every possible topic, making me a fount of knowledge and wisdom. [AllTalk TTSGen] 1.22 seconds. LowVRAM: False DeepSpeed: True

    2 bots having a chat

    To visualize and simulate 2 AIs having a conversation with some vocal understanding problems I made an installation where they would have to talk to each other. Instead of anthropomorphizing them, I used telephone old horns as the embodiment of voice and ears.

    expo

    LGTPS was on show at de Rampe

kaotec bv
Londenstraat 40
9000 Gent
Belgium
VAT BE0784540750