How to record and transcribe text and send to ChatInterface

ahuang11 · July 22, 2024, 5:31pm

import panel as pn
import speech_recognition as sr

from marvin.ai.audio import transcribe
from marvin.audio import record_phrase
from openai import AsyncOpenAI
pn.extension()

def recognize_speech(instance, event):
    with instance.active_widget.param.update(loading=True):
        audio = record_phrase()
        transcription = transcribe(audio)
        try:
            instance.active_widget.value = transcription
        except sr.RequestError as e:
            instance.stream("Could not request results; {0}".format(e), user="System")
        except sr.UnknownValueError:
            instance.stream("Unknown error occurred", user="System")

async def callback(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = instance.serialize()
    response = await aclient.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        stream=True,
    )
    message = ""
    async for chunk in response:
        part = chunk.choices[0].delta.content
        if part is not None:
            message += part
            yield message

aclient = AsyncOpenAI()
chat = pn.chat.ChatInterface(
    callback=callback,
    button_properties={"speak": {"callback": recognize_speech, "icon": "microphone"}},
    show_rerun=False,
    show_undo=False,
    show_clear=False,
)
chat.servable()

ahuang11 · July 22, 2024, 5:51pm

You can also make it speak back!

import panel as pn
import speech_recognition as sr

from marvin.ai.audio import transcribe, speak_async
from marvin.audio import record_phrase
from openai import AsyncOpenAI
pn.extension()

def recognize_speech(instance, event):
    with instance.active_widget.param.update(loading=True), instance.param.update(loading=True):
        audio = record_phrase()
        transcription = transcribe(audio)
        instance.active_widget.value = transcription

async def callback(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = instance.serialize()
    response = await aclient.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        stream=True,
    )
    message = ""
    async for chunk in response:
        part = chunk.choices[0].delta.content
        if part is not None:
            message += part
            yield message
    await (await speak_async(message, voice="shimmer")).play_async()


aclient = AsyncOpenAI()
chat = pn.chat.ChatInterface(
    callback=callback,
    button_properties={"speak": {"callback": recognize_speech, "icon": "microphone"}},
    show_rerun=False,
    show_undo=False,
    show_clear=False,
)
chat.servable()

awesomebytes · July 26, 2024, 8:28am

Thank you @ahuang11 so much for your example!

I have 2 little issues with your proposed approach:

I understand this is recording the audio of the machine running the code (note: I did some minor research on marvin and it’s dependencies speech_recognition and pyaudio). I aim to deploy an app in a server, and people will access through the browser, so doing marvin.audio.record_phrase will not work, as it defaults to capture from the audio card of the computer where it is running. So I was aiming to use Panel’s SpeechToText as it does the JS magic to capture the audio (and send to the default’s browser implementations of STT). I also tried to find a ‘record audio’ via the browser in Panel but I had no luck.
This uses OpenAI in the background, which implies a necessary API key, which implies paying for it. Albeit I’m not against it (in the end I’ll be using some LLM that most probably will be paid) having the “free” option given by the browser for both STT and TTS (I know in the end it basically sends it to google and there’s a limited amount until getting blocked/charged for stt) is a great way to develop and test.

Overall, I would like to, again, thank you SO MUCH, for spending your time and energies helping me (this convo started in github). If you, or any other kind soul, knows how to make a widget to just capture audio via the browser in panel (so I can then redirect that audio to any STT service of my choosing), I’d love to learn about it. I’m not well versed in JS, and just getting into learning Panel (which seems to use Bokeh in the background?) so I may be hitting my head against a wall for a while if I try to implement it myself.

Have a great day/night!

ahuang11 · July 26, 2024, 6:25pm

Thanks for the feedback. For SpeechToText can you submit an issue on GitHub to make it workable with ChatInterface?

For point #2 for speech to text, I think you can use speech_recognition without marvin and use recognize_whisper which will run on your own machine. (I recommend just asking ChatGPT/Claude on how to migrate to vanilla speech_recongition). You can also use elevenlabs for text to speech under the free tier.

awesomebytes · July 26, 2024, 6:58pm

Done @ahuang11 Integrate SpeechToText with ChatInterface · Issue #7021 · holoviz/panel · GitHub