How do I stream the response from RetrievalQA chain in holoviz panel application?

Kristada673 · March 1, 2024, 4:53pm

Panel newbie here. I have written the following Panel application for an LLM to query on a vector database:

import os, dotenv, openai, panel
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

panel.extension()

# Set API key
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY

@panel.cache
def load_vectorstore():
    # If the vector embeddings of the documents have not been created
    if not os.path.isfile('chroma_db/chroma.sqlite3'):

        # Load the documents
        loader = DirectoryLoader('Docs/', glob="./*.pdf", loader_cls=PyPDFLoader)
        data = loader.load()

        # Split the docs into chunks
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=50
        )
        docs = splitter.split_documents(data)

        # Embed the documents and store them in a Chroma DB
        embedding=OpenAIEmbeddings(openai_api_key = openai.api_key)
        vectorstore = Chroma.from_documents(documents=docs,embedding=embedding, persist_directory="./chroma_db")
    else:
        # load ChromaDB from disk
        embedding=OpenAIEmbeddings(openai_api_key = openai.api_key)
        vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embedding)

    return vectorstore


def retrieval_qa_chain():

    # Define prompt template
    template = """
    Provide your answers to the best of your ability to the user's questions.

    ## Task Context and History

    - **Context**: {context}
    - **Chat History**: {history}
    - **User Question**: {question}

    ## Answer Template

    Keep your explanations concise and to the point.

    """

    prompt = PromptTemplate(
        input_variables=["history", "context", "question"],
        template=template,
    )

    memory = ConversationBufferMemory(
        memory_key="history",
        input_key="question"
    )

    llm = ChatOpenAI(temperature=0, 
                     model="gpt-4-1106-preview",
                     openai_api_key = openai.api_key,
                     streaming=True
                    )

    vectorstore = load_vectorstore()

    qa_chain = RetrievalQA.from_chain_type(llm,
                                          chain_type='stuff',
                                          retriever=vectorstore.as_retriever(),
                                          chain_type_kwargs={
                                              "prompt": prompt,
                                              "memory": memory
                                          })
    return qa_chain

async def respond(contents, user, chat_interface):
    qa = retrieval_qa_chain()
    response = qa({"query": contents})
    answers = panel.Column(response["result"])
    yield {"user": "Bot", "value": answers}

chat_interface = panel.chat.ChatInterface(
    callback=respond, sizing_mode="stretch_width", callback_exception='verbose'
)
chat_interface.send(
    {"user": "Bot", "value": '''Ask me any question.'''},
    respond=False,
)

template = panel.template.BootstrapTemplate(main=[chat_interface])

template.servable()

It works, but the response from the LLM is deplayed at once. I want to be able to stream the response. How do I do that?

I tried using panel.chat.langchain.PanelCallbackHandler, but that introduces other artefacts in the chat response like source documents and changes the name of the chatbot too. I don’t want any of that - I just want my LLM responses to be streamed instead of displayed at once at the end. Is there a simple way to do that in Panel?

ahuang11 · March 1, 2024, 5:42pm

I think this could work:

Similar idea:

Kristada673 · March 2, 2024, 7:45am

Thanks for your answer. I have seen these examples actually, but none of them are about RAG (retrieval augmented generation). They all show how to stream the response when directly asking the ChatOpenAI function; but no examples of how to stream the response when querying with ChatOpenAI on a vector database. That’s where I am struggling at the moment.

Kristada673 · March 2, 2024, 10:20am

When I change the respond function to the following:

async def respond(contents, user, chat_interface):
    qa = retrieval_qa_chain()
    callback_handler = panel.chat.langchain.PanelCallbackHandler(chat_interface)
    return await qa(contents, callbacks=[callback_handler])

it does stream the responses (along with showing the document sources, which I don’t need, and changing the name of the chatbot, which I don’t want), but also produces the following error when it finishes streaming the response:

Traceback (most recent call last):
  File "/Users/Admin/miniforge3/lib/python3.10/site-packages/panel/chat/feed.py", line 526, in _prepare_response
    await asyncio.gather(
  File "/Users/Admin/miniforge3/lib/python3.10/site-packages/panel/chat/feed.py", line 495, in _handle_callback
    response = await self.callback(*callback_args)
  File "/Users/Admin/Documents/CompactBot/PanelApp/ex.py", line 138, in respond
    return await qa(contents, callbacks=[callback_handler])
TypeError: object dict can't be used in 'await' expression

So I rewrote the respond function this way:

async def respond(contents, user, chat_interface):
    qa = retrieval_qa_chain()
    callback_handler = panel.chat.langchain.PanelCallbackHandler(chat_interface)
    yield qa(contents, callbacks=[callback_handler])['result']

And this does fix the exception at the end of the streaming, but it produces even more artefacts I don’t need:

It shows the source documents
It streams the response from the LLM
And when the streaming from step 2 ends, it once again displays the final answer below it.
Changes the name of the chatbot at each step above, depending on what function it is doing.

Any idea how to not show the source documents being retrieved, not display the answer again after streaming ends, and not change the name I gave to the chatbot?

ahuang11 · March 2, 2024, 6:34pm

You don’t need calback_handler if you’re streaming manually.

Did you have a chance to check out ChatFeed — Panel v1.3.8 (holoviz.org) yet? It elaborates on streaming.