Defining failure modes and benchmarks for LLM use in the HoloViz ecosystem

We’ve recently started an experiment across the HoloViz ecosystem to better understand how our tools, documentation, and examples interact with large language models.

With support from a NumFOCUS Small Development Grant, the goal is to understand what actually makes an open-source project LLM-friendly in practice. In particular: when users ask an LLM to generate HoloViz code, what helps it succeed, and where does it consistently break down?

One early conclusion from our working sessions is that problem definition matters more than tooling at this stage. Before building integrations or servers, we need to step into the shoes of naïve users and observe how LLMs fail today when working with HoloViz, rather than assuming the failure modes in advance.

Our current focus is on documenting and categorizing those failures. Examples include hallucinated or outdated APIs, mixed old and new syntax across versions, missing imports that make otherwise plausible code non-runnable, and lower-quality outputs compared to ecosystems with more explicit examples and guidance.

The next concrete step is to turn this into something testable: a small benchmark of realistic tasks and questions, grounded in real user queries (from Discourse, Discord, etc.), and inspired by similar efforts like LangChain’s evaluation workflows. The aim is to make these problems visible, measurable, and reusable so improvements can be driven by evidence rather than anecdotes.

If you’re interested in following along or contributing examples, we’ll be sharing updates here and on our Discord channel.

2 Likes

Awesome topic! Thanks for starting it. I’ve just installed the recent MCP server of Panel shared by @Marc in both VS Code and Antigravity, and have been experimenting with it with success so far. I wonder if open-source projects wouldn’t be better to adopt the LLM integration problem preemptively, already building the ecosystem to work sustainably with that in mind, instead of opposing it to avoid financial problems like what happened in the Tailwind case.

2 Likes

High-level AI Agents like Claude and Github Copilot changes users expectations to time, quality and quantity dramatically. Before our HoloViz Ecosystem would primarily have been compared to other Python dataviz tools and ecosystems. But now users via AI Agents are capable of creating complex, high-quality data-driven applications in react in a very short time.

Basically I think AI Agents behave a lot like new human users would:

  • Bugs in the framework confuses them
  • Poor defaults (like zero margin of panel_material_ui.Paper) confuses them and gives them a much harder problem to solve. The same can be said about hvPlot/ HoloViews defaults or lack here of.
  • Multi steps makes it much harder for them. For example having to install watchfiles seperately.
  • Many ways to do one thing without very clear guidance on when to use what makes them much less reliable (like param.depends, param.bind, .watch, param.rx or Bokeh + Tabulator formatters. The Tabulator docs could be improved hugely by more clear guidance)
  • Lack of documentation, examples and community means they are untrained and cannot use web fetch to learn
  • They need to learn both what to do and what not to do (e.g. pn.extension(“tabulator”) is good, but pn.extension(“bokeh”) will raise exception is very hard for them to understand because its not explained anywhere)
  • Using a known framework like material-ui but then renaming or not supporting attributes/ parameters confuses them. They will try to use their knowledge from Material UI.
  • Lack of type annotations or docstrings means they get no help from language servers
  • Very long or poorly organized and formulated documentation is hard to consume.

Its actually a benefit that AI Agents starts as new users every time because then you can learn from all their problems.

Time:

  • AI Agents iterate much faster on typescript problems than python problems. So do everything you can to speed up your framework (import panel, panel serve, pytest ..., ruff ...) and associated workflow (UI testing etc.)
  • Claude, Copilot etc. can display Javascript/ React apps with live updates integrated in the chat/ canvas. Its a big challenge that they cannot do the same for Python viz. Because users will use what is easy/ possible.

Quality:

  • Every Python viz library will now be compared to React alternatives because its so easy and fast for users to create high-quality react visualizations. For me the consequence would be to say drop Plotly backend and replace with ECharts instead. That would give users something comparable.

Quantity:

  • For AI Agents to be able to develop large, complex visualizations or applications they need guidance on work processes (planning, development, testing, ui testing, …)
  • If your framework integrates with other libraries you need to guide on the integration. For example if you want to update echarts charts dynamically you need to add replaceMerge setting to the configuration and you need to explain the config dict must be json serializable and cannot contain python functions :slight_smile:

The holoviz-mcp server provides skills for most frameworks based on experienced problems: See holoviz-mcp/src/holoviz_mcp/config/resources/skills

I’ve found that having a heavily commented reference example helps them much more than less specific comments they have to interpret:

# DO import panel as pn
import panel as pn
import param

# DO always run pn.extension
# DO remember to add any imports needed by panes, e.g. pn.extension("tabulator", "plotly", ...)
# DON'T add "bokeh" as an extension. It is not needed.
# Do use throttled=True when using slider unless you have a specific reason not to
pn.extension(throttled=True)

# DO organize functions to extract data separately as your app grows. Eventually in a separate data.py file.
# DO use caching to speed up the app, e.g. for expensive data loading or processing that would return the same result given same input arguments.
# DO add a ttl (time to live argument) for expensive data loading that changes over time
@pn.cache(max_items=3)
def extract(n=5):
    return "Hello World" + "⭐" * n

text = extract()
text_len = len(text)

# DO organize functions to transform data separately as your app grows. Eventually in a separate transformations.py file
# DO add caching to speed up expensive data transformations
@pn.cache(max_items=3)
def transform(data: str, count: int=5)->str:
    count = min(count, len(data))
    return data[:count]

# DO organize functions to create plots separately as your app grows. Eventually in a separate plots.py file.
# DO organize custom components and views separately as your app grows. Eventually in separate components.py or views.py file(s).
# DO use param.Parameterized, pn.viewable.Viewer or similar approach to create new components and apps with state and reactivity
class HelloWorld(pn.viewable.Viewer):
    # DO define parameters to hold state and drive the reactivity
    characters = param.Integer(default=text_len, bounds=(0, text_len), doc="Number of characters to display")

    def __init__(self, **params):
        super().__init__(**params)

        # DO use sizing_mode="stretch_width" for components unless "fixed" or other sizing_mode is specifically needed
        with pn.config.set(sizing_mode="stretch_width"):
            # DO create widgets using `.from_param` method
            self._characters_input = pn.widgets.IntSlider.from_param(self.param.characters, margin=(10,20))

            # DO Collect input widgets into horizontal, columnar layout unless other layout is specifically needed
            self._inputs = pn.Column(self._characters_input, max_width=300)

            # CRITICAL: Create panes ONCE with reactive content
            # DON'T recreate panes and layouts in @param.depends methods - causes flickering!
            # DO bind reactive methods/functions to panes for smooth updates
            self._output_pane = pn.pane.Markdown(
                self.model,  # Reactive method reference
                sizing_mode="stretch_width"
            )

            # DO collect output components into some layout like Column, Row, FlexBox or Grid depending on use case
            self._outputs = pn.Column(self._output_pane)

            # DO collect all of your components into a combined layout useful for displaying in notebooks etc.
            self._panel = pn.Row(self._inputs, self._outputs)

    # DO use caching to speed up bound methods that are expensive to compute or load data and return the same result for a given state of the class.
    # DO add a ttl (time to live argument) for expensive data loading that changes over time.
    @pn.cache(max_items=3)
    # DO prefer .depends over .bind over .rx for reactivity methods on Parameterized classes as it can be typed and documented
    # DON'T use `watch=True` or `.watch(...)` methods to update UI components directly.
    # DO use `watch=True` or `.watch(...)` for updating the state parameters or triggering side effects like saving files or sending email.
    @param.depends("characters")
    def model(self):
        # CRITICAL: Return ONLY the content, NOT the layout/pane
        # The pane was created once in __init__, this just updates its content
        return transform(text, self.characters)

    # DO use `watch=True` or `.watch(...)` for updating the state parameters or triggering side effects like saving files or sending email.
    @param.depends("characters", watch=True)
    def _inform_user(self):
        print(f"User selected to show {self.characters} characters.")

    # DO provide a method for displaying the component in a notebook setting, i.e. without using a Template or other element that cannot be displayed in a notebook setting.
    def __panel__(self):
        return self._panel

    # DO provide a method to create a .servable app
    @classmethod
    def create_app(cls, **params):
        instance = cls(**params)
        # DO use a Template or similar page layout for served apps
        template = pn.template.FastListTemplate(
            # DO provide a title for the app
            title="Hello World App",
            # DO provide optional image, optional app description, optional navigation menu, input widgets, optional documentation and optional links in the sidebar
            # DO provide as list of components or a list of single horizontal layout like Column as the sidebar by default is 300 px wide
            sidebar=[instance._inputs],
            # DO provide a list of layouts and output components in the main area of the app.
            # DO use Grid or FlexBox layouts for complex dashboard layouts instead of combination Rows and Columns.
            main=[instance._outputs],
            # DO set main_layout=None for modern layout
            main_layout=None,
        )
        return template

# DON'T provide a `if __name__ == "__main__":` method to serve the app with `python`
# DO provide a method to serve the app with `panel serve`
if pn.state.served:
    # Mark components to be displayed in the app with .servable()
    HelloWorld.create_app().servable()

Its very clear to me that AI Agents need to be guided by opinions and great defaults.
Its also very clear that we need to enable AI Agents to be guided by skills and by easy access to great documentation. But we should also simplify the frameworks and make sure to do everything possible to provide great defaults and easy to digest documentation.

One benefit of our ecosystem is that ´param.Parameterized` classes are easily testable. You cannot say the same about Streamlit, Dash or Gradio. This means the AI can test and fix up front.

Another benefit is the Diataxis framework. This helps the semantic search.

3 Likes

Welcome to the community @apachaves :slight_smile:

1 Like

Current state: I get great results if I provide the LLM with an example, telling it to build code using the same structure.

We have some great advice on what to do and what to avoid in the documentation: we could make it easy to point the LLM at a relevant doc.

2 Likes

Pointing to documentation and examples is what makes holoviz-mcp so successful.

3 Likes

Wow! That post was very insightful. I wonder how we can start addressing them, e.g. for pn.extension(“bokeh”) should we just warn instead of raise an error?

Also that example looks very useful, not only to LLMs, but users. Perhaps we can add it as an example to Apply best practices — Panel v1.8.7?

1 Like

This is a really good and informative comment. Thank you for this!

1 Like