Discussion on best practices for advanced caching - please join

Hi

This is a DISCUSSION ON CACHING. I hope you will join. Please share your comments and suggestions.


I would like to start optimize the way I use caching. I wanna do this because I want to improve my development experience and my user experience. I need things to be faster for me and my users.

I can use the simple cache dictionary in pn.state.cache but it is simple I need something better.

My Requirements

  • If possible I would like to avoid setting up additional infrastructure to depend on like redis or memcache. I would like to keep things simple.
    • Maybe it is just me who don’t want to learn yet another thing and maintain it :slight_smile:
  • The caching should be easy and powerful to use just not for me but also for my analyst, traders colleagues and the Panel community.
  • My data and usage is not so large that I believe I need to store it else where. My laptop and mounted drive in my docker container would be fine.
  • My objects are DataFrames, HoloViews, Plotly and Echarts plots, Some machine learning models etc. They should be easy to cache.

Development Experience

My workflow is developing in VS code and serving my Panel app with the --dev option: python -m panel serve 'app.py' --dev. When I change the code the server reloads. But then it has to reload my data, plots etc. again. That is slow. Mostly bound by slow database calls.

  • I need the data to be persisted between server restarts.
    • I want the option to not persist my other objects like plots etc. as they change.
    • Optimally I would like the cache to be clever enough to know if the cache result of specific functions are still valid. Streamlit supports this.
  • I would also like the caching to be easy to apply to functions via a @memoize annotation.
  • I would like it to be easy to clear the cache manually via a button or script I can run.

User Experience

My apps should be fast and snappy and the data up to date.

  • I need to be able to set the expiration of specific data or collections of data either relative (in 5 mins) or absolutely (at 16:00).
  • The cache should be robust when I have many users and concurrent requests. It should work in a Panel/ Bokeh/ Tornado environment.
  • I would like to be able to add to and update the cache outside of my app by running scripts on regular intervals.
  • I should be able to control whether I cache globally, for the specific user or the specific session.

Other

  • One question that I would also like an answer to. Should I combine memory and desk caching to make things as fast as possible? And enable to option of persisting hard to serialize/ pickle/ cache objects in memory?

So my question is. HOW SHOULD I SETUP MORE ADVANCED CACHING? ANY SUGGESTIONS?

1 Like

I have been trying out DiskCache. It is very, very promising. As I can get the cache is working between sessions and on (--dev) restarts. It also support a lot of advanced caching strategies like expiration.

And it’s simple to use. And it could be even simpler if I could get the @cache.memoize annotation to work. See memoize not working in Panel.

Video

diskcache2

Code

import time

import numpy as np
import pandas as pd
import panel as pn
import param
from diskcache import FanoutCache

pn.config.sizing_mode = "stretch_width"

EXPIRE = 5 * 60  # 5 minutes
CACHE_DIRECTORY = "cache"

cache = FanoutCache(directory=CACHE_DIRECTORY)
# pn.state.cache["cache"]=cache # Maybe I should cache the cache or something?

# restart

def _get_data(value):
    value_str = str(value)
    if value_str in cache:
        return cache[value_str]
    else:
        print("loading data", value)
        time.sleep(1)

        df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
        cache[value_str] = df
        print("data loaded")
        return df


class MyApp(param.Parameterized):
    value = param.Integer(default=0, bounds=(0, 10))
    data = param.Integer()
    clear_cache = param.Action()

    def __init__(self, **params):
        super().__init__(**params)

        self.data_panel = pn.pane.Str()
        self.loading_spinner = pn.widgets.indicators.LoadingSpinner(
            width=25, height=25, sizing_mode="fixed"
        )
        self.clear_cache = self._clear_cache

        self.view = pn.Column(
            self.loading_spinner,
            self.param.value,
            self.data_panel,
            self.param.clear_cache,
            max_width=500,
        )

        self._update_data()

    @param.depends("value", watch=True)
    def _update_data(self):
        self.loading_spinner.value = True
        self.data_panel.object = f"Data: {_get_data(self.value)}"
        self.loading_spinner.value = False

    def _clear_cache(self, *events):
        cache.clear()


if __name__.startswith("bokeh"):
    MyApp().view.servable()
```
1 Like

I have tried adding the cachetools TTLCache on top. So far this seems to speed things up even more.

So updating the _get_data of the above example with

from cachetools import cached, TTLCache

@cached(cache=TTLCache(maxsize=1024, ttl=600))
def _get_data(value):
    value_str = str(value)
    if value_str in cache:
        return cache[value_str]
    else:
        print("loading data", value)
        time.sleep(1)

        df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
        cache[value_str] = df
        print("data loaded")
        return df

I will need to figure out how to get things working globally. Right now I think it’s working for a session.

Really good idea.
An adaptation from https://gist.github.com/GuiMarthe/8ebcc912fd9052ba64adc264900a9bb0

The bottleneck for me is essential loading of those data but I like the idea of decorators. It makes pandas processing clean.

def cache_pandas_result(cache_dir, hard_reset: bool, geoformat=False):
    '''
    This decorator caches a pandas.DataFrame returning function.
    It saves the pandas.DataFrame in a parquet file in the cache_dir.
    It uses the following naming scheme for the caching files:
        cache_dir / function_name + '.trc.pqt'
    Parameters:
    cache_dir: a pathlib.Path object
    hard_reset: bool
    '''
    def build_caching_function(func):
        @wraps(func)
        def cache_function(*args, **kwargs):
            if not isinstance(cache_dir, Path):
                raise TypeError('cache_dir should be a pathlib.Path object')

            cache_file = cache_dir / (func.__name__ + '.trc.pqt')

            if hard_reset or (not cache_file.exists()):
                result = func(*args, **kwargs)
                if not isinstance(result, pd.DataFrame):
                    raise TypeError(f"The result of computing {func.__name__} is not a DataFrame")
                result.to_parquet(cache_file)
                return result
            print("{} exist".format(cache_file.name))
            if geoformat:
                import geopandas as gpd
                result = gpd.read_parquet(cache_file)
            else:
                result = pd.read_parquet(cache_file)
            return result
        return cache_function
    return build_caching_function
1 Like

Thanks @slamer59 . You are thinking along the same lines as I am. :+1:

The DiskCache and cachetools above should be more general than dataframes. For example being able to also cache plots and machine learning models. But the pandas caching function you are pointing too could be more performant as it uses parquet to store the dataframes.

But I am in doubt if the pandas cache function can handle an intensive multiuser scenario? And it does not support expiration, tags/ namespaces etc.

Just my two cents.

Thanks, Marc; it would be great to get better support for a workflow of editing a text file and iterating on the results. This issue doesn’t come up much for people (like myself) using Jupyter, because Jupyter makes it easy to execute one cell to load the data then have any number of other cells for iteratively developing visualizations and apps using that data. If using Jupyter isn’t appropriate, then yes, it would be great to have good on-disk caching to reduce the cost of getting feedback on your edits.

Thanks @jbednar

I think that caching could also help a Jupyter user. It could help in two ways.

  • During development in Jupyter the user does not have to manually download and save data if that is a slow thing to repeat. She does not even have to spend time understanding if she should serialize as json, csv, parquet, pickle or something else. She can just use a cache decorater and annotate the relevent function. Then the object will be persisted to disk. Then the data is there the next day as well when she reopens the notebook. And when testing the app the parts that don’t are not under development could have cached calculation results to speed up things as well.
  • In production it should be the same improvement as for editor users. Their apps will be much, much faster.

I think this discussion will get very confusing if we don’t completely separate disk and memory caching, preferably into different issues or discourse topics because the implementation and use cases are so different. On-disk caching will definitely help development when using a text editor, and may also be of use for external data sources that might go down (though I would hope that’s not your only copy of such data!). But I don’t see how on-disk caching will improve the app-user experience, which is the province of in-memory caching, already supported by pn.state.

Maybe you are talking about disk caching for datasets so large they don’t fit in memory but with an original source so slow it makes sense to store a local disk copy instead of re-querying, but if they are that big, seems likely to me that they’d need special handling (e.g. storing in a user-selected fast file format appropriate for that particular data type), not just generic memoization. So I’m not seeing how generic disk-based caching is going to make an impact on the app-user experience in typical cases.

Unless I’m missing something or confused about that (always possible!), I think we’ll be able to make progress much better if we discuss possible improvements to how an app gets things in and out of pn.state (e.g. easily persisting to pn.state via a decorator) separately from how pn.state (or other state) might get persisted to disk. To me those are two very different topics with different but complementary solutions.