Discussion on best practices for advanced caching - please join

Hi

This is a DISCUSSION ON CACHING. I hope you will join. Please share your comments and suggestions.


I would like to start optimize the way I use caching. I wanna do this because I want to improve my development experience and my user experience. I need things to be faster for me and my users.

I can use the simple cache dictionary in pn.state.cache but it is simple I need something better.

My Requirements

  • If possible I would like to avoid setting up additional infrastructure to depend on like redis or memcache. I would like to keep things simple.
    • Maybe it is just me who don’t want to learn yet another thing and maintain it :slight_smile:
  • The caching should be easy and powerful to use just not for me but also for my analyst, traders colleagues and the Panel community.
  • My data and usage is not so large that I believe I need to store it else where. My laptop and mounted drive in my docker container would be fine.
  • My objects are DataFrames, HoloViews, Plotly and Echarts plots, Some machine learning models etc. They should be easy to cache.

Development Experience

My workflow is developing in VS code and serving my Panel app with the --dev option: python -m panel serve 'app.py' --dev. When I change the code the server reloads. But then it has to reload my data, plots etc. again. That is slow. Mostly bound by slow database calls.

  • I need the data to be persisted between server restarts.
    • I want the option to not persist my other objects like plots etc. as they change.
    • Optimally I would like the cache to be clever enough to know if the cache result of specific functions are still valid. Streamlit supports this.
  • I would also like the caching to be easy to apply to functions via a @memoize annotation.
  • I would like it to be easy to clear the cache manually via a button or script I can run.

User Experience

My apps should be fast and snappy and the data up to date.

  • I need to be able to set the expiration of specific data or collections of data either relative (in 5 mins) or absolutely (at 16:00).
  • The cache should be robust when I have many users and concurrent requests. It should work in a Panel/ Bokeh/ Tornado environment.
  • I would like to be able to add to and update the cache outside of my app by running scripts on regular intervals.
  • I should be able to control whether I cache globally, for the specific user or the specific session.

Other

  • One question that I would also like an answer to. Should I combine memory and desk caching to make things as fast as possible? And enable to option of persisting hard to serialize/ pickle/ cache objects in memory?

So my question is. HOW SHOULD I SETUP MORE ADVANCED CACHING? ANY SUGGESTIONS?

1 Like

I have been trying out DiskCache. It is very, very promising. As I can get the cache is working between sessions and on (--dev) restarts. It also support a lot of advanced caching strategies like expiration.

And it’s simple to use. And it could be even simpler if I could get the @cache.memoize annotation to work. See memoize not working in Panel.

Video

diskcache2

Code

import time

import numpy as np
import pandas as pd
import panel as pn
import param
from diskcache import FanoutCache

pn.config.sizing_mode = "stretch_width"

EXPIRE = 5 * 60  # 5 minutes
CACHE_DIRECTORY = "cache"

cache = FanoutCache(directory=CACHE_DIRECTORY)
# pn.state.cache["cache"]=cache # Maybe I should cache the cache or something?

# restart

def _get_data(value):
    value_str = str(value)
    if value_str in cache:
        return cache[value_str]
    else:
        print("loading data", value)
        time.sleep(1)

        df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
        cache[value_str] = df
        print("data loaded")
        return df


class MyApp(param.Parameterized):
    value = param.Integer(default=0, bounds=(0, 10))
    data = param.Integer()
    clear_cache = param.Action()

    def __init__(self, **params):
        super().__init__(**params)

        self.data_panel = pn.pane.Str()
        self.loading_spinner = pn.widgets.indicators.LoadingSpinner(
            width=25, height=25, sizing_mode="fixed"
        )
        self.clear_cache = self._clear_cache

        self.view = pn.Column(
            self.loading_spinner,
            self.param.value,
            self.data_panel,
            self.param.clear_cache,
            max_width=500,
        )

        self._update_data()

    @param.depends("value", watch=True)
    def _update_data(self):
        self.loading_spinner.value = True
        self.data_panel.object = f"Data: {_get_data(self.value)}"
        self.loading_spinner.value = False

    def _clear_cache(self, *events):
        cache.clear()


if __name__.startswith("bokeh"):
    MyApp().view.servable()
```
1 Like

I have tried adding the cachetools TTLCache on top. So far this seems to speed things up even more.

So updating the _get_data of the above example with

from cachetools import cached, TTLCache

@cached(cache=TTLCache(maxsize=1024, ttl=600))
def _get_data(value):
    value_str = str(value)
    if value_str in cache:
        return cache[value_str]
    else:
        print("loading data", value)
        time.sleep(1)

        df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
        cache[value_str] = df
        print("data loaded")
        return df

I will need to figure out how to get things working globally. Right now I think it’s working for a session.

Really good idea.
An adaptation from https://gist.github.com/GuiMarthe/8ebcc912fd9052ba64adc264900a9bb0

The bottleneck for me is essential loading of those data but I like the idea of decorators. It makes pandas processing clean.

def cache_pandas_result(cache_dir, hard_reset: bool, geoformat=False):
    '''
    This decorator caches a pandas.DataFrame returning function.
    It saves the pandas.DataFrame in a parquet file in the cache_dir.
    It uses the following naming scheme for the caching files:
        cache_dir / function_name + '.trc.pqt'
    Parameters:
    cache_dir: a pathlib.Path object
    hard_reset: bool
    '''
    def build_caching_function(func):
        @wraps(func)
        def cache_function(*args, **kwargs):
            if not isinstance(cache_dir, Path):
                raise TypeError('cache_dir should be a pathlib.Path object')

            cache_file = cache_dir / (func.__name__ + '.trc.pqt')

            if hard_reset or (not cache_file.exists()):
                result = func(*args, **kwargs)
                if not isinstance(result, pd.DataFrame):
                    raise TypeError(f"The result of computing {func.__name__} is not a DataFrame")
                result.to_parquet(cache_file)
                return result
            print("{} exist".format(cache_file.name))
            if geoformat:
                import geopandas as gpd
                result = gpd.read_parquet(cache_file)
            else:
                result = pd.read_parquet(cache_file)
            return result
        return cache_function
    return build_caching_function
1 Like

Thanks @slamer59 . You are thinking along the same lines as I am. :+1:

The DiskCache and cachetools above should be more general than dataframes. For example being able to also cache plots and machine learning models. But the pandas caching function you are pointing too could be more performant as it uses parquet to store the dataframes.

But I am in doubt if the pandas cache function can handle an intensive multiuser scenario? And it does not support expiration, tags/ namespaces etc.

Just my two cents.

Thanks, Marc; it would be great to get better support for a workflow of editing a text file and iterating on the results. This issue doesn’t come up much for people (like myself) using Jupyter, because Jupyter makes it easy to execute one cell to load the data then have any number of other cells for iteratively developing visualizations and apps using that data. If using Jupyter isn’t appropriate, then yes, it would be great to have good on-disk caching to reduce the cost of getting feedback on your edits.

Thanks @jbednar

I think that caching could also help a Jupyter user. It could help in two ways.

  • During development in Jupyter the user does not have to manually download and save data if that is a slow thing to repeat. She does not even have to spend time understanding if she should serialize as json, csv, parquet, pickle or something else. She can just use a cache decorater and annotate the relevent function. Then the object will be persisted to disk. Then the data is there the next day as well when she reopens the notebook. And when testing the app the parts that don’t are not under development could have cached calculation results to speed up things as well.
  • In production it should be the same improvement as for editor users. Their apps will be much, much faster.

I think this discussion will get very confusing if we don’t completely separate disk and memory caching, preferably into different issues or discourse topics because the implementation and use cases are so different. On-disk caching will definitely help development when using a text editor, and may also be of use for external data sources that might go down (though I would hope that’s not your only copy of such data!). But I don’t see how on-disk caching will improve the app-user experience, which is the province of in-memory caching, already supported by pn.state.

Maybe you are talking about disk caching for datasets so large they don’t fit in memory but with an original source so slow it makes sense to store a local disk copy instead of re-querying, but if they are that big, seems likely to me that they’d need special handling (e.g. storing in a user-selected fast file format appropriate for that particular data type), not just generic memoization. So I’m not seeing how generic disk-based caching is going to make an impact on the app-user experience in typical cases.

Unless I’m missing something or confused about that (always possible!), I think we’ll be able to make progress much better if we discuss possible improvements to how an app gets things in and out of pn.state (e.g. easily persisting to pn.state via a decorator) separately from how pn.state (or other state) might get persisted to disk. To me those are two very different topics with different but complementary solutions.

1 Like

I’ve been exploring for a while and what I found is that a good and simple solution to use is DiskCache.

I’ve created the demo app below. Check it out https://awesome-panel.org/caching-example :slight_smile:

Reply to @jbednar above

The reasons why I’m not using pn.state.cache a lot is 1) There is no easy to use decorator with expiration. 2) It is not persistent - my apps are updated and restarted multiple times a day 3) The cache cannot be “pre-loaded” outside the app.

  1. Can could be relatively easily solved.

  2. and 3) Probably also could if you spend enough effort on it. But why do that when there are existing solutions like DiskCache and Redis?

By the way as I understand it DiskCache uses a combination of memory and disk - so for me its not a discussion of one or the other. But how to combine the two most efficiently.

Thanks for the investigation here. I would like to see two issues come out of this:

  1. Improvements for the in-memory cache based around pn.state.cache and pn.state.as_cached and memoize decorators.

  2. An issue discussing integration with an out-of-memory (i.e. disk cache). My main concern here is that disk caches are quite dependent on the kind of data you are meaning to persist, however since DataFrames are going to be the dominant use case I’d be quite happy to ship something in Panel to improve the caching experience there. At the same time though, a savy user will be able to achieve quite good performance by persisting columnar data to something like parquet and nD-arrays to something like Zarr. Beyond that an persistent in-memory solution like Redis will surely provide some additional speedup. In either case I’d love to see an issue discussing these options.

1 Like

@philippjfr.See also this comment on stacking caches https://github.com/grantjenks/python-diskcache/issues/192#issuecomment-770412879 from the developer of diskcache.

Thanks for making that video; looks great!

I’m not sure if we’re all talking about the same things here, so here’s an example of what I mean by disk vs memory caching: our new example https://examples.pyviz.org/ship_traffic/ that uses both types. First, it uses disk-based caching to Parquet files to capture the results of 10 minutes of processing a large set of CSVs, creating a Parquet file that takes 10 seconds to load. That way the full pipeline from the original source data is captured, while not having to run through it every time. Only the first time the script is run on a given filesystem does it ever need to build the Parquet file, at the cost of some disk space taken up that’s smaller than the original data. You could certainly run that in batch if you prefer.

pn.state is then used both to avoid having a user wait even that 10 seconds to load from disk, but even more importantly so that only a single copy of that dataset is needed in Python memory. If a new copy were defined for every user, the machine would quickly run out of memory, while reusing from pn.state not only speeds it up, but also avoids the memory issue.

Because these are two very different concerns, they were implemented very differently. Yes, a decorator could replace the explicit call to pn.state, which might be nice, though it wouldn’t be significantly shorter and a lot of people aren’t comfortable with decorators. It would also be nice to have some support for automatic Parquet-based caching of a columnar dataframe, but here that would maybe save a line or two of code and not be any faster. I’d be surprised if DiskCache or any other general-purpose caching solution could compete on speedup, but would be happy to be proven wrong!

I’m not sure how what you’re caching in your example relates to this; is it pickled HoloViews objects? If so that wouldn’t apply to the case in my example, where the HoloViews object isn’t weighty; it’s the underlying DataFrame that needs caching.

1 Like

I’m caching both dataframes and holoviews objects.

My work use cases seldom have static data. We get new prices, weather data etc. all the time. So we need to work with things like expiration.

We ask the database a lot for smaller, filtered datasets. And those requests are slow 0,5 - 5secs.