Best practises for large high memory dataframes and selections

DrAstro · October 3, 2022, 2:04pm

Hello!

I’m in the process of making a dashboard where the base data could potentially be quite a large dataframe. By this I mean millions of rows but with also a lot of selections and resamples to make (the size I’m working with currently is ok, around 0.5 - 1 GB in memory but it could grow). For example, say it was oil field data, you could have a structure of company → site → well where users would be able to select which selection of these values that they want to see along with other fields such as a date range.

There is also resamplings to do on the data, i.e. how many events occurred a day - things like that.

Given that I need to host the dashboard initially on a server with not a huge amount of memory, I have been playing around with how to best structure the data and dashboard to make it the most efficient. In the past I’ve done things like loading the dataframe to the cache in a setup script, and then have something like (I’ve also established what the selector options are and entered that into the cache, below is not ideal as I would usually have these automatically filled):

class ExampleDataframe(param.Parameterized):
    df = param.DataFrame(precedence=-1)

    account_selector = param.ObjectSelector(
        objects=pn.state.cache['options']['accounts'],
        default="Company A",
        label="Account"
    )

    site_selector = param.ObjectSelector(
        objects=pn.state.cache['options']['sites']["Company A"],
        default=pn.state.cache['options']['sites']["Company A"][0],
        label="Site"
    )

    well_selector = param.ObjectSelector(
        objects=pn.state.cache['options']['wells'][pn.state.cache['options']['sites']["Company A"][0]],
        default=pn.state.cache['options']['wells'][pn.state.cache['options']['sites']["Company A"][0]][0],
        label="Well"
    )

    def __init__(self, **params):
        super().__init__(**params)
        self.update_df()

    @param.depends('account_selector', watch=True)
    def update_site_selector(self):
        self.param.site_selector.objects = pn.state.cache['options']['sites'][self.account_selector]
        self.site_selector = self.param.site_selector.objects[0]

    @param.depends('site_selector', watch=True)
    def update_well_selector(self):
        self.param.tower_selector.objects = pn.state.cache['options']['wells'][self.site_selector]
        self.well_selector = self.param.well_selector.objects[0]

    @param.depends('site_selector', watch=True)
    def update_df(self):
        self.df = pn.state.cache['data'].query(
            f"account_name=='{self.account_selector}' & "
            f"site=='{self.site_selector}'"
        ).reset_index(drop=True)

I could then define some views either in the above or inheriting on another param class where resamples are done. Below I resample by a Day but ideally another selector would be created that allows the user to change the resampling frequency:

class OilViews(my_example_dataframe):
    @param.depends(self.df, watch=True)
    def site_view(self):
        plot_df = (
            self.df.groupby('well_name')
            .resample('D', on='command_schedule_time')
            .agg({'command_id': 'nunique'})
        ).reset_index()

        return plot_df.hvplot.line(
            x='date',
            y='command_id',
            ylabel='Number of Commands'
        )

This seems to work ok but it doesn’t seem very efficient. When I create new sessions on the dashboard and watch the memory I can see it just stepping up and can imagine this becoming an issue.

I was wondering if anyone has dealt with having a large base dataset to work off that could require a lot of selections and manipulations and has found a low footprint method to go about it?

In the past I have experimented with out-of-core solutions like vaex but I have not seen many examples of using such methods combined with Panel. It may be that down the line a vaex/dash setup becomes more suitable for what I need but am very keen to get as far as I can with Panel!

Any advice anyone could offer would be very much appreciated, I may even be going about the above example in the wrong manner! Thanks!

EDIT: Should also say that using a setup like above, one problem I see is that using the account selector for example to just switch back forth to view the different data, I see the memory grow and grow with the update_df being called. Is the previous data not released?