Dask+Holoviews+Datashader

Hello everyone The question is related to performance. I have a Dataframe with 82 million rows and execute the following code:

def get_path(magnitude, month, day, interval, utc_offset):
    df_query = dask_df.loc[(dask_df.magnitude>=magnitude)&(dask_df.visible_flag==1)&(dask_df.utc_offset==utc_offset)&(dask_df.month==month)&(dask_df.day==day)&(dask_df.intervals==interval)]
    df_query.persist()
    paths = [g[['longitude', 'latitude', 'magnitude']] for _, g in df_query.groupby(['unit_id', 'track', 'road']) if len(g)>2]
    path = hv.Path(paths, vdims='magnitude')
    return path


dmap = hv.DynamicMap(get_path, kdims=['Magnitude', 'Month', 'Day', 'Interval', 'TZ']).redim.range(Magnitude=(1, 20), Month=(6, 7), Day=(22, 30), Interval=(1,4))
dmap = dmap.redim.values(TZ=utc_offset)

rastered = spread(rasterize(dmap, aggregator='mean', x_sampling=10, precompute=True), how='source', px=4)

Then I display the result. It looks as it should, but it works very slowly. I realized that the problem is in the bitmap rendering. The documentation recommends using the dask data frame, but I can’t switch to the dask. How can I adapt my code to improve performance?

Maybe you can persist outside if it can fit in memory, or don’t have persist inside?

Does dask give a performance boost in my example? I tried running this code on a dask dataframe and did not notice any improvement. I forgot to mention that the main problem I have is when I use the zoom.

Dask is a tool for distributing computation conveniently, letting you make use of independent computing resources (cores or CPUs and associated memory) to get the results faster and to work with datasets too large for any one machine. Using Dask speeds things up only to the extent that you can supply such additional computational resources, and if you’re running on a single core on problems that fit into memory, Dask won’t speed anything up.

For your example of 82 million rows, the actual rendering should be quick with or without dask, which you can test by running spread(rasterize(get_path(...), aggregator='mean', x_sampling=10, precompute=True), how='source', px=4) (where … should specify appropriate arguments to select all the data). Assuming that works quickly enough, then the slow part would be the fact that get_path needs to be executed all the time interactively. The speed of that bit is up to you, since that’s your code!

Thank you very much for the answer and for your work!
Can I ask another relevant question for me?
I have implemented the Explorer class as specified in Dashboard — Examples 0.1.0 documentation and I noticed that Dynamic does not cache the results. That is, when I change the value sliders, the viewable method is run every time and recalculates the values. I am not satisfied with this from the point of view of performance. Tell me how can I make Dynamicmap hash the results, that is, remember the parameter values?

Hi @padu

Could you provide minimum, reproducible code examples including the data. And if possible images or videos showing the issues. It is so hard to identify the exact issues and help solving without. Thanks.

Yes, ofcours:

class Explorer(pm.Parameterized):
    cm            = pm.Selector(cmaps, default=cmaps['green-yellow-red'])
    spreading     = pm.Integer(2, bounds=(1, 5))
    magnitude     = pm.Integer(2, bounds=(1, 50))
    month         = pm.Integer(6, bounds=(6, 7))
    day           = pm.Integer(22, bounds=(22, 30))
    tz            = pm.Selector(utc_offset)
    interval      = pm.Integer(1, bounds=(1, 4))
    basemap       = pm.Selector(bases)
    
    @pm.depends('basemap')
    def tiles(self):
        return self.basemap
    
    @pm.depends('magnitude', 'month', 'day', 'interval', 'tz')
    def _get_path(self):
        start = time.time()
        df_query = df.loc[(df.magnitude>=self.magnitude)&(df.visible_flag==1)&(df.utc_offset==self.tz)&(df.month==self.month)&(df.day==self.day)&(df.intervals==self.interval)]
        paths = [g[['longitude', 'latitude', 'magnitude']].to_numpy() for _, g in df_query.groupby(['unit_id', 'track', 'road']) if len(g)>2]
        path = hv.Path(paths, vdims='magnitude')
        print(f'time is:{time.time()-start}')
        return path

    def viewable(self, **kwargs):
        rasterized = rasterize(hv.DynamicMap(self._get_path), aggregator='mean', precompute=True)
        spreaded   = spread(rasterized, px=self.param.spreading, how="source").apply.opts(colorbar=True, cmap=self.param.cm, tools=['hover'], width=1100, height=600)
        
        return hv.DynamicMap(self.tiles) * spreaded
    
explorer = Explorer(name="")


panel = pn.Row(pn.Column(pn.Param(explorer.param, expand_button=False)), explorer.viewable())
panel.servable()

for example, changing the interval slider every time the result is recalculated, when it is written in the documentation for Dynamicmap that it caches the parameters on which the called function depends, in my case this is the _get_path method.

1 Like

Without the data or the complete code file, that is not reproducible. A reproducible example is one that anyone can run. In any case when you specify pm.depends, it is a declaration that the indicated code should be re-run by DynamicMap any time one of the dependent parameters changes. So I’m not sure what behavior you’re expecting here.

1 Like