Dask+Holoviews+Datashader

padu · August 11, 2021, 5:23pm

Hello everyone The question is related to performance. I have a Dataframe with 82 million rows and execute the following code:

def get_path(magnitude, month, day, interval, utc_offset):
    df_query = dask_df.loc[(dask_df.magnitude>=magnitude)&(dask_df.visible_flag==1)&(dask_df.utc_offset==utc_offset)&(dask_df.month==month)&(dask_df.day==day)&(dask_df.intervals==interval)]
    df_query.persist()
    paths = [g[['longitude', 'latitude', 'magnitude']] for _, g in df_query.groupby(['unit_id', 'track', 'road']) if len(g)>2]
    path = hv.Path(paths, vdims='magnitude')
    return path


dmap = hv.DynamicMap(get_path, kdims=['Magnitude', 'Month', 'Day', 'Interval', 'TZ']).redim.range(Magnitude=(1, 20), Month=(6, 7), Day=(22, 30), Interval=(1,4))
dmap = dmap.redim.values(TZ=utc_offset)

rastered = spread(rasterize(dmap, aggregator='mean', x_sampling=10, precompute=True), how='source', px=4)

Then I display the result. It looks as it should, but it works very slowly. I realized that the problem is in the bitmap rendering. The documentation recommends using the dask data frame, but I can’t switch to the dask. How can I adapt my code to improve performance?

ahuang11 · August 11, 2021, 7:03pm

Maybe you can persist outside if it can fit in memory, or don’t have persist inside?

padu · August 11, 2021, 7:19pm

Does dask give a performance boost in my example? I tried running this code on a dask dataframe and did not notice any improvement. I forgot to mention that the main problem I have is when I use the zoom.

jbednar · August 13, 2021, 6:11pm

Dask is a tool for distributing computation conveniently, letting you make use of independent computing resources (cores or CPUs and associated memory) to get the results faster and to work with datasets too large for any one machine. Using Dask speeds things up only to the extent that you can supply such additional computational resources, and if you’re running on a single core on problems that fit into memory, Dask won’t speed anything up.

For your example of 82 million rows, the actual rendering should be quick with or without dask, which you can test by running spread(rasterize(get_path(...), aggregator='mean', x_sampling=10, precompute=True), how='source', px=4) (where … should specify appropriate arguments to select all the data). Assuming that works quickly enough, then the slow part would be the fact that get_path needs to be executed all the time interactively. The speed of that bit is up to you, since that’s your code!

padu · August 14, 2021, 6:47am

Thank you very much for the answer and for your work!
Can I ask another relevant question for me?
I have implemented the Explorer class as specified in Dashboard — Examples 0.1.0 documentation and I noticed that Dynamic does not cache the results. That is, when I change the value sliders, the viewable method is run every time and recalculates the values. I am not satisfied with this from the point of view of performance. Tell me how can I make Dynamicmap hash the results, that is, remember the parameter values?

Marc · August 14, 2021, 6:49am

Hi @padu

Could you provide minimum, reproducible code examples including the data. And if possible images or videos showing the issues. It is so hard to identify the exact issues and help solving without. Thanks.

padu · August 14, 2021, 6:53am

Yes, ofcours:

class Explorer(pm.Parameterized):
    cm            = pm.Selector(cmaps, default=cmaps['green-yellow-red'])
    spreading     = pm.Integer(2, bounds=(1, 5))
    magnitude     = pm.Integer(2, bounds=(1, 50))
    month         = pm.Integer(6, bounds=(6, 7))
    day           = pm.Integer(22, bounds=(22, 30))
    tz            = pm.Selector(utc_offset)
    interval      = pm.Integer(1, bounds=(1, 4))
    basemap       = pm.Selector(bases)
    
    @pm.depends('basemap')
    def tiles(self):
        return self.basemap
    
    @pm.depends('magnitude', 'month', 'day', 'interval', 'tz')
    def _get_path(self):
        start = time.time()
        df_query = df.loc[(df.magnitude>=self.magnitude)&(df.visible_flag==1)&(df.utc_offset==self.tz)&(df.month==self.month)&(df.day==self.day)&(df.intervals==self.interval)]
        paths = [g[['longitude', 'latitude', 'magnitude']].to_numpy() for _, g in df_query.groupby(['unit_id', 'track', 'road']) if len(g)>2]
        path = hv.Path(paths, vdims='magnitude')
        print(f'time is:{time.time()-start}')
        return path

    def viewable(self, **kwargs):
        rasterized = rasterize(hv.DynamicMap(self._get_path), aggregator='mean', precompute=True)
        spreaded   = spread(rasterized, px=self.param.spreading, how="source").apply.opts(colorbar=True, cmap=self.param.cm, tools=['hover'], width=1100, height=600)
        
        return hv.DynamicMap(self.tiles) * spreaded
    
explorer = Explorer(name="")


panel = pn.Row(pn.Column(pn.Param(explorer.param, expand_button=False)), explorer.viewable())
panel.servable()

for example, changing the interval slider every time the result is recalculated, when it is written in the documentation for Dynamicmap that it caches the parameters on which the called function depends, in my case this is the _get_path method.

jbednar · August 14, 2021, 1:14pm

Without the data or the complete code file, that is not reproducible. A reproducible example is one that anyone can run. In any case when you specify pm.depends, it is a declaration that the indicated code should be re-run by DynamicMap any time one of the dependent parameters changes. So I’m not sure what behavior you’re expecting here.