How to create a progressive visualization dashboard involving datashader

Nithanaroy · April 18, 2020, 1:51am

I am able to process 5GB worth snappy compressed parquet local files and visualize them using holoviews, datashader, panel and bokeh on my mac. However, each user query takes a few seconds or even minutes sometimes and the visualization appears at once after all processing. I was wondering if there is any native support to progressively update the visualization from a dask dataframe as and when parts of the result are available.

import param, panel as pn
import dask.dataframe as dd
import holoviews as hv

class DataExplorer(param.Parameterized):
    xaxis_filter = param.Range(...)
    
    def load_data(self):
        return dd.read_parquet("*.parquet")

    def process_data(self, df):
        # Expensive compuation on a dask dataframe for example,
        return df[df.xaxis.between(*self.xaxis_filter)]

    @param.depends('xaxis_filter')
    def make_view(self):
        plot_df = self.process_data(self.load_data())
        points = hv.Points(plot_df) # too slow and pub-sub behaviour might help
        scatter = (dynspread(datashade(points))
        tooltip = rasterize(points, streams=[RangeXY]).apply(hv.QuadMesh)
        return (scatter * tooltip).opts(opts.QuadMesh(tools=['hover'], alpha=0, hover_alpha=0.2))

explorer = DataExplorer(name="")
dashboard = pn.Column(explorer.param, explorer.make_view)

I tried streaming visualizations but the size of Buffer started growing drastically making the implementation unusable on my mac. Any ideas or help is appreciated.

Thank you.

Nithanaroy · April 18, 2020, 12:45pm

In the original dask implementation, without any progressive visualization, the memory consumption was always way below the limits during computation and comes back to original after the plot is generated. It gives me hope that, this task is achievable on my mac with the above approach
Just do no know how to implement it with cool libraries like streamz and holoviews natively

philippjfr · April 18, 2020, 11:37pm

This is a great question which has been asked a few times. I’ll try to play around with some approaches to get this working. One quick comment for now:

    @param.depends('xaxis_filter')
    def make_view(self):
        plot_df = self.process_data(self.load_data())
        points = hv.Points(plot_df) # too slow and pub-sub behaviour might help
        scatter = (dynspread(datashade(points))
        tooltip = rasterize(points, streams=[RangeXY]).apply(hv.QuadMesh)
        return (scatter * tooltip).opts(opts.QuadMesh(tools=['hover'], alpha=0, hover_alpha=0.2))

You no longer need a QuadMesh to get a tooltip since bokeh has supported Image tooltips for quite a while now, so I’d just go with:

points = hv.Points(plot_df)
rasterized = rasterize(points)
return dynspread(shade(rasterized)) * rasterized.opts(tools=['hover'], alpha=0)

philippjfr · April 18, 2020, 11:58pm

Some more tips on how to structure this. If you have sufficient memory I’d suggest persisting the 5GBs into memory, alternatively I’d at least persist the filtered version to memory. Here is how I would write your app:

import param, panel as pn
import dask.dataframe as dd
import holoviews as hv

class DataExplorer(param.Parameterized):
    
    xaxis_filter = param.Range()
    
    def __init__(self, filename, **params)
        self.data = dd.read_parquet(filename).persist()
        super(DataExplorer, self).__init__(**params)
        x_range = self.data.xaxis.min(), self.data.xaxis.max()
        self.param.xaxis_filter.bounds = x_range
        self.xaxis_filter = x_range

    def view(self):
        points = hv.Points(self.data).apply.select(xaxis=self.param.xaxis_filter)
        rasterized = rasterize(points)
        return dynspread(shade(rasterized)) * rasterized.opts(alpha=0)

explorer = DataExplorer(name="")
dashboard = pn.Column(explorer.param, explorer.view())

This should be significantly more efficient 1) it persists the data into memory, and 2) it doesn’t have to load and select on the data twice. I’m not sure how familliar you are with dask but dask is lazy and as such as you had written it previously load_data did not actually load any data and process_data did not actually process any data. Therefore when it came to rendering it would have to load the data and then apply the selection twice on initialization (once for the rasterized plot and once for the datashaded plot). And then every time you zoomed on the plot it would again load all 5 GB into memory. This is probably why it took minutes to render.

Nithanaroy · April 23, 2020, 11:02pm

Thanks for the tips. Without hv.QuadMesh how to create hover info as shown in Working with large data using datashader — HoloViews v1.18.1

And what do you mean by # too slow and pub-sub behaviour might help?

philippjfr · April 23, 2020, 11:14pm

Thanks for the tips. Without hv.QuadMesh how to create hover info as shown in Working with large data using datashader — HoloViews v1.18.1

Not entirely sure what you mean, adding .opts(tools=[‘hover’], alpha=0) will give you a hover tool.

That was a comment in your code, not mine.

Nithanaroy · April 23, 2020, 11:22pm

This is what I was trying to achieve HoverInfoWithDefinedBox

But without hv.QuadMesh, and only .opts(tools=[‘hover’], alpha=0), a minute box (may be per pixel) is shown. The tooltip shows the count of this minute box, where most are zero in my case.

Oh , added that pub-sub comment in the original question for explanation purposes and forgot about it. I was thinking of streamz when writing this. But couldn’t figure out the right implementation with Dask

philippjfr · April 23, 2020, 11:33pm

I see, in that case do go back to using a QuadMesh:

scatter = dynspread(datashade(points))
tooltip = rasterize(points, streams=[RangeXY]).apply(hv.QuadMesh)
return (scatter * tooltip).opts(opts.QuadMesh(tools=['hover'], alpha=0, hover_alpha=0.2))

Nithanaroy · April 24, 2020, 5:29am

Thanks a lot for this idea as well. I was relying on dask’s experimental cache() functionality but didn’t realize that we load the data 2nd time - again for hover info. It is a lot faster now and smartly uses memory once I added persist() in an instance method.

Tried a few more things for my original question - streaming progressive visualization but got some weird undeclared variable use errors from streamz. Shared the toy code via Colab here. Google colab seems to have problems with running distributed Dask workloads. The notebook runs fine on local machine if you are interested to run it.

Nithanaroy · April 28, 2020, 7:01am

Any more suggestions @philippjfr ?

Nithanaroy · May 16, 2020, 8:48am

What am I missing in this way to solve this problem?