Question about the visualization using datashader and dask

arthurlalayan · August 21, 2023, 1:41pm

I’m attempting to create a visual representation of my processed output, which has been executed in Dask using the “persist” function. The result of my calculation is DataArray. However, when I apply the “shade” function from the datashader library after the computation, I notice in the dashboard that Dask doesn’t seem to perform any computation; instead, all the processing appears to be taking place on my personal computer. Is this a problem, or is datashader intended to function in this manner?

I’d be happy if someone could help me to understand the issue.

ianthomas23 · September 4, 2023, 11:29am

Yes, this is intended. Any Canvas function such as points() uses whatever dask partitioning the supplied DataFrame (or whatever) has. Each partition is processed separately and creates it own numpy array of shape (height, width) to return. These would be, for example, the counts of each partition. The individual-partition numpy arrays are combined using a tree reduction and wrapped in an xarray.DataArray to return to the user. This is not daskified, so no future operations on it are either.

It could be daskified, but it is not obvious that this is worth the effort or indeed how it would be achieved.

Is it worth the effort?
Take the US Census data, which has 300 million rows of data split into 35 dask partitions, so about 8.5 million rows per partition. The returned DataArray has 850 x 500 pixels, so about 0.5 million pixels. Processing this for an operation like shade is usually negligible compared to the calculations performed for Canvas.points of a single dask partition of the census data, so there is no benefit in repartitioning it to split the calculations up.

How would the partitioning be achieved?
The source dataset is partitioned by rows, i.e. individual people in the census. The returned DataArray of counts or whatever cannot be partitioned this way, it only has spatial extends of rows and columns. So it cannot use the same partitioning as the source dataset. One could split it into the same number of partitions as the source dataset, but then individual partitions would be too small to be efficiently processed. We could support dask partitioning here but would have to force the user to make the decision on how to partition between the Canvas.points and shade calls, and I do not imagine that most end users are interested in this responsibility, particularly when it will usually have almost no bearing on the overall performance.