I’m attempting to create a visual representation of my processed output, which has been executed in Dask using the “persist” function. The result of my calculation is DataArray. However, when I apply the “shade” function from the datashader library after the computation, I notice in the dashboard that Dask doesn’t seem to perform any computation; instead, all the processing appears to be taking place on my personal computer. Is this a problem, or is datashader intended to function in this manner?
I’d be happy if someone could help me to understand the issue.
Yes, this is intended. Any Canvas
function such as points()
uses whatever dask partitioning the supplied DataFrame (or whatever) has. Each partition is processed separately and creates it own numpy array of shape (height, width)
to return. These would be, for example, the counts of each partition. The individual-partition numpy arrays are combined using a tree reduction and wrapped in an xarray.DataArray
to return to the user. This is not daskified, so no future operations on it are either.
It could be daskified, but it is not obvious that this is worth the effort or indeed how it would be achieved.
Is it worth the effort?
Take the US Census data, which has 300 million rows of data split into 35 dask partitions, so about 8.5 million rows per partition. The returned DataArray
has 850 x 500 pixels, so about 0.5 million pixels. Processing this for an operation like shade
is usually negligible compared to the calculations performed for Canvas.points
of a single dask partition of the census data, so there is no benefit in repartitioning it to split the calculations up.
How would the partitioning be achieved?
The source dataset is partitioned by rows, i.e. individual people in the census. The returned DataArray
of counts or whatever cannot be partitioned this way, it only has spatial extends of rows and columns. So it cannot use the same partitioning as the source dataset. One could split it into the same number of partitions as the source dataset, but then individual partitions would be too small to be efficiently processed. We could support dask partitioning here but would have to force the user to make the decision on how to partition between the Canvas.points
and shade
calls, and I do not imagine that most end users are interested in this responsibility, particularly when it will usually have almost no bearing on the overall performance.