I love holoviz tools now for years.
As I increasingly have to deal with “BigData” (larger than memory data), please demystify and dewind my head in one core thing around datashader and dask I thought both together are capable of:
Can datashader plot data from a huge (let’s say 500GB) lazy dask array without the need of coding logic around loading chunks and .compute() them?
Trying to datashade a 5GB lazy dask array, it takes much, much, much longer to plot than call .compute() and then datashading it.
The data is stored in (multiple) hdf5 files, dask chunks are aligned with hdf5 chunks.
Loading data from disk will always take a while there’s nothing datashader can do about that (although there’s probably some work we could do to optimize that, e.g. by computing the plot ranges efficiently). That said if you do have the memory available or a cluster to distribute to you can still leverage dask by using .persist() instead of .compute(). That will load the data into memory but still allow datashader to parallelize over each chunk.
Thanks, I don’t use a distributed cluster, so I’m stuck on a local machine.
Remains the question:
Why is it uncomparably MUCH fast to use datashade(da.compute()) instead of just using datashade(da) on the same amount of data, even when it easily fits into memory?
Are you using da.persist()? Because otherwise you are still forcing it to load data into memory (multiple times) during the datashading step, which is not a fair comparison.