How to plot a huge image out of memory?

It loads the entire dataset before doing anything (and crashes), so I was wondering if it could just skip every 10 at farthest zoom (e.g. is there a decimate function for datashader outside of hv)?

import numpy as np
import dask.array as da
import xarray as xr
from datashader import transfer_functions as tf

def sample(range_=(0.0,2.4)):
    xs = da.arange(700000, dtype=np.float32)
    ys = da.arange(280000, dtype=np.float32)
    return xr.DataArray(
        da.random.random((len(ys), len(xs)), chunks=(10000, 10000)).astype(np.float32),
        coords=[('y', ys), ('x', xs)])

ds = sample(range_=(0,5))
tf.shade(ds)

Then it fails:

distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker

KilledWorker: ("('random_sample-astype-', 13, 11)", <WorkerState 'tcp://ip', name: 0, memory: 0, processing: 601>)

Are you saying that even with distributed, operating on a cluster, the total memory available on the cluster isn’t sufficient for this image? Or that it’s not really distributed, and that the image is larger than this one machine’s memory? In either case, I’m not sure how to ensure that your DataArray is backed by Dask, but it seems like it should be feasible (if slow) to work out of core using such a Dask array as long as you don’t persist it. Probably needs some arguments to DataArray?

1 Like

It’s like 720 GBs total; opens fine, runs for a while fine, then randomly crashes. The minimal example should reproduce the issue.

I am also using:

client = Client(n_workers=4, threads_per_worker=2, memory_limit="8GiB")

Similar: