Memory issue rasterizing dask based polygons to high resolution canvas

snegusse · August 25, 2022, 11:51pm

I’m prototyping zonal statistics on high resolution (30m) global rasters using spatial-xarray and using datashader to rasterize my polygons/zones as intermediate step.
I’m passing a spatial-pandas dask GeoDataFrame to canvas.polygons to run the rasterization in parallel on a cluster and testing this with 200K square polygons (on 1.6 billion pixel canvas) runs into memory issues until the workers die with warning Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS prior to being killed but the rasterization with a regular GeoDataFrame succeeds within a couple of seconds on a laptop with much less cpu/memory resources than the dask cluster used.

I’m using the following versions:
datashader 0.14.1
spatialpandas 0.4.4
dask 2022.7.0

Any insight into resolving this would be much appreciated!

Best,
Solomon

ianthomas23 · August 26, 2022, 4:11pm

Hi Solomon,

Please can you answer a few questions:

How much RAM does each cluster node have?
How many dask workers are you using?
Are you running a single dask worker or multiple dask workers on each cluster node?

Thanks,
ian

snegusse · August 26, 2022, 11:12pm

Hi Ian,
Thanks for getting back to me. The cluster is running on ECS using dask-cloudprovider to configure and launch it.

RAM isn’t set explicitly as this is Fargate managed ECS cluster (config below) but my understanding is that this is equivalent to the default worker memory of 16GB
I’ve tried 24 and 48 workers with same results
Single worker per node (as in #1, not set explicitly).

Here is how I’m instantiating the cluster:

cluster = ECSCluster(
        fargate_scheduler=True,
        fargate_workers=True,
        fargate_spot=False,
        n_workers=48,
        image="daskdev/dask:2022.7.0-py3.8",
        task_role_arn="***,
        environment={
            "EXTRA_PIP_PACKAGES": "datashader==0.14.1 spatialpandas==0.4.4 numpy==1.21.6"
        },
        scheduler_timeout="30 minutes",
        execution_role_arn="***",
        find_address_timeout=360,
)

Let me know if I need to provide more info.

Thanks again,
Solomon

ianthomas23 · August 27, 2022, 11:11am

Hi Solomon,

I am not surprised that you are running out of RAM. The way that dask parallelisation works in datashader is to partition up the GeoDataFrame but not the Canvas, so each dask worker has a subset of the GeoDataFrame and a complete canvas. The Canvas is probably using 4 bytes per pixel, so 1.9e9 pixels requires 1.9e9*4 bytes = 6 GB. When each dask worker has rasterised it partition of the GeoDataFrame to its local Canvas, the canvases are combined into the single one that is returned to the user. Any node that is combining canvases will therefore need storage for 2 of them, so 12 GB. To this add RAM required for the partition of the GeoDataFrame and whatever workspace is required to do the calculations and it is not surprising that 16 GB of RAM is not enough. Even if you had enough RAM on each node, it is probably going to be pretty inefficient as multiple transfers of 6 GB arrays between workers isn’t ideal.

If you do all the processing on a single node then you only need one Canvas not two, i.e. 6 GB less RAM but of course you will need enough RAM for the entire GeoDataFrame.

The design use case for datashader is to produce images for visualisation, so 1e6 pixels (e.g. 1000x1000) is expected but 1e9 is not. You are free to use it with 1e9, but the parallelisation is not working in your favour. What you really want is to partition the Canvas but not the GeoDataFrame. This could be performed in datashader but it would be a fundamental architectural change to alter the direction that parallelisation occurs and I not aware of anyone who is actively working on this.

It might be interesting to halve your Canvas size in both dimensions so that it only needs 1.5 GB and see if that works OK. If it does then you could do your own manual parallelisation of the canvas by only processing one quarter of the overall canvas at a time, with the whole GeoDataFrame, then combine the 4 sub-canvases together afterwards.

snegusse · August 29, 2022, 2:59pm

Hi Ian,
The strategy of partitioning the canvas and parallelizing the rasterization manually as you suggested has indeed resolved my issue. Thanks much for the prompt and detailed help resolving this!

Best,
Solomon