Fast visualization of large datasets (without saving to disk) using hvplot, kerchunk, and datashader

With a little preprocessing with Kerchunk, you can visualize a large dataset on-the-fly without needing to first downloading to disk by passing a fsspec URL and streaming the data.

Kerchunk is a wrapper to make a netCDF file (and other older file formats) cloud optimized; i.e. makes loading the chunks of the data that you actually want, fast.

This video demos visualizing streaming data from Kerchunk vs THREDDS.

Key points:

  1. Kerchunk + THREDDS streaming is bounded by internet speed (bounded by conference wifi which is 0.5mbps)
  2. After, once the entire slice of the dataset is downloaded, Kerchunk is significantly faster for loading the chunks within the frame vs THREDDS
  3. All of this is done without saving the data to local disk (besides the reference files)


  1. Combine with stac + panel to explore an entire catalog
  2. Compare against a locally downloaded file converted to zarr + dask

Proof of Concept (messy) Code:

import os
import fsspec
import xarray as xr
import apache_beam as beam
import hvplot.xarray
import time
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim
from pangeo_forge_recipes.transforms import (

target_root = "references"
store_name = "Pangeo_Forge"
full_path = os.path.join(target_root, store_name, "reference.json")

years = list(range(1979, 1980))
time_dim = ConcatDim("time", keys=years)
def format_function(time):
    return f"{time}.nc"
pattern = FilePattern(format_function, time_dim, file_type="netcdf4")
pattern = pattern.prune()

transforms = (
    # Create a beam PCollection from our input file pattern
    # Open with Kerchunk and create references for each file
    | OpenWithKerchunk(file_type=pattern.file_type)
    # Use Kerchunk's `MultiZarrToZarr` functionality to combine the reference files into a single
    # reference file. *Note*: Setting the correct contact_dims and identical_dims is important.
    | CombineReferences(
        identical_dims=["lat", "lon", "crs"],
    # Write the combined Kerchunk reference to file.
    | WriteCombinedReference(target_root=target_root, store_name=store_name)

mapper = fsspec.get_mapper(
ds = xr.open_dataset(
    mapper, engine="zarr", decode_coords="all", backend_kwargs={"consolidated": False}

t = time.perf_counter()
display(ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"KERCHUNK INITIAL Elapsed time: {elapsed:0.4f} seconds")

def url_gen(year):
    return (

urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
t = time.perf_counter()
display(netcdf_ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"THREDDS INITIAL Elapsed time: {elapsed:0.4f} seconds")


One issue / question I have is does hvplot+datashader have to pre-load the entire dataset initially, regardless of xlim/ylim (because it takes 17 seconds before showing anything; after that it’s quick to zoom in/zoom out) Also, shifting the time slider is a tad slow (on conf wifi)


3 seconds

t = time.perf_counter()
display(ds.sel(lat=slice(30, 26)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"Elapsed time: {elapsed:0.4f} seconds")

vs 16 seconds

t = time.perf_counter()
display(ds.hvplot("lon", "lat", rasterize=True, ylim=(26, 30)))
elapsed = time.perf_counter() - t
print(f"Elapsed time: {elapsed:0.4f} seconds")
1 Like

This is now live: