Fast visualization of large datasets (without saving to disk) using hvplot, kerchunk, and datashader

ahuang11 · July 20, 2023, 6:33pm

With a little preprocessing with Kerchunk, you can visualize a large dataset on-the-fly without needing to first downloading to disk by passing a fsspec URL and streaming the data.

Kerchunk is a wrapper to make a netCDF file (and other older file formats) cloud optimized; i.e. makes loading the chunks of the data that you actually want, fast.

This video demos visualizing streaming data from Kerchunk vs THREDDS.

Key points:

Kerchunk + THREDDS streaming is bounded by internet speed (bounded by conference wifi which is 0.5mbps)
After, once the entire slice of the dataset is downloaded, Kerchunk is significantly faster for loading the chunks within the frame vs THREDDS
All of this is done without saving the data to local disk (besides the reference files)

Ideas:

Combine with stac + panel to explore an entire catalog
Compare against a locally downloaded file converted to zarr + dask

Proof of Concept (messy) Code:

import os
import fsspec
import xarray as xr
import apache_beam as beam
import hvplot.xarray
import time
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim
from pangeo_forge_recipes.transforms import (
    CombineReferences,
    OpenWithKerchunk,
    WriteCombinedReference,
)

target_root = "references"
store_name = "Pangeo_Forge"
full_path = os.path.join(target_root, store_name, "reference.json")

years = list(range(1979, 1980))
time_dim = ConcatDim("time", keys=years)
def format_function(time):
    return f"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc"
pattern = FilePattern(format_function, time_dim, file_type="netcdf4")
pattern = pattern.prune()

transforms = (
    # Create a beam PCollection from our input file pattern
    beam.Create(pattern.items())
    # Open with Kerchunk and create references for each file
    | OpenWithKerchunk(file_type=pattern.file_type)
    # Use Kerchunk's `MultiZarrToZarr` functionality to combine the reference files into a single
    # reference file. *Note*: Setting the correct contact_dims and identical_dims is important.
    | CombineReferences(
        concat_dims=["day"],
        identical_dims=["lat", "lon", "crs"],
    )
    # Write the combined Kerchunk reference to file.
    | WriteCombinedReference(target_root=target_root, store_name=store_name)
)

# KERCHUNK
mapper = fsspec.get_mapper(
    "reference://",
    fo=full_path,
    remote_protocol="http",
)
ds = xr.open_dataset(
    mapper, engine="zarr", decode_coords="all", backend_kwargs={"consolidated": False}
)

t = time.perf_counter()
display(ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"KERCHUNK INITIAL Elapsed time: {elapsed:0.4f} seconds")

# THREDDS
def url_gen(year):
    return (
        f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
    )


urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
t = time.perf_counter()
display(netcdf_ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"THREDDS INITIAL Elapsed time: {elapsed:0.4f} seconds")

ahuang11 · July 20, 2023, 6:50pm

Questions:

One issue / question I have is does hvplot+datashader have to pre-load the entire dataset initially, regardless of xlim/ylim (because it takes 17 seconds before showing anything; after that it’s quick to zoom in/zoom out) Also, shifting the time slider is a tad slow (on conf wifi)

i.e.

3 seconds

t = time.perf_counter()
display(ds.sel(lat=slice(30, 26)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"Elapsed time: {elapsed:0.4f} seconds")

vs 16 seconds

t = time.perf_counter()
display(ds.hvplot("lon", "lat", rasterize=True, ylim=(26, 30)))
elapsed = time.perf_counter() - t
print(f"Elapsed time: {elapsed:0.4f} seconds")

ahuang11 · August 15, 2023, 2:09pm

This is now live:
https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader.html