With a little preprocessing with Kerchunk, you can visualize a large dataset on-the-fly without needing to first downloading to disk by passing a fsspec URL and streaming the data.
Kerchunk is a wrapper to make a netCDF file (and other older file formats) cloud optimized; i.e. makes loading the chunks of the data that you actually want, fast.
This video demos visualizing streaming data from Kerchunk vs THREDDS.
Key points:
- Kerchunk + THREDDS streaming is bounded by internet speed (bounded by conference wifi which is 0.5mbps)
- After, once the entire slice of the dataset is downloaded, Kerchunk is significantly faster for loading the chunks within the frame vs THREDDS
- All of this is done without saving the data to local disk (besides the reference files)
Ideas:
- Combine with stac + panel to explore an entire catalog
- Compare against a locally downloaded file converted to zarr + dask
Proof of Concept (messy) Code:
import os
import fsspec
import xarray as xr
import apache_beam as beam
import hvplot.xarray
import time
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim
from pangeo_forge_recipes.transforms import (
CombineReferences,
OpenWithKerchunk,
WriteCombinedReference,
)
target_root = "references"
store_name = "Pangeo_Forge"
full_path = os.path.join(target_root, store_name, "reference.json")
years = list(range(1979, 1980))
time_dim = ConcatDim("time", keys=years)
def format_function(time):
return f"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc"
pattern = FilePattern(format_function, time_dim, file_type="netcdf4")
pattern = pattern.prune()
transforms = (
# Create a beam PCollection from our input file pattern
beam.Create(pattern.items())
# Open with Kerchunk and create references for each file
| OpenWithKerchunk(file_type=pattern.file_type)
# Use Kerchunk's `MultiZarrToZarr` functionality to combine the reference files into a single
# reference file. *Note*: Setting the correct contact_dims and identical_dims is important.
| CombineReferences(
concat_dims=["day"],
identical_dims=["lat", "lon", "crs"],
)
# Write the combined Kerchunk reference to file.
| WriteCombinedReference(target_root=target_root, store_name=store_name)
)
# KERCHUNK
mapper = fsspec.get_mapper(
"reference://",
fo=full_path,
remote_protocol="http",
)
ds = xr.open_dataset(
mapper, engine="zarr", decode_coords="all", backend_kwargs={"consolidated": False}
)
t = time.perf_counter()
display(ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"KERCHUNK INITIAL Elapsed time: {elapsed:0.4f} seconds")
# THREDDS
def url_gen(year):
return (
f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
)
urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
t = time.perf_counter()
display(netcdf_ds.sel(lat=slice(48, 0)).hvplot("lon", "lat", rasterize=True))
elapsed = time.perf_counter() - t
print(f"THREDDS INITIAL Elapsed time: {elapsed:0.4f} seconds")