Compatibility of Pandas Series in Datashader

CrashLandonB · September 7, 2022, 1:12pm

New to Datashader, but I get the impression that Datashader primarily wants DataFrames as data source, and is less-integrated Pandas Series objects. Is this accurate?

More detail:
I’m looking to plot time series data in an efficient/performant manner. For example, I want to create a figure with 6 subplots. Each subplot might have 10-50 series plotted on it. Each series will have from 10k to 1 million points. The data will be asynchronous (having a different time array for every data array). I currently gather all of these time series signals into a dictionary of Pandas Series objects. Can this dictionary be used in Datashader without converting to a DataFrame? I have been avoiding such a conversion, assuming it will be a big performance hit and assuming it will be a big memory hit combining many async parameters into a DataFrame. Valid assumptions?

Thanks in advance for your feedback!

ianthomas23 · September 8, 2022, 7:29am

Yes, for lines Datashader is expecting a DataFrame. It could be a Pandas, Dask, cuDF, or Dask-cuDF DataFrame. A DataFrame is a collection of Series objects so there shouldn’t necessarily be a performance/memory hit of combining them in this way, but I am not a Pandas so I cannot advise on good/bad ways to do the combining.

If you are dealing with a lot of data then you should look at using a DaskDataFrame.

ianthomas23 · September 8, 2022, 5:07pm

I should expand further on this. The choice for combining the data is whether to have on big DataFrame or a separate DataFrame per time series. The former could be really inefficient if the time coordinates do not overlap in your various timeseries, but you can treat it as a single entity and use just one call to e.g. canvas.line() in Datashader. The latter are much easier to create and won’t require lots of extra RAM, but you will have to make a separate call to canvas.line() for each DataFrame. Each call will return a separate xarray.DataArray that you will have to combine yourself. If you are just counting the number of lines per pixel in the canvas then this is easy, but if you are doing something more complicated it will be more complicated.

CrashLandonB · September 8, 2022, 7:29pm

Thanks for the feedback. Both points are helpful. Sometimes i will have many parameters with same time array, but often with separate time arrays.

I’m wondering about “counting the number of lines in a pixel” for the case of multiple signals with the same time array, vs multiple signals with different time arrays (but on the same plot). Something I may have to test out… Thanks!