Computing the plot in another process raises `weakref.ReferenceType`

terry · December 17, 2022, 2:03pm

I’m calculating +100 plots in a loop (exploratory reasons …) and plotting them with holoviews.Layout(plot_list). It takes +3min. I tried to speed up the calculation using ray.io (on a distributed cluster) and returning df.plot(datashader=True). However the process raises TypeError: cannot pickle 'weakref.ReferenceType' object. Are there any other ways to speedup plot generation or do I have to copy some of the data to avoid weak references?

def plot(df1, df2, iloc_start: int, iloc_end):
    @ray.remote
    def get_plot(df1, df2, iloc, plot_opts):
        pandas.options.plotting.backend = "holoviews"
        
        row = df2.iloc[iloc]
        col1 = row["list_members"][0]
        col2 = row["list_members"][1]
        df = df1[[col1, col2]]
        return df.plot(**plot_opts, title=f"{col1}, {col2}").options(shared_axes=False)
    
    plots = []
    refs_task = []
    ref_df1 = ray.put(df1)
    ref_df2 = ray.put(df2
    
    for iloc in range(iloc_start, iloc_end):        
        refs_task.append(get_plot.remote(ref_df1, ref_df2 iloc, plot_opts))

    while refs_task:
        refs_task_done, refs_task = ray.wait(refs_task)
        plots += ray.get(refs_task)
    return holoviews.Layout(plots).cols(5)

Traceback

File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/worker.py:2309, in get(object_refs, timeout)
   2307     worker.core_worker.dump_object_store_memory_usage()
   2308 if isinstance(value, RayTaskError):
-> 2309     raise value.as_instanceof_cause()
   2310 else:
   2311     raise value

RayTaskError(TypeError): ray::get_plot() (pid=36727, ip=192.168.0.107)
  File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 627, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

maximlt · December 29, 2022, 8:46am

Hi @terry,

Is there you could provide a piece of code one can run to reproduce your issue? I’ve personally never used ray.io.

From what I know HoloViews (hvPlot returns HoloViews objects) was originally used in a similar way, distributing the creation of many HoloViews objects in a cluster.

I’m not sure what the issue is there. Either that’s a bug, in which case it’d be nice to be able to reproduce it, or you have to configure ray.io's serializer for HoloViews objects, as they need some special handling for the options tree (see this issue with some useful context).

terry · December 30, 2022, 3:39pm

Seems the problem with weakref exception occurs only when using .plot(datashade=True) which I sadly need.
I’ve checked ray serialization debuggin function, but get the same weakref info. Also tried to serialize with the library dill, without success.

import holoviews
import pandas
import ray


@ray.remote
def calculate_plot_on_remote_node(df, column_name):
    pandas.options.plotting.backend = "holoviews"
    # when `.plot(datashade=True)` we get `TypeError: cannot pickle 'weakref.ReferenceType' object`
    return df[column_name].plot()

def plot_alot_of_columns(df):
    plots = []
    refs_task = []  # save references to tasks we distribute
    ref_df = ray.put(df)  # move `df` to every node's memory
    
    # let's say we want to plot columns
    for column_name in df.columns:
        # distribute tasks and save references so we can collect them later
        refs_task.append(
            calculate_plot_on_remote_node.remote(ref_df, column_name))
        # plots.append(
            # calculate_plot_on_remote_node(df, column_name))
    
    # collect processed tasks
    while refs_task:
        refs_task_done, refs_task = ray.wait(refs_task)
        plots += ray.get(refs_task_done)
     
    return holoviews.Layout(plots).cols(3)


# This will start `ray` processes on local machine (also known in `ray` terminology 
# as the "driver node" - the node from which we're distributing tasks).
ray.init()  # N.B. this must be called only once in the script/jupyterlab

df = pandas.DataFrame({"col1": [i for i in range(10)], 
                       "col2": [i for i in range(10,20)],
                       "col3": [i for i in range(20,30)]})
plot_alot_of_columns(df)