Problem with `link_selections` with `index_cols`

jerry.vinokurov · July 22, 2024, 3:29pm

I’m having some trouble understanding how the link_selection instance is supposed to work with the index_cols argument. My initial understanding was that if you used this argument you would create a linked selection across plots that are created from the same data. However, I am running into a problem with the most trivial implementation of this approach I can think of. An MRE is provided below:

import holoviews as hv
from holoviews.selection import link_selections
import numpy as np
import pandas as pd
import panel as pn

hv.extension('bokeh')
pn.extension()

# some fake data
data = np.random.default_rng(seed=42).normal(size=(100, 3))
cols = ['x', 'y', 'z']
df = pd.DataFrame(data, columns=cols)
df['id'] = np.arange(100)
df['foo'] = np.random.default_rng(seed=42).integers(low=1, high=10, size=100)

# want to link two plots across the `id` column
ls = link_selections.instance(index_cols=['id'])

img1 = hv.Points(df, kdims=['x', 'y'], vdims=['id']).opts(tools=['box_select'])
img2 = hv.Points(df, kdims=['x', 'z'], vdims=['id']).opts(tools=['box_select'])

pn.Row(ls(img1), ls(img2)).servable()

If I do this, I get the following error (truncated somewhat):

File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/element/selection.py", line 334, in _get_selection_expr_for_stream_value
    expr, _, _ = self._get_index_selection(kwargs['index'], index_cols)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/element/selection.py", line 39, in _get_index_selection
    vals = dim(index_dim).apply(ds.iloc[index], expanded=False)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/interface.py", line 33, in __getitem__
    res = self._perform_getitem(self.dataset, index)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/interface.py", line 98, in _perform_getitem
    return dataset.clone(data, kdims=kdims, vdims=vdims, datatype=datatype)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/__init__.py", line 1203, in clone
    return super().clone(data, shared_data, new_type, *args, **overrides)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/dimension.py", line 561, in clone
    return clone_type(data, *args, **{k:v for k,v in settings.items()
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/__init__.py", line 329, in __init__
    initialized = Interface.initialize(type(self), data, kdims, vdims,
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/interface.py", line 253, in initialize
    (data, dims, extra_kws) = interface.init(eltype, data, kdims, vdims)
  File "/Users/jerry/Development/project/venv/lib/python3.10/site-packages/holoviews/core/data/pandas.py", line 83, in init
    raise DataError('Dimensions may not reference duplicated DataFrame '
holoviews.core.data.interface.DataError: Dimensions may not reference duplicated DataFrame columns (found duplicate 'id' columns). If you want to plot a column against itself simply declare two dimensions with the same name.

Ok, so it’s mad that I have the same dimension declared twice. This comes about because in _get_index_selection the following appears:

ds = self.clone(kdims=index_cols, new_type=Dataset)

This clones the dataset with kdims equal to the supplied index_cols, so now the resulting dataset has duplicate id dimensions, because by default, if vdims are not specified, the whole dataset is cloned. And if any of the index_cols are present in the vdims then we have duplicate columns. However, if I make vdims blank or explicitly leave them out (say, specify vdims=['foo'] to hv.Points) then I get a different error message about not being able to resolve the id column. And that makes sense because I left it off, so of course it can’t.

This feels like a bug to me but I also feel like I could be missing something about the internals, so I would love to hear from people who are more familiar with it than me. I have worked around this by making the change that can be seen in this PR but I’m not sure if it has any downsides or knock-on effects that I’m failing to consider.

jerry.vinokurov · July 25, 2024, 1:43pm

So, for anyone who might be struggling with this, the reason why this occurs is because the objects produced from e.g. hv.Points(df) are not identical to those produced by hv.Points(hv.Dataset(df)). To see this at work, consider the following snippet:

import holoviews as hv
import numpy as np
import pandas as pd

hv.extension('bokeh')

data = np.random.default_rng(seed=42).normal(size=(100, 3))
idx = np.arange(100)
cols = ['x', 'y', 'z']
df = pd.DataFrame(data, columns=cols)
df['id'] = idx
df['foo'] = np.random.default_rng(seed=42).integers(low=1, high=10, size=100)

points_from_df = hv.Points(df, kdims=['x', 'y'])
ds = hv.Dataset(df)
points_from_ds = hv.Points(ds, kdims=['x', 'y'])

print('points from df:        ', points_from_df)
print('points from df dataset:', points_from_df.dataset)
print('points from ds:        ', points_from_ds)
print('points from ds dataset:', points_from_ds.dataset)
print('underlying data equal? ', points_from_df.data.equals(points_from_ds.data))

If you thought these were the same, they are not:

points from df:         :Points   [x,y]   (z,id,foo)
points from df dataset: :Dataset   [x,y]   (z,id,foo)
points from ds:         :Points   [x,y]
points from ds dataset: :Dataset   [x,y,z,id,foo]
underlying data equal?  True

Now, when self.clone is called as in my original post, the results look like this:

points from df clone:   :Dataset   [id]   (z,id,foo)
points from ds clone:   :Dataset   [id]

As you can see, using the bare dataframe as input results in a duplication of the id column, whereas using the dataset as input does not. However, you cannot pass in vdims explicitly to the Points constructor to solve this problem because if you leave out the linking id column, the underlying dataset will be missing the id column during the link selection and will therefore throw a different error.

This feels a bit buggy to me, as I would have expected the result to be the same whether I pass in a pd.DataFrame or an hv.Dataset. Having to manually create the Dataset before passing it to the plot constructor is slightly annoying but at least now I know it works. But it’s very unexpected behavior for these two seemingly identical paths to produce very different results.

jerry.vinokurov · July 25, 2024, 1:48pm

@ahuang11 I know you’re keeping a list of HoloViews/Panel best practices, so you might consider adding “Always turn your data into a Dataset first” to that list. Hopefully this will keep people from getting tripped up like I did

ahuang11 · July 26, 2024, 6:29pm

Thanks for sharing your solution! I think this could be a bug (I’d expect them to return identical output). Would you mind submitting a GitHub issue?

Oh I just saw your PR. I’ll look into it a bit more.