Plotting Large Datasets

saulobertolino · July 12, 2020, 10:58am

I´m working with a large dataset ( 62576606 rows) and I´m trying to plot a ScatterPlot. It breaks my Jupyter Notebook book because of the memory. I´ve tried to use

holoviews.operation.datashader (datashade, shade, dynspread, rasterize) as like in the example from http://holoviews.org/user_guide/Large_Data.html, but I did´t work. Anybody has any kind of guide to work with large datasets in Holoviews?

Marc · July 12, 2020, 11:35am

Hi @saulobertolino

I’m not an expert in datashader but if you could post a minimum code example of What you have been trying and a link to the data then i think the community would be able to help.

tmikolajczyk · July 12, 2020, 2:36pm

You can also try with starting a notebook with this command:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

carl · July 13, 2020, 2:49pm

Hi @saulobertolino,

I don’t normally use that amount of data in one go, for myself I normally work with about a quarter of that and usually just fine anyway I gave the following a try with some random data generating a pandas dataframe with 63072001 rows - I have no doubt could be better but anyway. Note I have a windows 10 hp elitebook machine with 16GB ram, this pushes it to 80% + ram usage with a few other bits running - all done from within jupyter lab

Generate some random data in df

#adapted from https://stackoverflow.com/questions/42941310/creating-pandas-dataframe-with-datetime-index-and-random-values-in-column

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(730), freq='s')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'Datetime': days, 'data': data})

Now get data into shape and plot

import holoviews as hv
import datashader as ds

from holoviews.operation.datashader import datashade, rasterize
from holoviews import opts, dim

import holoviews.operation.datashader as hd
hv.extension("bokeh", "matplotlib")

dates = np.array(df['Datetime'], dtype=np.datetime64)
random_data = datashade(hv.Scatter((dates, df['data']), 'DateTime (mm/dd)', 'random data', label='random plot' ),cmap=['lightblue','darkblue'])

plot = random_data.opts(height=800,width=1500)
plot

Normally I’d be loading the data from file and getting it into shape for plot which also adds a bit of expense in resources but the with the above didn’t really need to do anything but generate the random data.

Hopefully of some use

saulobertolino · July 13, 2020, 5:52pm

Hi carl!
I think your example solved my problem!! I´ve just need to understand how configure the colors of my plot.
Let´me try to explain:
I have two features ploted in my scatter (just like your example). But I have another feature (the third column) with two discrete values (‘A’ and ‘B’ for ex.). Before that I was just used to create 2 new dataframe using pandas filters (for ex.: df_2 = df[df[‘Label’]==‘A’]. And then I´ve created two scatters and after overlay that 2 in one plot. But I´ve tried to do the same here and didn´t work. Is there a way to paint with different colors respecting that third feature?

carl · July 15, 2020, 3:17pm

Hi @saulobertolino,

I think as @Marc said really need something to work with here otherwise well I feel like I’m guessing and going in round abouts with my limited knowledge… I’ll give it a go though sorry I’m a bit of a bric-o-brac diy-ist I’m sure with some supplied information you’ll get decent help here. From what you’ve said, I’ve put the following together from couple of sources.

Some y1, y2 random data generation and assign it to ‘AB’ category randomly - I guess I’ve left timeseries in for x just because that’s what I’m normally working with rather than an arbitrary random x value

#adapted from: https://stackoverflow.com/questions/42941310/creating-pandas-dataframe-with-datetime-index-and-random-values-in-column
#              https://docs.scipy.org/doc//numpy-1.15.0/reference/generated/numpy.random.choice.html
#              https://datashader.org/user_guide/Points.html
#              https://datashader.org/getting_started/Interactivity.html

import pandas as pd
import numpy as np
from datetime import datetime, timedelta


date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(730), freq='s')

np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(181, high=280, size=len(days))
data3 = np.random.choice(['A', 'B'], size=len(days))


df = pd.DataFrame({'datetime':days, 'y1': data1, 'y2':data2, 'cat':data3})

Lets define the cat column as a pandas category

df['cat'] = df['cat'].astype('category')

Ok this is where I went in circles and hope understood what you mean so you’ve a couple of columns and the values of those columns are categorically assigned - I’ve stuck with the AB in this case. I couldn’t find a nice way other than to split the initial data frame into two, I’m sure it will be possible to work from the one but that is my lack of understanding

So lets get cracking and see what I’ve done here, first some more imports, then split the df into two copies and plotted and finally overlaid - the plots are coloured differently to A or B assigned. Note had to sample the data right down so could see the colour difference the data sets used aren’t really suitable to use as a demonstration but maybe will work with your data…

import holoviews as hv
import holoviews.operation.datashader as hd
hd.shade.cmap=["lightblue", "darkblue"]
hv.extension("bokeh", "matplotlib") 
import datashader as ds

df1 = df.copy()
df2 = df.copy()

df1.drop(columns=['y2'],inplace=True)
df2.drop(columns=['y1'],inplace=True)

points1 = hv.Points(df1.sample(1000))
points2 = hv.Points(df2.sample(1000))

datashaded = hd.datashade(points1, aggregator=ds.count_cat('cat'))
scatter1 = hd.dynspread(datashaded, threshold=0.1).opts(height=500,width=500)

datashaded2 = hd.datashade(points2, aggregator=ds.count_cat('cat'))
scatter2 = hd.dynspread(datashaded2, threshold=0.1).opts(height=500,width=500)

plot = scatter1 + scatter2
plot

Now for the overlay itself

plot2 = scatter1 * scatter2
plot2.opts(ylabel='overlayed')

Again hope of some help [I’m quietly confident I’m not doing things quite right but seems to function], I didn’t check to see if could change between red and blue but the fact it’s not the standard datashade blue tone I believe should be fairly easy to alter the colours to suit - I haven’t tried any further.

Thanks, Carl.