First calls to canvas.line are always incredibly slow

Hello everyone,

I want to stream data using datashader along with bokeh. I know that holoviews would already do the job for me, however interactivity with holoviews does not work on a particular installation on a particular server (while it does on my computer). I decided to quickly write a bokeh plot that more or less does the same thing as holoviews. While this plot work fine, I have noticed something strange. When calling canvas.line, the first call always takes about 1 second no matter how many point I plot, which is very slow. This means that if I want to plant multiple lines at the same time, the first draw will take n*1s to execute. If I take bokeh out of the picture, this problem still persists, so I know it is only related to datashader.

I prepared a little example, where I plot n random walks, which are updated at a regular interval. In this example, I time the calls of each individual canvas.line.

import numpy as np
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
import time

x_name = "x"

def shade(data, y):
    cvs = ds.Canvas(plot_width=800, plot_height=400)
    aggs = []

    if not isinstance(data, pd.DataFrame):
        data = pd.DataFrame(data).astype(float)

    for i, yn in enumerate(y):
        t = time.time()

        aggs.append(cvs.line(data, x_name, yn))

        t = time.time() - t
        print(f"line: {i}, csv.line: {t:.07}s")

    imgs = [tf.shade(aggs[i]) for i in range(len(y))]
    return tf.stack(*imgs)

def init_data(num_lines):
    data = {x_name: np.array([0])}
    ys = []
    for i in range(num_lines):
        data[f"y{i}"] = np.array([0])
    return data, ys

def update_data(data, num_data, num_lines, refresh_rate):
    last_time = data[x_name][-1]
    new_time = np.linspace(last_time + 1e-3, last_time + refresh_rate / 1000, num_data)
    x = list(np.append(data[x_name], new_time))
    data[x_name] = x

    # Create random walk
    for i in range(num_lines):
        key = f"y{i}"
        last_y = data[key][-1]
        rand = np.random.normal(0, np.sqrt(10 / num_data), size=num_data)
        new_y = np.cumsum(rand) + last_y
        y = list(np.append(data[key], new_y))
        data[key] = y

    return data, last_time

num_data = 1000
num_lines = 10
refresh_rate = 500
total_time = 30

data, y = init_data(num_lines)
for i in range(int(total_time / (refresh_rate / 1000))):
    data, last_time = update_data(data, num_data, num_lines, refresh_rate)
    t = time.time()
    imgs = shade(data, y)
    t = time.time() - t
    print(f"Time: {last_time}s, Shade function timing: {t:.07}s\n")
    time.sleep(refresh_rate / 1000)

Using Python 3.7.7 and datashader 0.11.0, I obtain the following results:

line: 0, csv.line: 1.019807s
line: 1, csv.line: 0.6681943s
line: 2, csv.line: 0.6904411s
line: 3, csv.line: 0.6443765s
line: 4, csv.line: 0.6933248s
line: 5, csv.line: 0.7077668s
line: 6, csv.line: 0.732996s
line: 7, csv.line: 0.6384065s
line: 8, csv.line: 0.6951737s
line: 9, csv.line: 0.712997s
Time: 0s, Shade function timing: 7.337082s

line: 0, csv.line: 0.0017488s
line: 1, csv.line: 0.001763821s
line: 2, csv.line: 0.001781225s
line: 3, csv.line: 0.001730204s
line: 4, csv.line: 0.001755476s
line: 5, csv.line: 0.001938581s
line: 6, csv.line: 0.002449274s
line: 7, csv.line: 0.001697063s
line: 8, csv.line: 0.001592875s
line: 9, csv.line: 0.001425743s
Time: 0.5s, Shade function timing: 0.143779s

line: 0, csv.line: 0.005766869s
line: 1, csv.line: 0.008542538s
line: 2, csv.line: 0.00506115s
line: 3, csv.line: 0.01425767s
line: 4, csv.line: 0.004119158s
line: 5, csv.line: 0.00257349s
line: 6, csv.line: 0.002157927s
line: 7, csv.line: 0.002114534s
line: 8, csv.line: 0.00369072s
line: 9, csv.line: 0.00231266s
Time: 1.0s, Shade function timing: 0.2164218s


You can see here that the first calls to canvas.line always take a lot of time, and the subsequent calls are okay. If I want to plot 32 lines, I would have to wait 32 seconds, which makes datashader unusable. Can you explain me what I did wrong, or if there is a better way to use datashader ?

Thank you for your answers.

Little bit confused by your statement, that “the first calls to canvas.line always take a lot of time”. In your example it’s really just the first iteration that’s slow and then subsequently it speeds up significantly. I can’t quite explain why the first iteration is so slow, I would really just expect the first call to be slow since numba has to compile the appropriate kernel, which should then subsequently be cached. I can investigate that further.

Secondly though you are really doing something quite sub-optimal in terms of your implementation here. Canvas.line has many options for rendering multiple lines at once so instead of iterating over the lines and then stacking them you should just render them all at once like this:

def shade(data, y):
    cvs = ds.Canvas(plot_width=800, plot_height=400)

    if not isinstance(data, pd.DataFrame):
        data = pd.DataFrame(data).astype(float)
    t = time.time()
    img = tf.shade(cvs.line(data, x_name, y))
    t = time.time() - t
    print(f"line: {i}, csv.line: {t:.07}s")
    return img

Okay, thank you for your answer.

What I was saying with “slow first calls”, is that calls to canvas.line in the first call of the shade function are very slow.

For my sub-optimal implementation, it is because in my case (which I did not show in my example), the number of points of each line is not the same. E.g. the first line would have 800 x&y elements, the second line 900, … This means I cannot build a pandas dataframe without having some None that are everywhere. Is there a way to do this with only one panda dataframe ?

You could concatenate the DataFrames and separate each line with NaNs. That has it’s own problems with efficiency but suspect it’ll still be better than stacking.