Crafting Data Visualizations from Scratch

April 27, 2024 (2y ago)

This article demonstrates a methodical approach to generating graphical data visualizations from scratch, using C and Raylib. We will be doing this in C and only assuming that you have the ability to draw pixels on the screen and basic shapes (lines rectangles, circles, etc) which is provided to us by Raylib. We also compile the result to web assembly so that you can see the generated data visualizations in this article itself.

Why build data visualizations from scratch?

There are a plethora of data visualization libraries out there that can help to generate data visualizations. These libraries may not sufficiently encourage users to engage deeply with the underlying communicative intent of the visualization. Instead, they focus on getting something onto the screen "fast" and moving on. Additionally, the currently popular libraries often lack some kind of extensibility or have documentation that are encyclopedic in length.

Rather than sort through endless documentation and libraries, I want to explain how it is not that hard to actually just draw the data visualization graphics yourself. In doing so you will also have to think about all the details, and it's those details that can make or break a great visualization.

So what to visualize?

For the purposes of this article we will be generating a simple bar chart which tracks the usage frequency of letters in the English language.

Here is the original dataset. Interestingly, this dataset was used by the inventor of Morse code in order to aid in determing codes that would give shorter overall combinations. Eg. if the code for e was very long, it would mean the overal length of telegram messages would be significantly longer since e is such a commonly used letter.

And here is a preview of the chart that we will be producing in this article:

Scales

With rendering any data visualization to a computer screen you need a method for turning the data values into appropriate pixel values. To get these pixel values you need to come up with an appropriate function that maps the data space to pixel space. This function is typically called a "scale". For most ordinary data visualizations you probably want to convert the data values linearly to pixel values. For example, if the data ranges from 1 to 10 but you have a 100-pixel wide are to render it onto, you would need a scaling function that multiplies the data values by 10 to get the corresponding pixel values.

Linear Scale

What that in mind we can write down a function that takes the data range, the pixel range and a data value, and returns the corresponding pixel value that that data value corresponds to.

float linear_scale(
    float data_min,
    float data_max,
    float pixel_min,
    float pixel_max,
    float data_value
) {
  return pixel_min + (data_value - data_min) *
    (pixel_max - pixel_min) / (data_max - data_min);
}

what is going on here? What we have is a linear function that converts the data to pixels. The conversion "factor" is the slope of this function which you can get by taking known values in the data and in the pixels and taking their ratio.

What I mean by this is is that it is not too difficult to know the extent of the area on the screen you want to render into (eg. the width or height in pixels) and you also can compute the min / max of the data values you want to render into that space. Then using the high school math formula for point-slope you can get out this linear equation for converting data values to pixel values.

Rendering Data

Now that we have a scale we can begin to render some actual data to the screen. For a simple example let's do the following. Take the following three points as a dataset:

float data[] = {1.0f, 2.0f, 5.0f};

Say that we have a 300 pixel tall area on which to render these points. We compute the min / max easily (min is 1 max is 5) and pixel_min is 0 and pixel_max is 300. We can then use our scale function to get the pixel values for these data points.

float y_accessor(float data_value) {
    return linear_scale(1.0f, 5.0f, 0.0f, 300.0f, data_value);
}

With this function we can just pass in arbitrary data values and get out a corresponding pixel value. Let's see it in action:

And the code for that looks something like

static float simple[] = {1.0f, 2.0f, 5.0f};

float linear_scale(
    float data_min,
    float data_max,
    float pixel_min,
    float pixel_max,
    float data_value
) {
  return pixel_min + (data_value - data_min) *
    (pixel_max - pixel_min) / (data_max - data_min);
}

int main_loop() {
  BeginDrawing();
  ClearBackground(BLACK);
  int WIDTH = GetScreenWidth();
  int HEIGHT = GetScreenHeight();
  // Axis line
  DrawLine(
    MARGINS.left,
    MARGINS.top,
    MARGINS.left,
    HEIGHT - MARGINS.bottom,
    WHITE
  );
  for (int i = 0; i <= 2; ++i) {
    float y = scale_linear(
      1.0f, 5.0f, MARGINS.bottom, HEIGHT - MARGINS.top,
                           simple[i]);
    DrawCircle(2 * MARGINS.left, HEIGHT - y, 5, WHITE);
  }
  EndDrawing();
  return 1;
}

What we have done here is draw the axis which represents the scale that we computed. And along with that we have rendered the three points in our dataset (slightly shifted off the axis line so that they are visible).

This is the first step on our way to rendering a full bar chart, but it's worth noting that at this point we already have a lot to work with. We can show a 1d distribution of data with only a single axis. This can be useful in certain cases in and of itself.

Bars

Now that we have the ability to draw pixels that correspond to data points we want to apply this more generally to drawing a bar chart. In this chart we will have one bar for each letter representing it's frequency of usage in the English language (from the sample in the dataset).

In order to show the bars we need to decide which order we will show the bars in. In some datasets (like a time-series) this sorting is natural. But for our dataset we have a choice. We could order the chart in alphabetical order or we could order it in order of the frequency (or some more exotic sortings like vowels-first). But for our purposes we will show the order of the bars in the order of the frequency of the letters.

To plot the "X" axis (of the letters) we effectively need a categorical scale. This function takes a letter and returns the corresponding pixel value -- the X position of the bar.

float scale_categorical(
  int domain_count,
  float range_min,
  float range_max,
  int value
) {
  return scale_linear(
    0,
    domain_count - 1,
    range_min,
    range_max,
    value
  );
}

What is this function doing? Basically we provide the categorical scale with the number of categories in the domain and it maps them uniformly to buckets in the pixel range -- we have it actually do this by just calling our linear scale function where we scale the domain from zero to the count minus one.

Given the categorical x scale, we are ready to start to draw bars on the screen.

Here is the rest of the implementation for bars

struct row {
  char letter;
  float frequency;
};

static struct row letter_frequencies[] = {
    {'E', 0.12702}, {'T', 0.09056}, {'A', 0.08167}, {'O', 0.07507},
    {'I', 0.06966}, {'N', 0.06749}, {'S', 0.06327}, {'H', 0.06094},
    {'R', 0.05987}, {'D', 0.04253}, {'L', 0.04025}, {'C', 0.02782},
    {'U', 0.02758}, {'M', 0.02406}, {'W', 0.02360}, {'F', 0.02288},
    {'G', 0.02015}, {'Y', 0.01974}, {'P', 0.01929}, {'B', 0.01492}};

static int rows = sizeof(letter_frequencies) / sizeof(struct row);

#define BAR_WIDTH 20

int main_loop() {
  BeginDrawing();

  ClearBackground(BLACK);

  int WIDTH = GetScreenWidth();
  int HEIGHT = GetScreenHeight();

  for (int i = 0; i < rows; ++i) {
    float y = scale_linear(
      0,
      0.12702,
      MARGINS.bottom,
      HEIGHT - MARGINS.top,
      letter_frequencies[i].frequency
    );
    float x = scale_categorical(
        rows,
        MARGINS.left,
        WIDTH - MARGINS.right - MARGINS.left - BAR_WIDTH,
        i
    );
    DrawRectangle(
      x,
      HEIGHT - y,
      BAR_WIDTH,
      y - MARGINS.bottom,
      WHITE
    );
  }
  EndDrawing();
  return 1;
}

One note is that we have to take into account the width of the bar into our calculations (we've opted for a fixed bar width, you could have a dynamic bar width eg. and fixed gaps or something like that). So we use a categorical scale to compute the x point and the same linear scale function to compute the y point (we've inserted the maximum y value manually, you probably don't want to compute that in the render loop.)

Decorations

Margins

It quickly is becoming aparent that we might want to do something like show ticks or values that are outside the "bounds" of the chart. We can adjust our chart by setting margins around the edges of the chart and reducing the effective draw area for the data. This will change the computations of our scales (needing to take the margins into account for the maximum / minimum pixel values).

We actually did this in the code above (otherwise it would have been too difficult to see the axis). But to spell out here how you deal with margins:

You need to adjust the width and height down by the size of the margins and define an effective chart area -- This means that all your scales need to have a range in the pixel values of the effective area instead of inside of the entire canvas area.

You can see above when we call scale_linear and scale_categorical we appropriately adjust by the margins for the range of the screen pixel values.

Axes

We have already drawn some basic axes as simple lines, but there are some nice improvements that we can make to those. We can add ticks and labels, and we can also produce more exotic axes lines eg. where we show explicitly the max / min value in the dataset on the axis line itself.

Ticks and Labels

Given that we now have some margin space, we have some blank are into which we can render axis labels, ticks and tick labels. Rendering Tick lables can be a bit tricky since you need to take into account the size of the characters in the text that you are rendering (ensuring alignment and that it doesn't overlfow).

Here is how for example we have rendered the tick labels for the Y axis:

    int textWidth =
        MeasureText(TextFormat("%.2f", i * Y_MAX / Y_TICKS), 10);
    DrawText(
      TextFormat("%.2f", i * Y_MAX / Y_TICKS),
      MARGINS.left - textWidth - TICK_LENGTH - 4,
      HEIGHT - y - FONT_SIZE / 2, FONT_SIZE,
      WHITE
    );

we render this in a loop over the number of Y_TICKS. Essentially what we do here is measure the length of the text we are about to render. Raylib provides a helper function for this MeasureText but you may have to do this manually depending on exactly how you are doing text rendering. Once we know the width of the text we can offset the x position that we will start the text from so that all of the tick labels end at the same X value (right aligned).

Grid-lines

Lastly we can add some gridlines to the chart. These are handy to be able to draw the eye to the values shown on the axis for quick comparisons and lookups.

A nice tip from Tufte is to make the grid lines the same color as the background but place them on top of the data. This way the grid lines are not distracting, but can actually show more precisely how far above or below a given tick the bar is.

Wrapping up

With that we have rendered the bar chart that we set out to produce. Hopefully you found the process not too difficult -- or at least you see that it is indeed tractable to produce charts yourself without relying on a library.

Will Bunting