How to make a histogram...in python

Now that I'm able to track the views on my site I thought I might as well pretty up the data a little bit. Being an embedded developer by trade, I've been a firm believer in finding 3rd party libraries, ignoring them and reinventing the wheel. Why use pip to install a package when I can just write the single function I actually need.

If you hadn't seen my previous post where I introduced the script...

More dabbling in CGI: Site Statistics

... I decided that a nice feature to add would be a histogram of site usage, errors, new visitors, etc. I found a few python libraries that created ASCII graphing functions, but the idea to add the functionality myself posed a fun evening project.

tl;dr

If you just want to look at the code, here you go:

def print_histogram(value_dict, width=10, reverse=False):
    max_value = max(value_dict.values())
    digits = len(str(max_value))
    possible = list(map(lambda x: pow(10, digits-1) * x, [2.5, 5, 10]))
    pick = min(list(filter(lambda x: x > max_value, possible)))

    per_square = math.ceil(pick / width)

    max_key_size = 1 + max(map(len, value_dict.keys()))
    for k in sorted(value_dict.keys(), reverse=reverse):
        v = value_dict[k]
        cnt = math.floor(v / per_square)
        print("{}|{}".format(k.ljust(max_key_size), '#'*cnt))

Histogram?

For anyone who doesn't know the difference, a bar graph is a graphical representation of data that is partitioned into categories and displayed as varying length bars. A histogram is similar, but rather than being broken into categories, the bars represent contiguous values of input being mapped over a function. In a bar graph there is no relationship between the subsequent bar categories.

As an example, if I were to poll 100 people on their favorite ice cream flavor I could create a bar graph that looked liked this:

Favorite Flavors

Vanilla            |######
Chocolate          |########
Blue Moon          |#
Mackinac Isl Fudge |###

There is no direct relationship between the flavors, no sequential ordering of them in the graph. I could swap the categories around and it would still be a bar graph. If, however, I was to graph the ages of those polled who liked ice cream, it would look like this:

Who likes ice cream, based on age

12 |############
13 |########
14 |####
15 |#######
16 |##

The order of the bars is important as its showing the progression of the age. It could go from 16 to 12 but part of what one sees in histogram is the ebb and flow of the values reported over the input data range. Here we see as kids got older they started to not like ice cream.

The Problem

An important part of any graph is to figure out the ranges of the axis. For the distribution input its easy. Just take the range over what we are measuring. In most cases it will be a range of dates, with each day having its own bar. But how long should we make the other axis? The easiest method would be to just pick the largest value and make that the max range. If you're always working with round numbers this would probably be fine, but that never seems to be the case. Instead graphs tend to use the next round number just greater than the largest value.

max_value = max(value_dict.values())
digits = len(str(max_value))
possible = list(map(lambda x: pow(10, digits-1) * x, [2.5, 5, 10]))
pick = min(list(filter(lambda x: x > max_value, possible)))

For some flavor, rather than just using round numbers, my graphs will grow in a pattern: 2.5, 5, 10, 25, 50, 100, 250, 500, 1000, etc.

Lets start out by grabbing the largest value in the data set, store it in `max_value`. Next we need to know how many factors of 10 are needed to increase our possible range values to be just greater than our maximum value. To do this we calculate the number of digits the max value contains. If our max value was 42, our digit count would be 2, making our possible graph axis max value of 25, 50, or 100. To get `possible` we raise 10 to the power of `digits - 1` and multiple by our smallest range values `2.5, 5, 10`. To pick the value we filter off those that would be too small. In this case 25. Then find the smallest value that is left, 50.

Since we are working with ASCII characters, we have a fixed number characters that make up our bar. Based on the total number possible to display `width`, each character is equal to max over width.

per_square = math.ceil(pick / width)

Since simple uses of floating point numbers can be problematic, we need to do a little finagling with the value per character. To make sure our graph does not overflow, we assign the value per square to be the ceiling of this ratio. In cases where the ratio would not be a whole number, a full bar would be represent a value slightly larger than axis max. This means the largest bar value shown won't overflow the graph.

A little bit of handling key values to make sure everything lines up. Find the largest key length and add 1.

max_key_size = 1 + max(map(len, value_dict.keys()))

Now to calculate how many characters need to be displayed.

for k in sorted(value_dict.keys(), reverse=reverse):
    v = value_dict[k]
    cnt = math.floor(v / per_square)
    print("{}|{}".format(k.ljust(max_key_size), '#'*cnt))

We loop over all the keys, sorted and optionally reversed. Next we get the value, `v` and convert that to the count of characters in our bar. Here we floor the ratio to again make sure that we don't overflow the graph if there is any strange computer float nonsense going on. The last step is to print the key and bar.

k.ljust(max_key_size)

Here we left justify the key by the size we measured previously, padding the rest of the string with spaces.

'#'*cnt

In python if you multiply a character by a number `N` you get the character repeated `N` times.

The Result

The last step is to generate a dictionary of contiguous inputs over the data. Using the `timedelta` function, we are able to initialize the dictionary with values before processing the entries.

for i in range(0, entry_count):
    d = datetime.today() - timedelta(days=i)
    entry_date = d.strftime('%Y-%m-%d')
    per_day[entry_date] = 0

As an after thought I should probably add in the option to display the values along with the graph, but for the moment I create keys with the date and count, and the value just containing the count.

per_day_counts = {}
for k in sorted(per_day.keys(), reverse=True)[:entry_count]:
    per_day_counts["{}: {}".format(k, per_day.get(k))] = per_day.get(k)
print_histogram(per_day_counts, 30, True)

The output looks like this:

### Visits Per Day (Last 30 days)
2022-10-11: 30  |#
2022-10-10: 176 |##########
2022-10-09: 374 |######################
2022-10-08: 243 |##############
2022-10-07: 229 |#############
2022-10-06: 262 |###############
2022-10-05: 262 |###############
2022-10-04: 182 |##########
2022-10-03: 237 |#############
2022-10-02: 355 |####################
2022-10-01: 239 |##############
2022-09-30: 134 |#######
2022-09-29: 148 |########
2022-09-28: 184 |##########
2022-09-27: 154 |#########
2022-09-26: 19  |#
2022-09-25: 35  |##
2022-09-24: 78  |####
2022-09-23: 5   |
2022-09-22: 6   |
2022-09-21: 16  |
2022-09-20: 8   |
2022-09-19: 6   |
2022-09-18: 13  |
2022-09-17: 3   |
2022-09-16: 3   |
2022-09-15: 20  |#
2022-09-14: 6   |
2022-09-13: 4   |
2022-09-12: 2   |

If you want to use this script, details can be found here. Just clone the repo and modify to match your log structure.

Traffic.py

$ published: 2022-10-11 00:02 $

$ tags: programming, gemini $

-- CC-BY-4.0 jecxjo 2022-10-11

back

-- Response ended

-- Page fetched on Tue May 21 17:41:13 2024