Building a scalable popularity algorithm with Ruby on Rails
Recently I had to implement a popularity/hotness algorithm for our discovery pages at Ripplr. I had never done this before but I implemented fancier algorithms in the past, how hard could this be?
Harder than expected actually. All of my attempts depended, one way or another, on the current_time (e.g. total_points_in_the_last_hour
, points / time_since_created
, etc). But how can we order and read from the database in a performant way if the current_time is different each time?
Use created_at as your base score
I stumbled upon a gist of Reddit’s algorithm and some articles discussing it and a possible solution started to take shape.
At its core, what Reddit does is storing the created_at timestamp as the base score for each new post. From there, they can manipulate how the votes shift this score up or down and by how much.
This is what the most basic score method could look like:
In this example, a post with 100 upvotes will need around 16h (100 * 10 minutes) to be overtaken by a new post with 0 upvotes. Or instead, 8h to be overtaken by a new one with 50 upvotes.
The big improvement is that this score only needs to be calculated in two scenarios: when a post is created and when a vote is submitted. That means it can be saved in the database and used for querying. As a bonus, we can add an index to make reading almost instantaneously.
Benchmarking 1M records
To illustrate the benefits, I ran 4 scenarios with 1M records to see how they would perform:
A. Load all records sort them and return the top 10 results;
B. (small optimization on A) Pluck values from the database, sort them and then load the top 10 results;
C. Load the top 10 results using the score for ordering;
D. Load the top 10 results using the score for ordering (with an index).
As expected, A & B fall short of an acceptable standard in the real world. Letting the database handle the ordering and on top of that adding an index to the specific column brings the query time to less than a millisecond — making it very easy to scale from there.
We ended up shipping a variation of the method above. For our use case, we added a third variable to the score (# comments) and used a logarithm to smooth the impact of each of these variables.
Hopefully, this post helps other people explore alternative scoring techniques so that they can make their queries as fast as possible.