Weighted Randomness using PHP and MySQL

So when making my Fake Name Generator, I ran into the problem of how to select a random record, but still have it weighted so some records show up more often than others.

My first step was to prepare the database. My table looked something like this:

Rank (1 = most common, 2 = second most common…) Name Percentage (For example, 1.006% of the sample had the last name Smith, so this column would equal 1.006) Cumulative Percentage (This record’s percentage and the percentage of all of the higher ranking records added together)

So the goal was to have a query that would somehow randomly select a record and weight it according to the percentage column.

So the next step was to search Google. The results sucked. Horribly. The most common example involved using ORDER BY rand()*(1/PercentageColumn) LIMIT 1. Yeah, that works, but not well. Essentially this requires your DBMS to sort the entire table every time you want ONE row. For a small table of a few thousand records, that’s fine. But my table has 88,799 records. Sorting 88,799 records takes a second or two, which is unacceptable when you have to perform this query at least 4 times to get a single full fake name.

The second most common example involved duplicate records. First you’d need whole numbers, so you’d multiply the percentage column by 1,000. Then you’d duplicate records based on the percentage. For example, Loomis, which occurs .002%, would then have the percentage column equal 2, and so you’d need 2 records in your table to represent Loomis. Again, this works with small tables, but that would require my table to have 866,620 records. So what’d I do? I went to my dad of course. :o)

My dad is an expert at everything database. Seriously. He is a genius (MCSE = Marty Can Solve Everything). And he had a GENIUS idea.

Here is what we did (using 1990 US Census data):

Added start and finish columns. We’ll talk about these later.
Some names had a percentage of 0.000 because they are extremely uncommon. We took those names and changed the percentage to 0.0001.
Updated each percentage column so percentage = percentage * 10,000.
Ran a script (using PHP) to get a sum of the finish column for all records in the table.
Set the start column of current record so start = sum(finish) + 1.
Set the finish column of current record so finish = sum(finish) + percentage.
Repeat for each record.

So the first record is Smith, which has a percentage of 10060. We do a sum on finish for the table, and get 0. So start = 0 + 1, or in other words, start = 1. Finish = 0 + 10060, or in other words, finish = 10060.

Second record is Johnson, which has a percentage of 8100. We do a sum on finish for the table, and get 10060. So start = 10060 + 1, or 10061. Finish = 10060 + 8100, or 18160.

What good does this do, you ask? Easy: We select max(finish) from the table. We then use PHP to generate a random number from 1 to max(finish). Suppose the number is 15684. We then do a select statement that does WHERE start <= 15684 AND finish >= 15684. The only row that should satisfy that query is Johnson.

It takes a bit to setup, but once its there it is incredibly easy to use, and the query are simple enough that there is little impact on your server.

Just thought this might help another person struggling with the limitations of MySQL’s rand().