Weighted Randomness using PHP and MySQL

So when making my Fake Name Generator, I ran into the problem of how to select a random record, but still have it weighted so some records show up more often than others.

My first step was to prepare the database. My table looked something like this:

Rank (1 = most common, 2 = second most common…)
Name
Percentage (For example, 1.006% of the sample had the last name Smith, so this column would equal 1.006)
Cummulative Percentage (This record’s percentage and the percentage of all of the higher ranking records added together)

So the goal was to have a query that would somehow randomly select a record and weight it according to the percentage column.

So the next step was to search Google. The results sucked. Horribly. The most common example involved using ORDER BY rand()*(1/PercentageColumn) LIMIT 1. Yeah, that works, but not well. Essentially this requires your DBMS to sort the entire table every time you want ONE row. For a small table of a few thousand records, thats fine. But my table has 88,799 records. Sorting 88,799 records takes a second or two, which is unacceptable when you have to perform this query at least 4 times to get a single full fake name.

The second most common example involved duplicate records. First you’d need whole numbers, so you’d multiply the percentage column by 1,000. Then you’d duplicate records based on the percentage. For example, Loomis, which occurs .002%, would then have the percentage column equal 2, and so you’d need 2 records in your table to represent Loomis. Again, this works with small tables, but that would require my table to have 866,620 records.
So what’d I do? I went to my dad of course. :o)

My dad is an expert at everything database. Seriously. He is a genius (MCSE = Marty Can Solve Everything). And he had a GENIUS idea.

Here is what we did (using 1990 US Census data):

  1. Added start and finish columns. We’ll talk about these later.
  2. Some names had a percentage of 0.000 because they are extremely uncommon. We took those names and changed the percentage to 0.0001.
  3. Updated each percentage column so percentage = percentage * 10,000.
  4. Ran a script (using PHP) to get a sum of the finish column for all records in the table.
  5. Set the start column of current record so start = sum(finish) + 1.
  6. Set the finish column of current record so finish = sum(finish) + percentage.
  7. Repeat for each record.

So the first record is Smith, which has a percentage of 10060. We do a sum on finish for the table, and get 0. So start = 0 + 1, or in other words, start = 1. Finish = 0 + 10060, or in other words, finish = 10060.

Second record is Johnson, which has a percentage of 8100. We do a sum on finish for the table, and get 10060. So start = 10060 + 1, or 10061. Finish = 10060 + 8100, or 18160.

What good does this do, you ask? Easy: We select max(finish) from the table. We then use PHP to generate a random number from 1 to max(finish). Suppose the number is 15684. We then do a select statement that does WHERE start <= 15684 AND finish >= 15684. The only row that should satisfy that query is Johnson.

It takes a bit to setup, but once its there it is incredibly easy to use, and the query are simple enough that there is little impact on your server.

Just thought this might help another person struggling with the limitations of MySQL’s rand().

3 comments

  1. By far, this single page and your willingness to share has the potential of being the best part of my day.

    I only have two questions that can help fully explain this to me.

    1. Did you set/pick the percentage of each name yourself, or is there some math you applied to each name/record that I’m missing?

    2. Why are you using the number 10,000? Could I just as easily use the number 100?

    Again, thanks for the info so far. You’ve been a wonderful help as I’ve been searching for this information all day with nothing worth talking about … until now. ;~)

  2. @DECODV:

    1. I used data provided by the US Census. The percentages were included with the data.

    2. I used 10,000 to get nice whole numbers. 1,000 probably would’ve made more sense (1.006 * 1,000 = 1006, as opposed to 1.006 * 10,000 = 10060). You could use 100 (1.006 * 100 = 100.6), but then you’ll have to deal with floats (decimals), which takes a bit more effort than plain old integers and defeats the purpose of multiplying the percentage.

    Hope this helps!

  3. Dude, you rock! Thanks for both the info and prompt response. You just made my day.

Leave a Reply

Your email address will not be published. Required fields are marked *