Weighted Randomness using PHP and MySQL

So when mak­ing my Fake Name Gen­er­a­tor, I ran into the prob­lem of how to select a ran­dom record, but still have it weighted so some records show up more often than others.

My first step was to pre­pare the data­base. My table looked some­thing like this:

Rank (1 = most com­mon, 2 = sec­ond most com­mon…)
Name
Per­cent­age (For exam­ple, 1.006% of the sam­ple had the last name Smith, so this col­umn would equal 1.006)
Cum­mu­la­tive Per­cent­age (This record’s per­cent­age and the per­cent­age of all of the higher rank­ing records added together)

So the goal was to have a query that would some­how ran­domly select a record and weight it accord­ing to the per­cent­age column.

So the next step was to search Google. The results sucked. Hor­ri­bly. The most com­mon exam­ple involved using ORDER BY rand()*(1/PercentageColumn) LIMIT 1. Yeah, that works, but not well. Essen­tially this requires your DBMS to sort the entire table every time you want ONE row. For a small table of a few thou­sand records, thats fine. But my table has 88,799 records. Sort­ing 88,799 records takes a sec­ond or two, which is unac­cept­able when you have to per­form this query at least 4 times to get a sin­gle full fake name.

The sec­ond most com­mon exam­ple involved dupli­cate records. First you’d need whole num­bers, so you’d mul­ti­ply the per­cent­age col­umn by 1,000. Then you’d dupli­cate records based on the per­cent­age. For exam­ple, Loomis, which occurs .002%, would then have the per­cent­age col­umn equal 2, and so you’d need 2 records in your table to rep­re­sent Loomis. Again, this works with small tables, but that would require my table to have 866,620 records.
So what’d I do? I went to my dad of course. :o)

My dad is an expert at every­thing data­base. Seri­ously. He is a genius (MCSE = Marty Can Solve Every­thing). And he had a GENIUS idea.

Here is what we did (using 1990 US Cen­sus data):

  1. Added start and fin­ish columns. We’ll talk about these later.
  2. Some names had a per­cent­age of 0.000 because they are extremely uncom­mon. We took those names and changed the per­cent­age to 0.0001.
  3. Updated each per­cent­age col­umn so per­cent­age = per­cent­age * 10,000.
  4. Ran a script (using PHP) to get a sum of the fin­ish col­umn for all records in the table.
  5. Set the start col­umn of cur­rent record so start = sum(finish) + 1.
  6. Set the fin­ish col­umn of cur­rent record so fin­ish = sum(finish) + percentage.
  7. Repeat for each record.

So the first record is Smith, which has a per­cent­age of 10060. We do a sum on fin­ish for the table, and get 0. So start = 0 + 1, or in other words, start = 1. Fin­ish = 0 + 10060, or in other words, fin­ish = 10060.

Sec­ond record is John­son, which has a per­cent­age of 8100. We do a sum on fin­ish for the table, and get 10060. So start = 10060 + 1, or 10061. Fin­ish = 10060 + 8100, or 18160.

What good does this do, you ask? Easy: We select max(finish) from the table. We then use PHP to gen­er­ate a ran­dom num­ber from 1 to max(finish). Sup­pose the num­ber is 15684. We then do a select state­ment that does WHERE start <= 15684 AND fin­ish >= 15684. The only row that should sat­isfy that query is Johnson.

It takes a bit to setup, but once its there it is incred­i­bly easy to use, and the query are sim­ple enough that there is lit­tle impact on your server.

Just thought this might help another per­son strug­gling with the lim­i­ta­tions of MySQL’s rand().

This entry was posted in Random. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Posted January 12, 2007 at 11:31 pm | Permalink

    By far, this sin­gle page and your will­ing­ness to share has the poten­tial of being the best part of my day.

    I only have two ques­tions that can help fully explain this to me.

    1. Did you set/pick the per­cent­age of each name your­self, or is there some math you applied to each name/record that I’m missing?

    2. Why are you using the num­ber 10,000? Could I just as eas­ily use the num­ber 100?

    Again, thanks for the info so far. You’ve been a won­der­ful help as I’ve been search­ing for this infor­ma­tion all day with noth­ing worth talk­ing about … until now. ;~)

  2. Posted January 12, 2007 at 11:40 pm | Permalink

    @DECODV:

    1. I used data pro­vided by the US Cen­sus. The per­cent­ages were included with the data.

    2. I used 10,000 to get nice whole num­bers. 1,000 prob­a­bly would’ve made more sense (1.006 * 1,000 = 1006, as opposed to 1.006 * 10,000 = 10060). You could use 100 (1.006 * 100 = 100.6), but then you’ll have to deal with floats (dec­i­mals), which takes a bit more effort than plain old inte­gers and defeats the pur­pose of mul­ti­ply­ing the percentage.

    Hope this helps!

  3. Posted January 12, 2007 at 11:43 pm | Permalink

    Dude, you rock! Thanks for both the info and prompt response. You just made my day.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>