May 21, 2008

Selecting multiple random rows from table in rails

Some time ago I told you how to properly select a random row from a table. This time I'm gonna show how to select multiple random rows.

First of all, we can't just select one row ten times, because we usually want to be sure that there are no duplicate rows in the results. Selecting again when we hit a duplicate is not a good idea performace-wise.

There are 3 solutions to this problem and you should choose the right solution based on the number of rows in the table and distribution of rows' ids.

Small number of rows, not evenly distributed ids

First, select all ids from which you want to choose random rows. If you have no conditions, just go with:

ids = Model.find( :all, :select => 'id' ).map( &:id )

id column is indexed, so the operation is pretty fast. The only bottleneck is that when you have a lot of rows in the table you have to transfer a lot of data from the DB to the app and store that in memory. For example if you have 2000000 rows in a table you'd have to transfer about 8MB of data, which is a lot. If you have a couple of thousands of rows, this should not be a problem.
You can, of course, specify conditions for the ids, for example you can use something like:

ids = Model.find( :all, :select => 'id', :conditions => { :sponsored => true } ).map( &:id )

to choose only from among sponsored content.

Once you have the collection of IDs, it's time for the real magic to begin. Now to select 5 distinct rows at random just do:

Model.find( (1..5).map { ids.delete_at( ids.size * rand ) } )


And that's all. This whole method uses only 2 queries to the database: first one selecting all potentially selectable ids, the other one selecting the actual result.

Many rows, continuous ids

Now what do we do if we have a big table? We sure can't select all the ids first, it would take too much memory. If you know that ids in your table are continuous (from 1 to count, without any gaps) this is very easy and straightforward. Just choose 5 different numbers from 1 to count and use Model.find with an array as a parameter to select all of them at once. How to choose 5 distinct random numbers is beyond scope of this post.

Many rows, discontinuous ids

This is the worst case, as we don't know which ids are in the table and which are not and there's a lot of rows to select from. The only thing that comes to my mind is to choose 5 random numbers between 1 and count and select each row separately with:

Model.find :first, :offset => x

If you want to avoid having to pick 5 random numbers (which can be tricky!) you can also go like this:

res = []
5.times { res.push( Model.find :first, :offset => ( (Model.count-res.size) * rand ).to_i, :conditions => [ 'id not in (?)', res.map(&:id) ] ) }


So you have one query per selected row which is not very nice, but I can't come up with a solution that would be both fast and fair.

And the obvious - just pick again!

Of course there's also a very obvious solution in the 3rd case - if we pick one number twice, just pick one again. You'll usually want to select no more than 10 random rows this way, so if you have lots of rows (say, 100000) what are the odds that you'll get a double?

5 random numbers...

As far as I know there's no easy algorithm for picking k different numbers at random from range 1..n, where k <= n. I could come up with a O(k log k) solution, but it's really no fun to implement. Maybe in my next post...