First...what is a bucket sort? Wikipedia has a good summary. The basic idea though is to put your data into buckets, sort each bucket, then stitch the data back together. A really simple example that probably makes it clear is to imagine a list of books that you want to sort by author. Say you have 2,000 books. A way you could do that is to put all books from authors with names starting from Aa to Am in one pile, all from An to Az, in another pile, and so on for 52 piles. Then you could sort those 52 piles and put them back together.

Is this any faster than other sorting algorithms? It depends on the data you have and the buckets you choose. I wrote some simple code as an example. The algorithm I used is to bucket the data, run insertion sorts on the buckets, then stitch them back together. You can do this a lot of different ways. I opted to specify a factor to divide the max element by to get the number of buckets. E.g., if the max value is ~5,000 and the factor is 100, it creates ~50 buckets. What are the results?

With 1,000 ints ranging from 0 to 10,000 randomly arranged in a list, I get the following performance:

Noting that the 1 bucket case is the original list, that's the approximate time for the standard insertion sort. It's clear that bucketing the data speeds it up. Once you get above 1,000 buckets, you have more buckets than data, so you get no value from adding more and it starts taking longer.

To make it clear how bucketing is working, I've included a plot of sample raw data, and a plot of the raw data after bucketing into 10 buckets but before sorting:

Finally...that last plot might make it obvious that an insertion sort on the bucketed data set will be pretty fast since the data are nearly sorted. Here's a final plot comparing the two algorithms:

The code I used for this is here:

source

The code I used for this is here:

source

## 0 comments:

## Post a Comment