Performance Profiling and Collection.retainAll

I’ve been working on optimising a very CPU intensive algorithm recently, trying to squeeze out as much straight-line performance as possible. Now the first principle of performance optimisation is to “profile before you optimise” to ensure you spend your time on high reward optimisations, so I’ve been using the excellent (but exceedingly boringly named) YourKit Java Profiler to find major problem areas.

One of the methods that showed up prominently were many calls to Collection.retainAll(Collection). This method, not commonly used, removes any elements from the target collection that are not in the passed collection. The algorithm was using ArrayLists for both collections which I figured would be massively inefficient, as it would require the entire collection (worst case) to be iterated for each element in the “retain” collection.

As an experiment I changed the target collection (the one on which the “retainAll” method is called) to a HashSet instead… and got the same result! That was unexpected, I was sure that would have improved things, after all weren’t we now looking up elements using a hash rather than iteration???

So I wrote a little tester which simulated retaining approximately 90% of the elements from a large String collection:


        Collection<String> things = new ArrayList<String>(NUM);
        Collection<String> thingsToKeep = new ArrayList<String>(NUM);

        populateCollections(things, thingsToKeep);

        long startTime = System.currentTimeMillis();
        things.retainAll(thingsToKeep);
        System.out.println("retainAll call took " + (System.currentTimeMillis() - startTime) + " milliseconds");

Running this yielded an average retainAll time of 750ms. Changing the target collection (”things”) to a HashSet…


        Collection<String> things = new HashSet<String>(NUM);
        Collection<String> thingsToKeep = new ArrayList<String>(NUM);

… yielded exactly the same results. Finally I decided to change the “thingsToKeep” to a HashSet as well to see what difference (if any) there would be…


        Collection<String> things = new HashSet<String>(NUM);
        Collection<String> thingsToKeep = new HashSet<String>(NUM);

… and low and behold the output:

retainAll call took 16 milliseconds

That’s a 97.9% improvement! Looking at the code for retainAll, it’s no surprise… it iterates over the target collection (”things”), looking up each element in the passed collection (”thingsToKeep”) with a call to “contains”. So changing the target Collection type doesn’t make a skerrick of difference… it’s the passed collection type that matters.

Morals of the story:

  1. If using retainAll on large collections, make sure the passed collection supports fast lookups (e.g.: a HashSet). The target collection type doesn’t matter.
  2. Choosing the correct collection type for the job at hand really matters! Think carefully before choosing your collection type!

Hello blog world

Despite being a software developer I’ve never been particularly ahead of the curve when it comes to internet trends. I’ve long written off blogging as a narcissistic indulgence, mostly of those with little in the way of writing talent and even less in the way of interesting things to say (no I don’t care who your favourite “Idol” is!).

Well blogging has definitely come of age now, and it seems that anyone who’s anyone has their own blog these days, and there’s even a few that are worth reading. Maybe I’m getting vain, but I’ve finally taken the plunge and got a blog of my very own. My only hope is that someone, somewhere, sometime might find something of interest spewing forth from these pages.

Ah… immortality at last!