Over the last several weeks, I’ve blogged about two different methods for solving the small cell suppression problem using SAS Macro code. In the first, we used a heuristic approach to find a solution that was workable but not necessarily optimal. In the second, we solved the problem to proven optimality with SAS PROC OPTMODEL. But all of this leaves a few open questions…
For example, how much better is the optimal approach than the heuristic? Is there ever a reason not to prefer the optimal approach? And what are some other improvements and techniques that a researcher using these macros might want to know about? I’ll spend this post reflecting on our two solutions and covering a few of these bases.
When I originally wrote these macros, the first question on my mind was “how far from optimal is the heuristic, anyway?” So let’s take a look at that, using the data on some of the most commonly used first and last names in North Carolina that we’ve used as an example in several of these posts.
There are, of course, several ways you could measure performance, but I’ll stick with looking at the total number of individuals represented in the suppressed cells, since that’s what we used as our objective function for the optimal algorithm. So, let’s pump our test data through both algorithms and see how they do! (Note: this code assumes you’ve already submitted the macro code for each algorithm…)
I was actually shocked by the results of this. The total number of individuals represented by this data set (including double-counting for each level of aggregation) was 440,244. The heuristic algorithm suppressed data on 4,755 individuals (again double-counting). The optimal algorithm suppressed data on 4,726. In other words, the optimal algorithm was somewhere in the ballpark of .6% better than the heuristic. Frankly, not really much of an improvement.
Of course, it could be that there would be more of a stark difference when it came to “real world” data sets. To truly assess performance, you’d need to try these algorithms on a decent sized sampling of real data sets and see how they compare. I’m not going to do all that heavy lifting here, but I’d love to hear back from anybody that’s tried it out!
Choosing an Algorithm
I’m sure many people are wondering why we would even think about using the heuristic algorithm when an optimal approach is available to us. However, there are two good reasons that you might want to think about using the heuristic approach.
- SAS/OR is expensive, so unless you already have a license, you may not want to drop the cash for it just to get access to PROC OPTMODEL.
- The optimal approach is computationally expensive. Even on a behemoth of a workstation, I’ve run into suppression problems that PROC OPTMODEL simply couldn’t solve – the amount of RAM required was simply too great. In those situations, your only options are to reduce the size of the problem (see below), or go the non-optimal route. The advantage of the heuristic is that, if your data can be sorted by PROC SORT, it can be suppressed by that macro. You’d have to get a really big data set for that not to work out.
With that in mind, using the optimal approach makes sense if you have ready access to PROC OPTMODEL, a small enough problem that the system can handle it, and really need to suppress as little data as possible. The heuristic approach makes sense if you’re bounded by available cash or available computational resources, or you simply don’t need perfection.
Of course, there are lots of ways to think about tweaking and improving these models. One of the most obvious is to reduce the complexity of the problems you submit to the optimal macro, so that PROC OPTMODEL doesn’t blow up on you. There are two ways to accomplish this.
- You can split the data set being suppressed into each of its various “levels” of aggregation (think _TYPE_ in PROC MEANS results when a CLASS statement is used). The lowest level of aggregation will still be a fairly large problem, but you’ll definitely buy yourself some space by removing the other levels.
- You can also split the data by any dimension you don’t aggregate across. For example, when I worked at the North Carolina Department of Commerce, I suppressed the data for www.nctower.com, a website that reported on the post-graduation employment outcomes of the State’s college and university students. We let you look at outcomes for individual campuses and subject areas, but we also let you aggregate across campuses and subject areas. (In other words, you could look at the salary of everyone who got a bachelor’s degree at NC State, or of those that got a bachelor’s degree in, say, mechanical engineering at NC State.) But the one thing we didn’t let you aggregate across was type of degree. You could never see the average salary of all graduates lumped together – you had to pick a type of degree (bachelor’s vs. master’s vs. doctorate). This meant that we could break our data set into pieces based on degree type – one for each degree – and then suppress each piece separately. This made each individual optimization problem substantially smaller.
Lots of other tweaks could be made in terms of objective functions, other constraints, etc. For example, you might want to introduce a requirement that the value of a cell can’t be mathematically determined to be within a tolerance of, say, 5 individuals away from the original value, rather than that it can’t be determined exactly. You could introduce requirements that you didn’t report the value of a cell if it was based on some small external factor, even if it had a large N (for example, not reporting the number of construction workers in a county if there was only one construction firm in the county – no matter how many employees that firm had). The list goes on indefinitely.
Obviously, I won’t go into all of these possibilities here, but the framework is available. Tweak the problem to fit your individual needs, and modify the code to suit!
I hope you’ve found this series of posts on small cell suppression helpful. I’d love to know if you’re using the macros, any ways you’ve improved or built upon the ideas, or anything else you find interesting.
As for me, I’m on to other things!