In last week’s post, we constructed a set of constraints to bound a binary integer program for solving the small cell suppression problem. These constraints allow us to ensure that every group of data points which could be aggregated across in a tabular report contains either 0 or 2+ suppressed cells.
At some point before age five, every kid masters the art of satisfying constraints with solutions that are hilariously non-optimal.
Obviously, there’s plenty of ways we could satisfy our constraints – suppressing everything, for example. But we want choose the optimal pattern of secondarily suppressed cells to minimize data loss. So, we’re going to tackle the problem using binary integer programming in PROC OPTMODEL. Strap yourself in, folks – it’s going to be an exciting ride.
In last week’s post we built a SAS macro that found acceptable solutions to the small cell suppression problem using a simple heuristic approach. But what if acceptable isn’t good enough? What if you want perfection? Well, then, you’re in luck!
Benjamin Franklin once attempted to become morally perfect. Too bad he didn’t have PROC OPTMODEL…
I’ve blogged previously about optimization with linear programming in SAS PROC OPTMODEL, and it turns out that the cell suppression problem is another class of problems that can be tackled using this approach. (If you’re unfamiliar with linear programming, check out the linear programming Wikipedia article to get up to speed.) Over the next two posts, we’ll be setting up a SAS Macro that builds the constraints necessary to bound our optimization problem, then implementing the actual optimization code in PROC OPTMODEL.
Often, complex problems can be adequately solved by simple rules that provide an acceptable solution, even if they don’t necessarily get you to the optimal point. The cell suppression problem (summarized in last week’s post) is a perfect example of this – using a methodology that would be readily apparent to any human faced with tackling the problem with pen and paper, we can create a computerized solution that can appropriately suppress data sets containing tens of thousands of records disaggregated over dozens of dimensions. This heuristic method will likely suppress more data than it really needs to, but when all is said and done, it will finish the job quickly and without completely mangling your statistics.
Heuristics are kind of like Fermi estimation. Or, more accurately, I needed an image for this post and this was the best thing I could come up with.
Image credit: XKCD
We’ll start with an explanation of the basic idea, then move on to implementing it in code.
The “cell suppression problem” is one type of “statistical disclosure control” in which a researcher must hide certain values in tabular reports in order to protect sensitive personal (or otherwise protected) information. For instance, suppose Wayout County, Alaska has only one resident with a PhD – we’ll call her “Jane.” Some economist comes in to do a study of the value of higher education in rural areas, and publishes a list of average salaries disaggregated by county and level of education. Whoops! The average salary for people with PhDs in Wayout County is just Jane’s salary. That researcher has just disclosed Jane’s personal information to the world, and anybody that happens to know her now knows how much money she makes. “Suppressing” or hiding the value of that cell in the report table would have saved a lot of trouble!
No, not that kind of suppression.
Over the next couple weeks, I’ll be blogging about some algorithms used to solve the cell suppression problem, and showing how to implement them in code. For now, we’re going to start with an introduction to the intricacies of the problem.