In last week’s post, we constructed a set of constraints to bound a binary integer program for solving the small cell suppression problem. These constraints allow us to ensure that every group of data points which could be aggregated across in a tabular report contains either 0 or 2+ suppressed cells.
Obviously, there’s plenty of ways we could satisfy our constraints – suppressing everything, for example. But we want choose the optimal pattern of secondarily suppressed cells to minimize data loss. So, we’re going to tackle the problem using binary integer programming in PROC OPTMODEL. Strap yourself in, folks – it’s going to be an exciting ride.
Conceptualizing the Problem
As I’ve already stated, we’re going to be framing this as a binary integer programming problem – minimizing an objective function by making a set of binary decisions, subject to a set of constraints. But what are we deciding, what are we minimizing, and what are our constraints? Let’s walk through each step individually, illustrating the problem using our now-familiar data on bachelor’s, master’s and PhD holders in Wayout and Farout Counties, Alaska:
Our decisions are the easy part. We’re simply deciding yes or no on whether a cell is suppressed. We’ll represent this as a binary vector – one 1/0 option for every cell in our data. In PROC OPTMODEL terms, it looks something like this (where dec_set is just the set of numbers from 1 to the number of cells in our data set):
The objective is slightly more complicated – there’s lots of options. I’m choosing to minimize the total sum of individuals represented by the suppressed cells rather than, say, the total number of suppressed cells – you might suppress a few more data points, but they’ll be small. Of course, if I’ve got 5 or 6 options that will all suppress data on the same number of individuals, I’d still like to choose the option that suppresses the least amount of cells. So we’ll include a very small punishment for choosing extra cells. That makes our objective function look something like this:
Which brings us to our constraints. We basically just have two types of constraint. First, a primary suppression constraint – cells that are too small to be displayed must be suppressed. To pull this off, we’ll create a binary vector showing which cells have to be suppressed before we start our optimization. Then, in PROC OPTMODEL, we’ll specify that our decision vector must be greater than or equal to our primary suppression vector. Easy!
The second constraint type is our grouping constraints – the ones we built last week. If you’ll remember, we built a binary vector representing each possible set of cells that could be grouped and aggregated – the vectors had a 1 if the cell belonged in a group, and a 0 if it didn’t. To make sure every group has either 0 or 2+ suppressed cells, we simply multiply our decision vector by each grouping vector, sum up the result, and make sure the sum is either 0 or 2+. And that’s exactly where things get strange.
There’s no such thing as an either/or in a linear programming problem. You just have constraints. But these are mutually exclusive constraints! You could never satisfy both of them at the same time! So what do we do?
We’ll get around this problem with a clever trick. (This one was not discovered by a mom.) We’ll introduce another set of decision variables, one variable for each grouping constraint. If the decision variable is set to 1, we’re choosing to have 0 suppressed cells in that group. If the decision is 0, we’re choosing to have 2+ suppressed cells in that group. This gets us out of a pinch. If the we choose to suppress 2+ cells, we’ll subtract a ridiculously large number from our total in the 0 constraint, so it’s automatically satisfied. If we choose to suppress 0 cells, we’ll add a ridiculously large number to our total in the 2+ constraint, so it’s automatically satisfied. That way, the constraint we’re not worried about is automatically satisfied, and we only have to worry about the other. Then we just have to let the optimizer choose which constraint is the optimal one to satisfy, and which is the one to ignore! It looks something like this (where con_set is a list of numbers from 1 to the number of grouping constraints in our data):
That’s it. That’s how we conceptualize the problem. Now, we just need to turn PROC OPTMODEL loose on it! The following code will take care of building constraints, and optimizing suppression on the dummy data shown above… Of course, a lot of things are hard-coded that shouldn’t be, but it’s always good to read non-macro code before you try the macro version.
It’s always better to use a macro in these sorts of situations. That way, you can easily re-use your code in a dozen different scenarios, without having to re-write anything. The following code is readable and well-commented, so I won’t spend too much time explaining it. The last line shows how to actually run the macro against the test data listed earlier. You know, if you’re into that sort of thing.
Of course, you might want to try this out on a slightly more beefy data set. As we did with the heuristic approach to suppression, we’ll illustrate by suppressing some data on the most common first and last names in North Carolina… You can download the data in CSV format here. And you can use the following code to try it out! (Modify the file location to suit your needs.)
So, that’s it! We’ve accomplished statistical disclosure control in a SAS macro by finding an optimal secondary suppression pattern using binary integer programming. Quite a mouthful! In next week’s final post on cell suppression, we’ll talk about the relative performance of the optimal algorithm vs. the heuristic approach, some optimizations to keep the algorithm from blowing up your computer even with large data sets, and other fun odds and ends.
In the meantime, please check out my code and feel free to ask any questions or make any suggestions/corrections. I’d love to hear from you!