Optimal Cell Suppression in SAS – Final Macro

In last week’s post, we constructed a set of constraints to bound a binary integer program for solving the small cell suppression problem. These constraints allow us to ensure that every group of data points which could be aggregated across in a tabular report contains either 0 or 2+ suppressed cells.

Cop-out test answer.

At some point before age five, every kid masters the art of satisfying constraints with solutions that are hilariously non-optimal.

Obviously, there’s plenty of ways we could satisfy our constraints – suppressing everything, for example. But we want choose the optimal pattern of secondarily suppressed cells to minimize data loss. So, we’re going to tackle the problem using binary integer programming in PROC OPTMODEL. Strap yourself in, folks – it’s going to be an exciting ride.

Conceptualizing the Problem

As I’ve already stated, we’re going to be framing this as a binary integer programming problem – minimizing an objective function by making a set of binary decisions, subject to a set of constraints. But what are we deciding, what are we minimizing, and what are our constraints? Let’s walk through each step individually, illustrating the problem using our now-familiar data on bachelor’s, master’s and PhD holders in Wayout and Farout Counties, Alaska:

Our decisions are the easy part. We’re simply deciding yes or no on whether a cell is suppressed. We’ll represent this as a binary vector – one 1/0 option for every cell in our data. In PROC OPTMODEL terms, it looks something like this (where dec_set is just the set of numbers from 1 to the number of cells in our data set):

The objective is slightly more complicated – there’s lots of options. I’m choosing to minimize the total sum of individuals represented by the suppressed cells rather than, say, the total number of suppressed cells – you might suppress a few more data points, but they’ll be small. Of course, if I’ve got 5 or 6 options that will all suppress data on the same number of individuals, I’d still like to choose the option that suppresses the least amount of cells. So we’ll include a very small punishment for choosing extra cells. That makes our objective function look something like this:

Which brings us to our constraints. We basically just have two types of constraint. First, a primary suppression constraint – cells that are too small to be displayed must be suppressed. To pull this off, we’ll create a binary vector showing which cells have to be suppressed before we start our optimization. Then, in PROC OPTMODEL, we’ll specify that our decision vector must be greater than or equal to our primary suppression vector. Easy!

The second constraint type is our grouping constraints – the ones we built last week. If you’ll remember, we built a binary vector representing each possible set of cells that could be grouped and aggregated – the vectors had a 1 if the cell belonged in a group, and a 0 if it didn’t. To make sure every group has either 0 or 2+ suppressed cells, we simply multiply our decision vector by each grouping vector, sum up the result, and make sure the sum is either 0 or 2+. And that’s exactly where things get strange.

There’s no such thing as an either/or in a linear programming problem. You just have constraints. But these are mutually exclusive constraints! You could never satisfy both of them at the same time! So what do we do?

We’ll get around this problem with a clever trick. (This one was not discovered by a mom.) We’ll introduce another set of decision variables, one variable for each grouping constraint. If the decision variable is set to 1, we’re choosing to have 0 suppressed cells in that group. If the decision is 0, we’re choosing to have 2+ suppressed cells in that group. This gets us out of a pinch. If the we choose to suppress 2+ cells, we’ll subtract a ridiculously large number from our total in the 0 constraint, so it’s automatically satisfied. If we choose to suppress 0 cells, we’ll add a ridiculously large number to our total in the 2+ constraint, so it’s automatically satisfied. That way, the constraint we’re not worried about is automatically satisfied, and we only have to worry about the other. Then we just have to let the optimizer choose which constraint is the optimal one to satisfy, and which is the one to ignore! It looks something like this (where con_set is a list of numbers from 1 to the number of grouping constraints in our data):

That’s it. That’s how we conceptualize the problem. Now, we just need to turn PROC OPTMODEL loose on it! The following code will take care of building constraints, and optimizing suppression on the dummy data shown above… Of course, a lot of things are hard-coded that shouldn’t be, but it’s always good to read non-macro code before you try the macro version.

Macro Time

It’s always better to use a macro in these sorts of situations. That way, you can easily re-use your code in a dozen different scenarios, without having to re-write anything. The following code is readable and well-commented, so I won’t spend too much time explaining it. The last line shows how to actually run the macro against the test data listed earlier. You know, if you’re into that sort of thing.

Of course, you might want to try this out on a slightly more beefy data set. As we did with the heuristic approach to suppression, we’ll illustrate by suppressing some data on the most common first and last names in North Carolina… You can download the data in CSV format here. And you can use the following code to try it out! (Modify the file location to suit your needs.)

Conclusion

So, that’s it! We’ve accomplished statistical disclosure control in a SAS macro by finding an optimal secondary suppression pattern using binary integer programming. Quite a mouthful! In next week’s final post on cell suppression, we’ll talk about the relative performance of the optimal algorithm vs. the heuristic approach, some optimizations to keep the algorithm from blowing up your computer even with large data sets, and other fun odds and ends.

In the meantime, please check out my code and feel free to ask any questions or make any suggestions/corrections. I’d love to hear from you!

5 Responses

  1. Rob Pratt February 22, 2016 / 10:04 pm

    This approach is interesting. Unfortunately, the use of the “ridiculously large” number 100000000000 is asking for numerical trouble and in this case hides the fact that the problem is actually infeasible. When I ran this using SAS/OR 14.1 (the latest production release), the MILP presolver detected the infeasibility:

    NOTE: The MILP presolver value AUTOMATIC is applied.
    NOTE: Constraint two_more_con[33] causes the problem to be infeasible.
    NOTE: The MILP presolver found this problem to be infeasible.

    So let’s try getting rid of the large numbers by replacing them with numbers that are large enough but not too large. For zero_con, you can replace the large number with the much smaller value sum{j in dec_set} constraints[i, j] and enforce the same logical rule. Just consider the two cases choose_zero[i] = 0 or 1. Similarly, for two_more_con, replace the large number with 2. Again, both possibilities for choose_zero[i] do the right thing.

    If you run again with these changes, the MILP presolver still detects infeasibility. Maybe even the LP relaxation is infeasible. Let’s use the IIS detection feature to investigate further. Just replace the SOLVE WITH MILP statement with the following:

          solve with lp / iis=on;
          expand / iis;

    The EXPAND output shows the following small infeasible subsystem:

    Constraint primary_con[555]: choices[555] >= 1
    Constraint zero_con[126]: choices[555] + choose_zero[126] <= 1
    Constraint two_more_con[126]: choices[555] + 2*choose_zero[126] >= 2

    Indeed, the first constraint forces choices[555] = 1, so the second constraint forces choose_zero[126] = 0, and then the third constraint becomes 1 >= 2, which is impossible. The issue is that this group of cells contains only a single cell, so you cannot possibly suppress either 0 or 2+ cells.

    A simple way to fix the error in the formulation is to impose the constraint only if the group contains more than one cell, as follows:

    		con two_more_con {i in con_set: sum{j in dec_set} constraints[i, j] > 1.5}: (sum{j in dec_set} choices[j] * constraints[i, j]) + choose_zero[i] * 2 >= 2;

    With this corrected formulation, the MILP solver returns an optimal solution.

    • daynebatten February 23, 2016 / 8:02 am

      I’m curious as to why the “ridiculously large number” should be causing problems? My understanding is that it is several orders of magnitude smaller than the largest integer number that can be stored in a SAS numeric value, and the constraint shouldn’t have any decimal values involved. Regardless, your proposed alternative is certainly valid.

      More importantly, you’re absolutely right on the 1-cell issue. Thanks for the catch! That’s why I post this stuff online.

      In fact, I’m actually encouraged that the 1-cell issue is the only one you’ve found here. The lengthy back story on this post is that I originally tackled this problem while working for the State of NC. At the time, I didn’t have a terribly reliable way of checking my work, other than writing some additional routines to check that all my constraints had been met. Your comment gives me some encouragement that what I did was likely pretty effective, especially since I definitely had to handle the 1-cell problem in the data I was working on (for http://www.nctower.com).

      So, why the 1-cell bug if I had to handle the 1-cell situation for the project? Unfortunately, the project I was working on was grant-funded so, when I left that position, I couldn’t take my code with me. What you see here was my attempt to re-write it from scratch before all the nuances involved fell out of my memory entirely.

      Anyway, all of that to say, I really appreciate you reaching out on both this post and the lumber optimization one. Looks like you’ve been on the SAS/OR team for awhile, so thanks for your contributions. You guys are putting together some pretty fantastic software over there.

  2. Rob Pratt February 23, 2016 / 10:26 am

    As a simple example of what can go wrong with a huge value for big M, suppose you want to model the logical rule that x > 0 implies y = 1, where x is an integer variable in [0,10] and y is a binary variable. You can enforce this rule by using the linear constraint x le M y. If y = 0, the constraint forces x = 0, as desired. If y = 1, we want the resulting constraint x le M to be redundant, so take M to be any value ge 10. Notice that we want the solution (x,y) = (1,0) to be infeasible, but (x,y) = (1,1/M) satisfies the constraint. So making M huge allows a tiny value 1/M for y. The SAS MILP solver (as well as all the commercial MILP solvers) operates in floating point arithmetic, and values “close enough” to integer are considered integer. This tolerance (called INTTOL in the SAS MILP solver) is by default 1e-5:
    http://support.sas.com/documentation/cdl/en/ormpug/68156/HTML/default/viewer.htm#ormpug_milpsolver_syntax02.htm#ormpug.milpsolver.milpinttol

    For big enough M, the value of 1/M is smaller than the integer tolerance, allowing the solver to “cheat” by returning (x,y) = (1,1/M) as an integer feasible solution. To avoid this undesirable behavior, a best practice is to use the smallest value for M that does the job. In this example, M=10 is the best choice. In the latest production release, the SAS MILP presolver is more aggressive about detecting this situation and reducing the value of M on behalf of the user.

    • daynebatten February 23, 2016 / 2:24 pm

      Ah, that makes perfect sense… Less of a problem of storing that number in a double and more of a problem of doing a bunch of floating-point math involving that number… Thanks again for all the helpful feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *