In last week’s post we built a SAS macro that found acceptable solutions to the small cell suppression problem using a simple heuristic approach. But what if acceptable isn’t good enough? What if you want perfection? Well, then, you’re in luck!
I’ve blogged previously about optimization with linear programming in SAS PROC OPTMODEL, and it turns out that the cell suppression problem is another class of problems that can be tackled using this approach. (If you’re unfamiliar with linear programming, check out the linear programming Wikipedia article to get up to speed.) Over the next two posts, we’ll be setting up a SAS Macro that builds the constraints necessary to bound our optimization problem, then implementing the actual optimization code in PROC OPTMODEL.
The needed constraints
For this week, we’re going to build a set of constraints that will help our optimizer find the lowest-cost pattern of suppressed cells that will actually prevent all of our sensitive data from being revealed. If you remember from our summary of the suppression problem a couple of weeks ago, there are a couple of rules that must be followed in order for all of our data to be secure. The first, and easiest, of these is that every data point based on a number of individuals smaller than a certain threshold (often 3) should be hidden. So, you wouldn’t report average GPA on a group of 3 or less students, for example.
The much harder rule to satisfy is that your data must have either 0 suppressed cells or 2+ suppressed cells across every dimension for which you report an aggregate. So, if you have a 2-dimensional table with row totals, column totals, and a grand total, you’d need to have 0 or 2+ cells in every row, every column, within the row totals, and within the column totals. This is a task that’s impossible to accomplish by hand across multi-dimensional data with lots of levels of aggregation, and a human would be hopelessly lost if tasked with finding an optimal solution
Thankfully, it’s pretty easy to represent these constraints for PROC OPTMODEL. Suppose we have a “long shaped” data set with 2-dimensional data, including aggregates, such as this handy example data of number of citizens and average salaries by degree type for a couple of fake counties in Alaska, which we described in last week’s post.
In order to suppress this data appropriately, you’d need to find patterns of 0 or 2+ suppressed cells within every group of rows that gets aggregated somewhere. Expressing this to PROC OPTMODEL is going to be super easy. When we write our actual optimization code later on, we’ll be asking PROC OPTMODEL to return us a simple binary list showing whether the value for a given row should be suppressed or not. So, to build our constraints, we build a lot of binary lists (one for each aggregated group) showing whether a row belongs to that group or not. We’ll then tell our solver that, when you multiply the list of suppression decisions by each grouping list and sum the results, you should end up with a value of either 0, or 2+. It’s just that easy.
In the example above, our first grouping might be the first three rows (since row 10 is an aggregate of them). We’d build a binary list with a 1 in the first 3 rows and a 0 in the rest. That’s our first constraint. Then, we ask the solver to produce a binary list showing whether each row should be suppressed (e.g., a 1 in row 2 indicates row 2 should be suppressed). If we multiply the suppression list by our constraint list, we’ll get a new binary list that shows only suppressed cells within our group. If the sum of that list is 1 (1 suppressed cell), we don’t have a feasible solution. If it’s something else, we’re good to go! Doing this for every possible group in the data will appropriately constrain the problem.
So how do we write code that can build all of the binary lists necessary to represent all of our constraints? As we’ve done in the past, we’ll illustrate the concept on simple data with vanilla SAS code, then turn it into a universally applicable macro. So let’s start by loading our dummy data into SAS.
We need to create constraints for every group of data that could conceivably be aggregated across. As we’ve done before, we’ll approach this using a “leave one out” algorithm. You can aggregate across any set of rows that shares the same county name, or any rows that share the same degree type, so let’s simply make a list of all the county names and degree types (including “ALL”) while leaving the other dimension out by setting it to “dummy.” Here’s some simple PROC SQL code to make it happen, followed by a DATA step to number each of our levels.
And here’s our results… Looks like we’ll be building 7 constraints to apply to our data!
Now we simply need to determine how to figure out which of the rows in our data fall within each of these groups. We can accomplish this using a cartesian product join of our data with the “levels” data set we just created. (Cartesian product join just means we’re joining all rows to all rows.) If the data county matches the constraint county, the constraint degree is set to “dummy,” and the data degree isn’t set to “ALL” (’cause that would be the aggregate row, not one of its constituent parts), the data row belongs to that constraint. Same logic applies to matches on degree, with county set to dummy.
Again, here’s some PROC SQL code to make it happen.
Our results are very long, but it looked like things worked out! The first little bit of the result set is shown below. You’ll notice that constraint 1 forms a group for all of the rows that have county set to “ALL” with a specified degree… because those are aggregated in the row where both county and degree are set to “ALL.” Not too shabby!
Of course, we’re going to want to transpose this to make it a little more user-friendly.
And here’s our final result (with a few constraints hidden in the interest of not having a ridiculously wide image)!
If you run the code, you’ll see that there’s one binary list identifying every group of rows that has an aggregated value in some other row. A perfect list of constraints!
The macro code
As I did in the last post, I’m simply going to post some commented macro code here and let you peruse it at your own leisure. If you know what you’re doing with SAS macros, this shouldn’t be too bad – it’s reasonably well-commented, and it’s basically just a modification of what we did above.
Of course, this is only the first step in actually solving the problem – we’ve built our constraints, but we still haven’t put them to use. Next week, we’ll be writing code in PROC OPTMODEL to actually solve the small cell suppression problem to proven optimality! Awesome!
Of course, as always, if you’ve got any comments, questions, or suggestions, feel free to follow up! I’d love to hear from you!