Often, complex problems can be adequately solved by simple rules that provide an acceptable solution, even if they don’t necessarily get you to the optimal point. The cell suppression problem (summarized in last week’s post) is a perfect example of this – using a methodology that would be readily apparent to any human faced with tackling the problem with pen and paper, we can create a computerized solution that can appropriately suppress data sets containing tens of thousands of records disaggregated over dozens of dimensions. This heuristic method will likely suppress more data than it really needs to, but when all is said and done, it will finish the job quickly and without completely mangling your statistics.
We’ll start with an explanation of the basic idea, then move on to implementing it in code.
The basic idea
Suppose you were asked to carry out small cell suppression (protect any values of 3 or less) on the following totally ridiculous table… What would you do?
Well, first, you’d go ahead and protect the cells that definitely had to be hidden.
You’d realize, of course, that this was insufficient. Using some simple math, somebody could easily back out the numbers that you just hid. So, perhaps you’d go through each column looking for suppressed cells. If you found one suppressed value in the column, you’d go ahead and suppress the second largest value in that column in order to protect the first. You’d end up with something like this…
Of course, somebody could still do math across the rows to back out the protected values. So, you’d go through every row, making sure they had either 0 or 2+ suppressed cells.
Now, you would go back and double-check your columns to make sure that you still had 0 or 2+ suppressed cells in every column (this may have changed while you were working on the rows). In this case, working on the rows didn’t mess up the columns (aren’t examples wonderful?), so you’d be done!
That’s all there is to it. That’s the simplest approach to small cell suppression.
Generalizing the idea
You’re likely thinking that, while this seems fine for a 2-dimensional table, it seems harder to implement on multi-dimensional data. But that’s not really the case… Suppose we had 3-dimensional data laid out in a cube. We’d simply do the same thing we did above, except that instead of checking rows and columns, we’d check rows front-to-back, side-to-side, and top-to-bottom, to make sure all of them had either 0 or 2+ suppressed cells. If they didn’t, we’d hide the smallest unsuppressed cell in order to get the total up to 2 with minimal data loss. This logic can be extended on and on, to any number of dimensions…
To implement this in a more code-friendly fashion, we take a “leave one out” approach. We group our data by every dimension except one, then check that every group has either 0 or 2+ hidden cells. Then we switch it up and leave out a different dimension when building the groups. We cycle through every dimension, grouping the data by all the others and checking suppression status. Then, we keep running through the cycle until we’re sure that there are no vulnerable cells. Easy!
Of course, when dealing with multi-dimensional data, you also have a possibility of multiple levels of aggregation. Suppose you had data on student GPA and your dimensions were race, gender, and school district. You could see aggregations for each individual gender, but all races. Or each race, but all genders. Or all races and genders. Throw district into the mix, and it gets even more complicated. As long as we include the aggregations in our data set (e.g., in an additional row marked “ALL”), and keep each level of aggregation separate from each other (using a “level” variable, for example) this will work out just fine. Stay with me, and you’ll see how this works.
Let’s get coding!
We’ll be implementing today’s algorithm in SAS. We’ll start out using the example data from last week’s post on salaries for various fake counties and educational levels in Alaska. There’s two counties (“Farout” or “FAR” and “Wayout” or “WAY”) and three educational levels (Bachelor’s or “B,” Master’s or “M,” and PhD or “P”). We need to protect the values for any cell containing data on only 1 person. You can read the data in using this data step…
You’ll noticed that this data includes a “level” variable. This is just an indicator of the level of aggregation present in the row so that we can prevent the levels of aggregation from getting mixed when we run our suppression algorithm. If you’re familiar with the CLASS statement in PROC MEANS, the level variable here is simply the equivalent of the _TYPE_ variable that would result from using that statement. Basically, any time we sort the data for processing, we’ll sort by the level variable first to keep the other information separated.
OK, let’s get started. The first thing we’ll do to this data is primary suppression!
Now, we’ll need to run through our dimensions, and check for rows with only 1 suppressed cell, so we can give those cells a buddy. We’ll accomplish this by sorting our data set by the level of aggregation, our dimensions (leaving one at a time out) and then by number of citizens. Then we’ll use a data step to suppress the smallest un-suppressed value in each group. For the purposes of this example, we’ll group by degrees first, then by counties, but you could obviously go the other way.
As the comment mentions, this would need to be looped multiple times in order to make sure everything that needed suppression got protected. But that’s all there is to it! You can, of course, hide the salary figures for all of the rows where the number of citizens is hidden, like so…
That’s it! We’ve solved the small cell suppression problem! And here’s the results… as you can see, the values for Master’s degree holders in each county got protected in order to protect the values for the PhD holders.
The macro perspective
Of course, you wouldn’t want to manually re-code this for every data set you needed to suppress… And it definitely violates some pretty basic programming best practices to be repeating as much code as I do above. But that’s why SAS macros were invented! The following is a macro that will run the previously discussed algorithm on any data set, if given the appropriate parameters. The code is fairly straightforward and well-commented, so I’ll let it speak for itself.
If you want to test this out, you could certainly do so on the dummy data described above… But you might want a slightly more beefy example. I’ve created a simple CSV from North Carolina voting rolls which includes counts of individuals with each combination of the 10 most common first names, 10 most common last names, and races present in the state – download it in CSV format here. Using the SAS code below, you can create aggregate data at every possible level, then suppress it. Poke around and see what you can learn.
As we discussed above, this method doesn’t always find the optimal solution to the suppression problem. In fact, it’s not even the best heuristic. The so-called “hypercube method” – which searches across multiple dimensions for the least costly combination of “buddy cells” that can all protect each other – generates substantially better solutions. Nevertheless, in only an hour or two, we’ve put together a few lines of SAS code that can tackle the problem easily and quickly, and which generates very good solutions. Not half bad, if you ask me.
If you’ve got any questions, comments, or suggestions, please feel free to leave a comment and let me know! I’d love to hear your thoughts.