The “cell suppression problem” is one type of “statistical disclosure control” in which a researcher must hide certain values in tabular reports in order to protect sensitive personal (or otherwise protected) information. For instance, suppose Wayout County, Alaska has only one resident with a PhD – we’ll call her “Jane.” Some economist comes in to do a study of the value of higher education in rural areas, and publishes a list of average salaries disaggregated by county and level of education. Whoops! The average salary for people with PhDs in Wayout County is just Jane’s salary. That researcher has just disclosed Jane’s personal information to the world, and anybody that happens to know her now knows how much money she makes. “Suppressing” or hiding the value of that cell in the report table would have saved a lot of trouble!
Over the next couple weeks, I’ll be blogging about some algorithms used to solve the cell suppression problem, and showing how to implement them in code. For now, we’re going to start with an introduction to the intricacies of the problem.
So, there’s undoubtedly somebody out there who’s wondering why the cell suppression problem is a big deal. Why can’t you just hide any sensitive value that’s calculated based on data from only one person, and call it a day? In our example, just make sure that the data for PhDs in Wayout County Alaska is suppressed, and we’re good to go.
Well, not so fast. Take a look at the following imaginary table. It shows the number of citizens and average salaries by county and degree type for a couple of fake counties in Alaska. The table also shows aggregations across counties and across degree types, all the way up to the total number of citizens in the study, and their overall average salary. The values for Wayout County residents with PhDs have been hidden with asterisks to protect Jane’s privacy.
It doesn’t take a mathematical genius to figure out that if the PhDs in Farout County make $70,000 and the PhDs in all counties make $70,000, Jane must be making $70,000 as well. (Of course, you could also get at this solution by doing algebra on the weighted average for salaries within Wayout’s total, but that would be a little harder to illustrate.)
But there’s also another problem. Suppose you live in Farout County and you have a PhD. You know your salary is $60,000. That means that the other guy must be making $80,000. That dirty rascal! At any rate, privacy has still been compromised.
As a consequence of these mathematical realities, cells must be suppressed in such a way that the solution meets a couple of strict requirements. First, any value that is based on data from 2 or less individuals should be suppressed. In practice, most ethics guides and regulations set this number to at least 3 and often 4 or more. Second, there must be either 0 or 2+ values suppressed within any group of values that are aggregated across – or, as an alternative, the aggregated value must be suppressed. So, if PhDs need to be hidden in Wayout County, you’re going to have to hide the value for Master’s degree holders in Wayout County… or hide the aggregate value for the county as a whole. Anything less means somebody could work the math and back out the original suppressed value. This process is known as “secondary suppression.”
Of course, some situations call for more stringent regulations. The Bureau of Labor Statistics won’t provide average salaries for an occupation in a particular town if there’s only one business in town that employs people in that occupation, for example. Even if there are 100 people in the sample and no individual’s salary could be identified, it still reveals sensitive information about the business. In other instances, data suppression regulations require not only that values can’t be exactly calculated, but also that they can’t be predicted within a certain tolerance… You get the picture. For the purposes of this series, we’re going to stick to the more simple situation, however – you just need to protect values on 3 or less individuals from exact calculation.
The problem with all of this is, of course, that in a very large data set, it can be extremely hard to find all of the combinations of cells that need to be hidden in order to make sure all of the sensitive data is protected. Going through all of the numerous combinations of checks necessary to make sure that no cell is revealed can be extremely arduous if done by hand – even for relatively simple 2 or 3-dimensional tables. It becomes nearly impossible for highly dimensional data that can be aggregated in numerous ways.
But finding a solution isn’t even really the crux of the issue. Every time you hide a cell, you’re hiding potentially valuable information from people who may have a stake in knowing what that value is. Suppressing the whole table would protect all of the sensitive information, but it would also make the study completely useless. As a consequence, we’re highly interested in finding ideal or near-ideal solutions that release the maximum amount of data, while still making sure that the sensitive parts are protected.
Solving the cell suppression problem is not an easy task to accomplish by hand, but there are several algorithms that can make quick work of it, and which can be easily adapted to fit a variety of situations. We’ll start talking about them next week.