Big Blue is experimenting with an idea for customer databases called data randomization. The technique will, conceivably, preserve consumer privacy by masking data such as income, age, past purchases or medical information through mathematical calculations that can't be unwound.
For instance, if a customer submits their age as 38 when registering at an online shopping site, a randomizing plug-in in their browser software will add a number between minus 25 and 112 to their age and send that number over to the server.
The wrinkle is that, at the back end, computers then apply a barrage of calculations onto the scrambled data to discern patterns among all customers. The 38-year-old individual's true age can never be recovered, but an online business can somewhat accurately figure out how popular it is with 38 year olds. Unscrambled data collected by the company--such as how much a person paid for a car and on what date--could subsequently be randomized too, for additional privacy.
"The basic notion, in some sense, is kind of heresy in computer science. The normal notion is, in order to do a good job, you need to have accurate information," said Rakesh Agrawal, a senior fellow at IBM who is leading the research. "And here we are saying, 'You have good information, and we are going to perturb it or put errors into it to protect people's privacy.'"
A boon to privacy?
I find data randomization appealing on two levels. First, it's a healthy reminder of why we have big companies in the first place. They exist to hire the math geniuses and chemistry whizzes of the world, who in turn build the society of tomorrow. Without them, the Wheelo would stand as the apex of scientific achievement.
Second, it represents an opportunity to defuse the ugly conflict over privacy. A large--and seemingly growing--number of consumers are furious about how companies and institutions collect, trade and transmit their data.
In all reality, most of the harvested data is never exploited for nefarious purposes. Using an ATM card does create an electronic trail of your life, but it's not like the FBI agents are sitting around right now looking at your file and thinking, "He's eaten at Carl's Jr. three times in the last month. Wanna bet he goes there again in five days?"
Still, consumers resent the practice, and the Federal Trade Commission has made protecting consumer privacy a high priority.
To spoof data harvesting, people often lie, but that actually doesn't work. Companies can reconstruct basic data patterns. "It turns out that people are not very good at lying," Agrawal said. "Essentially people leave tell-tale signs."
The randomization system relies on determining the relationship between different values through Bayesian probability. Consumers fill in their true data, which then gets randomized before being sent over.
At the corporate end, servers then try to determine what type of randomizing calculations were applied to scramble the original values.
"We basically ask the following question: 'What could have generated this distribution?'" Agrawal said.
If the computer can come up with the likely randomizing technique that was employed--adding a random number between 15 and 87, or subtracting one between 8 and 32, for example--it can then draw a chart that accurately simulates what the customer base looks like. In several contained trials, the reconstructed curve differed from the curve plotted by the original data by two to three percent.
"It comes back to the true distribution, always. This is the beauty of math, fortunately or unfortunately," Agrawal said. "I think the key insight was that you don't have to have access to precise information to build good models."
IBM continues to conduct trials with the technology, but Agrawal already sees some areas where it could bring benefits. Large businesses such as rental car companies could pool their data without the risk of disclosing customer lists. Hospitals could give access to records about a hepatitis outbreak without being sued. Network break-ins would become potentially less dangerous.
And when filling out a customer questionnaire at Home Depot, you won't feel compelled to claim you have 16 kitchens.
Biography
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas. He has worked as an attorney, travel writer and sidewalk hawker for a time share resort, among other occupations.
See more CNET content tagged:
privacy,
age,
calculation,
IBM Corp.,
technique

"He's eaten at Carl's Jr. three times in the last month, let's sell him these large size pants. Oh yeah, we'll raise his health insurance premium too."
But I don't understand, what the point of having statistics at all is, if they're random, to an extent. Doesn't that defeat the purpose?
IBM are old-timers at this computing game. I learned programming on the mammoth old IBM 360 at Columbia University, because it was a pre-requisite for courses in Operations Research (applied stat and probability theory, wargames, et al).
Over the past twenty one years, life has taught me how critical those lessons were. IBM wrote the book on how to sidestep invasive info tech. The firm understood the potential threats to personal, corporate and government security, and potential abuse of basic civil liberties; all of this, before 1975.
Well done, Big Blue, as always... .