November 24, 2004 10:26 AM PST

Perspective: Privacy's random answer

See all Perspectives
Privacy's random answer
If IBM is right, corporate databases in the future might record your age as 157 and your income as the square root of two.

Big Blue is experimenting with an idea for customer databases called data randomization. The technique will, conceivably, preserve consumer privacy by masking data such as income, age, past purchases or medical information through mathematical calculations that can't be unwound.

For instance, if a customer submits their age as 38 when registering at an online shopping site, a randomizing plug-in in their browser software will add a number between minus 25 and 112 to their age and send that number over to the server.

Randomization represents an opportunity to defuse the ugly conflict over privacy

The wrinkle is that, at the back end, computers then apply a barrage of calculations onto the scrambled data to discern patterns among all customers. The 38-year-old individual's true age can never be recovered, but an online business can somewhat accurately figure out how popular it is with 38 year olds. Unscrambled data collected by the company--such as how much a person paid for a car and on what date--could subsequently be randomized too, for additional privacy.

"The basic notion, in some sense, is kind of heresy in computer science. The normal notion is, in order to do a good job, you need to have accurate information," said Rakesh Agrawal, a senior fellow at IBM who is leading the research. "And here we are saying, 'You have good information, and we are going to perturb it or put errors into it to protect people's privacy.'"

A boon to privacy?
I find data randomization appealing on two levels. First, it's a healthy reminder of why we have big companies in the first place. They exist to hire the math geniuses and chemistry whizzes of the world, who in turn build the society of tomorrow. Without them, the Wheelo would stand as the apex of scientific achievement.

Second, it represents an opportunity to defuse the ugly conflict over privacy. A large--and seemingly growing--number of consumers are furious about how companies and institutions collect, trade and transmit their data.

In all reality, most of the harvested data is never exploited for nefarious purposes. Using an ATM card does create an electronic trail of your life, but it's not like the FBI agents are sitting around right now looking at your file and thinking, "He's eaten at Carl's Jr. three times in the last month. Wanna bet he goes there again in five days?"

Still, consumers resent the practice, and the Federal Trade Commission has made protecting consumer privacy a high priority.

It turns out that people are not very good at lying. Essentially, people leave tell-tale signs.
--Rakesh Agrawal, senior fellow, IBM

To spoof data harvesting, people often lie, but that actually doesn't work. Companies can reconstruct basic data patterns. "It turns out that people are not very good at lying," Agrawal said. "Essentially people leave tell-tale signs."

The randomization system relies on determining the relationship between different values through Bayesian probability. Consumers fill in their true data, which then gets randomized before being sent over.

At the corporate end, servers then try to determine what type of randomizing calculations were applied to scramble the original values.

"We basically ask the following question: 'What could have generated this distribution?'" Agrawal said.

If the computer can come up with the likely randomizing technique that was employed--adding a random number between 15 and 87, or subtracting one between 8 and 32, for example--it can then draw a chart that accurately simulates what the customer base looks like. In several contained trials, the reconstructed curve differed from the curve plotted by the original data by two to three percent.

"It comes back to the true distribution, always. This is the beauty of math, fortunately or unfortunately," Agrawal said. "I think the key insight was that you don't have to have access to precise information to build good models."

IBM continues to conduct trials with the technology, but Agrawal already sees some areas where it could bring benefits. Large businesses such as rental car companies could pool their data without the risk of disclosing customer lists. Hospitals could give access to records about a hepatitis outbreak without being sued. Network break-ins would become potentially less dangerous.

And when filling out a customer questionnaire at Home Depot, you won't feel compelled to claim you have 16 kitchens.

Biography
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas. He has worked as an attorney, travel writer and sidewalk hawker for a time share resort, among other occupations.

More Perspectives

See more CNET content tagged:
privacy, age, calculation, IBM Corp., technique

Add a Comment (Log in or register) 7 comments
Nefarious use?
by November 24, 2004 1:04 PM PST
If the data is available to anybody, the discussion will be more like
"He's eaten at Carl's Jr. three times in the last month, let's sell him these large size pants. Oh yeah, we'll raise his health insurance premium too."
Reply to this comment
Randomization
by November 24, 2004 2:35 PM PST
First of all, was my information randomized when I sent it to register with news.com? :-P
But I don't understand, what the point of having statistics at all is, if they're random, to an extent. Doesn't that defeat the purpose?
Reply to this comment View reply
Thank the Lord IBM Understands
by malabrm1 November 24, 2004 4:05 PM PST
Over the years, I've told Amazon that I'm 952 y/o, Yahoo that I'm 6 y/o, and other sites such nonsense just to avoid receiving their spam and tracking codes. So, sue me.

IBM are old-timers at this computing game. I learned programming on the mammoth old IBM 360 at Columbia University, because it was a pre-requisite for courses in Operations Research (applied stat and probability theory, wargames, et al).

Over the past twenty one years, life has taught me how critical those lessons were. IBM wrote the book on how to sidestep invasive info tech. The firm understood the potential threats to personal, corporate and government security, and potential abuse of basic civil liberties; all of this, before 1975.

Well done, Big Blue, as always... .
Reply to this comment View reply
Not enough
by November 24, 2004 6:03 PM PST
If you can apply a mathematical formula, even with random numbers and even random functions, there will be a way to unravel it. It might take a ton of work, but if you can get all the encoded data, it will eventually be cracked. To advertise it as a perfect privacy solution, preys on the majority of people who have woeful mathematical skills.
Reply to this comment View reply
Powered by Jive Software
advertisement

Latest tech news headlines

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.

More feeds available in our RSS feed index.

advertisement

Inside CNET News

Scroll Left Scroll Right
  • Nanotech: The Circuits Blog

    Timing rumors surface for AMD plant spin-off

    Rumors persist that Advanced Micro Devices is planning to spin off all or part of its manufacturing operations.

  • Gallery

    Photos: Ron Paul's RNC alternative

    As the Republican convention took place just miles away, a crowd rallied for the former presidential candidate and his message of limited government, ensured civil liberties, lower taxes, and peace.

  • Digital Noise: Music and Tech

    Was 1980s music that bad?

    NPR asks listeners which year featured the best music, and the 1980s emerge as a bleak era. Personally, the '80s figure prominently in my collection, but well behind the 1970s.

  • Beyond Binary

    Microsoft begins big ad push

    Microsoft's multi-year push, estimated at $300 million, begins with a spot featuring Bill Gates and Jerry Seinfeld aired during Thursday's NFL game.

  • Video

    YouTube plays party politics

    During the presidential campaigning four years ago, YouTube didn't even exist. Now it's a tool candidates must master to get their message across. CNET's Kara Tsuboi stops by the YouTube upload booths at the Democratic and Republican conventions to find out why Google's video site has such a big presence in Denver and St. Paul, Minn.

  • News - Digital Media

    Michael Moore plans Net-only film premiere

    Filmmaker plans to premiere his latest documentary exclusively on the Internet for free, forgoing the traditional theatrical release.

  • Video

    Political party playlists

    We know the Democrats and Republicans are split over policy issues, but does their musical taste fall down party lines too? And what kind of gadgets did they bring to the conventions to listen to their music? CNET reporter Kara Tsuboi finds out.

  • News - Politics and Law

    What you can--and can't--find about Palin on the Internet

    John McCain's choice of Sarah Palin as a running mate has inspired a wealth of creativity on the Internet.

  • News - Cutting Edge

    Execs predict next Google-like tech

    On eve of company's 10-year anniversary, researchers and business pundits speculate about what technologies might someday have as much impact as Google.

  • Gallery

    Photos: The brains behind Google Chrome

    Here's a look at some of the engineers and executives who took the stage at the company's headquarters as they unveiled the new browser.

  • Crossfade

    Ying Yang Twins, 'Look Back At It': Free MP3 of the Day

    This amped-up duo gets the party started with a mix of crisp, Southern hip-hop beats and shout-along rhymes. Download a free MP3 of "Look Back At It" courtesy of CNET Download Music.

  • Green Tech

    Clean-tech group forms to support Obama

    "Clean Tech and Green Business for Obama" aims to raise $1 million for the Democratic presidential nominee while elevating issues of climate change and alternative energy.