Vitaly Shmatikov thinks computer security experts and pranksters have a lot in common.
“The nice thing about being a security researcher,” he says, “is that you’re sort of paid to be a troublemaker. You are kind of paid to do things that other people don’t want to do and don’t want to think about. For a certain type of personality, this is a very good match.”
Shmatikov should know — he has been known to be a computer security troublemaker himself.
In 2006 Netflix had a competition to develop a program to best predict what movies a customer would select next. For this, they published part of their customer database with all names deleted. Shmatikov and a fellow researcher at the University of Texas proved they could compare an anonymous individual’s movie choices to public information on movie fan sites and figure out their identities.
After arriving at Cornell Tech just last year, Shmatikov immediately began digging into the thorny issue of what unexpected things can be inferred from the massive amount of personal data that is being collected by new digital devices like fitness sensors. “Maybe by looking at the person’s biological measurements and by looking at his social activity, I can determine when he or she is lying. Or maybe I can infer what exactly a person is doing at a particular moment,” he hypothesizes.
Cornell Tech recently sat down with Shmatikov to explore his rich imagination.
Cornell Tech: You grew up in the Soviet Union. Is that where your love of computer science started?
Vitaly Shmatikov: My parents were physicists and did do some computing, but that required programming on punch cards. I was in high school when I saw my first personal computer: a little Yamaha. But in college, I mostly studied applied mathematics.
So how did you end up in the United States studying computer science?
My parents spent a summer at the University of Washington for a research visit. After that, they thought I should go there to finish my undergraduate degree. That’s when I started studying computer science as well as math. I remember the biggest surprise to me was that you actually have to do something during the course of the semester. In Russia, the entire grade for the course is based on the final exam.
I went to Stanford for computer science and like many PhD students, I was somewhat aimless. At the time, people barely realized the importance of security for the Web. Then Netscape Navigator appeared as the first commercial browser and for this browser, Netscape came up with this new protocol called secure sockets layer or SSL. My advisor suggested that I look for its weaknesses. I liked the process of looking at systems from a different perspective than the people who built them and trying to think creatively of all the ways in which they could fail.
Tell us about the great Netflix caper.
My colleague at the University of Texas, Arvind Narayanan, and I were already working on various privacy-related things and then one day he walked into my office and said, ‘Did you hear Netflix released this huge dataset for their data mining competition and they claim it’s all anonymous? There is no way to reconstruct people’s identities.’ And that just sounded bogus. The challenge was to actually show this in a rigorous fashion. So we just went and wrote a simple program that scraped information from a separate Internet movie database website and tried to match it against what was in the Netflix Prize dataset. And it worked.
What did you conclude?
The implications went beyond this particular dataset. We showed that, in general, it is very difficult to ‘anonymize’ data so that it cannot be re-identified.
Does knowing what you know make you afraid to put secure information online?
I’m not paranoid, in part because I know that technology can only do so much. And it’s very important to understand we need non-technological protections like legal and regulatory protections. As an example of this, using credit cards online doesn’t bother me because I know that even if there is a fraudulent charge, I don’t have liability. That’s an example of a legal mechanism or a regulatory mechanism that mitigates damage from technological vulnerabilities.
Tell us about the research you’re doing now.
There are several projects that I am trying to get started here. Machine learning is big these days. These amazing services that we see on our mobile phones like image recognition, voice recognition and natural text translation are all enabled by collecting massive amounts of information from people and then having pretty clever algorithms learn from it.
Of course, there seems to be some kind of conflict with privacy. Data is collected for one particularly stated purpose, perhaps image recognition, but then used for another purpose — like to infer that a particular person was in a certain place at a given time. But in order for these algorithms to work properly, training them requires collecting data from everybody, keeping it in some centralized place, and using it for all kinds of purposes that owners of the data might not have intended.
So I’m trying to look at it from two perspectives. First of all, understanding the invasion of privacy, understanding what could be learned or inferred about people by having access to their data, like their biological data. Then we’re also trying to build systems that can learn from massive amounts of data and build useful predictive models without violating people’s privacy.
With four faculty members focused on security, Cornell Tech has a significant concentration of security experts. Is it fun to have a whole group with subversive personalities?
Yes, it’s great. The really nice thing about it is I feel like there is no problem we couldn’t tackle collectively here. For pretty much any problem related to security, in pretty much any space, we have some expertise.
How do you approach teaching computer security?
Last semester for the first time I taught a course called “Privacy in the Digital Age.” I structured it so that, for many lectures, I had external visitors who could talk about different aspects of privacy: founders of privacy-oriented startups, lawyers working on privacy issues, investigative journalists, former chief technologist of the Federal Trade Commission who could talk about privacy regulation and government, and experts on civil rights.
I felt that this course was what a Cornell Tech education is all about. The students got exposure to issues that are not purely technical. This is not just a vocational school that teaches programming skills; we prepare students with the greater context as well.