The NSA and Big Data

This essay was first published by Praeger Security International (

Recent revelations that the National Security Agency (NSA) has been collecting large sets of telephone call meta-data raise a host of questions, with the most interesting systematic cybersecurity questions revolving around the NSA’s use of call log meta-data. This is yet another example of the phenomenon known as Big Data—the collection, storage, and analysis of massive databases of information. In this short piece, I want to describe Big Data in general terms and then talk about the NSA program as an example of the phenomenon.

Big Data

Increasingly, in a networked world, technological changes have made personal information pervasively available. As the available storehouse of data has grown, so have governmental and commercial efforts to use this personal data for their own purposes. Commercial enterprises use collected information to target their advertising and solicit new customers in order to expand market share. Governments use the data to, for example, identify and target previously unknown terror suspects—to find so-called “clean skins” who are not in any intelligence database. This capability for enhanced data analysis has already proven its utility and holds great promise for the future of commercial activity and counter-terrorism efforts.

Yet, this analytical capacity also comes at a price—the peril of creating an ineradicable trove of information about innocent individuals. That peril is typically supposed to stem from problems of misuse; in the government sphere one imagines data mining to identify political opponents, and in the private sector we fear targeted spam. To be sure, that is a danger to be guarded against.

But, the dangers of pervasively available data also arise from other factors. Often, for example, there is an absence of context to the data that permits or requires inaccurate inferences. Knowing that an individual has a criminal conviction is a bare data point; knowing what the conviction was for and in what context allows for a more granular and refined judgment.

The challenges arising from these new forms of analysis have already become the subject of significant political debate. The NSA meta-data disclosures are but the most recent and the most public.

NSA Meta-Data

The NSA, apparently, collects “meta-data” from Verizon. When we say “meta-data,” we mean the non-content portions of telephonic communications. These include data elements such as what number originated a call; what number was called; how long the call lasted; and quite possibly where, geographically, the two endpoints of the call were physically located. This meta-data is collected for every telephone call in the United States or between the United States and a foreign country. And though the disclosures do not say so directly, there is every reason to suspect that other telecommunications providers (AT&T, Sprint) are subject to similar disclosure orders. Though the contents of the phone call are not recorded, this database of meta-data for every call in the United States is a powerful analytic tool.

The data may well serve two purposes. First, it serves as a repository for what me might call “retrospective link analysis.” When, for example, the Tsarnaev brothers were suspected in connection with the Boston Marathon bombing, this repository of information could be queried to give investigators a picture of who (if anyone) the Tsarnaevs may have been in contact with prior to the bombing. With appropriate court orders, the subscriber information for the most commonly called phone numbers might be revealed and that could very well guide further investigation.

This use, of course, is limited to the extent that the database itself is limited. In the absence of a collection program of the sort operated by the NSA, the retrospective look would only go back as far as the service providers retained the calling meta-data—and that varies from company to company. If (as some have speculated) the NSA program is more than a half-dozen years old, that database would be rich indeed—and far larger than what the commercial service providers would retain on their own account.

The second use is a far broader and less narrowly-tailored one. It would involve the use of big data analytics for what we call social network analysis. In other words, again starting with a particular subject, we might map out not only who he is connected to, but also how the people he knows may be connected to each other and/or to other as yet unidentified individuals.

Here the science is clear—large databases are effective in establishing social patterns only to the extent they are actually comprehensive. If your argument is that we need to do a social network analysis to find terrorist connections, then you need the entire network to provide the grist for the mill, so to speak. That, almost surely, is what the Director of National Intelligence James Clapper meant when he said, “The collection is broad in scope because more narrow collection would limit our ability to screen for and identify terrorism-related communications. Acquiring this information allows us to make connections related to terrorist activities over time.”

Describing social network in the abstract is difficult. For those who want to see how social network analysis operates in a real-world context, I recommend the interesting (and amusing) post by Kieran Healey (a sociology professor at Duke), “Using Metadata to find Paul Revere.” Healy did a very simple form of matrix analysis using only two factors—the name of a person and the name of the political clubs he belonged to—and applied it to the colonist revolutionaries. The names were familiar—e.g., Sam and John Adams—as were the clubs (the North Party and the Long Room Club, for example). He used data collected from historical records by David Hackett Fisher that might well have been available to the British at the time of the revolution.

What he found is quite stunning for those who don’t know big data. Perhaps it’s a bit of a spoiler to say so but it turns out that the data identify one man as the lynchpin for a large fraction of the organization of the clubs and the men in Boston—Paul Revere. And while, in historical retrospect he may not have been THE leader of the revolution, it is pretty clear that he was a significant operative in the revolutionary structure—hence his famous ride. So with just two fields of data and some relatively simple analytics, British counter-intelligence of the era might have learned about his significance. (Note, of course, that more fields of data give even greater granularity and fidelity to the conclusions.)

And so, we now understand why it is the NSA was interested in these data sets. Large data sets can, with appropriate manipulation, reveal the organizational details of social structures. Terrorist activities are social structures of that sort. To my mind it is pretty clear that there are reasonable grounds to believe that the telephone call metadata database is relevant to the discovery of that structure and therefore relevant to an investigation of those terrorists.

Is It Wise?

But what is legal, is not always wise. Whatever this program’s legality, the entire order is remarkably overbroad and quite likely unwise. We fear (with some reason) its potential abuse. The technique is, of course, value neutral. It can be used to discover links for other types of groups, and it can be used in other large data sets. The limits we set are only constraints of law—the technology is not self-limiting.

In the end then, for me at least, the value of this program comes down to empirical questions about which I have no data: How effective has the program been in identifying terrorist social structures? How useful has it been in retrospective investigations? How likely is it to be abused and what preventative controls are in place? And how (if at all) will the existence of the program change citizen behaviors in ways we can’t predict?

In the criminal justice world we have an old maxim: “Better that ten guilty go free than that one innocent suffer in jail.” It is, in essence, a mathematical statement of a risk preference—we’d rather suffer the crime that comes from releasing the guilty than the societal harm that comes from imprisoning an innocent by about a 10–1 margin. Today the question posed by the NSA data collection program is a similar one: “Better that X terrorists structures go undiscovered than that Y innocent Americans have their calling data collected” where both X and Y are unknown.

For myself, in the absence of hard data, it is difficult to imagine a set of facts that would justify collecting all telephony meta-data in America. We live in a changed world after 9/11. But I would have hoped that it has not that much changed.

Leave a Reply

Your email address will not be published. Required fields are marked *


HTML tags are not allowed.