Spying on the censors: How metadata could expose regimes

Abstract

Metadata helps spies to keep tabs on their targets but it can be used to reveal censorship too, as Roger Highfield reports

SIMPLY PUT, METADATA is data about data and, in the wake of recent revelations about the US National Security Agency collecting oceans of the stuff, there has been understandable alarm about what such data can reveal about you without the need to listen in to your phone calls or rummage through your emails.

But, in the light of a new analysis, metadata might also offer fascinating opportunities to spy on the censors who spy on us.

Metadata is pervasive. It is written into a digital photo file, for example, to identify who owns it, how large the image is, copyright information and keywords about it. Web pages and video files also carry metadata to describe their content, although most people don’t see it. Text files are imprinted with information about when they were created, their author, plus a summary. And so on.

By far the most common uses of all this metadata are relatively benign, though they can be annoying. Many companies use metadata to reveal our habits, interests and connections to work out what we buy, where we go online and who we talk to, so we can be targeted with hints, suggestions and advertising.

When you send or receive text messages by mobile or Twitter or third-party apps, a huge amount can be deduced. Add in a phone’s location data and email metadata and it is easy to see whom you work with (activity in office hours), who you like to hang out with or are related to (lots of calls) and where you like to relax (activity in the evenings).

Facebook can reveal a lot too. The Mirror project, devised by the National Media Museum in Bradford, creative agency RKCR/Y&R and Cambridge University Psychometrics Centre, can deduce much about a person’s personality. The tool, which relied on extensive research by the Cambridge team comprising more than six million test results and in excess of four million individual profiles, could use “likes” to make a pretty good guess about a user’s traits, gender and age.

Every tweet comes loaded with metadata, and if they’re geotagged or have images, that data should show up, too. However, there may also be opportunities to use this data to scan for state monitoring and censorship of microblogs, such as Twitter and China’s Sina Weibo, according to work carried out by Donn Morrison at the Digital Enterprise Research Institute (now known as Insight) in Galway, Ireland.

Through various networks we can now interact with players all over the world, from friends and family to business partners and colleagues, courtesy of the telephone, email, or the internet. Though these networks look haphazard, they do contain mathematical structure.

Scientific study of this “small world” idea dates back to investigations by US social psychologist Stanley Milgram at Harvard University in the 1960s. In one experiment, he sent a package to 160 people randomly selected in Omaha, Nebraska, asking them to forward it to a friend or acquaintance whom they thought would help bring the package closer to a target person, a stockbroker who lived in Boston, Massachusetts. Amazingly, given the many millions of people living in the US, his experiment suggested there tended to be six people on average linking any one person with another – giving rise to the popular notion that we all may be connected by just six degrees of separation.

Above: Spying in the 21st century rarely involves guns and high-speed car chases. A more realistic James Bond (portrayed here by Daniel Craig) would spend far more time staring at computers

Credit: REX/c.MGM/Everett

It could be possible to create an app that uses metadata to send an alert when the authorities are tampering with posts

Then came an interesting study putting Milgram’s observations into a theoretical framework. Duncan Watts of Columbia University and Steven Strogatz of Cornell University came up with a mathematical model to show that the six degrees of separation idea works because in every small group of friends, websites, power grids or whatever there are a few people, sites, hubs and other nodes that have much wider connections, either across continents or across social divisions. Albert-László Barabási at Northeastern University, Boston, then took the idea further, highlighting that the distribution of network’s links approximates to what scientists call a “power law”, or a so-called exponential distribution. This sees a tiny fraction of nodes receiving a hugely disproportionate share of links, while the vast majority is mostly ignored. These networks have a small number of hubs that are significantly more connected than the other nodes in the network. The bottom line of all this research is that online social networks do have deep mathematical regularities.

When Morrison, who is now at the Norwegian University of Science and Technology, simulated on a virtual network the actions of state censors who deleted some 10 per cent of posts, he found the missing links altered the shape of the entire network, leaving it malformed and less connected. This was especially true with popular posts that had been retweeted, which of course are the ones that are more likely to be spotted by the censors.

From the shifts in the network topology (ie the arrangement of its nodes and connecting lines), Morrison was able to spot when censorship was taking place on a wide scale, with 85 per cent accuracy. That means it could be possible to create an app that uses metadata to send an alert when the authorities are tampering with posts.

Above: This diagram, created by Raffi Krikorian at Twitter, shows what is revealed from the metadata of one Tweet

Credit: Raffi Krikorian/Twitter

China is one country where this kind of censorship takes place. Research published in 2012 revealed that 16 per cent of all Sina Weibo posts had been deleted and that censorship of messages originating in areas of potential unrest, such as Tibet, occurred in more than half of cases.

Morrison’s work complements related projects. ConceptDoppler, under development by a team at the University of California, Davis, can spot information that is filtered for keywords as it passes along routers on the internet, for instance by the “Great Firewall of China”, which blocks a range of websites.

Sensitive word lists play an important role in the cat-and-mouse game of censorship and circumvention

Morrison says that sensitive word lists play an important role in the cat-and-mouse game of censorship and circumvention, but the network structure also offers some interesting new opportunities to detect and quantify censorship. Indeed, weighing up the impact of censorship on Twitter, LinkedIn and Facebook this way could turn out to be much easier to do than tracking lists of sensitive words, such as “dictator”, “anarchy” and “riot”, which may change depending on what is going through the mind of online censors at any one time.

Studies of metadata could also complement the more traditional reports sent to the Herdict project at Harvard University. When individuals can’t access a site, or have evidence of deleted Tweets and posts, they can report that experience to Herdict through browser toolbars, email, Twitter or Herdict.org.

When crowdsourced information about internet filtering, denial of service attacks, and other blockages is blended with a judicious pinch of metadata, then the larger scale uses of censorship, for reasons of politics, morality or whatever, will be more transparent than ever before.

Footnotes

Roger Highfield is director of external affairs at the Science Museum Group. He is former science editor of The Daily Telegraph and former editor of the New Scientist