Lessons from a life of startups, coding, countryside, and kids
The big news in the tech world today is PRISM. This is supposedly a system installed at the major Internet providers like Google, Microsoft and Yahoo that lets the NSA access any user’s data. They’ve probably read the drafts of this post before you have ;-)
The NSA document says that the “Collection directly from the servers of these U.S. Service Providers: Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, Apple”. Apparently, the article contains “numerous inaccuracies” but lets assume for the moment that PRISM can retrieve specified information (photos, Google Docs etc) for a given person/email address.
As a software developer, my first thought was “Whoa! That is a complicated system”. Each of these companies would have different network topologies, operating systems, databases, data structures, architectures. Honestly, the technical problems faced by PRISM are second only to Jeff Goldblum infecting the alien mothership with a software virus in Independence Day.
I’m not unaware of these challenges. The product that I worked on at IBM probably did < 50% of what PRISM would need to do (yes, I have considered that it might have been sold to the NSA). It was designed to reach into a wide variety of data sources (databases, messaging buses, files etc), poll it looking for new events and run scripts based on rules (and those scripts might put data into other databases etc). It was not a trivial system: >10yrs of development, probably several hundred man-years, and millions of lines of code. It also took quite a lot of configuration to identify that data sources and data types. But how would you do that configuration for each service that Google offers? It’s a mind-boggling endeavour even for an organisation as well-funded and well-resourced with the smartest people as the NSA. The slides say that PRISM has a $20m annual budget so it’s well-funded (though not particularly huge) but I would imagine the vast majority is spent on data mining & analysis of the data rather than the actual retrieval process from the “special sources”.
And then there are the operational concerns: would Google really let any outside agency run arbitrary queries on their production systems at any time? No way. I just can’t see that a) being tolerated or b) remaining secret from other engineers outside that group (i.e., by someone debugging a performance problem comes across this query from a system that shouldn’t be there). I’m sure the 20% time of many Google engineers will today be informally spent searching their systems for PRISM. And the pace of changes in modern large-scale systems would mean that new servers would appear daily, and old ones disappear. The PRISM system would have to be included in the configuration manager system in order to keep up with these changes. It couldn’t simply be configured once and left alone to silently do its job.
I’m quite convinced that this install-and-forget-it architecture for PRISM is not a viable direction. Far too complicated and brittle. But there is a simpler alternative: a data dropbox, much like the old “dead drops” in cold war spy films.
Remember that this is not (apparently) an online monitoring system which is constantly slurping data. It is an offline (offline not in the network sense but in the active sense) system which retrieves specific data related to a user. Here’s how I think it actually works:
The advantage of this method is it requires a very limited understanding of a company’s technical infrastructure to implement, there’s really nothing for the NSA to keep update-to-date with and the internal group can be kept small. And the NSA wouldn’t need to have the dropbox installed within the company so there’s no chance an inquistivitive engineer would find it.
Anyway, that’s how I would build PRISM: A mostly low-tech manual process utilising a small dedicated group within the target organisation.
Dear NSA, when you read this, I’m available for hire but I’m very choosy about my clients so don’t get your hopes up ;-)
Update: I think the target company might automate much of this collection if there were too many requests. Julian Assange has said that Facebook has already done this. My point is really that the NSA won’t have built this system. It would have started as a manual, in-house process as I describe but may have transitioned to an automated system built internally to ease the reporting burden. If I was cynical, I’d suggest that the NSA might arbitrarily increase the frequency of the requests simply to force the companies into automating the process.
Update 2: This is an interesting article along the same lines although Google have apparently denied the dropbox scenario since I wrote this post