Twitter’s Archive

6 min readJan 4, 2018

The day after Christmas the Library of Congress announced that it would no longer be maintaining, collecting and preserving a complete archive of all public tweets. After 31 December 2017, “the Library will continue to acquire tweets but will do so on a very selective basis”. The White Paper to which this announcement links makes it clear that public access to the archive will not be possible for some considerable time:

The Twitter collection will remain embargoed until access issues can be resolved. Three priorities have guided the Library’s work to provide access to the Twitter collection: respect the intent of the producers of the content; honor donor (Twitter) access requirements; and manage taxpayer-provided resources wisely. There is no projected timetable for providing public access at this time.

The announcement also makes it clear that the Library has a very substantial data collection in its grasp. And the scale of this collection is one reason why no feasible access can be provided for the immediate future. An earlier White Paper that noted that even by January 2013 the archive runs to 130 Terabytes. Twitter has been growing rapidly in the intervening years the archive must be much, much larger and it no doubt remains true ….. “that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data” (Update on the Twitter Archive, January 2013).

Putting these announcements together one has the impression that the Library of Congress may have bitten off more than it can comfortably chew, so calling a halt at this point may well have been a prudent choice. As they note tweets are becoming increasingly non-textual — whereas the content deposited since 2010 has been exclusively textual (no images, no videos, URLs but no certainty of valid or recoverable links from those references). The announcement also notes that “Twitter is expanding the size of tweets beyond what was originally described…” — this sounds like a rather ‘carping’ or ‘purist’ excuse. Is it buried in the announcement to provide cover for the timing of the announcement and as a shield from accusations of betrayal from digital archaeologists in decades to come?

No doubt there will be complaints, and it is certain that the Library of Congress has an enormous technological problem (which may be more easily solvable within a few short years) in making such an archive usefully accessible; but we should welcome the fact that a very substantial corpus of social media data has been preserved. Nothing comparable will emerge from Facebook, Snapchat, Instagram or Pinterest. Nor should we underestimate the delicate legal and ethical issues that may be involved in making this archive properly accessible and usable. Some of these legal constraints can be traced back to the terms under which the deposit was granted by Twitter (as for example “The Library will not provide a substantial portion of the Collection on its public website in a form that may be easily subject to bulk download.” If the archive is to be accessible via the web it is hardly possible to prevent substantial portions being made available seriatim).

The ethical and possibly legal issues are also complex in relation to the rights of the original users of Twitter. As I draft this piece, a British journalist/politician has been deleting thousands of his recorded Tweets from his publicly available account. Some of these tweets, we may assume relatively few, are shocking and embarrassing — but Mr Young has been deleting swathes, surely many more than his embarrassment requires; and how will with the Library of Congress deal with this and similar issues? Tweets that are part of the historical record but deleted by their author? There is here a key and unresolved, perhaps unresolvable, tension between the rights of academic and scholarly research and the rights of copyright and individual privacy.

In Following Searle on Twitter I point out that Twitter’s complex but entirely digital institutional structure has a close and intricate relation with its complete documentary arrangement:

….we might encourage the Twitter experts at the Library of Congress, who are curating the enormous historical dump of Twitter, to provide us with a browsing tool and a highly collapsible and expandable, four-dimensional model of the universe of Twitter content, by means of which the historian would be able to zoom into constellations in a network of membership relations and, for each node, to explode timelines of content visible from that node at that point in time. From each content node one would be able to look back at any of the timelines that fed into the author’s account prior to the tweet and also peer into the future to see any accounts that would be in subscriber relationships. (p 119 Following Searle on Twitter)

In fact, the Twitter data-dump at the Library of Congress does not replicate the details of each individual’s social arrangements (the accounts following, followed by, blocked, muted etc are not disclosed). It would be extremely tricky and invasive of privacy if Twitter were to pass this data to the Library of Congress. Though it would of course be enormously useful to the digital researcher to be able to investigate or measure these social relations.

I suspect that the Library of Congress will never be more than a documentary record of the tweets that have been deposited therein. The social structure will be largely lost. But this complete documentary record, without the social structure, will be an enormously useful prize. Twitter is above all a textual and historical record. It is deeply historical, since every tweet has its unique place in a linear temporal order. Recognising that the scale of a 12-year long, 100s of terabytes deep, archive of 100s of billions of tweets is a challenge, the Library should surely start with an accessible and searchable archive of the first year of Twitter’s content, and then the second year, this would be a way of gradually testing the feasibility and the ethical and legal issues that will need to be addressed in due course.

A complete archive for Twitter would capture social and textual data

The Library of Congress has an important collection on its hands, and we hope that it will in due course deliver a useful and accessible archive. Fortunately we are not wholly dependent on official or national efforts to collect and archive significant digital media and social media. For some years Brendan Brown a private citizen and programmer has been building and maintaining a complete archive of Donald Trump’s tweets. This effort has been well designed, is complete — even with respect to nearly all deleted tweets, and has useful information about some of Trump’s followers. Obviously, such private efforts cannot be other than partial and local, but they do show that archiving and the preservation of our digital culture has to be a distributed effort. Exact Editions is itself an example of such an approach. Digital archives will not just happen, they need to be planned, implemented, and preserved, but they will also be independently motivated and supported. Only if they are accessed and used can their preservation be shown to be worthwhile. I am sure that the Library of Congress has this thought in mind. Its Twitter archive needs to be visible even if not fully accessible by all, which would be the ideal solution.

Twitter’s Archive

Written by adamhodgkin