The State Department Collection is comprised of the Central Foreign Policy Files, provided to us by the National Archives in the form of XML files. Currently we have all the records available from the National Archives, covering the years from 1973-1979.
The collection is metadata rich - with information about classification, handling instructions, and Traffic Analysis by Geography and Subject (TAGS), along with other data at a document level. In order to feature the collection on our website, all of this information was stored in different fields according to our database schema, and the body of the cables was cleaned to more clearly present the text to end users.
We have analyzed the collections in various ways - for example using the collection for event detection exercises - using both topic modeling and burst analysis. Other work using the State Department Cables includes an analysis of official classification policy. We have also contrasted our data with other publicly available datasets, for example the visualization below shows which countries the State Department uses the Human Rights TAG for, contrasted with the countries it is most secretive about. The color of the bubble indicates the greviousness of each countries Human Rights record according to Freedom House on a scale of 1-7.
We have integrated this data with our search and exploration tools to allow users to see which countries are mentioned the most, and search for documents associated with these countries. Currently the countries were extracted using the TAGS - since the State Department’s internal recording system took what countries a document mentioned into account. Aggregating this information, and using it in unison with external data sets and forms of metadata allows us to create new ways to explore the historical record.
We are working on new techniques to facilitate other kinds of analysis. For example researchers on our team are now using Named Entity Recognition (NER) to extract mentions of persons from the text of the Cables, and to extract more detailed country information than can be provided by the TAGS. Using NER allows us to take into account the context of usage - for example counting “he” or “she” for a person, when it is used as an alternative to the name later in the sentence or paragraph.