Abstract
Research relating to bioterrorism and its associated risks is interdisciplinary and is performed with a wide variety of objectives. Although published reports of this research have appeared only in the past decade, there has been a steady increase in their number and a continuous diversification of sources, content, and document types. In this analysis, we explored a large set of published reports, identified from accessible indices using simple search techniques, and tried to rationalize the patterns and connectivity of the research subjects rather than the detailed content. The analysis is based on a connectivity network representation built from author-assigned keywords. Network analysis reveals a strong relationship between research aimed at bioterrorism risks and research identified with public health. Additionally, the network identifies clusters of keywords centered on emergency preparedness and food safety issues. The network structure includes a large amount of meta-information that can be used for assessment and planning of research activity and for framing specific research interests.
Consequently, public scientific research that addresses issues associated with bioterrorism has received support from several national and international agencies or programs that are primarily concerned with security. These include the US Department of Homeland Security and its development of a bioterrorism risk assessment (later reviewed by the National Research Council of the US) and the European Union research and development programs that support their CBRN (chemical, biological, radiological, and nuclear) action plan. 3
These initiatives have driven the emergence of wide-ranging research activity aimed at improving understanding and preparedness for bioterrorism hazards. Research concerning bioterrorism includes social and natural sciences and many different subjects and methodologies, and it appears in peer-reviewed articles, reviews, and commentaries. To support this research, academic publishers have established new sections in established journals, promoted special issues, and even developed new, dedicated publications. In this report, without reviewing detailed contents, we have generated a network picture of the research effort surrounding bioterrorism risks, and we have generated indicators that show the complexity, guide appreciation, and promote progression. This representation of the bioterrorism research effort is novel and is complementary to traditional systematic review methodologies 4 that provide an overview of evidence that relates to specific scientific questions in order to support decision making.
Publication Data and Network Construction
A large fraction of published scientific research is indexed by specialist organizations that include PubMed, Google Scholar, and the Thomson Reuters Web of Science. These indices can be interrogated systematically, using structured search strings, and in most cases data describing each identified item, such as titles, authors, keywords, and even abstracts can be downloaded as an organized data set. In late 2012, a search string that is the intersection of the words “bioterror” and “risk” (or “assessment” or “mathematics”), applied to the topic field of the Web of Science index, identified 679 source items. The precise search string was:
Topic=(((bioterror* OR bio-terror* OR biodefense OR bio-defense OR biodefence OR bio-defence) AND (risk OR assessment OR mathematic*)))
It includes distinct spellings and automated extensions and uses the lemmatization option supplied by the search tool. Only a few of the identified sources were published prior to 2002, but at least 40 distinct items are attributed to each subsequent year; there is a broad peak in 2006 corresponding with 80 identified sources. The number of citations per year for the identified sources has increased monotonically since 2002, and the complete set of sources has h-index=39. Approximately 70% of the sources are associated with research from the United States (5% with the UK, 4% with Germany, 3% with France). The Web of Science classifies 28% of the publications as research in “public, environmental and occupational health” and other significant fractions as research concerning “infectious diseases,” “immunology,” “international relations,” and “microbiology.” The Web of Science identifies published material matching the search parameters in more than 100 different journals.
In this case data were extracted manually from the Web of Science, as plain text, using the web interface, but this step can also be performed in code using predefined protocols. The data set can be prepared for analysis using programing tools such as perl. Unnecessary tags, stopwords, and punctuation are removed, and, where necessary, words are returned to lower case, checked for a small number of synonyms, and stemmed (we have used a conservative stemming algorithm that retains recognizable words 5 ).
Approximately 350 of the identified sources include a list of author-assigned keywords in the data record; there are typically 9 keywords per publication. These keywords are separate from, but linked with, the topic field used by the Web of Science classification system. For network analysis, the set of keywords representing each source item is programmatically transformed into a set of unique keyword pairs. The complete list of keyword pairs corresponds to a connected network that can be visualized using software tools such as NodeXL (NodeXL: Network Overview, Discovery and Exploration for Excel, http://nodexl.codeplex.com/). The network, represented in Figure 1, has 1,075 nodes and 12,502 unique edges; nodes represent unique author-assigned keywords, and edges represent pairs of keywords that occur in a single source. The actual layout in Figure 1 is constructed with the Harel-Koren method, 6 which uses a graph theoretic scheme to identify node positions (initially in high dimension). The layout is not particularly crucial, and many alternatives are possible, but the result provides a strong visualization of relational information and, when constructed systematically, provides the starting point for more quantitative exploration.

A network of author-assigned keywords for reports relating to bioterrorism and risk. Nodes represent keywords and have a size scaled according to degree. Links correspond with the use of 2 keywords in a particular source and have a width and opacity scaled by the rate of occurrence. (A) complete network, (B) many links with small weight were removed for clarity.
Network Properties
The network representation in Figure 1 scales the size of the nodes by their degree (the number of links to other nodes in the network) and scales the width and the opacity of the links by their weight (the number of times that a pair of keywords appears together). Additionally, those nodes that represent a named agent or condition have been shaded (the agent category includes words like “ebola,” “ricin,” “poliomyelitis,” and “anthracis” but not words like “toxin,” “bacteria,” or “virus”). In the second part of Figure 1, many links with very small weight have been removed for clarity. The network representation is a novel summary of a collection of published reports that concern bioterrorism and risk; the network properties can be used to indicate some patterns and some collective properties of the research.
The network in Figure 1 illustrates a complex structure with some dominant features. As expected, the “bioterrorism” and “risk” nodes, which are closely connected to the search criteria, both have strong connections with many other keywords. However, it is equally clear that “bioterrorism” is strongly associated with “public” and “health” and that keyword combinations such as “emergency” and “preparedness” are significant within the identified sources. Keywords “smallpox” and “anthrax” are the most strongly connected nodes in the agent category. NodeXL provides several tools for systematic interrogation of network properties and metrics. 7 The nodes in the keyword network have degrees in the range [2, 537], and degree is well represented by an exponential distribution. The average geodesic distance (distance measured by the number of steps on the network) between nodes of the keyword network is 2.5, and the maximum separation of any 2 nodes in the network (diameter) is 5.
Visual inspection of the network in Figure 1 indicates a prominent position for the keyword “vaccine,” particularly in relation to a group of nodes that includes “smallpox” and “vaccinia.” In Figure 2 the network nodes are represented in a plot of betweenness centrality against degree. Betweenness centrality is a network metric that measures the significance of a node for connecting different parts of the network together. 8 Figure 2 shows that the majority of the nodes in the network follow a relatively smooth variation of betweenness centrality with degree. However, 3 nodes, indicated by a different shade in Figure 2, have a relatively high value of betweenness centrality; the exceptional keyword with highest degree is “vaccine.” This property indicates that research in the area of vaccines may bring together several areas of research that each relate to bioterrorism and that it is, therefore, potentially integrative (Figure 1 indicates that research involving viruses may be one of the areas that is linked by the keyword “vaccine”). The other 2 keywords that fall outside of the smooth relationship between degree and betweenness centrality are “exposure” and “maternal.”

A relationship between betweenness centrality and degree for nodes in a keyword network that represents research in bioterrorism and risk. The lighter colored nodes (red) correspond with “vaccine,” “exposure,” and “maternal.” Color images available online at www.liebertonline.com/bsp
Since some keyword pairs occur in multiple source documents, the keyword data establishes a weighted network (eg, the keyword pair “public” and “health” occurs 34 times, and the pair “vaccination” and “smallpox” occurs 10 times). The weights are indicated visually in the network representation, but these may also be used to construct an appropriate distance measure; each link of the network is assigned a length that is the reciprocal of the weight of the link. In this case each path between 2 nodes in the network has a corresponding distance (the sum of the lengths of the individual links), and, in turn, there is a minimum distance between any 2 nodes in the network (corresponding with the shortest path). Several algorithmic processes (eg, Dijkstra 9 ) allow all of the shortest paths between pairs of network nodes to be established at once, thus enabling a systematic examination of proximity.
Table 1 is a ranked list of the shortest path length between the node representing “bioterrorism” and the nodes representing named agents. Although this scheme may include anomalies—such as the separate appearance of “anthrax,” “anthracis,” and “bacillus” in the list of agents—it represents a rough prioritization of harmful agents portrayed by the volume of reports concerning bioterrorism. The “smallpox” and “anthrax” keywords are most closely connected with “bioterrorism” via heavily weighted direct edges, but for the keyword “botulinum,” the shortest path involves a step via keyword “vaccine,” even though there is a weak direct edge.
A ranked list of agent nodes and their shortest path to the node representing the keyword “bioterrorism” in a keyword network that represents research in bioterrorism and risk
The proximity values may also be used to identify potential relationships between keywords even when direct connections—that is, those arising from the use of 2 keywords in a single report—are not explicit. Table 2 includes a list of the shortest path lengths from 2 nodes that represent the named agents “smallpox” and “crimean-congo” to a set of nodes that represent potential modes of spread (or vehicles). In the context of bioterrorism, the keyword network identifies “food,” “animals,” and “water” in connection with “smallpox.” Connectivity is generally weaker for the node representing the keyword “crimean-congo,” but, more significantly, the ranking of connections with nodes that represent modes of transmission (in the context of bioterrorism) is very different and emphasizes “agriculture,” “mosquito,” and “human.”
A list of shortest path lengths from nodes representing the keywords “smallpox” and “crimean-congo” to nodes representing keywords associated with potential modes of transmission for a keyword network that represents research in bioterrorism and risk
Network representation supports many alternatives for identification and visualization of large-scale structures in complex data. Nodes may be grouped into clusters, based on node densities or edge patterns, to reveal community structure. In Figure 3 the keyword network has been partitioned into a set of groups of nodes based on their local topology or motif; star shapes indicate a set of nodes that has a tightly connected spiderweb-type connectivity (a “clique”), and circular sections indicate a group of nodes that are arranged in a dominantly star-shaped connectivity (a “fan”). The community structure in Figure 3 is developed from the keyword network with some weak links removed so that the clusters are quite strongly indicative of particular keyword patterns (these groups can be computed from within the NodeXL program, but alternative clustering schemes can be implemented externally and imported to NodeXL for visualization).

A network of author-assigned keywords extracted from reports concerning bioterrorism. The star-shaped symbols identify clique-like groups of keywords, and the circular sectors identify fan-like clusters of keywords.
One clique structure is centered on keywords “bioterrorism,” “health,” “disease,” and “emergency” and another on keywords “assessment,” “food,” “system,” and “safety.” A third clique is centered on keywords “virus,” “smallpox,” and “vaccination.” These clique structures may indicate coherent and integrated aspects of research activity in the context of bioterrorism.
The keyword “bioterrorism” is also the apex of a keyword fan motif that includes “immunity,” “biotechnology,” and “epidemiology.” Another fan structure is centered on the keyword “disease,” and a third is centered on the keyword “risk.” In the lower right of Figure 3, there is a fan motif centered on the keyword “water.” These fan structures may indicate relatively disconnected activity that spawns cross-disciplinary research in the context of bioterrorism. Other clustering methods applied to the keyword network can identify groups of nodes associated with legal considerations about bioterrorism, groups largely associated with single pathogens, and even a small group of keywords—“drum,” “hide,” and “Africa”—that portray particular events.
Discussion
The network analysis of author-assigned keywords indicates the potential for extracting useful information from large unstructured data. The network is not optimal for analysis of specific bioterrorism incidents, but it is semiquantitative and strongly integrative so that patterns and trends, not apparent from individual reports, are discernible and accountable. It is clear that the construction and analysis of the network can (1) encapsulate information from a large disparate data supply, (2) provide visualization and pattern discovery opportunities that are beyond traditional reviews, and (3) act as a framework for a wider appreciation of multidisciplinary research. The simple analysis above identifies significant anomalies in keyword usage, supports ranking of important subsets of keywords, reveals implicit association between strategic keyword pairs (which may be suitable for initiating scenario construction), and classifies keyword groups according to a community structure. Additionally, this preliminary investigation highlights numerous opportunities for improved extraction of meaningful information from machine mined natural language sources relating to bioterrorism.
It is relatively straightforward to identify instances in which the networked information supply can be used in practical situations that emphasize the complex connectivity of bioterrorism research.
Exercises, both in the field and table based, are an essential part of planning in relation to bioterrorism. Preparation for bioterrorism exercises includes the identification of suitable documentation and relevant expertise, and it is immediately clear that this process can be aided by using information from the keyword network. In the keyword network, the word “preparedness” is not directly connected to keywords representing named agents (with the exception of “influenza”), but a distance measure still indicates a close proximity with “anthrax.” In contrast, keywords “polio” and “tularemia” are further from “preparedness,” and in each case the shortest path includes an additional keyword apart from “bioterror”; the intermediaries are “vaccine” and “smallpox.” It is possible to conclude that an exercise aimed at improved preparedness in relation to anthrax threats could be based on a large amount of directly relevant published information and could be driven, effectively, by anthrax experts. In contrast, for bioterror threats associated with polio or tularemia, published research that is directly relevant for bioterrorism preparedness is scarce, so that an exercise should plan additional scoping and should access complementary expertise arising from vaccination programs or smallpox specialists.
Bioterrorism is a complex concept shared by many stakeholder groups, so that harmonization of appreciation and of application methods across distinct disciplines is a strategic goal. A network approach, applied to segmented data, provides a method for spotting systematic differences between distinct streams of research. Segmentation of the keyword information can be achieved in many ways; most easily, filters can be applied to the meta-information associated with each report following the initial data collection, to segment by the primary location, by the time period of publication, and the like, or alternatively the search string can be augmented to select a subset of results that relate to specialized content.
Table 3 includes results for 2 networks developed from keyword subsets; one corresponds with keywords from reports for which the primary location is in the European Union, and the other corresponds with keywords from reports that are closely associated with agriculture. In the first case, the subset is obtained by filtering, and in the second an additional clause, crop OR animal OR agri* OR agro*, is included in the primary search string. Table 3 reports the values of 2 network metrics, the betweenness centrality and the clustering coefficient, for the 2 keywords “emergency” and “preparedness” (the clustering coefficient reflects the degree to which a node's neighbors are connected to each other 8 and hence in the keyword network can indicate whether a keyword has a meaning that is distinct from its neighbors). The networks representing EU-based reports and reports on agricultural bioterrorism identify 822 and 503 distinct keywords, and both have many features in common with the full keyword network (such as a strong link between the keywords “public” and “health”). However, in the network based on reports of agricultural bioterrorism, the keyword “preparedness” is relatively weakly connected (degree=8) and has a large clustering coefficient (indicating it only appears in 1 keyword string so that its meaning is not distinct). This disparity is not observed for the keyword “emergency” or within the network based on reports identified with the EU. These results indicate that the word “preparedness” may have an unclear meaning, or is less well known, for researchers who address agricultural bioterrorism and that a strategic communications process, establishing a consistent view of preparedness, may be valuable for improved harmonization and for interdisciplinary communications.
Values for network metrics measured in 3 keyword networks. Metrics correspond with keywords “emergency” and “preparedness,” and the networks correspond with the full keyword network (Full), with a network based only on reports from the European Union (EU) and a network based on reports concerning agroterrorism (Agro).
Bioterrorism research is multidisciplinary and often involves large teams working on different aspects, so it is often difficult to identify small, truly representative groups of experts for the purpose of foresight or review. A network picture of researchers can be used systematically to identify finite groups that connect with a wide population of other researchers and, hence, have experiences that cover a wide range of relevant ideas and issues.
Figure 4 shows a network where each node corresponds with an author for 1 of the reports that leads to the bioterror keyword network. Coauthors are joined by weighted links to create the network structure so that darker links correspond with strong collaborations (ie, many jointly authored reports on bioterrorism). This network includes several disconnected subsets of authors (not shown), working in isolation, as well as a large connected component that represents a core research community (the diameter of the largest connected component is 11). In Figure 4, 6 nodes corresponding to the authors with the highest values of betweenness centrality, and their corresponding links, are highlighted to reveal a cluster that stretches across a large fraction of the network. These individuals do not coincide with those that have the largest number of coauthors (highest degree), but they could reasonably form the starting point for assembling a representative group for bioterrorism research.

A network representing coauthorship of published reports that relate to bioterrorism. Each node corresponds with an author, and the weight of the links corresponds with the number of joint publications. Highlighted nodes and links correspond with 6 authors who have a high betweenness centrality.
In each of these 3 examples, it is a network property, such as the connectivity or proximity, that identifies the corresponding feature, such as a disparate word usage or a spanning group, and so these patterns would be difficult to identify by other means (eg, from lists of keywords).
Author-assigned keywords represent only a small element of the information supply that can be identified with the reports on bioterrorism and risk that are extracted from the Thomsen Reuters Web of Science. Authors, locations, titles, references, citations, and free-form abstracts can all be accessed and interrogated by automated methods. Simple processing tools, such as those available at the National Centre for Text Mining (www.nactem.ac.uk), provide powerful opportunities for unsupervised learning from complex machine-readable natural language information such as abstracts. The abstracts corresponding with the sources used above can be considered as a corpus representing published bioterrorism research and, therefore, can be an additional source for discovery. Preliminary automatic recognition of multiword terms in this information supply, using TerMine (a multiword term extraction tool based on the C-value/NC-value scoring algorithm for attribute recognition 10 ), identifies 5,981 distinct terms (and some acronyms). Table 4 includes a ranked list of the identified terms that can be compared and contrasted with both the weighted links in Figure 1 and the ranked agents in Table 1 (which correspond with author-assigned keywords). The TerMine ranking reflects both the frequencies and the significance of the terms identified in the abstracts and in general supports the author-assigned keyword analyses. However, the bioterrorism corpus can be used to construct additional structures, such as a term document matrix, that are the starting point for more detailed text mining and discovery. 11
A ranked list of multiword terms identified in free-form abstracts from published reports on bioterrorism and risk. The scores reflect both the frequency and significance of the terms in the source.
The current analyses are restricted to information extracted from 1 indexed source, but, in practice, the supply could be expanded to include other indices and could encompass a wide variety of document types, including unattributed reports (grey literature), web pages, or even messages. The techniques used to process text sources and to discover patterns scale very effectively with the data supply.
In relation to bioterrorism, a keyword network analysis, and natural language processing, highlight some areas for development. It is immediately obvious that bioterrorism does not have an established terminology, and, therefore, in unsupervised mode, words such as “anthrax,” “anthracis,” and “bacillus-anthracis” appear independent and can become problematic. The data supply used in this analysis was subjected to minimal preprocessing prior to the discovery steps, and it is clear that a specialized and targeted use of synonyms and stemming may be advantageous; currently, these resources are not developed in relation to free form sources that concern bioterrorism. Additionally, it is crucial to maintain some differentiation because “bacillus” might correspond with other organisms, and “anthrax” might be used in specific contexts or domains and so may be distinct from “anthracis.” These issues are closely related to more general considerations of terms and definitions used in natural language in relation to bioterrorism. 12
The data set extracted from the Thomsen Reuters Web of Science is complex and feature rich. Each item is associated with time and geography so that spatiotemporal features are easily accessible—that is, the network diagrams in Figure 1 are strictly dynamic. Additionally, the data could be mined in relation to detection technology, communication methods, or modeling that are beyond the simple segmentation in terms of agents and transmission presented above (eg, “defender-attacker-defender” is identified as a keyword during construction of the network representation, but it is relatively weakly connected to “bioterrorism”). Pairwise word associations, such as the agent transmission associations identified above, are relatively simple, but, within a machine-readable framework, complex word patterns that also involve actions, causations, and consequences can be explored and combined with statistical methods so that potential scenarios, encapsulated by the full literature source, could be addressed. Text mining to discovery causality and to extract events has become a major impetus in the development of biomedical sciences 13 and can contribute to the progression of research into bioterrorism and other areas of security.
Conclusion
Information relating to bioterrorism and risks that is included in research publications is extensive, complex, and disparate. Network analyses and natural language processing methods provide opportunities to explore and visualize this information supply in ways that are complementary to traditional literature reviews. A progression requires identification of strong corpora to represent bioterrorism research, more tools for preprocessing and annotation of free-form sources relating to bioterrorism, and development of discovery techniques that are specialized to bioterrorism and security research sources. However, the largely unsupervised nature of this discovery process improves appreciation of information that crosses disciplinary boundaries and may enhance an ability to address questions concerning communication and integration that are often the goals of biosecurity initiatives such as the EU CBRN Action. 3
Footnotes
Acknowledgments
This research was supported by/executed in the framework of the EU-project AniBioThreat (Grant Agreement: Home/2009/ISEC/AG/191) with the financial support from the Prevention of and Fight against Crime Programme of the European Union, European Commission—Directorate General Home Affairs. This publication reflects the views only of the author, and the European Commission cannot be held responsible for any use that may be made of the information contained therein.
