Abstract
Next-generation sequencing (NGS) technologies have revolutionized biological research by generating genomic data that were once unaffordable by traditional first-generation sequencing technologies. These sequencing methodologies provide an opportunity for in-depth analyses of host and pathogen genomes as they are able to sequence millions of templates at a time. However, these large datasets can only be efficiently explored using bioinformatics analyses requiring huge data storage and computational resources adapted for high-performance processing. High-performance computing allows for efficient handling of large data and tasks that may require multi-threading and prolonged computational times, which is not feasible with ordinary computers. However, high-performance computing resources are costly and therefore not always readily available in low-income settings. We describe the establishment of an affordable high-performance computing bioinformatics cluster consisting of 3 nodes, constructed using ordinary desktop computers and open-source software including Linux Fedora, SLURM Workload Manager, and the Conda package manager. For the analysis of large antibody sequence datasets and for complex viral phylodynamic analyses, the cluster out-performed desktop computers. This has demonstrated that it is possible to construct high-performance computing capacity capable of analyzing large NGS data from relatively low-cost hardware and entirely free (open-source) software, even in resource-limited settings. Such a cluster design has broad utility beyond bioinformatics to other studies that require high-performance computing.
Keywords
Introduction
Next-generation sequencing (NGS) technologies have revolutionized the sequencing landscape through their ability to generate millions of sequences at a time and provided new insights into pathogen and host evolution.1-3 Furthermore, the development of cheaper NGS systems means that this is no longer restricted to well-resourced centers of excellence and is now becoming a standard tool in many molecular biology laboratories including several African countries. 4 For example, the Oxford Nanopore MinION costs just less than US$1000, and each flow cell can generate 10 to 20 Gb of DNA sequence data. 5 Although NGS technologies are getting increasingly portable and cheaper, data management is lagging behind. 6 The amount of data generated using NGS comes with significant memory and storage requirements, making it challenging to analyze using desktop computers.
Some bioinformatics programs required for such analyses are computationally intensive, requiring high processing power and long computational times either because of the large datasets being analyzed or the complex calculations and simulations performed.7,8 Clusters and servers have been tailored to perform such analyses; however, these are very expensive to establish. This means that researchers from low-income settings (where pathogen NGS data would be particularly useful to confront high burdens of disease) might not be able to afford to buy or rent the necessary computational resources. In settings where existing computational resources are available but scarce, Internet is often a limiting factor. Transfer of large datasets between high-performance computing centers and the research centers is often very expensive, or the Internet connections are too slow to be effective. In addition, some of the research data (eg, patient data) may contain personal and sensitive information that require high levels of protection, making it risky to transfer into public domains. Finally, many institutions have firewalls that prevent users from accessing outside servers and clusters to analyze their data, forcing researchers to perform analyses using ordinary desktop computers.
Our laboratory is focused on studies of viral-host evolution in the context of HIV infection and vaccination. Understanding the interplay between the host immune system and the evolving virus is a major aspect of HIV vaccine design. We use bioinformatics analyses of Sanger and NGS sequence data to analyze viral and antibody gene evolution. The human antibody repertoire is enormous, at greater than 10 11 molecules per individual.9-12 NGS technologies are, therefore, ideally suited for antibody repertoire analyses as they are able to sequence millions of antibody reads at a time. Similarly, the HIV envelope protein, which is the target of antibody responses during HIV infection, varies by as much as 30% between infected individuals.13,14 The amount of data generated through these types of studies has resulted in our laboratory encountering the computational challenges described above.
To enhance our analyses, we have developed a low-cost bioinformatics cluster that is easy to build, capable of analyzing large NGS datasets and performing phylodynamic analyses, and less reliant on Internet data usage. This system, which is amenable to analysis of diverse large datasets, will enhance the ability of under-resourced researchers to independently interrogate large datasets to address locally relevant scientific problems.
Methods
Cluster architecture
Development of a multi-node cluster
Computer hardware
A 3-node (12-core) cluster, consisting of a master node and 2 subsidiary nodes, was developed using ordinary personal computer workstations (Figure 1A). The master node (Bio-Linux) is an Intel(R) Xeon(R) CPU E3-1220 v3 at 3.10 GHz, with 4 CPU cores with 32GB RAM and 1TB SSD. Nodes 01 and 02 are Intel(R) Core(TM) i7-3770 CPU at 3.40 GHz machines, with 4 cores per socket and 32GB RAM (increased from an initial 4GB RAM per node). We also installed 11T and 50T network-attached storage (NAS) for raw and processed data, respectively (Figure 1A). The nodes perform the computational tasks, whereas the storage, as the name suggests, are the devices that store all the data that are generated. Files are also backed up to a separate file system that is further backed up to Linear Tape-Open (LTO) tapes that have a capacity of 3T each.

Bioinformatics cluster architecture. (A) Schematic description of the cluster architecture, storage, and memory. (B) Bioinformatics cluster file system. The storage area is shown in blue, the user area for data analysis in green, and the restricted system files in red.
Operating system and packages
Fedora release 23 was installed on all the nodes. The standard installation of Linux was used, which does not have a graphical user interface. Fedora uses RPM (Redhat Package Manager), and the packages were installed from their repositories using either “yum install” or “dnf install.” The packages and all file components were extracted during installation and stored in the correct locations on the system (the default location being /var/lib/rpm). Cluster users were created using the Fedora dashboard.
Networking
Users login to the master node, which is connected to the external Internet (Figure 1A). All the other nodes are connected through a local area network (LAN) to the master node. External access to the cluster is given by ssh on a non-standard port to reduce the risk of port scanning by automated bots. A firewall was also put in place on the LAN network as part of cybersecurity. Users login by ssh on their terminal using a username and password supplied by the system administrator.
Power
The electricity from the main electrical supply passes through a generator and uninterruptible power supply (UPS) before passing through a secondary local UPS, to avoid disruption of analyses due to power failure.
Cluster configuration
SLURM
Simple Linux Utility for Resource Management (SLURM) version 15.08 was installed to manage the cluster resources. SLURM allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for a defined duration of time by providing a framework for starting, executing, and monitoring jobs (normally a parallel job) on the set of allocated nodes and arbitrates contention for resources by managing a queue of pending jobs. This means users can share the cluster in a controlled manner. SLURM was first installed on the master node, and the process was repeated on the additional nodes. The number of nodes can be increased in future depending on the laboratory computational requirements. The instructions for the step-by-step approach for installing were obtained on the following website: https://slurm.schedmd.com/quickstart_admin.html.
Bioinformatics cluster file system
Data and scientific software packages are shared between nodes in the cluster using Network File System (NFS). The /opt/conda2 directory (for scientific software) and /home directory (for data being analyzed) are exported from the master node and mounted on the same paths on the worker nodes. This allows scripts submitted to the SLURM scheduler to run on a consistent environment on all nodes on the cluster. User information was synchronized between nodes using Ansible 2.2.0 (https://www.ansible.com/).
The cluster has two storage facilities, 11T for raw data and 50T for processed data (Figure 1A). The storage for raw data is accessible to the institutional sequencing core (where the Illumina MiSeq instruments are housed) to upload data. Cluster users access the cluster from the master node and are able to access the raw data already uploaded in the storage. Users first have to copy the data from the raw data storage into the home directory and then analyze it using various bioinformatics tools installed in the /opt directory. Cluster users have access to the green and blue areas (Figure 1B). The red area shows an example of system files only accessible to the cluster administrator.
The user home directories are on the master node, and by default, all the jobs running write their output on the master node disk space which only has 1T capacity. Once the disk space is full, the master node would not be able to allocate other nodes to run jobs. To resolve this, we decongested the master node by mounting a 11T volume to /share to increase the storage space in the share folder. Symbolic links (symlinks) were then created for all the user home folders to the share directory to prevent the master node from becoming congested (Figure 1B).
Package manager and installation of bioinformatics packages
We installed several bioinformatics packages relevant to our studies of viral and antibody sequences described below. However, these could be replaced with tools relevant to local laboratory needs. We installed conda (https://conda.io/docs/intro.html) as our package manager. All the installed packages were located in /opt/conda2/ (Figures 1 and 2). All programs were installed on the master node and are executable on all the computer nodes. Conda is a package manager application that quickly installs, runs, and updates packages and their dependencies. The conda command is the primary interface for managing installations of various packages. It can query and search the package index and current installation, create new environments, and install and update packages into existing conda environments. Creation of different environments is done for programs that might have conflicting requirements in terms of the dependencies used or versions of certain binaries (pre-built executables). The bioconda channel was enabled within conda to provide access to its collection of over 4000 open-source bioinformatics packages. 15

Access to bioinformatics programs on the cluster. (A) Bioinformatics programs on the cluster. (B) Example of a bash script to run programs on the cluster.
We used open-source tools and established pipelines developed by collaborators to perform bioinformatics analyses on viral and antibody sequence data. Bash scripts were used to execute the various bioinformatics programs allowing us to specify a given node and memory requirements to run the program, an example of which is shown in Figure 2B. Job scripts are stored in a folder accessible by all users, standardizing all parameters for performing analyses and saving time when performing similar analyses between donors.
HIV-1 and antibody datasets used for the analyses
Datasets used for the analyses were collected from participants in the Center for the AIDS Program of Research in South Africa (CAPRISA). 16 These individuals were followed-up from time of infection through chronic HIV infection. Ethics clearance for the use of samples was obtained from the Human Research Ethics Committee (Medical) from the University of Witwatersrand (MM040202), the University of Cape Town (025/2004), and CAPRISA at the University of KwaZulu-Natal (E013/04). HIV-1 sequences were obtained using the Sanger method, 17 and antibody data from NGS with MiSeq Illumina as described previously. 18
Results
Cluster performance with computationally intensive programs
Bioinformatics programs perform many analyses ranging from identifying signatures in sequences to performing complex calculations, simulations, and predictions. Some of these programs are very computationally intensive, requiring large memory and longer computational hours. These types of analyses include virus phylodynamics that provide insights into, for example, intra-host evolution of HIV-1 by estimating the rates of evolution, selection, diversity, divergence, elucidating spatio-temporal distributions of viruses, and identifying number of viral infections.19,20
To achieve this, we installed a number of phylodynamics programs on the cluster, including Bayesian Evolutionary Analysis by Sampling Trees (BEAST), 21 which uses Bayesian statistical methods that require long computational times and memory requirements. BEAST ran faster on the bioinformatics cluster compared with an ordinary machine (164 and 284 hours, respectively) to complete the same task (Table 1). The “ordinary machine” used for the comparison was a MacBook Pro with a 2.6-GHz Intel Core i7 processor and 16GB RAM. The MacBook Pro makes a good comparison with the computational cluster since it is a high-end machine with above-average processing power and memory compared with other ordinary machines. Furthermore, the bioinformatics cluster analyzed 18 datasets at a time compared with one on the ordinary machine. This further highlights the benefits of the cluster in rapidly performing analyses with computationally intensive programs. For BEAST, data visualization was done using FigTree and SpreaD3 installed on local machines.22,23
Comparison of the cluster performance with that of a high specification ordinary machine.
The cluster runs jobs faster compared with the ordinary machine and also runs multiple jobs in parallel, therefore reducing the total time to complete multiple analyses.
Cluster performance with large datasets
The cluster has also been applied to antibody repertoire analyses that rely on large datasets to accurately capture their evolution. We have thus far analyzed more than 3 TB of NGS antibody repertoire data (Figure 3A), which far exceeds the capability of ordinary desktop computers. These data were obtained from 4 HIV-infected participants at multiple time-points ranging between 7 and 281 weeks post infection (Figure 3B). Each analysis (2-10 million light and heavy chain antibody reads) took 8 to 168 computational hours, and the memory usage varied between 4 and 20 GB. The time and memory requirements of each of the analysis were dependent on the size of the input data, that is, the number of reads in the dataset. The cluster enables users to run several jobs simultaneously, thus allowing multiple time-points to be analyzed concurrently.

Cluster performance with large datasets of antibody repertoire data. (A) Total amount of data from the SONAR analyses, followed by the breakdown per participant for donors. (B) Number of antibody sequences analyzed using the bioinformatics cluster per donor and per time point. Heavy and light chain antibody sequences data are shown in black and gray, respectively.
Easy integration of bioinformatics programs on the cluster
Bioinformatics analyses of complex data often involve using different tools to perform specialized tasks as part of a pipeline or a workflow. An example is the antibody repertoire NGS data analysis that we use, which requires many steps, summarized in Figure 4, and for which all of the steps are performed on the cluster. The analysis steps involve the use of the SONAR pipeline, 24 which links several bioinformatics tools, including Clustal Omega,25,26 MUSCLE, 25 Basic Local Alignment Search Tool (BLAST+), 27 BEAST, 28 and DNAML, 29 to process the data and identify sequences related to an antibody sequence of interest (clonally related sequences) (Figure 4). 24

NGS analysis flowchart using the bioinformatics programs on the cluster. The antibody repertoire analysis involves first organizing the data from the sequencing facility (data pre-processing), SONAR analysis, and post SONAR analysis that involves selection of clonally related sequences. The arrows show the flow of data through the various stages, and in some instances, data can go back and forth in certain stages (shown by the thin arrows).
The pipeline was installed as follows: dependencies were first installed using conda or as per the developer’s instructions if not packaged within conda. SONAR was downloaded from https://github.com/scharch/SONAR and placed in /opt/conda2/pkgs/. After uncompressing the files, we installed the pipeline by following the command prompts after executing “setup.sh.” The command prompts allowed us to specify the paths to the installed dependencies. Plots for data visualization were made using R which was also installed on the cluster. We then defined the python path using the following command: “export PYTHONPATH =$PYTHONPATH: /opt/conda2/pkgs/sonar/.” Finally, for all the cluster users to be able to use the bioinformatics programs, we edited the .bashrc scripts to include the path to all the programs. The cluster, therefore, allows easy integration and automation of these bioinformatics programs, enhancing the laboratory’s throughput.
Discussion
Lack of infrastructure and limited resources are bottlenecks in research conducted in low-to-medium income settings. These regions are often also those that bear the major burden of infectious diseases such as HIV. Conducting disease surveillance and vaccine research to curb the spread of locally relevant pathogens is a public health priority. Our research laboratory is involved in HIV vaccine research which involves generating and analyzing huge amounts of sequencing data. We have successfully set up a bioinformatics cluster using cheap and low-specification CPUs and demonstrated its application in analyzing these large NGS datasets. This approach has broad utility for other pathogens and beyond bioinformatics to other studies that require high-performance computing.
The cost of constructing this cluster was reduced using relatively old computers that were no longer used in the laboratory and which were upgraded at a small cost to accommodate more data. This approach is very feasible in resource-limited laboratories that have access to old computers, or funding to purchase relatively inexpensive ones. We reduced costs through the use of open-source software to configure the cluster. There is a significant body of useful open-source software that also comes with good technical support. Much of this support comes from the community of users that interact on Google groups such as Gitter, GitHub, Biostars, Stack Overflow, and other online platforms. We used Linux Fedora operating system, as it is open source, stable, and is sponsored and funded by Red Hat. Conda package manager is also open-source software that makes it easy and manageable to install bioinformatics packages. The SLURM scheduler is also open-source software with excellent community support.
There are a number of software packages that may be used for cluster configuration management such as Ansible, Chef, and Puppet. Puppet has advantages when it comes to updating programs on the cluster, in that it automatically updates the other nodes, whereas Ansible has to be run to update any changes on the other nodes. We used Ansible because it is simple and easy to configure compared with the other two as our cluster is small, with less than 3 nodes.
A centralized cluster for running tasks makes it easy to standardize processes and install updates on programs run by users. In addition, it enables good management of memory and storage resources as well as data security as users do not have to carry huge quantities of data around. Access to the cluster only requires use of the command line and an understanding of Unix systems. If a Windows operating system is preferred, installation of a program such as Putty will provide a command line to execute Unix commands.
We applied the cluster to the analysis of the HIV envelope glycoproteins evolution within individuals 19 and to studying the development of antibody responses to HIV infection. 30 These examples demonstrated how usage of this cluster has helped overcome hurdles in data processing and analysis and generated valuable insights into HIV vaccine design. This computational cluster was built using ordinary desktops, and the memory was upgraded to the maximum limits of the nodes. It will not perform well with analyses that require memory beyond the limit of the infrastructure, and more sophisticated systems may be required. Future work would look into implementing parallel computing to break down large memory-demanding tasks into manageable jobs that can run within the limits of the available infrastructure.
Conclusion
In this article, we have demonstrated the building and implementation of a low-cost cluster for analyzing large NGS data and performing computationally intensive studies of intra-host evolution of HIV. The establishment of this low-cost cluster demonstrates how researchers from low-income settings can solve global challenges using relatively inexpensive resources. Such an approach of using low-cost technologies and recycling/repurposing equipment to tackle complex scientific problems is highly relevant to Africa and has broader implications in advancing creativity, research, and bringing about home-grown solutions to the challenges facing the continent.
Footnotes
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge research funding from the South African Medical Research Council (MRC) SHIP program and the International AIDS Vaccine Initiative (IAVI). IAVI’s work is made possible by generous support from many donors including: the Bill & Melinda Gates Foundation; the Ministry of Foreign Affairs of Denmark; Irish Aid; the Ministry of Finance of Japan in partnership with The World Bank; the Ministry of Foreign Affairs of the Netherlands; the Norwegian Agency for Development Cooperation (NORAD); the United Kingdom Department for International Development (DFID), and the United States Agency for International Development (USAID). The full list of IAVI donors is available at
. The contents are the responsibility of the authors and do not necessarily reflect the views of USAID or the United States Government. PLM. and PvH are supported by the South African Research Chairs Initiative of the Department of Science and Technology and the NRF (Grant Nos 98341 and 64571, respectively). BMM is supported by a NRF SARChI Chair-linked PhD bursary and the Poliomyelitis Research Foundation.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
BMM conceived, designed and tested the high-performance computing cluster. RR and LD installed and maintained the cluster. PvH contributed to configuration of the cluster. LM, CS and PLM provided supervision to BMM. BMM, CS and PLM wrote the manuscript. All authors reviewed and approved the manuscript.
