Abstract
The study captures the COVID-19 lifecycle in different states of India using predictive analytics. Drawing upon the seminal susceptible–infected–removed (SIR) model of capturing the spread of viral diseases, this study models the spread of COVID-19 in the ten most infected states of India (as on 30 April 2020). Using publicly available state-wise time series data of COVID-19 patients during the period 1–30 April 2020, the study uses the forecasting technique of auto-regressive integrated moving averages (ARIMA) to predict the likely population susceptible to COVID-19 in each state. Thereafter, based on the SIR model, predictive modelling of state-wise COVID-19 data is carried out to determine: (a) the predictive accuracy; (b) the likely number of days it would take for the disease to reach the peak number of infections in a state; (c) the likely number of infections at the peak; and (d) the state-wise end date. The SIR model is implemented by running Python 3.7.4 on Jupyter Notebook and using the package Matplotlib 3.2.1 for visualization. The study offers rich insights for policymakers as well as common citizens.
Introduction
The COVID-19 pandemic has probably been the most influential public catastrophe encountered by humankind since World War II. It has changed our outlook towards life and its existential uncertainty. It has forced us to adopt a lifestyle which is different from the one we are used to. It has put a question mark over our economic survival. And even as it has given us an opportunity to reflect and introspect, learn, unlearn and relearn, it still is a nemesis. Therefore, it is but natural to explore its lifecycle patterns. A cautious endeavour to capture the uncertainty may help in planning without displaying over-optimism or undue pessimism about the future. A data-driven approach to predict the COVID-19 lifecycle, based on robust theoretical framework in extant research related to viral infections, may give us greater insights and help us to prepare rationally for the challenges that lie ahead.
Even as researchers the world over are trying to capture the COVID-19 infection data and build predictive models, to the best of our knowledge, no such attempt has been made in India so far. Moreover, no study so far has attempted to do predictive modelling at a comparatively granular level of each state in India. Such a study is relevant at two levels. First, India offers a unique set-up for the spread of infectious diseases. While we have a robust universal vaccination programme (Banik et al., 2020) and a proven track record of completely eliminating various viral diseases, we are also hampered by a large population and high population density. Second, in a federal structure, different states in India have been able to implement various policies to prevent the spread of COVID-19 with different levels of efficacy. This has resulted in extremely different rates of spread of the disease in different states of India. Therefore, the very specific objective of this study is to carry out predictive modelling of the COVID-19 lifecycle pattern of various states in India. In the process, the study determines: (a) the predictive accuracy; (b) the likely number of days it would take for the disease to reach the peak number of infections in a state; (c) the likely number of infections at the peak; and (d) the state-wise end date. The rest of the article is structured as follows: the second section lays down the conceptual framework; the third section gives a description of the research method; the fourth section presents results from the study; and, finally, the fifth section offers a discussion on implications.
Conceptual Framework
The study is based on the susceptible–infected–removed (SIR) model (Kermack & McKendric, 1927) of viral infection. The model divides the population into three categories: susceptible, infected and removed. The SIR model has been used by various studies in the recent past to model the spread of COVID-19 (e.g., Liu et al., 2020) and also for other infectious diseases, in general (Bhattacharya et al., 2015; Muthuramakrishnan & Martin, 2016). The SIR model captures incremental changes in the number of susceptible, infected and removed individuals over a period of time in a region and, therefore, is summed up in terms of the following three equations:
where
1.
wherein
2.
3.
4.
5.
6.
7.
8.
The SIR model captures two directions of movement—from susceptible to infected and from infected to removed (Bhattacharya et al., 2015). Further, the SIR model is based on the following assumptions:
The population size, The contact and removal rates are constant. There are no demographic changes during the period of assessment. The population is well-mixed such that any infected individual has a probability of contacting any susceptible individual.
The possible variants of the SIR model that can provide a theoretical framework for modelling the spread of an epidemic are the susceptible–exposed–infectious–recovered (SEIR) and susceptible–infected–recovered–deceased (SIRD) models. The SEIR model assumes that the virus incubates inside the host for a period of time before the infected individual becomes infectious and is, therefore, more suitable for those epidemics having a significant incubation period (Li et al., 1999). The SIRD model, on the other hand, considers recovered and deceased as separate compartments and, therefore, has to take into consideration the mortality rate. Some of the recent research has attempted to model COVID-19 using the SIRD model (e.g., Caccavo, 2020). A major prerequisite for modelling an epidemic using the SEIR or SIRD model is having to consider the possible incubation period of virus and mortality rate per unit time, respectively. Given the fact that research is still in progress on various aspects of COVID-19, there is a great degree of uncertainty regarding these values. For example, Fernández-Villaverde and Jones (2020) model COVID-19 using the SIRD model, considering a mortality rate of 0.8 per cent but with the caveat that ‘there is substantial uncertainty about this number’. Since the research on COVID-19 is still in its infancy, the authors of the present study use the SIR model to avoid having to adopt additional assumptions related to the incubation period or the mortality rate. However, once more information is available, the authors intend to use the SIRD framework for modelling COVID-19 in future studies.
Research Design
Data
The data for the study was obtained from publicly available sources (Rajkumar, 2020). Even as the dataset has data from the beginning of the outbreak of COVID-19 in a given state, we considered data during 1–30 April 2020. This was done to catch the trend of infection in the general public rather than capture sporadic noise in the data.
Methodology
The following parameters were calculated as follows:
Time series regression analysis technique of auto regressive integrated moving average (ARIMA) is used for estimating
The basic reproduction number,
The effective contact rate, Number of other individuals that an infected individual comes into contact with per day = per-day infections × 3 × 3.28 For example, let us say Delhi has an average daily infection of 300 individuals. On day (t) these 300 infected individuals are in contact with the susceptible population and thereby, infecting them. However, since it takes 3 days for an infected individual to be confirmed as a COVID-19 patient, the 300 infected individuals of day (
Since
The values of per-day infections and
The number of confirmed COVID-19 cases per day not only varies across states but also varies across the time period being considered (1–30 April 2020). Due to the fluctuating nature of the data, the following process was adopted to identify a representative figure for per-day infections. First, the 30 data points pertaining to confirmed cases during 1–30 April 2020 were sorted in an ascending order. Second, the median number of infections out of the ten highest values across the data was identified and considered as the number of infections per day. Due to the highly fluctuating nature of number of cases per day, the overall mean, median or quartile was not considered as an appropriate representative of per-day infections.
The removal rate, The value of
Results
. Determining the Effective Contact Rate (β )
Using the package Matplotlib 3.2.1 (Hunter, 2007) and running the codes on Jupyter Notebook (Kluyver et al., 2016) IDLE, we obtained state-wise plots for ten states of India for determining information pertaining to: (a) the overall lifecycle and end date of the pandemic; (b) the peak date and the peak number of infected individuals; and (c) the predicted number of infected individuals as on 30 April 2020, so that the same could be compared vis-à-vis the actual data. The plots related to each of the ten states having the highest number of COVID-19-infected individuals as on 30 April 2020 is given in Figure 1. The predictive accuracy (Table 2), peak time and end time (Table 3) have been calculated based on the SIR plots given in Figure 1.
The output (Table 2) suggests the model has worked well for most of the states. Based on a comparative analysis between the actual data and predicted data as on 30 April 2020, the predictive accuracy of the model has been determined. The predictive accuracy is high for the states Maharashtra, Rajasthan, Tamil Nadu, Andhra Pradesh and West Bengal. This could be due to less variation in the number of confirmed cases per day. On the other hand, for Gujarat, Delhi, Madhya Pradesh, Uttar Pradesh and Telangana, the predictions have been moderately close to the actual figure. Again, the reason could be due to the following two factors. First, due to the variation in the number of confirmed cases per day, the calculated value of per-day infection confirmation during 1–30 April 2020 may not be a true representative of the per-day infections. A comparison of the representative value with per-day actual infection confirmation value as on 3 May 2020 (Table 2) reveals the discrepancy. This is due to the still-climbing rate of infections. Second, it is possible that the number of susceptible people each infected individual is infecting (

. Predictive Accuracy
Predicting the Peak Date and End Date
The output in Table 3 provides the likely number of days it would take for each state to reach the peak stage, as calculated from 1 April 2020 onwards. As per the present data, all the states are likely to reach the peak stage between 10 and 20 May 2020. The predicted number of cases as on the peak date has been calculated based on the proportion of infected individuals out of the total susceptible population as on the peak date. Further, based on culmination of the infection curve in each plot, the likely end date of the disease has been estimated for each state. All the states are likely to be under the influence of the disease for 175–185 days, as calculated from 1 April 2020. Thus, the influence of the COVID-19 pandemic in India is likely to last until the first week of October 2020.
Discussion and Conclusion
A model is only as good as the data available. Moreover, every model’s predictive accuracy depends on whether the assumptions are met. To what extent the available data reflects the true situation is questionable, because of limited testing during the early stages of the onset of the disease. This has resulted in the state-wise data having a lot of variations in terms of number of COVID-19-positive cases detected every day. Second, the study is based on the following major assumptions: (a) the total susceptible population of the state would be exposed to the virus for 3 months from the first date of data collection, that is, from 1 April 2020; (b) the median of the highest ten instances of confirmed cases per day during the month of April 2020 is representative of per-day infection confirmation; (c) an infected person continues to infect susceptible individuals in the population for 3 days until removed; and (d) each infected individual infects 3.28 other individuals per day. A violation in any of these assumptions results in a poor prediction. However, the findings of the study are in alignment with a similar international endeavour by Singapore University of Technology and Design’s (SUTD) Data-Driven Innovation Lab to predict the end date of the pandemic in various countries. This study by Luo (2020, May 5) predicts the theoretical end date of the pandemic in India to be in September 2020. Given the uncertainty and confusion surrounding the pandemic, our model is of some help to policymakers as well as ordinary citizens, giving them some form of timeline regarding the lifecycle of COVID-19 across ten states in India.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
