Vaccination rates


Konstantin Burkin


Introduction


This project was written in Jupyter Notebook, using Python version 3.7.13. This work is available in my Github repository, where it is possible to download Colab Notebook, check the code and reproduce my work.

Assignment

The goal of this project is to find a dataframe, statistically describe it, remove or fill in the missing values, visualise the patterns in data and make a prediction on the basis of the obtained data. In this case, the dataframe describing COVID-19 World Vaccination Progress was chosen. After cleaning the data and visualising the rates of vaccinations in different countries and continents, the vaccinations data was fitted by non-linear regression and date was predicted when each country can achive herd immunity from COVID-19.

The result of this project is a dataframe containing id the names of the countries and the predicted date when the herd immunity is reached.

Brief outline

  • Setting up notebook environment: data import, data subsetting import of libraries
  • Exploration of dataframes: description of data types, description of column names, calculation of NAs.
  • Filling missing values
  • Visualisation of absolute vaccination rates across continents and fully vaccinated ratio for each country.
  • Extrapolation and prediction of the date of fully vaccinated population.


Import of data and Python libraries


The following data analysis includes several Python libraries for data analysis and ploting:
  • Numpy
  • Pandas
  • Plotly
  • Sklearn
  • Scipy
  • Datetime
  • Google colab
The dataframe was downloaded and read from the Github repository as a csv file. The dataframe is available on Github or Kaggle. The original dataframe contains 104214 observations of 16 parameters. For this project only 8 parameters were subsetted and used.

The 5 random rows of the dataframe

Country Date Vaccinations Vaccinated Fully_Vaccinated Vaccinations_Ratio Vaccinated_Ratio Fully_Vaccinated_Ratio
Canada 2021-07-22 47056217.0 26783674.0 20268756.0 123.61 70.36 53.24
Italy 2021-05-31 35434142.0 23815455.0 12280305.0 58.70 39.45 20.34
Palestine 2022-02-15 NaN NaN NaN NaN NaN NaN
Bangladesh 2021-12-29 NaN NaN NaN NaN NaN NaN
Montenegro 2022-03-28 668025.0 289643.0 281511.0 106.36 46.12 44.82


Exploratory Data Analysis


Statistical data description

The dataframe describes vaccination rates in 235 countries across the world. The vaccinations began in the end of 2020. Vaccination programs still continue. There are several parameters that describe vaccination rates. The names of the columns of the data frame that descibe each type of parameter are listed below with the description. It is important to underline that the number of vaccinations could be larger than the population size since many vaccines require two shots. Moreover, many people traveled aboad to recieve a better or additional vaccine.

Table of variables

Variable Description Datatype NAs
Country The country for which the vaccination rate is provided object 0
Date Date for the data entry object 0
Vaccinations The absolute number of immunizations in the country float64 50241
Vaccinated Total number of people vaccinated. A person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people float64 52683
Fully_Vaccinated The number of people that received the entire set of immunization according to the immunization scheme (typically 2) float64 55236
Vaccinations_Ratio The ratio between vaccination number and total population up to the date in the country float64 50241
Vaccinated_Ratio The ratio between population immunized and total population up to the date in the country float64 52683
Fully_Vaccinated_Ratio The ratio between population fully immunized and total population up to the date in the country float64 55236

Statistical description of numeric data

Vaccinations Vaccinated Fully_Vaccinated Vaccinations_Ratio Vaccinated_Ratio Fully_Vaccinated_Ratio Fully_Vaccinated_Ratio
mean 2.12e+08 1.03e+08 8.57e+07 86.25 42.69 37.19 37.19
min 0.00e+00 0.00e+00 1.00e+00 0.00 0.00 0.00 0.00
max 1.17e+10 5.17e+09 4.71e+09 355.75 124.88 122.94 122.94
25% 8.28e+05 4.88e+05 3.87e+05 17.83 12.25 7.49 7.49
50% 6.36e+06 3.77e+06 2.95e+06 75.80 45.47 35.78 35.78
75% 3.96e+07 2.21e+07 1.82e+07 142.02 69.76 63.88 63.88


Three columns that describe ratio of vaccinated people (Vaccinations_Ratio, Vaccinated_Ratio, and Fully_Vaccinated_Ratio) contain values of more than 100%, since some vaccinations require two immunization shots. However, the fact that in some countries there could be more fully vaccinated people that the amount of population (Fully_Vaccinated_Ratio > 100 %) seems erroneous.

Filling missing values

It is unlikely that vaccination rates can drasticaly change between two points of data entry. Therefore, it seems logical to fill missing values with linear interpolation. To be certain that filling of missing values was correct and linear approximation was pertinent, it is possible to see the visualisation of the plots in the next section.

Data visualisation

The first graph presents absolute vaccinations rates across six continents. Filling missing values did not result in any kind of unusual values or outliers. Therefore, it is possible to build predictions for the end of vaccinations.

Before going to the next step it is interesting to notice that Europe, Asia, and North America were the first countries to develop vaccines against COVID-19 and deploy a full-scale vaccination program. Other continents, like Africa, South America, and Australia lagged behind in vaccination rates, since they did not have neither developed pharm industry nor enough resources, and they had to wait for vaccine supplies from developed countries. These patterns can be seen in the graphs. Europe, Asia, and North America have high numbers of vaccination rates since the end of 2020.
The ratio of fully vaccinated people is ploted below for each country. It is possible to choose the country of interest using the dropdown menu. Again, it is evident that despite the fact that vaccinations began in the end of 2020 many countries started vaccinations programs much later. At that time only developed asian, european and north american countries could financially and logisticaly afford to begin vaccinations. Less developed countries started to receive vaccines later in the beginning of 2021.


Prediction


To predict the end of vaccinations non-linear regression was implemented to fit the vaccination rates in each country. The logistic function was used:
$$ f(x) = \frac L {1+{e}^{-k * (x-x_0)}} \ $$

This type of non-linear regression fitted curves for 199 countries with values of \(R^2 > 0.9\). The example of vaccination rates in Ireland proves that this function perfectly describes the data. The data from countries that could not be fitted with logistic regression were not analyzed in the following work. Generally, these countries did not have enough vaccinations data.
After the data was fitted with non-linear regression, it is interesting to get information about the date, when vaccination programs can be closed. To do that, it is necessary to determine the threshold, after which, the vaccinations can be stopped. The threshold, in a given population, is the point where the disease reaches a steady state, which means that the infection level is neither growing nor declining exponentially. The threshold is evaluated using the formula below.
$$ p_c = 1 - \frac 1 R_0\ $$
where \(R_0\) \(-\) the basic reproduction number, the average number of new infections caused by each case in a population where each individual is equally likely to come into contact with any other susceptible individual in the population;
\(P_c\) \(-\) the critical proportion of the population needed to be immune to stop the transmission of disease, which is the same as the "herd immunity threshold" (HIT). Information about herd immunity thresholds was found here.

Values of \(R_0\) and herd immunity thresholds (HITs) of well-known infectious diseases prior to intervention

Disease Transmission \(R_0\) HIT, %
Measles Aerosol 12-18 92-94
Chickenpox (varicella) Aerosol 10-12 90-92
Mumps Respiratory droplets 10-12 90-92
COVID-19 (ancestral strain) Respiratory droplets and aerosol 2.4-3.4 58-71
COVID-19 (Alpha variant) Respiratory droplets and aerosol 4-5 75-80
COVID-19 (Delta variant) Respiratory droplets and aerosol 5.1 80
COVID-19 (Omicron variant) Respiratory droplets and aerosol 9.5 89
Rubella Respiratory droplets 6-7 83-86
Polio Fecal-oral route 5-7 80-86
Pertussis Respiratory droplets 5.5 82
Smallpox Respiratory droplets 3.5-6.0 71-83
HIV/AIDS Body fluids 2-5 50-80
SARS Respiratory droplets 2-4 50-75
Diphtheria Saliva 1.7-4.3 41-77
Common cold Respiratory droplets 2-3 50-67
Monkeypox Physical contact, body fluids, respiratory droplets 1.5-2.7 31-63
Influenza (1918 pandemic strain) Respiratory droplets 2 50
Ebola (2014 outbreak) Body fluids 1.4-1.8 31-44
Influenza (2009 pandemic strain) Respiratory droplets 1.3-2.0 25-51
Influenza (seasonal strains) Respiratory droplets 1.2-1.4 17-29
Andes hantavirus Respiratory droplets and body fluids 0.8-1.6 0-36
Nipah virus Body fluids 0.5 0
MERS Respiratory droplets 0.3-0.8 0


For this project, the lowest threshold of 75% for the alpha variant of COVID-19 was calculated. Using logistic regression, the dates when immunizations reach the threshold were found (shown below). It is important to underline that out of 199 curves (\(R^2 > 0.9\)) that described vaccination rates in 199 countries only 38 countries reached or will reach in nearest future the threshold of herd immunity. The logistic regression for data from other countries shows that herd immunity cannot be achieved according to the current trend in vaccination rates.
However, it is important to point out that a lot of people could get immunity from COVID-19 after having had the disease. That could have lead to lower percentage of the population needed for reaching herd immunity threshold. Unfortunately, in this project this factor was not yet taken into account.

Predicted dates of the end of vaccinations

Country \(R^2\) Prediction Country \(R^2\) Prediction
Argentina 0.998 01 Jan 2022 Isle of Man 0.986 03 Oct 2021
Australia 0.999 12 Dec 2021 Italy 0.997 23 Nov 2021
Bangladesh 0.994 23 May 2022 Japan 0.998 08 Nov 2021
Belgium 0.998 12 Oct 2021 Kuwait 0.997 30 Jan 2022
Brazil 0.998 31 Mar 2022 Luxembourg 0.991 25 Sep 2022
Brunei 0.997 21 Nov 2021 Macao 0.986 01 Feb 2022
Cambodia 0.998 03 Nov 2021 Malaysia 0.999 03 Nov 2021
Canada 0.993 19 Sep 2021 Malta 0.994 03 Aug 2021
Chile 0.983 07 Sep 2021 Nepal 0.986 21 Jul 2022
Congo 0.984 12 Apr 2023 New Zealand 0.997 31 Dec 2021
Costa Rica 0.993 17 Feb 2022 Peru 0.998 23 Feb 2022
Cuba 0.993 22 Nov 2021 Portugal 0.999 06 Sep 2021
Denmark 0.996 01 Oct 2021 Samoa 0.962 24 May 2022
Faeroe Islands 0.996 23 Sep 2021 Singapore 0.993 13 Sep 2021
Finland 0.999 25 Dec 2021 South Korea 0.998 11 Nov 2021
France 0.997 24 Dec 2021 Spain 0.998 15 Sep 2021
Guernsey 0.994 25 Dec 2021 Taiwan 0.999 08 Feb 2022
Iceland 0.995 06 Sep 2021 United Arab Emirates 0.995 19 Aug 2021
Ireland 0.999 01 Oct 2021 Vietnam 1.000 21 Jan 2022


Results


In this project, the countries that reached the threshold of herd immunity were found. For those countries that can reach this threshold, the date of the threshold attainment was calculated.
  • The vaccination rates were examined, cleaned and plotted for each country and continent. All the graphs in this project are interactive and can be thoroughly examined.
  • The vaccination rates were fitted by logistic function. This regression fitted the data from 199 countries with \(R^2 > 0.9\). Other counties were not analyzed due to lack of data.
  • Out of these 199 countries only 38 showed accomplishment of reaching the status of 75% immunized population, which is the lowest threshold for herd immunity for alpha variant of COVID-19.
  • The mathematical model used to fit the data of vaccination rates shows that other countries cannot surpass the threshold of herd immunity.




Main page