Provisional Death Counts - Influenza, Pneumonia, COVID

Author

Collin Real

Description

      Our dataset includes weekly provisional death counts for influenze, pneumonia, and COVID from 2020-2023 for all age groups exported from the CDC website. Click here to view the CDC data source.

Import Libraries

library(dplyr)
library(here)
library(tidyverse)
library(ggplot2)
library(reticulate)
use_python("/usr/local/bin/python3")

Load DataFrame

data_location <- here::here("cdcdata-exercise","full_clean_data.csv")
df = read.csv(data_location)
str(df)
'data.frame':   800 obs. of  9 variables:
 $ Start.Week      : chr  "12/29/2019" "12/29/2019" "12/29/2019" "12/29/2019" ...
 $ End.Week        : chr  "01/04/2020" "01/04/2020" "01/04/2020" "01/04/2020" ...
 $ MMWRyear        : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
 $ MMWRweek        : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Jurisdiction    : chr  "United States" "United States" "United States" "United States" ...
 $ Age.Group       : chr  "All Ages" "0-17 years" "18-64 years" "65 years and over" ...
 $ COVID.19.Deaths : num  0 0 0 0 1 0 0 1 2 0 ...
 $ Pneumonia.Deaths: num  4102 19 767 3316 4137 ...
 $ Influenza.Deaths: num  432 22 183 227 472 14 186 272 465 12 ...
summary(df)
  Start.Week          End.Week            MMWRyear       MMWRweek    
 Length:800         Length:800         Min.   :2020   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:2020   1st Qu.:13.00  
 Mode  :character   Mode  :character   Median :2021   Median :25.50  
                                       Mean   :2021   Mean   :25.66  
                                       3rd Qu.:2022   3rd Qu.:38.00  
                                       Max.   :2023   Max.   :53.00  
 Jurisdiction        Age.Group         COVID.19.Deaths Pneumonia.Deaths 
 Length:800         Length:800         Min.   :    0   Min.   :    0.0  
 Class :character   Class :character   1st Qu.:   14   1st Qu.:  153.8  
 Mode  :character   Mode  :character   Median : 1164   Median : 2608.0  
                                       Mean   : 2879   Mean   : 2947.8  
                                       3rd Qu.: 3416   3rd Qu.: 4017.0  
                                       Max.   :25974   Max.   :16884.0  
 Influenza.Deaths 
 Min.   :   0.00  
 1st Qu.:   3.00  
 Median :  11.00  
 Mean   :  56.27  
 3rd Qu.:  32.00  
 Max.   :1048.00  

Histogram Plots

df %>%
    select(Pneumonia.Deaths, Influenza.Deaths, COVID.19.Deaths) %>%
    gather() %>%
    ggplot(aes(value)) +
    geom_histogram(color='black', fill='steelblue') +
    facet_wrap(~key, scales = 'free') +
    ggtitle(("Numerical Predictors - Histograms")) + 
    theme_minimal()

Box Plots

df %>%
    select(Pneumonia.Deaths, Influenza.Deaths, COVID.19.Deaths) %>%
    gather() %>%
    ggplot(aes(value)) +
    geom_boxplot(color='black', fill='steelblue') +
    facet_wrap(~key, scales = 'free') +
    ggtitle(("Numerical Predictors - Boxplots")) + 
    theme_minimal()

This Section contributed by Joaquin Ramirez

Explanation:

Data Structure:

The synthetic dataset has 800 observations and 9 variables, mirroring the structure of the original dataset.

# Create a synthetic dataset with the same structure as the original
set.seed(123) # For reproducibility

# Define the number of observations
n <- 800

# Generate synthetic data
synthetic_df <- tibble(
  Start.Week = sample(seq.Date(as.Date('2020-01-01'), as.Date('2023-12-31'), by="week"), n, replace = TRUE),
  End.Week = sample(seq.Date(as.Date('2020-01-08'), as.Date('2024-01-07'), by="week"), n, replace = TRUE),
  MMWRyear = sample(2020:2023, n, replace = TRUE),
  MMWRweek = sample(1:53, n, replace = TRUE),
  Jurisdiction = rep("United States", n),
  Age.Group = sample(c('0-17 years', '18-64 years', '65 years and over', 'All Ages'), n, replace = TRUE),
  COVID.19.Deaths = round(rnorm(n, mean = 2879, sd = 5000)),
  Pneumonia.Deaths = round(rnorm(n, mean = 2948, sd = 3000)),
  Influenza.Deaths = round(rnorm(n, mean = 56, sd = 100))
)

# Summarize the synthetic data
str(synthetic_df)
tibble [800 × 9] (S3: tbl_df/tbl/data.frame)
 $ Start.Week      : Date[1:800], format: "2023-01-11" "2023-12-13" ...
 $ End.Week        : Date[1:800], format: "2021-05-12" "2021-01-27" ...
 $ MMWRyear        : int [1:800] 2023 2022 2022 2020 2023 2022 2022 2023 2022 2020 ...
 $ MMWRweek        : int [1:800] 47 6 26 28 6 4 22 15 6 3 ...
 $ Jurisdiction    : chr [1:800] "United States" "United States" "United States" "United States" ...
 $ Age.Group       : chr [1:800] "18-64 years" "0-17 years" "65 years and over" "65 years and over" ...
 $ COVID.19.Deaths : num [1:800] 5604 9771 2061 -2147 6097 ...
 $ Pneumonia.Deaths: num [1:800] -1716 2743 5832 3735 3139 ...
 $ Influenza.Deaths: num [1:800] 8 49 100 35 14 -42 102 -18 -81 201 ...
summary(synthetic_df)
   Start.Week            End.Week             MMWRyear       MMWRweek   
 Min.   :2020-01-01   Min.   :2020-01-08   Min.   :2020   Min.   : 1.0  
 1st Qu.:2021-01-13   1st Qu.:2020-12-09   1st Qu.:2021   1st Qu.:14.0  
 Median :2022-01-26   Median :2021-11-24   Median :2022   Median :28.0  
 Mean   :2022-01-17   Mean   :2021-12-18   Mean   :2022   Mean   :27.3  
 3rd Qu.:2023-01-25   3rd Qu.:2022-12-21   3rd Qu.:2022   3rd Qu.:40.0  
 Max.   :2023-12-27   Max.   :2024-01-03   Max.   :2023   Max.   :53.0  
 Jurisdiction        Age.Group         COVID.19.Deaths    Pneumonia.Deaths 
 Length:800         Length:800         Min.   :-11825.0   Min.   :-6492.0  
 Class :character   Class :character   1st Qu.:  -223.8   1st Qu.:  992.2  
 Mode  :character   Mode  :character   Median :  2786.5   Median : 2936.0  
                                       Mean   :  2823.4   Mean   : 2931.6  
                                       3rd Qu.:  6107.0   3rd Qu.: 4880.5  
                                       Max.   : 16635.0   Max.   :11465.0  
 Influenza.Deaths 
 Min.   :-318.00  
 1st Qu.: -15.00  
 Median :  51.00  
 Mean   :  49.57  
 3rd Qu.: 116.00  
 Max.   : 402.00  

Variables included are as follows:

  • Start.Week: ranges from January 1, 2020, to December 27, 2023, with a median date of January 26, 2022.

  • End.Week: ranges from January 8, 2020, to January 3, 2024, with a median date of November 24, 2021.

  • MMWRyear: ranges from 2020 to 2023, with a median year of 2022.

  • MMWRweek: ranges from 1 to 53, with a median week of 28.

  • Jurisdiction: has only one value: “United States”.

  • Age.Group: includes “0-17 years”, “18-64 years”, “65 years and over”, and “All Ages”.

  • COVID-19 Deaths: Minimum: -11825, Maximum: 16635, Median: 2786.5, Mean: 2823.4

  • Pneumonia Deaths: Minimum: -6492, Maximum: 11465, Median: 2936, Mean: 2931.6

  • Influenza Deaths: Minimum: -318, Maximum: 402, Median: 51, Mean: 49.57

# Histogram Plots for Synthetic Data
synthetic_df %>%
  select(Pneumonia.Deaths, Influenza.Deaths, COVID.19.Deaths) %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram(color='black', fill='steelblue') +
  facet_wrap(~key, scales = 'free') +
  ggtitle("Numerical Predictors - Histograms (Synthetic Data)") + 
  theme_minimal()

Histograms of Synthetic Data:

  • COVID-19 Deaths: The histogram for COVID-19 deaths shows a symmetrical distribution centered around zero, with a range extending roughly from -10,000 to 10,000. The highest frequency of values lies close to the mean, with the counts tapering off symmetrically as they move away from the center.

  • Influenza Deaths: Similar to the COVID-19 deaths, the histogram for influenza deaths also shows a symmetrical distribution around zero. The range is narrower, extending from approximately -200 to 400, with the highest concentration of values close to the mean.

  • Pneumonia Deaths: The histogram for pneumonia deaths displays a wider range compared to influenza, extending from around -5,000 to 10,000. The distribution is symmetrical, with the highest frequencies near the center.

# Box Plots for Synthetic Data
synthetic_df %>%
  select(Pneumonia.Deaths, Influenza.Deaths, COVID.19.Deaths) %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_boxplot(color='black', fill='steelblue') +
  facet_wrap(~key, scales = 'free') +
  ggtitle("Numerical Predictors - Boxplots (Synthetic Data)") + 
  theme_minimal()

Box Plots of Synthetic Data:

  • COVID-19 Deaths: The box plot for COVID-19 deaths indicates that the data is centered around zero. The interquartile range (IQR) is very narrow, with a few outliers extending to approximately negative or positive 10,000.

  • Influenza Deaths: The box plot for influenza deaths also shows a central tendency around zero with a narrow IQR. Outliers are present but less extreme compared to COVID-19 deaths.

  • Pneumonia Deaths: The box plot for pneumonia deaths follows a similar pattern, with the data centered around zero and a narrow IQR. There are a few significant outliers extending to negative or positive 10,000.

Conclusion:

The synthetic data generation process has created a dataset that closely mirrors the original data’s structure and statistical properties. Key insights from the synthetic data include:

  • COVID-19 Deaths: The data shows a significant impact, with death counts ranging from -11,825 to 16,635. The median death count is 2,786.5, indicating substantial variability in the number of deaths.

  • Pneumonia Deaths: Death counts range from -6,492 to 11,465, with a median of 2,936. This suggests considerable fluctuation in the impact of pneumonia over the observed period.

  • Influenza Deaths: The range of death counts is narrower, from -318 to 402, with a median of 51. This indicates that while influenza deaths are generally lower, they still exhibit variability.

Overall, the synthetic dataset provides us with a realistic and an alternative for analysis, allowing us to explore and model the data without compromising the original dataset. This synthetic data retains the key characteristics necessary for effective exploratory data analysis and modeling.