Data Origin

COVID-19 Deaths: 1/03/2020 - 17/04/2020

COVID-19 hit the news headlines in December 2019, the first identified case was in Wuhan, China, and has since spread globally resulting in a pandemic. The first recorded case hit UK shores on the 31st January and the UK experience its first COVID-19 related death on 28th February in Scotland. In England the first death was on 4th March. The following considers datasets available through the Office for National Statistics website concerning the deaths caused by COVID-19 in England from the 1st March 2020 to the 17th April 2020.

Between 1 March and 17 April 2020 there were 90,232 deaths occurring in England and Wales that were registered by 18th April. Almost 22.5% of these deaths involved the coronvirus (COVID-19).

The information used in these datasets were collected by the Office for National Statistics when certified deaths were registered with the local registry offices. The data provided offers information regarding the number of total COVID-19 related deaths as well as the death rate per 100,000 people according to age, region, gender and deprivation scale.

The data used was provided in a very large excel spreadsheet, with multiple sheets, and I have used a number of differing datasets for this project. As a results I do not have one ‘master’ dataset, instead I have a few. The datasets provided by the Office for National Statistics were not very R friendly, so they needed some manipulation before uploading to R as csv files. Below is an example of a couple of the datasets I have used.

Weekly Deaths by Age Group and Gender

##     ï..Week Total.1 Total1.4 Total5.9 Total10.14 Total15.19 Total20.24
## 1 28-Feb-20       0        0        0          0          0          0
## 2 06-Mar-20       0        0        0          0          0          0
## 3 13-Mar-20       0        0        0          0          0          0
## 4 20-Mar-20       0        0        0          0          0          1
## 5 27-Mar-20       0        0        0          0          2          1
## 6 03-Apr-20       0        0        0          0          3          5
##   Total25.29 Total30.34 Total35.39 Total40.44 Total45.49 Total50.54 Total55.59
## 1          0          0          0          0          0          0          0
## 2          0          0          0          0          0          0          0
## 3          0          0          0          1          0          0          1
## 4          1          0          0          2          7         16         13
## 5          3         11          9          9         26         37         71
## 6          5         12         14         27         70         92        191
##   Total60.64 Total65.69 Total70.74 Total75.79 Total80.84 Total85.89 Total90.
## 1          0          0          0          0          0          0        0
## 2          0          0          0          1          1          0        1
## 3          4          5          6          2          6          8        7
## 4         20         27         35         46         73         84       67
## 5         90        126        210        285        340        323      263
## 6        269        354        572        784        981        832      778
##   F.1 F1.4 F5.9 F10.14 F15.19 F20.24 F25.29 F30.34 F35.39 F40.44 F45.49 F50.54
## 1   0    0    0      0      0      0      0      0      0      0      0      0
## 2   0    0    0      0      0      0      0      0      0      0      0      0
## 3   0    0    0      0      0      0      0      0      0      1      0      0
## 4   0    0    0      0      0      1      1      0      0      1      5      8
## 5   0    0    0      0      1      1      2      3      4      2     13     13
## 6   0    0    0      0      2      2      3      4      4     11     29     31
##   F55.59 F60.64 F65.69 F70.74 F75.79 F80.84 F85.89 F90. M.1 M1.4 M5.9 M10.14
## 1      0      0      0      0      0      0      0    0   0    0    0      0
## 2      0      0      0      0      1      0      0    1   0    0    0      0
## 3      0      3      1      2      1      0      1    2   0    0    0      0
## 4      5      9     11     13     16     33     39   35   0    0    0      0
## 5     25     28     44     74     95    118    142  124   0    0    0      0
## 6     73     84    112    183    274    359    331  396   0    0    0      0
##   M15.19 M20.24 M25.29 M30.34 M35.39 M40.44 M45.49 M50.54 M55.59 M60.64 M65.69
## 1      0      0      0      0      0      0      0      0      0      0      0
## 2      0      0      0      0      0      0      0      0      0      0      0
## 3      0      0      0      0      0      0      0      0      1      1      4
## 4      0      0      0      0      0      1      2      8      8     11     16
## 5      1      0      1      8      5      7     13     24     46     62     82
## 6      1      3      2      8     10     16     41     61    118    185    242
##   M70.74 M75.79 M80.84 M85.89 M90.
## 1      0      0      0      0    0
## 2      0      0      1      0    0
## 3      4      1      6      7    5
## 4     22     30     40     45   32
## 5    136    190    222    181  139
## 6    389    510    622    501  382

Deprivation across regions in England

This data was also acquired from the Office for National Statistics Website.

##   ï..OA11CD OAC11CD                                   OAC11NM  LSOA11CD
## 1 E00000001     2d3                   EU White-Collar Workers E01000001
## 2 E00000016     2d3                   EU White-Collar Workers E01000002
## 3 E00000010     2d3                   EU White-Collar Workers E01000003
## 4 E00000030     3b2              Bangladeshi Mixed Employment E01000005
## 5 E00000049     4b1                  Asian Terraces and Flats E01000006
## 6 E00175106     3b3 Multi-Ethnic Professional Service Workers E01000007
##                    LSOA11NM IMD SOAC11CD                 SOAC11NM  MSOA11CD
## 1       City of London 001A   9       6a  Inner city cosmopolitan E02000001
## 2       City of London 001B  10       6a  Inner city cosmopolitan E02000001
## 3       City of London 001C   5       6a  Inner city cosmopolitan E02000001
## 4       City of London 001E   3       6a  Inner city cosmopolitan E02000001
## 5 Barking and Dagenham 016A   5       7b Young ethnic communities E02000017
## 6 Barking and Dagenham 015A   3       6a  Inner city cosmopolitan E02000016
##                   MSOA11NM   LAD17CD              LAD17NM LACCD
## 1       City of London 001 E09000001       City of London  5a1r
## 2       City of London 001 E09000001       City of London  5a1r
## 3       City of London 001 E09000001       City of London  5a1r
## 4       City of London 001 E09000001       City of London  5a1r
## 5 Barking and Dagenham 016 E09000002 Barking and Dagenham  4a1r
## 6 Barking and Dagenham 015 E09000002 Barking and Dagenham  4a1r
##                                     LACNM   RGN11CD RGN11NM  CTRY11CD CTRY11NM
## 1                     London Cosmopolitan E12000007  London E92000001  England
## 2                     London Cosmopolitan E12000007  London E92000001  England
## 3                     London Cosmopolitan E12000007  London E92000001  England
## 4                     London Cosmopolitan E12000007  London E92000001  England
## 5 Ethnically Diverse Metropolitan  Living E12000007  London E92000001  England
## 6 Ethnically Diverse Metropolitan  Living E12000007  London E92000001  England
##      FID
## 1 146308
## 2 146348
## 3 146346
## 4 146302
## 5 146306
## 6 146497

Research Questions

COVID-19 is a factor in everyone’s lives at the moment, most people are aware that the elderly and sick are the most vulnerable, but even in the first six weeks of the virus hitting the UK people from every age and background were affected.

The following graphics will attempt to show that even though the death rate among the most vulnerable is high, anyone could be afflicted by it.

Visualisation 1

Total Deaths in England, Deaths caused by COVID-19

#this will produce a basic barchart

ggplot(TotalDeaths, aes(x=Type, y=Deaths, fill=Type))+geom_bar(stat = 'identity', position='dodge')+
  ggtitle('Total COVID-19 Deaths compared to Total Deaths in England')

Total Deaths in England

Cause Total Deaths Male Female
COVID-19 19,135 11,472 7,843
Overall deaths 84,908 43,306 41,602
Percentage of Deaths caused by COVID-19 22.5361568% 26.4905556% 18.852459%

By April 17th 22.5% of deaths in England was caused by COVID-19, this was higher for Males (26.5%) compared to females (18.9%).

Visualiation 2

Deaths by COVID-19, Gender Comparison

Its seems that by the 10th April 2020 males were more likely to be killed than females by COVID-19 related illnesses.

Female.Deaths <- ggplot(Animate.data, aes(x=Age.Group, y=FemaleDeaths, fill=Age.Group))+
  geom_bar(stat='identity')+
  
  labs(title='Cumulative Death Count in Females according to Age Group', 
       subtitle='{closest_state}',
              #The {} function allows the 'Week' Label to change as the y axis increases
       x='Age Group')+
  
  geom_text(aes(label=format(round(FemaleDeaths)), y=FemaleDeaths),
            position=position_dodge(0.9), vjust=-1)+
  #the round function is required as the labels on each bar would present with up to 7 decimal places, which is very hard to read.
  
  scale_fill_gradient(low='red1', high='gold1')+
  #the gradient function slowly changes the first colour to the last colour
  
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10), labels = c("0-10", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+"))+
  scale_y_continuous(limits = c(0,3000), breaks = seq(0,3000,by=500))+
  theme_classic()+
  #theme_classic ensures a blank background on the figure
  theme(legend.position='none')+
  
  transition_states(states=Week, transition_length=3, state_length = 3)+
  ease_aes('cubic-in-out')
  #These functions are essential for gganimate to animate the bars.

Female.Deaths.gif <- animate(Female.Deaths, width=540, height=280)

Female.Deaths.gif

#The above code is the same for the Male animated bar chart, with amendments to include the male data

Visualisation 3

How COVID-19 has affected different age groups

By the 10th April 2020 no children under the age of 10 years had been killed by COVID-19, and so a figure is not included to represent this age group, as it would just be blank.

The original data set split age groups into every 5 years, (e.g. 5-9, 10,14), totalling 20 age groups. To represent these small age ranges would produce too many figures, and be too many variables on a graph to be meaningful. It was decided instead to group these age groups together to cover 10 years (e.g. 10-19, 20-29, 30,39) resulting in a total of 10 age groups.

Age.Percent = mutate(Age.Percent,
                     Deaths.Percent = (Deaths/sum(Deaths)*100))

#waffle plots are 10x10 grids, in which each cell represents 1%, summing up to total 100%

parts <- c(Age.Percent$Deaths.Percent)
Age.Waffle  <-  waffle(parts, rows=10)+
  labs(fill='Age Group',
       title='Percentage of Deaths per Age Group',
       caption='The lower age groups (0-9, 10-19, 20-29, 30-39) are combined as they have small influence  
       on the total in comparison to the other age groups')+
  scale_fill_discrete(labels=c(Age.Percent$AgeGp))

Age.Waffle

Age.Percent = mutate(Age.Percent,
                     Deaths.Percent = (Deaths/sum(Deaths)*100))
#Calculates percentages for each value in the Number of Deaths column


parts <- c(Age.Percent$Deaths.Percent)
#creates a vector with the percentages for the waffle chart

Age.Waffle  <-  waffle(parts, rows=10)+
  labs(fill='Age Group',
       title='Percentage of Deaths per Age Group',
       caption='The lower age groups (0-9, 10-19, 20-29, 30-39) are combined as they have small influence on the total in comparison to the other age groups')+
  scale_fill_discrete(labels=c(Age.Percent$AgeGp))
#here the figures under each '5-year' age group are added together to produce figures for the new age groups.
WD$Total0.9 <- WD$Total.1 + WD$Total1.4 + WD$Total5.9
WD$Total10.19 <- WD$Total10.14 + WD$Total15.19
WD$Total20.29 <- WD$Total20.24 + WD$Total25.29
WD$Total30.39 <- WD$Total30.34 + WD$Total35.39
WD$Total40.49 <- WD$Total40.44 + WD$Total45.49
WD$Total50.59 <- WD$Total50.54 + WD$Total55.59
WD$Total60.69 <- WD$Total60.64 + WD$Total65.69
WD$Total70.79 <- WD$Total70.74 + WD$Total75.79
WD$Total80.89 <- WD$Total80.84 + WD$Total85.89


#A new data frame is created to only include the new '10-year' age groups.
Weekly.Age <- data.frame("Week" = WD$ï..Week,
"Age0" = WD$Total0.9,
"Age1" = WD$Total10.19,
"Age2" = WD$Total20.29,
"Age3" = WD$Total30.39, 
"Age4" = WD$Total40.49,
"Age5" = WD$Total50.59, 
"Age6" = WD$Total60.69, 
"Age7" = WD$Total70.79, 
"Age8" = WD$Total80.89, 
"Age9" = WD$Total90.)
Age01 = ggplot(Weekly.Age, aes(x=Week, y=Age1))+
#Each plot will represent one age group, with the first being the 10-19 age group. Ages <10 were not included in these figures due to there not being any COVID-19 related deaths in the time period considered.
  
  geom_area(fill="turquoise3", alpha=0.4) +
  geom_line(color="turquoise4", size=2) +
  geom_point(size=3, color="turquoise4") +
  #The three functions, area, line and point allow a filled line graph to be produced.
  
  ggtitle("COVID-19 Deaths: Age 10-19")+
  labs(x='Week Ending',  y='Number of Deaths', subtitle = 'Note that the y axis only goes up to 100 deaths, compared to 3000 for the older age categories')+
  scale_y_continuous(limits = c(0,100), breaks = seq(0,100,by=10))+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7), labels = c("28-Feb-20", "06-Mar-20", "13-Mar-20", "20-Mar-20", "27-Mar-20", "03-Apr-20", "10-Apr-20"))
#Labelling the x axis with each week end date makes for easier understanding and viewing of the figure.

#All age plots use the same code as Age01, with the y axis altered to fit the correct age group.
#all figures are arranged using 'grid.arrange' from library 'grid' & 'gridExtra' for easy viewing and comparison.

grid.arrange(Age01, Age02, Age03, ncol=3)

#Note that the y-axis limit is 100 deaths for the first 4 figures and increases to 3000 deaths for the final 5 figures

grid.arrange(Age04, Age05, Age06, ncol=3)

grid.arrange(Age07, Age08, Age09, ncol=3)

The following figure presents the total deaths for each age group caused by COVID-19. Hover over each data point to see its y value.

Total.Age.Plot <- ggplot(Total.Age, aes(x=Age, y=Deaths, group=1))+
  geom_area(color='darksalmon', alpha=0.4)+
  geom_line(colour='darkred', size=2)+
  geom_point(colour='darkred', size=4)+
  scale_y_continuous(limits=c(0,6000), breaks=seq(0,6000,by=500))+
  ggtitle('Total COVID-19 Deaths in each Age Group by 10th April 2020')+
  labs(x='Age Group', y='No of Deaths')+
  theme_minimal()

ggplotly(Total.Age.Plot)
#plotly is used to make an interactive figure. If the mouse is hovered over a datapoint on the figure, the data will show.

Visualisation 4

How COVID-19 has affected different regions in England

Death.Region.Bar <- ggplot(Death.Region, aes(x= reorder(Region, Death.Rate), y=Death.Rate, fill=Region))+geom_bar(stat = 'identity', position='dodge')+
  #the Regions along the x axis have been reordered according to their death rate.
  
  ggtitle('COVID-19 Deaths Rate according to Region')+
  
  labs(x = 'Region', y='Deaths Rate')+
  
  theme(axis.text.x = element_text(angle = 90))+
  #axis labels have been rotated 90 degrees so that they don't overlap
  
  scale_y_continuous(limits=c(0,100), breaks= seq(0,100, by=10))+
  scale_colour_brewer(palette='Accent')+
  
  theme(legend.position = 'none')
  #the legend is unnecessary so it has been removed

ggplotly(Death.Region.Bar, tooltip = c("Death.Rate"))

Visualisation 5

How COVID-19 has affected different groups according to their level of deprivation

The Office for National Statistics use the ‘English Index of Multiple Deprivation’ (IMD) to assess the level of deprivation in different regions of England. The dataset provided by the Office for National Statistics uses LSOAs (lower layer super output areas), which are geographical hierarchies. In England there are over 30000 of these areas. For simplicity, these LSOA’s were grouped by region and an average deprivation score was calculated.

The IMD considers a multitude of factors including:

  • Income
  • Employment
  • Education
  • Health
  • Crime
  • Barriers to Housing & Services
  • Living Environment

The scale has scores of 1-10, with 1 being the highest deprivation, and 10 being the lowest level of deprivation.

Deprivation.Region <- read.csv('Deprivation by Region.csv')
#This dataset has over 32000 variables, considering the deprivation of individual LSOAs would be very complicated and time consuming.

Mean.Deprivation.Region <- aggregate(IMD~Region, Deprivation.Region, mean)
#A new data set was created where the mean IMD score was calculated for each region.

mean.deprivation.plot <- ggplot(Mean.Deprivation.Region, aes(x= Region, y=IMD, fill=Region))+
  geom_bar(stat='identity', position='dodge')+
  ggtitle('Average Deprivation Score (IMD) of each Region in England')+
  labs(x = 'Region', y='Mean Deprivation Score')+
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())+
  scale_y_continuous(limits=c(0,10), breaks = seq(0,10, by=0.5))+
  scale_colour_brewer(palette='Accent')+
  scale_fill_discrete(labels=c('East Midlands', 'East of England', 'London', 'North East', 'North West', 'South East', 'South West','West Midlands', 'Yorkshire'))
  

ggplotly(mean.deprivation.plot)
Deprivation.plot <- ggplot(Deprivation, aes(x=Deprivation.Scale, y=Death.Rate, fill=Deprivation.Scale))+geom_bar(stat = 'identity', position='dodge')+
  ggtitle('COVID-19 Death Rate according to Deprivation Scale')+
  labs(x = 'Deprivation Scale', y='Death Rate, per 100,000')+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10))+
  scale_fill_gradient(low = "blue",high = "red")+
  theme(legend.position='none')
ggplotly(Deprivation.plot)

Summary

What I have learned:

I’ve learned that with R you can be very specific about how you want your outputs to look, and maintain consistency in similar outputs by altering the same piece of code. This is something that is not very easy and quite time consuming on software such as Microsoft Excel.

What I would do next:

The obvious thing to do next would be to wait for the peak of COVID-19 to pass and reassess the data to see what, if any, changes there are in the trends seen so far. With more time and practive I would like to be able to manipulate data frames more thoroughly to create more interesting comparisons of different types of data.

The thing I struggled to get my head around the most was what type of graph was appropriate for the data I wanted to present (basic maths I know). If I were to do more data visualisations with more time I would sit down and learn the purpose of each type of graph in detail so that the data is presented in the most efficient way.