COVID-19 hit the news headlines in December 2019, the first identified case was in Wuhan, China, and has since spread globally resulting in a pandemic. The first recorded case hit UK shores on the 31st January and the UK experience its first COVID-19 related death on 28th February in Scotland. In England the first death was on 4th March. The following considers datasets available through the Office for National Statistics website concerning the deaths caused by COVID-19 in England from the 1st March 2020 to the 17th April 2020.
Between 1 March and 17 April 2020 there were 90,232 deaths occurring in England and Wales that were registered by 18th April. Almost 22.5% of these deaths involved the coronvirus (COVID-19).
The information used in these datasets were collected by the Office for National Statistics when certified deaths were registered with the local registry offices. The data provided offers information regarding the number of total COVID-19 related deaths as well as the death rate per 100,000 people according to age, region, gender and deprivation scale.
The data used was provided in a very large excel spreadsheet, with multiple sheets, and I have used a number of differing datasets for this project. As a results I do not have one ‘master’ dataset, instead I have a few. The datasets provided by the Office for National Statistics were not very R friendly, so they needed some manipulation before uploading to R as csv files. Below is an example of a couple of the datasets I have used.
## ï..Week Total.1 Total1.4 Total5.9 Total10.14 Total15.19 Total20.24
## 1 28-Feb-20 0 0 0 0 0 0
## 2 06-Mar-20 0 0 0 0 0 0
## 3 13-Mar-20 0 0 0 0 0 0
## 4 20-Mar-20 0 0 0 0 0 1
## 5 27-Mar-20 0 0 0 0 2 1
## 6 03-Apr-20 0 0 0 0 3 5
## Total25.29 Total30.34 Total35.39 Total40.44 Total45.49 Total50.54 Total55.59
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 1 0 0 1
## 4 1 0 0 2 7 16 13
## 5 3 11 9 9 26 37 71
## 6 5 12 14 27 70 92 191
## Total60.64 Total65.69 Total70.74 Total75.79 Total80.84 Total85.89 Total90.
## 1 0 0 0 0 0 0 0
## 2 0 0 0 1 1 0 1
## 3 4 5 6 2 6 8 7
## 4 20 27 35 46 73 84 67
## 5 90 126 210 285 340 323 263
## 6 269 354 572 784 981 832 778
## F.1 F1.4 F5.9 F10.14 F15.19 F20.24 F25.29 F30.34 F35.39 F40.44 F45.49 F50.54
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 1 0 0
## 4 0 0 0 0 0 1 1 0 0 1 5 8
## 5 0 0 0 0 1 1 2 3 4 2 13 13
## 6 0 0 0 0 2 2 3 4 4 11 29 31
## F55.59 F60.64 F65.69 F70.74 F75.79 F80.84 F85.89 F90. M.1 M1.4 M5.9 M10.14
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 1 0 0 1 0 0 0 0
## 3 0 3 1 2 1 0 1 2 0 0 0 0
## 4 5 9 11 13 16 33 39 35 0 0 0 0
## 5 25 28 44 74 95 118 142 124 0 0 0 0
## 6 73 84 112 183 274 359 331 396 0 0 0 0
## M15.19 M20.24 M25.29 M30.34 M35.39 M40.44 M45.49 M50.54 M55.59 M60.64 M65.69
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 1 1 4
## 4 0 0 0 0 0 1 2 8 8 11 16
## 5 1 0 1 8 5 7 13 24 46 62 82
## 6 1 3 2 8 10 16 41 61 118 185 242
## M70.74 M75.79 M80.84 M85.89 M90.
## 1 0 0 0 0 0
## 2 0 0 1 0 0
## 3 4 1 6 7 5
## 4 22 30 40 45 32
## 5 136 190 222 181 139
## 6 389 510 622 501 382
This data was also acquired from the Office for National Statistics Website.
## ï..OA11CD OAC11CD OAC11NM LSOA11CD
## 1 E00000001 2d3 EU White-Collar Workers E01000001
## 2 E00000016 2d3 EU White-Collar Workers E01000002
## 3 E00000010 2d3 EU White-Collar Workers E01000003
## 4 E00000030 3b2 Bangladeshi Mixed Employment E01000005
## 5 E00000049 4b1 Asian Terraces and Flats E01000006
## 6 E00175106 3b3 Multi-Ethnic Professional Service Workers E01000007
## LSOA11NM IMD SOAC11CD SOAC11NM MSOA11CD
## 1 City of London 001A 9 6a Inner city cosmopolitan E02000001
## 2 City of London 001B 10 6a Inner city cosmopolitan E02000001
## 3 City of London 001C 5 6a Inner city cosmopolitan E02000001
## 4 City of London 001E 3 6a Inner city cosmopolitan E02000001
## 5 Barking and Dagenham 016A 5 7b Young ethnic communities E02000017
## 6 Barking and Dagenham 015A 3 6a Inner city cosmopolitan E02000016
## MSOA11NM LAD17CD LAD17NM LACCD
## 1 City of London 001 E09000001 City of London 5a1r
## 2 City of London 001 E09000001 City of London 5a1r
## 3 City of London 001 E09000001 City of London 5a1r
## 4 City of London 001 E09000001 City of London 5a1r
## 5 Barking and Dagenham 016 E09000002 Barking and Dagenham 4a1r
## 6 Barking and Dagenham 015 E09000002 Barking and Dagenham 4a1r
## LACNM RGN11CD RGN11NM CTRY11CD CTRY11NM
## 1 London Cosmopolitan E12000007 London E92000001 England
## 2 London Cosmopolitan E12000007 London E92000001 England
## 3 London Cosmopolitan E12000007 London E92000001 England
## 4 London Cosmopolitan E12000007 London E92000001 England
## 5 Ethnically Diverse Metropolitan Living E12000007 London E92000001 England
## 6 Ethnically Diverse Metropolitan Living E12000007 London E92000001 England
## FID
## 1 146308
## 2 146348
## 3 146346
## 4 146302
## 5 146306
## 6 146497
COVID-19 is a factor in everyone’s lives at the moment, most people are aware that the elderly and sick are the most vulnerable, but even in the first six weeks of the virus hitting the UK people from every age and background were affected.
The following graphics will attempt to show that even though the death rate among the most vulnerable is high, anyone could be afflicted by it.
#this will produce a basic barchart
ggplot(TotalDeaths, aes(x=Type, y=Deaths, fill=Type))+geom_bar(stat = 'identity', position='dodge')+
ggtitle('Total COVID-19 Deaths compared to Total Deaths in England')
Cause | Total Deaths | Male | Female |
---|---|---|---|
COVID-19 | 19,135 | 11,472 | 7,843 |
Overall deaths | 84,908 | 43,306 | 41,602 |
Percentage of Deaths caused by COVID-19 | 22.5361568% | 26.4905556% | 18.852459% |
By April 17th 22.5% of deaths in England was caused by COVID-19, this was higher for Males (26.5%) compared to females (18.9%).
Its seems that by the 10th April 2020 males were more likely to be killed than females by COVID-19 related illnesses.
Female.Deaths <- ggplot(Animate.data, aes(x=Age.Group, y=FemaleDeaths, fill=Age.Group))+
geom_bar(stat='identity')+
labs(title='Cumulative Death Count in Females according to Age Group',
subtitle='{closest_state}',
#The {} function allows the 'Week' Label to change as the y axis increases
x='Age Group')+
geom_text(aes(label=format(round(FemaleDeaths)), y=FemaleDeaths),
position=position_dodge(0.9), vjust=-1)+
#the round function is required as the labels on each bar would present with up to 7 decimal places, which is very hard to read.
scale_fill_gradient(low='red1', high='gold1')+
#the gradient function slowly changes the first colour to the last colour
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10), labels = c("0-10", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+"))+
scale_y_continuous(limits = c(0,3000), breaks = seq(0,3000,by=500))+
theme_classic()+
#theme_classic ensures a blank background on the figure
theme(legend.position='none')+
transition_states(states=Week, transition_length=3, state_length = 3)+
ease_aes('cubic-in-out')
#These functions are essential for gganimate to animate the bars.
Female.Deaths.gif <- animate(Female.Deaths, width=540, height=280)
Female.Deaths.gif
#The above code is the same for the Male animated bar chart, with amendments to include the male data
By the 10th April 2020 no children under the age of 10 years had been killed by COVID-19, and so a figure is not included to represent this age group, as it would just be blank.
The original data set split age groups into every 5 years, (e.g. 5-9, 10,14), totalling 20 age groups. To represent these small age ranges would produce too many figures, and be too many variables on a graph to be meaningful. It was decided instead to group these age groups together to cover 10 years (e.g. 10-19, 20-29, 30,39) resulting in a total of 10 age groups.
Age.Percent = mutate(Age.Percent,
Deaths.Percent = (Deaths/sum(Deaths)*100))
#waffle plots are 10x10 grids, in which each cell represents 1%, summing up to total 100%
parts <- c(Age.Percent$Deaths.Percent)
Age.Waffle <- waffle(parts, rows=10)+
labs(fill='Age Group',
title='Percentage of Deaths per Age Group',
caption='The lower age groups (0-9, 10-19, 20-29, 30-39) are combined as they have small influence
on the total in comparison to the other age groups')+
scale_fill_discrete(labels=c(Age.Percent$AgeGp))
Age.Waffle
Age.Percent = mutate(Age.Percent,
Deaths.Percent = (Deaths/sum(Deaths)*100))
#Calculates percentages for each value in the Number of Deaths column
parts <- c(Age.Percent$Deaths.Percent)
#creates a vector with the percentages for the waffle chart
Age.Waffle <- waffle(parts, rows=10)+
labs(fill='Age Group',
title='Percentage of Deaths per Age Group',
caption='The lower age groups (0-9, 10-19, 20-29, 30-39) are combined as they have small influence on the total in comparison to the other age groups')+
scale_fill_discrete(labels=c(Age.Percent$AgeGp))
#here the figures under each '5-year' age group are added together to produce figures for the new age groups.
WD$Total0.9 <- WD$Total.1 + WD$Total1.4 + WD$Total5.9
WD$Total10.19 <- WD$Total10.14 + WD$Total15.19
WD$Total20.29 <- WD$Total20.24 + WD$Total25.29
WD$Total30.39 <- WD$Total30.34 + WD$Total35.39
WD$Total40.49 <- WD$Total40.44 + WD$Total45.49
WD$Total50.59 <- WD$Total50.54 + WD$Total55.59
WD$Total60.69 <- WD$Total60.64 + WD$Total65.69
WD$Total70.79 <- WD$Total70.74 + WD$Total75.79
WD$Total80.89 <- WD$Total80.84 + WD$Total85.89
#A new data frame is created to only include the new '10-year' age groups.
Weekly.Age <- data.frame("Week" = WD$ï..Week,
"Age0" = WD$Total0.9,
"Age1" = WD$Total10.19,
"Age2" = WD$Total20.29,
"Age3" = WD$Total30.39,
"Age4" = WD$Total40.49,
"Age5" = WD$Total50.59,
"Age6" = WD$Total60.69,
"Age7" = WD$Total70.79,
"Age8" = WD$Total80.89,
"Age9" = WD$Total90.)
Age01 = ggplot(Weekly.Age, aes(x=Week, y=Age1))+
#Each plot will represent one age group, with the first being the 10-19 age group. Ages <10 were not included in these figures due to there not being any COVID-19 related deaths in the time period considered.
geom_area(fill="turquoise3", alpha=0.4) +
geom_line(color="turquoise4", size=2) +
geom_point(size=3, color="turquoise4") +
#The three functions, area, line and point allow a filled line graph to be produced.
ggtitle("COVID-19 Deaths: Age 10-19")+
labs(x='Week Ending', y='Number of Deaths', subtitle = 'Note that the y axis only goes up to 100 deaths, compared to 3000 for the older age categories')+
scale_y_continuous(limits = c(0,100), breaks = seq(0,100,by=10))+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7), labels = c("28-Feb-20", "06-Mar-20", "13-Mar-20", "20-Mar-20", "27-Mar-20", "03-Apr-20", "10-Apr-20"))
#Labelling the x axis with each week end date makes for easier understanding and viewing of the figure.
#All age plots use the same code as Age01, with the y axis altered to fit the correct age group.
#all figures are arranged using 'grid.arrange' from library 'grid' & 'gridExtra' for easy viewing and comparison.
grid.arrange(Age01, Age02, Age03, ncol=3)
#Note that the y-axis limit is 100 deaths for the first 4 figures and increases to 3000 deaths for the final 5 figures
grid.arrange(Age04, Age05, Age06, ncol=3)
grid.arrange(Age07, Age08, Age09, ncol=3)
The following figure presents the total deaths for each age group caused by COVID-19. Hover over each data point to see its y value.
Total.Age.Plot <- ggplot(Total.Age, aes(x=Age, y=Deaths, group=1))+
geom_area(color='darksalmon', alpha=0.4)+
geom_line(colour='darkred', size=2)+
geom_point(colour='darkred', size=4)+
scale_y_continuous(limits=c(0,6000), breaks=seq(0,6000,by=500))+
ggtitle('Total COVID-19 Deaths in each Age Group by 10th April 2020')+
labs(x='Age Group', y='No of Deaths')+
theme_minimal()
ggplotly(Total.Age.Plot)
#plotly is used to make an interactive figure. If the mouse is hovered over a datapoint on the figure, the data will show.
Death.Region.Bar <- ggplot(Death.Region, aes(x= reorder(Region, Death.Rate), y=Death.Rate, fill=Region))+geom_bar(stat = 'identity', position='dodge')+
#the Regions along the x axis have been reordered according to their death rate.
ggtitle('COVID-19 Deaths Rate according to Region')+
labs(x = 'Region', y='Deaths Rate')+
theme(axis.text.x = element_text(angle = 90))+
#axis labels have been rotated 90 degrees so that they don't overlap
scale_y_continuous(limits=c(0,100), breaks= seq(0,100, by=10))+
scale_colour_brewer(palette='Accent')+
theme(legend.position = 'none')
#the legend is unnecessary so it has been removed
ggplotly(Death.Region.Bar, tooltip = c("Death.Rate"))
The Office for National Statistics use the ‘English Index of Multiple Deprivation’ (IMD) to assess the level of deprivation in different regions of England. The dataset provided by the Office for National Statistics uses LSOAs (lower layer super output areas), which are geographical hierarchies. In England there are over 30000 of these areas. For simplicity, these LSOA’s were grouped by region and an average deprivation score was calculated.
The IMD considers a multitude of factors including:
The scale has scores of 1-10, with 1 being the highest deprivation, and 10 being the lowest level of deprivation.
Deprivation.Region <- read.csv('Deprivation by Region.csv')
#This dataset has over 32000 variables, considering the deprivation of individual LSOAs would be very complicated and time consuming.
Mean.Deprivation.Region <- aggregate(IMD~Region, Deprivation.Region, mean)
#A new data set was created where the mean IMD score was calculated for each region.
mean.deprivation.plot <- ggplot(Mean.Deprivation.Region, aes(x= Region, y=IMD, fill=Region))+
geom_bar(stat='identity', position='dodge')+
ggtitle('Average Deprivation Score (IMD) of each Region in England')+
labs(x = 'Region', y='Mean Deprivation Score')+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())+
scale_y_continuous(limits=c(0,10), breaks = seq(0,10, by=0.5))+
scale_colour_brewer(palette='Accent')+
scale_fill_discrete(labels=c('East Midlands', 'East of England', 'London', 'North East', 'North West', 'South East', 'South West','West Midlands', 'Yorkshire'))
ggplotly(mean.deprivation.plot)
Deprivation.plot <- ggplot(Deprivation, aes(x=Deprivation.Scale, y=Death.Rate, fill=Deprivation.Scale))+geom_bar(stat = 'identity', position='dodge')+
ggtitle('COVID-19 Death Rate according to Deprivation Scale')+
labs(x = 'Deprivation Scale', y='Death Rate, per 100,000')+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10))+
scale_fill_gradient(low = "blue",high = "red")+
theme(legend.position='none')
ggplotly(Deprivation.plot)
I’ve learned that with R you can be very specific about how you want your outputs to look, and maintain consistency in similar outputs by altering the same piece of code. This is something that is not very easy and quite time consuming on software such as Microsoft Excel.
The obvious thing to do next would be to wait for the peak of COVID-19 to pass and reassess the data to see what, if any, changes there are in the trends seen so far. With more time and practive I would like to be able to manipulate data frames more thoroughly to create more interesting comparisons of different types of data.
The thing I struggled to get my head around the most was what type of graph was appropriate for the data I wanted to present (basic maths I know). If I were to do more data visualisations with more time I would sit down and learn the purpose of each type of graph in detail so that the data is presented in the most efficient way.