Creating data visualisation beyond default.
In this take-home exercise, we will explore the pros and cons of creating data visualisations using ggplot2 and Tableau. The two data visualisations included in this exercise are:
Reference charts were already created using Tableau for in-class exercise 2, hence they will be used as the proposed sketch for the data visualisations for ggplot2.
As Tableau produces interactive charts and dashboards, we will use plotly to convert the ggplot2 plots to produce interactive graphs for a more objective comparison.
A list of packages, namely tidyverse, readxl, knitr, plotly are required for this makeover exercise. This code chunk installs the required packages and loads them onto RStudio environment.
packages = c('tidyverse', 'readxl', 'knitr', 'plotly')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Pareto charts show the ordered frequency counts of data. They show the ordered frequency counts of values for the different levels of a categorical or nominal variable. These charts are often used to identify areas to focus on first in process improvement, as supported by the Pareto Principle (80/20 Rule).
The dataset to be used for the pareto chart is based on a fictitious superstore data by Tableau. It consists of three worksheets, namely: Orders, People and Returns.
Data import was accomplished using read_xls() of readxl package, which is useful for reading excel into a tibble.
orders <- read_xls("data/Superstore-2021.xls",
sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
sheet = "Returns")
head(orders)
# A tibble: 6 x 21
`Row ID` `Order ID` `Order Date` `Ship Date`
<dbl> <chr> <dttm> <dttm>
1 1 CA-2020-152156 2020-11-08 00:00:00 2020-11-11 00:00:00
2 2 CA-2020-152156 2020-11-08 00:00:00 2020-11-11 00:00:00
3 3 CA-2020-138688 2020-06-12 00:00:00 2020-06-16 00:00:00
4 4 US-2019-108966 2019-10-11 00:00:00 2019-10-18 00:00:00
5 5 US-2019-108966 2019-10-11 00:00:00 2019-10-18 00:00:00
6 6 CA-2018-115812 2018-06-09 00:00:00 2018-06-14 00:00:00
# ... with 17 more variables: Ship Mode <chr>, Customer ID <chr>,
# Customer Name <chr>, Segment <chr>, Country/Region <chr>,
# City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
# Product ID <chr>, Category <chr>, Sub-Category <chr>,
# Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
# Profit <dbl>
head(returns)
# A tibble: 6 x 2
Returned `Order ID`
<chr> <chr>
1 Yes CA-2018-100762
2 Yes CA-2018-100762
3 Yes CA-2018-100762
4 Yes CA-2018-100762
5 Yes CA-2018-100867
6 Yes CA-2018-102652
To combine the separate data frames, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.
# A tibble: 6 x 22
Returned `Order ID` `Row ID` `Order Date` `Ship Date`
<chr> <chr> <dbl> <dttm> <dttm>
1 Yes CA-2018-1~ 6315 2018-11-24 00:00:00 2018-11-29 00:00:00
2 Yes CA-2018-1~ 6316 2018-11-24 00:00:00 2018-11-29 00:00:00
3 Yes CA-2018-1~ 6317 2018-11-24 00:00:00 2018-11-29 00:00:00
4 Yes CA-2018-1~ 6318 2018-11-24 00:00:00 2018-11-29 00:00:00
5 Yes CA-2018-1~ 6315 2018-11-24 00:00:00 2018-11-29 00:00:00
6 Yes CA-2018-1~ 6316 2018-11-24 00:00:00 2018-11-29 00:00:00
# ... with 17 more variables: Ship Mode <chr>, Customer ID <chr>,
# Customer Name <chr>, Segment <chr>, Country/Region <chr>,
# City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
# Product ID <chr>, Category <chr>, Sub-Category <chr>,
# Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
# Profit <dbl>
To achieve the frequency count of returns by sub-category, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.
freq_returned <- joined_tab %>%
group_by(`Sub-Category`) %>%
summarise('Returns'=n()) %>%
ungroup()
head(freq_returned)
# A tibble: 6 x 2
`Sub-Category` Returns
<chr> <int>
1 Accessories 251
2 Appliances 177
3 Art 177
4 Binders 552
5 Bookcases 51
6 Chairs 238
By default, the values of the tibble data frame is sorted according to the values of the first column. In this case, the values are sorted alphabetically by Sub-Category field.
We will need to sort the sub-category field by descending order of values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- freq_returned %>%
arrange(desc(Returns))
head(freq_sorted)
# A tibble: 6 x 2
`Sub-Category` Returns
<chr> <int>
1 Binders 552
2 Paper 487
3 Phones 309
4 Furnishings 266
5 Accessories 251
6 Chairs 238
Lastly, we can compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
The newly computed cumulative frequency values will be stored in a new field called cumfreq.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum(Returns)) %>%
mutate(cum = cumsum(Returns)/sum(Returns))
head(freq_cum)
# A tibble: 6 x 4
`Sub-Category` Returns cumfreq cum
<chr> <int> <int> <dbl>
1 Binders 552 552 0.171
2 Paper 487 1039 0.322
3 Phones 309 1348 0.418
4 Furnishings 266 1614 0.500
5 Accessories 251 1865 0.578
6 Chairs 238 2103 0.652
A pareto chart was plotted using ggplot2 as follows:
pa <- ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-`Returns`))) +
geom_col(aes(y=`Returns`), fill = 'lightblue', width= 0.8) +
geom_point(aes(y=`cumfreq`), color = 'grey20', size = 0.8) +
geom_line(aes(y =`cumfreq`, group = 1), colour = 'grey20', size = 0.4) +
labs(x = "Sub-Category", title = "Pareto Chart of returns by sub-category") +
scale_y_continuous(
name = 'Returns (absolute frequency)', breaks = seq(0, 3500, 300), minor_breaks = seq(0, 3500, 100),
sec.axis = sec_axis(~.*1/sum(freq_cum$Returns), name = 'Cumulative Frequency', breaks = seq(0, 1, by = 0.1), labels = scales::percent)
) +
geom_hline(yintercept=0.8*sum(freq_cum$Returns), linetype="dashed", color = "grey50") +
geom_text(aes(17, 0.8*sum(freq_cum$Returns), label = "80.0%", vjust = -0.5), size = 2) +
theme(panel.background = element_rect(fill = 'white', colour = 'grey60', size = 0.5, linetype = 'solid'),
panel.grid.major = element_line(size = 0.3, linetype = 'solid', colour = 'grey85'),
panel.grid.minor = element_line(size = 0.2, linetype = 'solid', colour = 'grey90'),
text = element_text(size = 5.5),
axis.ticks.x = element_blank())
pa
From the previous chart plotted using ggplot2, the secondary y axes was based on a one-to-one transformation of the primary y axes. The bar and line charts were plotted with reference to the primary y axes, hence the secondary y axes will not be transferred to the interactive chart using ggplotly() of plotly library.
Therefore, an interactive pareto chart was plotted using plot_ly() from plotly. A scatter plot was not included in this case as one can simply hover across the line to view the corresponding cumulative percentage for each product. Similarly, hovering on the bars will reveal the absolute frequency of returns for each product sub-category.
plot_ly(freq_cum, x = ~reorder(`Sub-Category`,-`Returns`), y = ~`Returns`, type = "bar", name = "Returns") %>%
add_trace(x = ~reorder(`Sub-Category`,-`Returns`), y = ~`cum`*100,type = "scatter", mode = "lines", yaxis = "y2", name = "Cum. %") %>%
layout(title = "Pareto chart of returns by sub-category",
xaxis = list(title = "Sub-Category"),
yaxis = list(title = "Returns (Absolute Frequency)", showgrid = F),
yaxis2 = list(overlaying = "y", side = "right", title = "Cumulative Percentage (%)", range = list(0, 100)),
legend = list(orientation="h", yanchor="bottom",y=0.9,xanchor="top",x=0.2))
From the pareto chart, we can tell that 8 sub-categories account for 80% of the returned products. The sub-categories are Binders, Paper, Phones, Furnishings, Accessories, Chairs, Storage, Appliances and Art. As these are about 50% of the product sub-categories, the distribution of returns across can be considered relatively even. The store owners can still use the chart to focus on products with high returns such as Binders and Paper.
A population pyramid depicts the distribution of a population by age groups and sex. The pyramid can be used to visualize the age of a particular population. It is also used determine the overall age distribution of a population and an indication of the reproductive capabilities and likelihood of the continuation of a species.
For this task, the data entitled Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2021 should is used. The data set is available at Department of Statistics home page.
Data import was accomplished using read_csv() of readr package, which is useful for reading delimited files into a tibble.
pop <- read_csv("data/respopagesextod2021.csv")
head(pop)
# A tibble: 6 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1- a~ 0 2021
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3-Ro~ 10 2021
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4-Ro~ 10 2021
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5-Ro~ 30 2021
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC Fla~ 0 2021
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Landed P~ 0 2021
First, we are going to compute the frequency count of the total population in Singapore by gender and age. In the code chunk below, group_by() of dplyr package is used to group the orders by age and gender. Then, summarise() of dplyr is used to count (i.e. n()) the number of residents.
freq_pop <- pop %>%
group_by(`AG`, `Sex`) %>%
summarise('Count'= sum(`Pop`)) %>%
ungroup()
head(freq_pop)
# A tibble: 6 x 3
AG Sex Count
<chr> <chr> <dbl>
1 0_to_4 Females 87730
2 0_to_4 Males 91400
3 10_to_14 Females 97980
4 10_to_14 Males 102330
5 15_to_19 Females 100190
6 15_to_19 Males 105100
By default, the values of the tibble data frame is sorted according to the values of the first column. In this case, the values are sorted alphanumerically by the age group.
Hence, we will sort the data based on the categorical age-group assigned in the original dataset. This task is first performed by changing the order using mutate() and then sorting via arrange() of dplyr package.
order <- c("0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24", "25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49", "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74", "75_to_79", "80_to_84", "85_to_89", "90_and_over")
sorted_pop <- freq_pop %>%
mutate(AG = factor(AG, levels = order)) %>%
arrange(AG)
head(sorted_pop)
# A tibble: 6 x 3
AG Sex Count
<fct> <chr> <dbl>
1 0_to_4 Females 87730
2 0_to_4 Males 91400
3 5_to_9 Females 97120
4 5_to_9 Males 102390
5 10_to_14 Females 97980
6 10_to_14 Males 102330
A Age-Sex Population Pyramid was plotted using ggplot2 as follows:
p <- ggplot(sorted_pop, aes(x = ifelse(Sex == "Males", yes = -Count, no = Count),
y = AG, fill = Sex)) +
geom_col() +
scale_x_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))),"k")) +
labs (x = "Population", y = "Age", title='Singapore Age-Sex Population Pyramid 2021') +
theme_bw() +
theme(axis.ticks.y = element_blank()) +
scale_fill_manual(values = c("Males" = "lightblue", "Females" = "lightpink"))
p
The static pyramid was converted to an interactive chart by using ggplotly() of the plotly library. By hovering your mouse on the bars, you can view the population stats for each age group and gender.
ggplotly(p,session="knitr")
From the plot, we can tell that Singapore has a constrictive population because there is a lower percentage of younger people. This indicates declining birth rates in Singapore, since each succeeding age group is getting smaller and smaller. It may also represent a higher life expectancy for the older population. As Singapore enters into a silver economy, government bodies can use such data to put fourth appropriate measures and policies to manage its ageing population in the future.