ISSS608 Take-home Exercise 1

Creating data visualisation beyond default.

Author

Affiliation

Ranice Tan

SMU MITB

Published

Jan. 27, 2022

DOI

1. Overview

In this take-home exercise, we will explore the pros and cons of creating data visualisations using ggplot2 and Tableau. The two data visualisations included in this exercise are:

Challenges Faced

Comparing between Tableau and R

2. Proposed Sketch

Reference charts were already created using Tableau for in-class exercise 2, hence they will be used as the proposed sketch for the data visualisations for ggplot2.

Superstore Returns Pareto Chart by Tableau

Returns Pareto Chart

Singapore Population Pyramid 2021 by Tableau

Population Pyramid As Tableau produces interactive charts and dashboards, we will use plotly to convert the ggplot2 plots to produce interactive graphs for a more objective comparison.

3. Step-by-step Description

Installing Packages

A list of packages, namely tidyverse, readxl, knitr, plotly are required for this makeover exercise. This code chunk installs the required packages and loads them onto RStudio environment.

packages = c('tidyverse', 'readxl', 'knitr', 'plotly')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.1. Pareto Chart

Pareto charts show the ordered frequency counts of data. They show the ordered frequency counts of values for the different levels of a categorical or nominal variable. These charts are often used to identify areas to focus on first in process improvement, as supported by the Pareto Principle (80/20 Rule).

3.1.1 Import Dataset

The dataset to be used for the pareto chart is based on a fictitious superstore data by Tableau. It consists of three worksheets, namely: Orders, People and Returns.

Data import was accomplished using read_xls() of readxl package, which is useful for reading excel into a tibble.

orders <- read_xls("data/Superstore-2021.xls",
                   sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
                   sheet = "Returns")
head(orders)
# A tibble: 6 x 21
  `Row ID` `Order ID`     `Order Date`        `Ship Date`        
     <dbl> <chr>          <dttm>              <dttm>             
1        1 CA-2020-152156 2020-11-08 00:00:00 2020-11-11 00:00:00
2        2 CA-2020-152156 2020-11-08 00:00:00 2020-11-11 00:00:00
3        3 CA-2020-138688 2020-06-12 00:00:00 2020-06-16 00:00:00
4        4 US-2019-108966 2019-10-11 00:00:00 2019-10-18 00:00:00
5        5 US-2019-108966 2019-10-11 00:00:00 2019-10-18 00:00:00
6        6 CA-2018-115812 2018-06-09 00:00:00 2018-06-14 00:00:00
# ... with 17 more variables: Ship Mode <chr>, Customer ID <chr>,
#   Customer Name <chr>, Segment <chr>, Country/Region <chr>,
#   City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
#   Product ID <chr>, Category <chr>, Sub-Category <chr>,
#   Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
#   Profit <dbl>
head(returns)
# A tibble: 6 x 2
  Returned `Order ID`    
  <chr>    <chr>         
1 Yes      CA-2018-100762
2 Yes      CA-2018-100762
3 Yes      CA-2018-100762
4 Yes      CA-2018-100762
5 Yes      CA-2018-100867
6 Yes      CA-2018-102652

3.1.2 Data Wrangling

Joining the two data frames

To combine the separate data frames, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.

joined_tab <- left_join(returns, orders,
                        by = c('Order ID' = 'Order ID'))
head(joined_tab)
# A tibble: 6 x 22
  Returned `Order ID` `Row ID` `Order Date`        `Ship Date`        
  <chr>    <chr>         <dbl> <dttm>              <dttm>             
1 Yes      CA-2018-1~     6315 2018-11-24 00:00:00 2018-11-29 00:00:00
2 Yes      CA-2018-1~     6316 2018-11-24 00:00:00 2018-11-29 00:00:00
3 Yes      CA-2018-1~     6317 2018-11-24 00:00:00 2018-11-29 00:00:00
4 Yes      CA-2018-1~     6318 2018-11-24 00:00:00 2018-11-29 00:00:00
5 Yes      CA-2018-1~     6315 2018-11-24 00:00:00 2018-11-29 00:00:00
6 Yes      CA-2018-1~     6316 2018-11-24 00:00:00 2018-11-29 00:00:00
# ... with 17 more variables: Ship Mode <chr>, Customer ID <chr>,
#   Customer Name <chr>, Segment <chr>, Country/Region <chr>,
#   City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
#   Product ID <chr>, Category <chr>, Sub-Category <chr>,
#   Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
#   Profit <dbl>

Compute the frequency count by Sub-Category

To achieve the frequency count of returns by sub-category, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.

freq_returned <- joined_tab %>%
  group_by(`Sub-Category`) %>%
  summarise('Returns'=n()) %>%
  ungroup()
head(freq_returned)
# A tibble: 6 x 2
  `Sub-Category` Returns
  <chr>            <int>
1 Accessories        251
2 Appliances         177
3 Art                177
4 Binders            552
5 Bookcases           51
6 Chairs             238

Sorting data

By default, the values of the tibble data frame is sorted according to the values of the first column. In this case, the values are sorted alphabetically by Sub-Category field.

We will need to sort the sub-category field by descending order of values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.

freq_sorted <- freq_returned %>%
  arrange(desc(Returns))
head(freq_sorted)
# A tibble: 6 x 2
  `Sub-Category` Returns
  <chr>            <int>
1 Binders            552
2 Paper              487
3 Phones             309
4 Furnishings        266
5 Accessories        251
6 Chairs             238

Compute the cumulative frequency

Lastly, we can compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.

The newly computed cumulative frequency values will be stored in a new field called cumfreq.

freq_cum <- freq_sorted %>%
  mutate(cumfreq = cumsum(Returns)) %>%
  mutate(cum = cumsum(Returns)/sum(Returns))
head(freq_cum)
# A tibble: 6 x 4
  `Sub-Category` Returns cumfreq   cum
  <chr>            <int>   <int> <dbl>
1 Binders            552     552 0.171
2 Paper              487    1039 0.322
3 Phones             309    1348 0.418
4 Furnishings        266    1614 0.500
5 Accessories        251    1865 0.578
6 Chairs             238    2103 0.652

3.1.3 Plot Pareto Chart

Static Pareto Chart

A pareto chart was plotted using ggplot2 as follows:

pa <- ggplot(data = freq_cum, 
       aes(x = reorder(`Sub-Category`,-`Returns`))) +
  geom_col(aes(y=`Returns`), fill = 'lightblue', width= 0.8) +
  geom_point(aes(y=`cumfreq`), color = 'grey20', size = 0.8) +
  geom_line(aes(y =`cumfreq`, group = 1), colour = 'grey20', size = 0.4) +
  labs(x = "Sub-Category", title = "Pareto Chart of returns by sub-category") +
  scale_y_continuous(
    name = 'Returns (absolute frequency)', breaks = seq(0, 3500, 300), minor_breaks = seq(0, 3500, 100),
    sec.axis = sec_axis(~.*1/sum(freq_cum$Returns), name = 'Cumulative Frequency', breaks = seq(0, 1, by = 0.1), labels = scales::percent)
  ) +
    geom_hline(yintercept=0.8*sum(freq_cum$Returns), linetype="dashed", color = "grey50") +
  geom_text(aes(17, 0.8*sum(freq_cum$Returns), label = "80.0%", vjust = -0.5), size = 2) +
  theme(panel.background = element_rect(fill = 'white', colour = 'grey60', size = 0.5, linetype = 'solid'),
        panel.grid.major = element_line(size = 0.3, linetype = 'solid', colour = 'grey85'),
        panel.grid.minor = element_line(size = 0.2, linetype = 'solid', colour = 'grey90'),
        text = element_text(size = 5.5),
        axis.ticks.x = element_blank()) 
pa

Interactive Pareto Chart

From the previous chart plotted using ggplot2, the secondary y axes was based on a one-to-one transformation of the primary y axes. The bar and line charts were plotted with reference to the primary y axes, hence the secondary y axes will not be transferred to the interactive chart using ggplotly() of plotly library.

Therefore, an interactive pareto chart was plotted using plot_ly() from plotly. A scatter plot was not included in this case as one can simply hover across the line to view the corresponding cumulative percentage for each product. Similarly, hovering on the bars will reveal the absolute frequency of returns for each product sub-category.

plot_ly(freq_cum, x = ~reorder(`Sub-Category`,-`Returns`), y = ~`Returns`, type = "bar", name = "Returns") %>%
  add_trace(x = ~reorder(`Sub-Category`,-`Returns`), y = ~`cum`*100,type = "scatter", mode = "lines", yaxis = "y2", name = "Cum. %") %>%
  layout(title = "Pareto chart of returns by sub-category",
         xaxis = list(title = "Sub-Category"),
         yaxis = list(title = "Returns (Absolute Frequency)", showgrid = F),
         yaxis2 = list(overlaying = "y", side = "right", title = "Cumulative Percentage (%)", range = list(0, 100)),
         legend = list(orientation="h", yanchor="bottom",y=0.9,xanchor="top",x=0.2)) 
BindersPaperPhonesFurnishingsAccessoriesChairsStorageAppliancesArtTablesFastenersLabelsSuppliesMachinesBookcasesEnvelopesCopiers0100200300400500020406080100
ReturnsCum. %Pareto chart of returns by sub-categorySub-CategoryReturns (Absolute Frequency)Cumulative Percentage (%)

Insights from Visualisation

From the pareto chart, we can tell that 8 sub-categories account for 80% of the returned products. The sub-categories are Binders, Paper, Phones, Furnishings, Accessories, Chairs, Storage, Appliances and Art. As these are about 50% of the product sub-categories, the distribution of returns across can be considered relatively even. The store owners can still use the chart to focus on products with high returns such as Binders and Paper.

3.2. Population Pyramid

A population pyramid depicts the distribution of a population by age groups and sex. The pyramid can be used to visualize the age of a particular population. It is also used determine the overall age distribution of a population and an indication of the reproductive capabilities and likelihood of the continuation of a species.

3.2.1 Import Dataset

For this task, the data entitled Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2021 should is used. The data set is available at Department of Statistics home page.

Data import was accomplished using read_csv() of readr package, which is useful for reading delimited files into a tibble.

pop <- read_csv("data/respopagesextod2021.csv")
head(pop)
# A tibble: 6 x 7
  PA         SZ                     AG     Sex   TOD         Pop  Time
  <chr>      <chr>                  <chr>  <chr> <chr>     <dbl> <dbl>
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1- a~     0  2021
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3-Ro~    10  2021
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4-Ro~    10  2021
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5-Ro~    30  2021
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC Fla~     0  2021
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Landed P~     0  2021

3.2.2 Data Wrangling

Compute the frequency count by Age and Sex

First, we are going to compute the frequency count of the total population in Singapore by gender and age. In the code chunk below, group_by() of dplyr package is used to group the orders by age and gender. Then, summarise() of dplyr is used to count (i.e. n()) the number of residents.

freq_pop <- pop %>%
  group_by(`AG`, `Sex`) %>%
  summarise('Count'= sum(`Pop`)) %>%
  ungroup()
head(freq_pop)
# A tibble: 6 x 3
  AG       Sex      Count
  <chr>    <chr>    <dbl>
1 0_to_4   Females  87730
2 0_to_4   Males    91400
3 10_to_14 Females  97980
4 10_to_14 Males   102330
5 15_to_19 Females 100190
6 15_to_19 Males   105100

Sort data table

By default, the values of the tibble data frame is sorted according to the values of the first column. In this case, the values are sorted alphanumerically by the age group.

Hence, we will sort the data based on the categorical age-group assigned in the original dataset. This task is first performed by changing the order using mutate() and then sorting via arrange() of dplyr package.

order <- c("0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24", "25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49", "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74", "75_to_79", "80_to_84", "85_to_89", "90_and_over")

sorted_pop <- freq_pop %>%
  mutate(AG =  factor(AG, levels = order)) %>%
  arrange(AG)

head(sorted_pop)
# A tibble: 6 x 3
  AG       Sex      Count
  <fct>    <chr>    <dbl>
1 0_to_4   Females  87730
2 0_to_4   Males    91400
3 5_to_9   Females  97120
4 5_to_9   Males   102390
5 10_to_14 Females  97980
6 10_to_14 Males   102330

3.2.3 Plot Age-Sex Population Pyramid

Static Age-Sex Population Pyramid

A Age-Sex Population Pyramid was plotted using ggplot2 as follows:

p <- ggplot(sorted_pop, aes(x = ifelse(Sex == "Males", yes = -Count, no = Count), 
                          y = AG, fill = Sex)) + 
  geom_col() +
  scale_x_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))),"k")) +
  labs (x = "Population", y = "Age", title='Singapore Age-Sex Population Pyramid 2021') +
  theme_bw() +
  theme(axis.ticks.y = element_blank()) +
  scale_fill_manual(values = c("Males" = "lightblue", "Females" = "lightpink"))
p

Interactive Age-Sex Population Pyramid

The static pyramid was converted to an interactive chart by using ggplotly() of the plotly library. By hovering your mouse on the bars, you can view the population stats for each age group and gender.

ggplotly(p,session="knitr")
150k100k50k0k50k100k150k0_to_45_to_910_to_1415_to_1920_to_2425_to_2930_to_3435_to_3940_to_4445_to_4950_to_5455_to_5960_to_6465_to_6970_to_7475_to_7980_to_8485_to_8990_and_over
SexFemalesMalesSingapore Age-Sex Population Pyramid 2021PopulationAge

Insights from visualisation

From the plot, we can tell that Singapore has a constrictive population because there is a lower percentage of younger people. This indicates declining birth rates in Singapore, since each succeeding age group is getting smaller and smaller. It may also represent a higher life expectancy for the older population. As Singapore enters into a silver economy, government bodies can use such data to put fourth appropriate measures and policies to manage its ageing population in the future.