Take-Home Exercise 3

Creating interactive visualisations and animations.

Author

Affiliation

Ranice Tan

SMU MITB

Published

Feb. 19, 2022

DOI

1 Overview

In this take-home exercise, we will explore creating data visualisations for multidimensional data, using starbucks drinks and their nutritional value.

Source: Starbucks

1.1 Challenges Faced

2 Installing Packages

The following packages and libraries were installed for this exercise:

Show code
packages = c('tidyverse','knitr', 'corrplot', 'ggstatsplot', 'rmarkdown', 'heatmaply', 'dendextend','parallelPlot')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3 Dataset

For this task, the Starbucks Drinks dataset is used.

3.1 Data Preparation

Import and combine data sets

The dataset was imported using the read_csv() function.

Show code
drinks <- read_csv("data/starbucks_drink.csv")

kable(tail(drinks[,c(1:5, 15:18)]))
Category Name Portion(fl oz) Calories Calories from fat Caffeine(mg) Size Milk Whipped Cream
tea Iced Teavana® London Fog Tea Latte 24 160 25 40–60 Venti Iced Almond NA
tea Iced Teavana® London Fog Tea Latte 24 180 35 40–60 Venti Iced Coconut NA
tea Iced Teavana® London Fog Tea Latte 24 180 0 40–60 Venti Iced Nonfat milk NA
tea Iced Teavana® London Fog Tea Latte 24 230 50 40–60 Venti Iced Whole Milk NA
tea Iced Teavana® London Fog Tea Latte 24 210 35 40 Venti Iced 2% Milk NA
tea Iced Teavana® London Fog Tea Latte 24 210 25 40–60 Venti Iced Soy (United States) NA

Cleaning Caffeine(mg) Field

The field ‘Caffeing(mg)’ is classified as a string datatype as some cells contains a range of values. Rows containing ‘40+’ was first converted to ‘40’. Next, rows containing range i.e ‘25-40’, were identified and converted to the max. value i.e. ‘60’ using a for loop and ifelse() condition. Lastly, the column was converted to numeric datatype.

Show code
drinks["Caffeine(mg)"][drinks["Caffeine(mg)"] == '40+'] <- '40'

for (i in 1:nrow(drinks)) {
  drinks[i, "Caffeine(mg)"] <- ifelse(grepl("–", drinks[i, "Caffeine(mg)"]),
                                      substr(drinks[i, "Caffeine(mg)"],
                                             nchar(drinks[i, "Caffeine(mg)"])-1,
                                             nchar(drinks[i, "Caffeine(mg)"])),
                                      drinks[i, "Caffeine(mg)"]
                                      )
}

drinks["Caffeine(mg)"] <- as.numeric(unlist(drinks["Caffeine(mg)"]))

kable(tail(drinks[,c(1:2, 15)]))
Category Name Caffeine(mg)
tea Iced Teavana® London Fog Tea Latte 60
tea Iced Teavana® London Fog Tea Latte 60
tea Iced Teavana® London Fog Tea Latte 60
tea Iced Teavana® London Fog Tea Latte 60
tea Iced Teavana® London Fog Tea Latte 40
tea Iced Teavana® London Fog Tea Latte 60

3.2 Data Wrangling

Identifying Top 4 Largest Categories of Drinks

To identify the largest drink categories, the group_by() function was used to group the orders by category and summarise() was used to count (i.e. n()) the total number of drinks for each category. Then, arrange(desc) was used to sort the data and top_n() was used to select and identify the top 4 largest categories. filter() was used to display rows of drinks that are in the top 4 largest categories. They are espresso, frappuccino blended beverages, kids and others, and tea.

Show code
top_cat <- drinks %>%
  group_by(`Category`) %>%
  summarise(Total=n()) %>%
  arrange(desc(Total)) %>%
  top_n(4) %>%
  ungroup

top_cat_list <- as.vector(top_cat$Category)

topcat <- drinks %>%
  filter(Category %in% top_cat_list)

kable(head(topcat[,c(1:5, 15:18)]))
Category Name Portion(fl oz) Calories Calories from fat Caffeine(mg) Size Milk Whipped Cream
espresso Iced Starbucks® Blonde Caffè Latte 12 50 30 85 Tall Almond NA
espresso Iced Starbucks® Blonde Caffè Latte 12 70 35 85 Tall Coconut NA
espresso Iced Starbucks® Blonde Caffè Latte 12 70 0 85 Tall Nonfat milk NA
espresso Iced Starbucks® Blonde Caffè Latte 12 120 50 85 Tall Whole Milk NA
espresso Iced Starbucks® Blonde Caffè Latte 12 100 35 85 Tall 2% Milk NA
espresso Iced Starbucks® Blonde Caffè Latte 12 100 25 85 Tall Soy (United States) NA

Identifying Top 3 Largest Drink Names

To identify the largest drink names, the group_by() function was used to group the orders by name and summarise() was used to count (i.e. n()) the total number of drinks for each name. Then, arrange(desc) was used to sort the data and top_n() was used to select and identify the top 3 largest names. filter() was used to display rows of drinks that are in the top 3 largest names. They are iced coffee, hot chocolate and pumpkin spice crème.

Show code
top_drinks <- drinks %>%
  group_by(`Name`) %>%
  summarise(Total=n()) %>%
  arrange(desc(Total)) %>%
  top_n(3) %>%
  ungroup

top_drinks_list <- as.vector(top_drinks$Name)

topdrinks <- drinks %>%
  filter(Name %in% top_drinks_list)

kable(head(topdrinks[,c(1:5, 15:18)]))
Category Name Portion(fl oz) Calories Calories from fat Caffeine(mg) Size Milk Whipped Cream
iced-coffee Iced Coffee with Milk 30 35 20 240 Trenta Iced Almond Unsweetened
iced-coffee Iced Coffee with Milk 30 210 25 190 Trenta Iced Coconut Sweetened
iced-coffee Iced Coffee with Milk 30 50 25 240 Trenta Iced Coconut Unsweetened
iced-coffee Iced Coffee with Milk 30 210 0 190 Trenta Iced Nonfat milk Sweetened
iced-coffee Iced Coffee with Milk 30 50 0 240 Trenta Iced Nonfat milk Unsweetened
iced-coffee Iced Coffee with Milk 30 240 40 190 Trenta Iced Whole Milk Sweetened

Normalising Against Volume of Drink

As the nutritional value of the drinks contain different toppings i.e. milk and whipped cream, and are also based on different volumes and serving sizes, we will normalise the nutritional value by the volume of drink for each milk and whipped cream type.

The group_by() function was used to group the orders by name, milk and whipped cream and summarise() was calculate the nutritional value of the drink per unit volume. Next, the name, milk and whipped cream column was joined using paste() to create a single full name for the drink. The full name was then assigned to be the row name of the dataset using row.names()

Show code
topdrinks_norm <- topdrinks %>%
  group_by(`Name`, `Milk`, `Whipped Cream`) %>%
  summarise('Calories per oz'= mean(`Calories`/`Portion(fl oz)`),
            'Calories from fat per oz'= mean(`Calories from fat`/`Portion(fl oz)`),
            'Total Fat(g/oz)'= mean(`Total Fat(g)`/`Portion(fl oz)`),
            'Saturated fat(g/oz)'= mean(`Saturated fat(g)`/`Portion(fl oz)`),
            'Trans fat(g/oz)'= mean(`Trans fat(g)`/`Portion(fl oz)`),
            'Cholesterol(mg/oz)'= mean(`Cholesterol(mg)`/`Portion(fl oz)`),
            'Sodium(mg/oz)'= mean(`Sodium(mg)`/`Portion(fl oz)`),
            'Total Carbohydrate(g/oz)'= mean(`Total Carbohydrate(g)`/`Portion(fl oz)`),
            'Dietary Fiber(g/oz)'= mean(`Dietary Fiber(g)`/`Portion(fl oz)`),
            'Sugars(g/oz)'= mean(`Sugars(g)`/`Portion(fl oz)`),
            'Protein(g/oz)'= mean(`Protein(g)`/`Portion(fl oz)`),
            'Caffeine(mg/oz)'= mean(`Caffeine(mg)`/`Portion(fl oz)`)) %>%
  ungroup()

topdrinks_norm$Full_Name <- paste(topdrinks_norm$Name, topdrinks_norm$Milk, topdrinks_norm$`Whipped Cream`)

topdrinks_n <- topdrinks_norm %>%
  select(c(4:15))

row.names(topdrinks_n) <- topdrinks_norm$Full_Name

kable(head(topdrinks_n[,c(1:7)]))
Calories per oz Calories from fat per oz Total Fat(g/oz) Saturated fat(g/oz) Trans fat(g/oz) Cholesterol(mg/oz) Sodium(mg/oz)
20.86667 5.000000 0.5641667 0.3700000 0.000 1.745833 10.083333
26.83333 10.075000 1.1066667 0.6783333 0.005 3.537500 10.641667
15.86667 4.358333 0.5000000 0.1250000 0.000 0.000000 9.441667
21.56667 9.183333 1.0075000 0.4283333 0.000 1.979167 10.083333
17.47500 5.641667 0.6383333 0.5641667 0.000 0.000000 8.883333
23.34167 10.175000 1.1458333 0.8791667 0.000 1.979167 9.441667

4 Visualisation

In this section, we will explore data visualisations for multidimensional data using:

4.1 Corrgram

A corrgram is a visual display technique that helps us to represent the pattern of relations among a set of variables in terms of their correlations.

4.1.1 Corrgram using pairs()

The figure below shows a conventional corrgram using the pairs() function.

Show code
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
  usr <- par("usr")
  on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, use="complete.obs"))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}

pairs(drinks[,4:15], 
      upper.panel = panel.cor,
      label.pos = 0.5, 
      line.main = 3,
      cex.labels = 0.5, 
      font.labels = 0.5,
      gap = 0.2)

4.1.2 Corrgram using ggstatplot()

The figure below shows a corrgram using the ggstatplot package. ggcorrplot and ggplot elements can be added into the the corrgram to customise the corrgram’s size and colour.

Show code
ggstatsplot::ggcorrmat(
  data = drinks, 
  cor.vars = 4:15,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         lab_col = "black",
                         lab_size = 2,
                         pch.col = "red",
                         pch.cex = 6),
  title = "Correlogram for Starbucks Drink dataset",
  subtitle = "One pair is not significant at p < 0.05",
  ggplot.component = list(theme_void(base_size = 9),
                          theme(plot.title=element_text(size=12),
                                plot.subtitle=element_text(size=9),
                                legend.text = element_text(size=6),
                                axis.text.x = element_text(size = 6, angle = 45, hjust = 0.6),
                                axis.text.y = element_text(size = 6, hjust = 1)
                                ))
  )

4.1.3 Multiple Corrgrams using ggstatplot()

The figure below shows multiple corrgrams using the ggstatplot() function.

Show code
grouped_ggcorrmat(
    data = topcat,
    cor.vars = 4:15,
    grouping.var = Category,
    type = "p",
    p.adjust.method = "holm",
    plotgrid.args = list(ncol = 2),
    ggcorrplot.args = list(outline.color = "black",
                           lab_col = "black",
                           lab_size = 1.5,
                           pch.col = "red",
                           pch.cex = 3),
    annotation.args = list(tag_levels = "a",
    title = "Correlogram for Top 4 Categories of Starbucks Drink dataset",
    subtitle = "The categories are: Espresso, Frapuccino blended beverages, Kids Drinks and Tea"),
    ggplot.component = list(theme_void(base_size = 7),
                          theme(plot.title = element_text(size=5),
                                plot.subtitle = element_text(size=3),
                                legend.text = element_text(size=5),
                                axis.text.x = element_text(size = 5, angle = 45, hjust = 0.6),
                                axis.text.y = element_text(size = 5, hjust = 1),
                                strip.text.x = element_text(size = 7),
                                legend.key.size = unit(3, 'mm')
                                ))
    )

4.1.4 Corrgram with significant level of 0.1 using corrplot()

The figure below shows a corrgram using the corrplot combined with the significant test of 0.1. The corrgram reveals that not all correlation pairs are statistically significant. For example the correlation between total carbohydrate and sugar is statistically significant at significant level of 0.1 but not the pair between total caffeine and trans fat.

Show code
drinks.cor <- cor(drinks[, 4:15])

drinks.sig = cor.mtest(drinks.cor, conf.level= .9)

corrplot.mixed(drinks.cor,
               lower = "number",
               upper = "square",
               order="AOE",
               tl.pos = "lt",
               tl.col = "black",
               tl.cex = .6,
               tl.srt = 45,
               pch.col = "grey70",
               pch.cex = 1.5,
               number.cex = .6,
               cl.cex = .6,
               lower.col = "black",
               p.mat = drinks.sig$p, 
               sig.level = 0.1,
               title = "Correlogram of Starbucks Drinks with significant level of 0.1",
               mar=c(0,0,1,0)
               )

4.1.5 Corrgram with hierarchical clustering

The dend_expend() and find_k() functions of dendextend package was used to determine the best clustering method and number of cluster.

Show code
drinks_matrix <- data.matrix(drinks.cor)

drinks_d <- dist(normalize(drinks_matrix[c(1:12)]), method = "euclidean")

drinks_clust <- hclust(drinks_d, method = "average")
num_k <- find_k(drinks_clust)
plot(num_k)

Next, the corrgram was plotted using corrplot() and hclust based on the results of hierarchical clustering.

Show code
corrplot(drinks.cor,
         method = "ellipse",
         order="hclust",
         hclust.method = "ward.D",
         addrect = 3,
         tl.pos = "lt",
         tl.col = "black",
         tl.cex = .6,
         tl.srt = 45,
         number.cex = .6,
         cl.cex = .6,
         title = "Correlogram of Starbucks Drinks with 3 levels of hierarchical cluster",
         mar=c(0,0,1,0))

4.1.6 Conclusion

In general, the corrgram for all starbucks drinks show that caffeine is mostly negatively correlated with the other nutritional factors except protein, whereas the rest are positively correlated. The diagram also shows that following pairs of nutritional factors of starbucks drinks are highly correlated (r > 0.90):

The trans fat(g) and caffeine (mg) pair is not significant at p <0.05 and has a correlation parameter of only 0.01.

From the multiple corrgram, an interesting finding is that the caffeine for kids drinks and tea is positively correlated to the other factors.

The starbucks drinks nutrition factors can be separated into 3 clusters:

The nutrition factors in each cluster are correlated with one another. Caffeine is standalone as it is not highly correlated with the others and generally has a negative correlation with the rest.

4.2 Heatmap

A heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

The top drinks dataset where nutritional values have been normalised against the unit volume will be used for plotting the heat map.

First, the dend_expend() and find_k() functions of dendextend package was used to determine the best clustering method and number of cluster.

Show code
topdrinks_matrix <- data.matrix(topdrinks_n)

topdrinks_d <- dist(normalize(topdrinks_matrix[c(1:12)]), method = "euclidean")

topdrinks_clust <- hclust(topdrinks_d, method = "average")
topnum_k <- find_k(topdrinks_clust)
plot(topnum_k)

Next, heatmaply package was used to plot the heatmap for Ice Coffee, Hot Chocolate and Pumpkin Spice Crème for different combinations of milk and whipped cream.

Show code
heatmaply(percentize(topdrinks_matrix),
          colors = Blues,
          k_row = 4,
          margins = c(0, 100, 30, 50), #btm, left, top, right
          fontsize_row = 6,
          fontsize_col = 6,
          title= list(text = "Playing with Fonts",font = t1),
          xlab = "Nutrition",
          ylab = "Drinks",
          main = "Heatmap of Top 3 popular Starbucks Drinks")
Sodium(mg/oz)Protein(g/oz)Sugars(g/oz)Total Carbohydrate(g/oz)Calories per ozCalories from fat per ozTotal Fat(g/oz)Saturated fat(g/oz)Cholesterol(mg/oz)Dietary Fiber(g/oz)Caffeine(mg/oz)Trans fat(g/oz)Iced Coffee with Milk Almond UnsweetenedIced Coffee with Milk Soy (United States) UnsweetenedIced Coffee with Milk Nonfat milk UnsweetenedIced Coffee with Milk 2% Milk UnsweetenedIced Coffee with Milk Whole Milk UnsweetenedIced Coffee with Milk Coconut UnsweetenedIced Coffee with Milk Coconut SweetenedIced Coffee with Milk Almond SweetenedIced Coffee with Milk Soy (United States) SweetenedIced Coffee with Milk Nonfat milk SweetenedIced Coffee with Milk 2% Milk SweetenedIced Coffee with Milk Whole Milk SweetenedPumpkin Spice Crème Almond No Whipped CreamPumpkin Spice Crème Coconut No Whipped CreamHot Chocolate Coconut No Whipped CreamHot Chocolate Almond No Whipped CreamHot Chocolate 2% Milk No Whipped CreamHot Chocolate Whole Milk No Whipped CreamHot Chocolate Nonfat milk Whipped CreamHot Chocolate Nonfat milk No Whipped CreamHot Chocolate Soy (United States) No Whipped CreamPumpkin Spice Crème Soy (United States) No Whipped CreamPumpkin Spice Crème Nonfat milk No Whipped CreamPumpkin Spice Crème 2% Milk No Whipped CreamPumpkin Spice Crème Nonfat milk Whipped CreamPumpkin Spice Crème Whole Milk No Whipped CreamPumpkin Spice Crème Almond Whipped CreamPumpkin Spice Crème Coconut Whipped CreamPumpkin Spice Crème Whole Milk Whipped CreamPumpkin Spice Crème 2% Milk Whipped CreamPumpkin Spice Crème Soy (United States) Whipped CreamHot Chocolate Soy (United States) Whipped CreamHot Chocolate 2% Milk Whipped CreamHot Chocolate Whole Milk Whipped CreamHot Chocolate Coconut Whipped CreamHot Chocolate Almond Whipped Cream
0.250.500.751.00Heatmap of Top 3 popular Starbucks DrinksNutritionDrinks

4.2.1 Conclusion

The heatmap compares the nutritional value of hot chocolate, pumpkin spice crème and iced coffee, which are popular drinks in Starbucks. It shows that hot chocolate and pump spice crème are generally unhealthier, containing higher sodium, sugars, carbohydrates, and cholesterol levels than iced coffee. On the other hand, iced coffee contains higher caffeine levels than the hot chocolate and pump spice crème. The impact of milk, whipped cream, sweetener choices on the nutritional value of drinks were further analysed using hierarchical clustering. The drinks were separated into 4 clusters:

For Hot Chocolate and Pumpkin Spice Crème, the nutritional value was determined by whipped cream then milk type. In general, no whipped cream and plant-based milk milk choices are considered healthier with lower sodium, sugars, carbohydrates, and cholesterol levels. For Iced coffee, the nutritional value was determined by sweetener then milk type. Unsweetened iced coffee with plant-based milk is considered healthier with lower sodium, sugars, carbohydrates, and cholesterol levels.

4.3 Parallel Coordinate Plot

Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an n-dimensional space, a backdrop is drawn consisting of n parallel lines, typically vertical and equally spaced. A point in n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point.

The parallel coordinate was plotted using the parallelPlot package.

Show code
drinks.pp <- drinks %>%
  select(c(4:15))

histoVisibility <- rep(TRUE, ncol(drinks.pp))

parallelPlot(drinks.pp,
             rotateTitle = TRUE,
             continuousCS = 'YlGnBu',
             histoVisibility = histoVisibility)
050100150200250300350400450500550600650Calories020406080100120140160180200220240Calories from fat0246810121416182022242628Total Fat(g)024681012141618Saturated fat(g)00.050.10.150.20.250.30.350.40.450.5Trans fat(g)010203040506070Cholesterol(mg)050100150200250300350400450500Sodium(mg)0102030405060708090100Total Carbohydrate(g)00.511.522.533.544.555.566.57Dietary Fiber(g)0102030405060708090Sugars(g)024681012141618Protein(g)050100150200250300350400450Caffeine(mg)

4.3.1 Conclusion

The findings from the parallel coordinate plot are generally in line with the corrgram above. Drinks with high calories typically have high total carbohydrate, sugars, sodium and lower caffeine, vice versa. Some factors like trans fat, dietary fibre and cholesterol are generally not-well distributed with most drinks having a low nutritional value for those factors. They may not be a good indicator of the calorific content of the drinks.

5 References