Data Wrangling With R
Introduction
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. (Wikipedia, 2020)
Data scientists and data analysts spend the majority of their time performing data wrangling. And as such, any aspiring data scientist or data analyst must master the techniques of data wrangling in a programming language of his/her choice (R
or Python
).
That said, in this article, we will demonstrate how to perform data wrangling with dplyr
using South Sudan 2008 census data that we obtained from South Sudan Data Portal.
Our workflow
- Load packages
- Import the dataset
- Data wrangle with
dplyr
- Tabulate data with
gt
- Visualize data with
ggplot2
&plotly
- Summary
Load packages
Here we will load tidyverse
so that we can access both dplyr
for data wrangling and readr
for importing CSV files; gt
for tabulating data; and plotly
for plotting interactive graphs.
# Load packages
library(RWordPress)
library(knitr)
library(XMLRPC)
library(tidyverse)
library(gt)
library(plotly)
Import the Dataset
Now, let’s import our dataset from a data folder within our blog directory and call it “ss_2008_census_data_raw”.
# Import the dataset
ss_2008_census_data_raw <- readr::read_csv("ss_2008_census_data.csv")
# Let's view the first 5 rows with slice_head()
ss_2008_census_data_raw %>% slice_head(n = 5)
Here we see that there are no blank rows or NAs. So let’s view the last 5 rows.
# Let's view the last 5 rows with slice_tail()
ss_2008_census_data_raw %>% slice_tail(n = 5)
Here, a quick examination of the last five rows reveals three rows with missing values or NAs. And two of which are the source where we obtained the data from and the source link; the third to the last row has all nas.
Wrangle the data with Dplyr
Now, let’s begin our data wrangling process here.
- Since we have already determined that our data has three rows with
NAs
and because these rows add no value to our data, we will delete them with!is.na()
.
# Subset the data
ss_2008_census_data_clean <- ss_2008_census_data_raw %>%
# Select the columns to keep with select() & contains()
select(contains("Name"), "2008") %>%
# Delete nas with !is.na()
filter(!is.na(`2008`)) %>%
# Rename the columns with set_names()
set_names("State", "Category", "Age Category", "Population") %>%
# Separate the Category column into three columns with separate()
separate(Category,
into = c("Pop.", "Population Category", "Other"),
sep = " ") %>%
# Delete the extra columns that we no longer need:'Pop.' & 'Other'
select(-Other, -Pop.)
# Inspect the top 5 rows once again
ss_2008_census_data_clean %>%
slice_head(n = 5)
Next, view the last 5 rows one more time to ensure that we have indeed deleted the rows with NAs
.
# Inspect the last 5 rows
ss_2008_census_data_clean %>%
slice_tail(n = 5)
The above outputs show that we have successfully deleted the NAs
. So, let’s now transform the Age Category column. Since there are too many age categories, we will reduce them by creating new categories as shown below.
# Modify the age category; let's subset the data again
ss_2008_census_wrangled_tbl <- ss_2008_census_data_clean %>%
# Transform the age category column with case_when()
mutate(`Age Category` = case_when(
`Age Category` %in% c("0 to 4", "5 to 9", "10 to 14", "15 to 19") ~ "0-19",
`Age Category` %in% c( "20 to 24", "25 to 29", "30 to 34" ) ~ "20-34",
`Age Category` %in% c("35 to 39", "40 to 44", "45 to 49") ~ "35-49",
`Age Category` %in% c( "50 to 54", "55 to 59", "60 to 64") ~ "50-64",
`Age Category` == "Total" ~ "Total",
TRUE ~ ">= 65")
) %>%
# Group the dataset by state and population category and then summarize
group_by(State, `Population Category`,
`Age Category`) %>%
summarise(Population = sum(Population),
.groups = "drop")
# Inspect the new column names
ss_2008_census_wrangled_tbl %>%
slice_head(n = 5)
Modify the State Names
Sometimes it may be necessary to transform variable names to make them shorter or longer, depending on your project (particularly when visualizing the data). Here, we will shorten the state names to illustrate how to accomplish that with the case_when()
.
# Transform state names to the appropriate length
ss_2008_census_data_short_state_names_tbl <- ss_2008_census_wrangled_tbl %>%
# Modify the state column
mutate(State = case_when(
State == "Northern Bahr el Ghazal" ~ "N. Bhar el G.",
State == "Western Bahr el Ghazal" ~ "W. Bhar el G.",
State == "Western Equatoria" ~ "W. Equatoria",
State == "Central Equatoria" ~ "C. Equatoria",
State == "Eastern Equatoria" ~ "E. Equatoria",
TRUE ~ State))
# Inspect the new names
ss_2008_census_data_short_state_names_tbl %>%
# Select and view the distinct state names
pull(State) %>% unique() %>%
# Format state names as a tibble
tibble()
Tabulate state total populations with gt
Now that we have wrangled and tidied our data let’s tabulate the state total populations to see which state has the largest population and which one has the smallest population.
While there are several R packages for creating data tables, I personally love the gt
package – I use it every day at work. With that being said, the other R Packages for creating data tables can be found at the bottom of this link: gt.
# Tabulate state general populations with gt()
state_gt <- ss_2008_census_wrangled_tbl %>%
# Filter to obtain state populations
filter(`Population Category`== "Total" &
`Age Category` == "Total") %>%
# Select state and total population columns only
select(1, 4) %>%
# Arrange the data in descending order by population
arrange(desc(Population)) %>%
# Add the total column to display South Sudan total population
add_row(State = "Total Population",
Population = sum(.$Population)) %>%
# Add the percentage column with mutate() & format values with the scales::percent()
mutate(Proportion = (Population / 8260490) %>%
scales::percent(accuracy = 0.1)) %>%
# Initialize a gt table
gt() %>%
# Add the title and the subtitle
tab_header(
title = "South Sudan 2008 Census Data by State",
subtitle = "Jonglei State has the largest population based on this analysis!"
) %>%
# Format the column values; remove decimal points; apply a comma as the separator
fmt_number(
columns = vars(Population),
decimals = 0,
use_seps = TRUE) %>%
# Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
tab_options(
heading.background.color = "#4caf50",
column_labels.background.color = "#010f05",
heading.title.font.size = 25,
heading.subtitle.font.size = 20,
table.width = "50%"
) %>%
# Apply the color-coding to the total row
tab_style(
style = list(
cell_fill(color = "#010f05"),
cell_text(weight = "bold",
color = "white")),
locations = cells_body(
columns = vars(State, Population, Proportion),
rows = State == "Total" | Population == 8260490)) %>%
# Align column values
cols_align(
columns = 2:3,
align = "center"
) %>%
cols_align(
columns = 1,
align = "left"
) %>%
# Add a source note
tab_source_note(
source_note = "Data Source: http://southsudan.opendataforafrica.org"
)
# Display the gt table
state_gt
<!–html_preserve–>
South Sudan 2008 Census Data by State | ||
---|---|---|
Jonglei State has the largest population! | ||
State | Population | Proportion |
Jonglei | 1,358,602 | 16.4% |
Central Equatoria | 1,103,557 | 13.4% |
Warrap | 972,928 | 11.8% |
Upper Nile | 964,353 | 11.7% |
Eastern Equatoria | 906,161 | 11.0% |
Northern Bahr el Ghazal | 720,898 | 8.7% |
Lakes | 695,730 | 8.4% |
Western Equatoria | 619,029 | 7.5% |
Unity | 585,801 | 7.1% |
Western Bahr el Ghazal | 333,431 | 4.0% |
Total Population | 8,260,490 | 100.0% |
Data Source: http://southsudan.opendataforafrica.org |
<!–/html_preserve–>
Tabulate the data by Gender
# Tabulate the state populations by gender
gender_gt <- ss_2008_census_wrangled_tbl %>%
# Filter to extract state total populations by gender
filter(`Age Category` == "Total",
`Population Category` %in% c("Female", "Male")) %>%
# Rename a column
rename(Gender = `Population Category`) %>%
# Delete unwanted column
select(-`Age Category`) %>%
# Spread the data with pivot_wider()
pivot_wider(
names_from = Gender,
values_from = Population) %>%
# Arrange data in a descending order using the Male column
arrange(desc(Male)) %>%
# Initialize a gt table
gt() %>%
# Add the title and the subtitle
tab_header(
title = "South Sudan 2008 Census Data by State and Gender",
subtitle = "Jonglei State has the largest population in both categories: Male & Female based on this analysis"
) %>%
# Add the row totals with grand_summary_rows()
grand_summary_rows(
columns = vars("Female", "Male"),
fns = list(
`Grand Totals` = ~ sum(.)),
formatter = fmt_number,
use_seps = FALSE
) %>%
# Align column values
cols_align(
columns = 2:3,
align = "center") %>%
# Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
tab_options(
heading.background.color = "#4caf50",
column_labels.background.color = "#010f05",
heading.title.font.size = 25,
heading.subtitle.font.size = 15,
table.width = "50%"
)
# Display the table
gender_gt
<!–html_preserve–>
South Sudan 2008 Census Data by State and Gender | |||
---|---|---|---|
Jonglei State has the largest population in both categories: Male & Female based on this analysis | |||
State | Female | Male | |
Jonglei | 624275 | 734327 | |
Central Equatoria | 521835 | 581722 | |
Upper Nile | 438923 | 525430 | |
Warrap | 502194 | 470734 | |
Eastern Equatoria | 440974 | 465187 | |
Lakes | 329850 | 365880 | |
Northern Bahr el Ghazal | 372608 | 348290 | |
Western Equatoria | 300586 | 318443 | |
Unity | 285554 | 300247 | |
Western Bahr el Ghazal | 156391 | 177040 | |
Grand Totals | — | 3973190.00 | 4287300.00 |
<!–/html_preserve–>
Tabulate the data by Age Category
# Transform the data further to display it by age category
age_category_gt <- ss_2008_census_wrangled_tbl %>%
# Exclude state total populations
filter(
`Age Category` != "Total"
) %>%
# Spread the data with pivot_wider()
pivot_wider(
names_from = `Age Category`,
values_from = Population) %>%
# Arrange by 0-19 column
arrange(State, `Population Category`, desc(`0-19`)) %>%
# Initialize a gt table
gt() %>%
# Add the title and the subtitle
tab_header(
title = "South Sudan 2008 Census Data by Age Category",
subtitle = "Jonglei State still leads all the other states in all Age Categories"
) %>%
# Style the total rows
tab_style(
style = list(
cell_fill(color = "white"),
cell_text(weight = "bold")),
locations = cells_body(
columns = 1:7,
rows = `Population Category` == "Total" | State == "Total")
) %>%
# Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
tab_options(
heading.background.color = "#4caf50",
column_labels.background.color = "grey",
heading.title.font.size = 30,
heading.subtitle.font.size = 15,
table.width = "60%"
)
# Display the gt table
age_category_gt
<!–html_preserve–>
South Sudan 2008 Census Data by Age Category | ||||||
---|---|---|---|---|---|---|
Jonglei State still leads all the other states in all Age Categories | ||||||
State | Population Category | >= 65 | 0-19 | 20-34 | 35-49 | 50-64 |
Central Equatoria | Female | 8596 | 283092 | 139942 | 66745 | 23460 |
Central Equatoria | Male | 11409 | 308935 | 153332 | 79238 | 28808 |
Central Equatoria | Total | 20005 | 592027 | 293274 | 145983 | 52268 |
Eastern Equatoria | Female | 8637 | 243642 | 111079 | 57120 | 20496 |
Eastern Equatoria | Male | 12528 | 274404 | 99862 | 55139 | 23254 |
Eastern Equatoria | Total | 21165 | 518046 | 210941 | 112259 | 43750 |
Jonglei | Female | 12384 | 329048 | 164193 | 87198 | 31452 |
Jonglei | Male | 22658 | 419182 | 157319 | 90925 | 44243 |
Jonglei | Total | 35042 | 748230 | 321512 | 178123 | 75695 |
Lakes | Female | 6396 | 176918 | 86832 | 42932 | 16772 |
Lakes | Male | 10100 | 198581 | 87219 | 49536 | 20444 |
Lakes | Total | 16496 | 375499 | 174051 | 92468 | 37216 |
Northern Bahr el Ghazal | Female | 12585 | 200375 | 89179 | 48861 | 21608 |
Northern Bahr el Ghazal | Male | 13523 | 204291 | 63709 | 45635 | 21132 |
Northern Bahr el Ghazal | Total | 26108 | 404666 | 152888 | 94496 | 42740 |
Unity | Female | 7801 | 163798 | 66837 | 33267 | 13851 |
Unity | Male | 8999 | 179616 | 62313 | 34091 | 15228 |
Unity | Total | 16800 | 343414 | 129150 | 67358 | 29079 |
Upper Nile | Female | 10144 | 237435 | 108924 | 60058 | 22362 |
Upper Nile | Male | 15746 | 294848 | 113552 | 70681 | 30603 |
Upper Nile | Total | 25890 | 532283 | 222476 | 130739 | 52965 |
Warrap | Female | 10625 | 273397 | 127170 | 66936 | 24066 |
Warrap | Male | 12345 | 275805 | 94888 | 63010 | 24686 |
Warrap | Total | 22970 | 549202 | 222058 | 129946 | 48752 |
Western Bahr el Ghazal | Female | 3527 | 83151 | 41467 | 20767 | 7479 |
Western Bahr el Ghazal | Male | 4171 | 92265 | 45326 | 26307 | 8971 |
Western Bahr el Ghazal | Total | 7698 | 175416 | 86793 | 47074 | 16450 |
Western Equatoria | Female | 7369 | 148059 | 83592 | 45314 | 16252 |
Western Equatoria | Male | 11541 | 162324 | 77197 | 47857 | 19524 |
Western Equatoria | Total | 18910 | 310383 | 160789 | 93171 | 35776 |
<!–/html_preserve–>
Visualize the Data with ggplot2
Now that we have tabulated our data with gt
, let’s visualize it with ggplot2
and plotly
.
# Plot the state population data
state_gg <- ss_2008_census_wrangled_tbl %>%
# Filter to isolate state total populations
filter(
`Population Category` == "Total" &
`Age Category` == "Total") %>%
# Select the columns to use
select(State, Population) %>%
# Arrange the data in descending order by population
arrange(desc(Population)) %>%
# Modify the state names to wrap them in the x-axis
mutate(State = State %>% str_wrap(width = 8)) %>%
# Initialize a ggplot
ggplot(aes(x = State %>% as_factor(), y = Population)) +
# Add the bars
geom_bar(aes(fill = State),
stat = "identity") +
# Add the title
labs(title = "South Sudan 2008 Population Census Results by State",
x = NULL,
y = NULL) +
# Format the population values
scale_y_continuous(labels = scales::number) +
# Adjust the y-axis scale
expand_limits(y = c(0, 1500000)) +
# Hide the legend; center the title; add a margin around
theme(
plot.title = element_text(hjust = 0.5,
margin = margin(0, 0, 10, 0)),
legend.position = "none",
axis.text.x = element_text(hjust = 1),
strip.text.x = element_text(margin = margin(5, 5, 5, 5, unit = "pt"))
)
# Display the graph
state_gg
Plot an interactive graph with plotly
# Plot an interactive graph
ggplotly(state_gg)
## Error in file(con, "rb"): cannot open the connection
Plot a Stacked Bar Graph
# Plot a stacked bar graph
gender_gg_stacked <- ss_2008_census_wrangled_tbl %>%
# Filter to isolate the state total population by gender
filter(`Age Category` == "Total" &
`Population Category` %in% c('Male', 'Female')) %>%
# Rename a column
rename(Gender = `Population Category`) %>%
# Delete unwanted column
select(-`Age Category`) %>%
# Arrange by population column in descending order
arrange(desc(Population)) %>%
# Modify state names to wrap around
mutate(State = State %>% str_wrap(width = 8)) %>%
# Initialize a ggplot
ggplot(aes(x = State %>% as_factor(), y = Population)) +
# Add the bars
geom_bar(aes(fill = Gender),
stat = "identity") +
# Add the title
labs(
title = "South Sudan 2008 Population Census Results by State and Gender",
x = NULL,
y = NULL) +
# Hide the legend; center the title
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "none",
axis.text.x = element_text(hjust = 1)) +
# Add data labels on the y-axis
scale_y_continuous(labels = scales::number) +
# Adjust the y-axis scale
expand_limits(y = c(0, 1500000))
# Display the graph
ggplotly(gender_gg_stacked)
## Error in file(con, "rb"): cannot open the connection
Visualize the Data by Age Category
# Subset the data
age_category_gg <- ss_2008_census_wrangled_tbl %>%
# Filter out the totals
filter(`Population Category` != "Total",
`Age Category` != "Total") %>%
# Arrange the data in descending order by population
arrange(desc(Population)) %>%
# Initialize the ggplot
ggplot(aes(x = State %>% as_factor(), y = Population)) +
# Add the bars
geom_bar(aes(fill = State), stat = "identity") +
# Facet the graph with facet_wrap()
facet_wrap(~ `Age Category`, scales = "free_y") +
# Add the title
labs(
title = "South Sudan 2008 Population Census Results by Age Category",
x = NULL,
y = NULL) +
# Format the population values
scale_y_continuous(labels = scales::number) +
# Adjust the y-axis scale
expand_limits(y = 0) +
# Hide the legend; center the title
theme(
plot.title = element_text(margin = margin(0, 0, 15, 0, unit = "pt")),
legend.position = "none",
strip.text.x = element_text(margin = margin(5, 5, 5, 5, unit = "pt")),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Display the graph
age_category_gg
Summary
Data wrangling is an important aspect of data science, and as such, any aspiring data scientist or data analyst must master the techniques of data wrangling with R
(or with Python
, for those who use Python) to optimize their productivity. Because, in practice, data scientists and data analysts spend about 80% of their time performing data wrangling and the remainder of their time on data analysis and data modeling. With that being said, in this article, we have thoroughly demonstrated how to perform data wrangling with dplyr
; tabulate data with gt
; and visualize data with both ggplot2
and plotly
.