• Home
      • ABOUT US
      • CONTACT US
      • JIT LEADERSHIP
      • FAMILY FIRST INITIATIVE
  • JIT LEADERSHIP
  • CONTACT US
  • FORUMS
  • EVENTS
  • FAQs
  • GIVE
MY ACCOUNT
Jonglei Institute of Technology

ABOUT US

Learn more about who we are and what we do.

FAMILY FIRST

Learn more about how JIT program may help you and your family.

COURSES

Kickstart your data science journey with our free world-class online courses.

BLOG

Monthly articles on data science, Python R, and statistics.
ENROLL TODAY!

Data Science

Data Wrangling With R

  • Posted by Alier Reng
  • Categories Data Science, R, Statistics
  • Date August 12, 2020
  • Comments 0 comment
  • Tags

Introduction

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. (Wikipedia, 2020)

Data scientists and data analysts spend the majority of their time performing data wrangling. And as such, any aspiring data scientist or data analyst must master the techniques of data wrangling in a programming language of his/her choice (R or Python).

That said, in this article, we will demonstrate how to perform data wrangling with dplyr using South Sudan 2008 census data that we obtained from South Sudan Data Portal.

Our workflow

  • Load packages
  • Import the dataset
  • Data wrangle with dplyr
  • Tabulate data with gt
  • Visualize data with ggplot2 & plotly
  • Summary

Load packages

Here we will load tidyverse so that we can access both dplyr for data wrangling and readr for importing CSV files; gt for tabulating data; and plotly for plotting interactive graphs.

# Load packages
library(RWordPress)
library(knitr)
library(XMLRPC)
library(tidyverse)
library(gt)
library(plotly)

Import the Dataset

Now, let’s import our dataset from a data folder within our blog directory and call it “ss_2008_census_data_raw”.

# Import the dataset
ss_2008_census_data_raw <- readr::read_csv("ss_2008_census_data.csv") 

# Let's view the first 5 rows with slice_head()
ss_2008_census_data_raw %>% slice_head(n = 5) 

Here we see that there are no blank rows or NAs. So let’s view the last 5 rows.

# Let's view the last 5 rows with slice_tail()
ss_2008_census_data_raw %>% slice_tail(n = 5) 

Here, a quick examination of the last five rows reveals three rows with missing values or NAs. And two of which are the source where we obtained the data from and the source link; the third to the last row has all nas.

Wrangle the data with Dplyr

Now, let’s begin our data wrangling process here.

  • Since we have already determined that our data has three rows with NAs and because these rows add no value to our data, we will delete them with !is.na().
# Subset the data
ss_2008_census_data_clean <-  ss_2008_census_data_raw %>% 

    # Select the columns to keep with select() & contains()
    select(contains("Name"), "2008") %>% 

    # Delete nas with !is.na()
    filter(!is.na(`2008`)) %>% 

    # Rename the columns with set_names()
    set_names("State", "Category", "Age Category", "Population") %>% 

    # Separate the Category column into three columns with separate()
    separate(Category,

           into = c("Pop.", "Population Category", "Other"),

           sep  = " ") %>%

  # Delete the extra columns that we no longer need:'Pop.' & 'Other'
   select(-Other, -Pop.)  


# Inspect the top 5 rows once again
ss_2008_census_data_clean %>% 

  slice_head(n = 5) 

Next, view the last 5 rows one more time to ensure that we have indeed deleted the rows with NAs.

# Inspect the last 5 rows
ss_2008_census_data_clean %>% 

  slice_tail(n = 5)

The above outputs show that we have successfully deleted the NAs. So, let’s now transform the Age Category column. Since there are too many age categories, we will reduce them by creating new categories as shown below.

# Modify the age category; let's subset the data again
ss_2008_census_wrangled_tbl <- ss_2008_census_data_clean %>%

  # Transform the age category column with case_when()
  mutate(`Age Category` = case_when(
    `Age Category`    %in%  c("0 to 4", "5 to 9", "10 to 14", "15 to 19")  ~  "0-19",
    `Age Category`    %in%  c( "20 to 24", "25 to 29", "30 to 34" )        ~ "20-34",
    `Age Category`    %in%  c("35 to 39", "40 to 44", "45 to 49")          ~ "35-49",
    `Age Category`    %in%  c( "50 to 54", "55 to 59", "60 to 64")         ~ "50-64",
    `Age Category`     ==   "Total"                                        ~ "Total",
    TRUE                                                                   ~ ">= 65")
  ) %>% 

  # Group the dataset by state and population category and then summarize
  group_by(State, `Population Category`, 
           `Age Category`) %>% 
  summarise(Population = sum(Population),
            .groups    = "drop")


# Inspect the new column names
ss_2008_census_wrangled_tbl %>% 

  slice_head(n = 5)

Modify the State Names

Sometimes it may be necessary to transform variable names to make them shorter or longer, depending on your project (particularly when visualizing the data). Here, we will shorten the state names to illustrate how to accomplish that with the case_when().

# Transform state names to the appropriate length
ss_2008_census_data_short_state_names_tbl <- ss_2008_census_wrangled_tbl %>% 

  # Modify the state column
  mutate(State = case_when(
    State     == "Northern Bahr el Ghazal" ~  "N. Bhar el G.",
    State     == "Western Bahr el Ghazal"  ~  "W. Bhar el G.",
    State     == "Western Equatoria"       ~  "W. Equatoria",
    State     == "Central Equatoria"       ~  "C. Equatoria",
    State     == "Eastern Equatoria"       ~  "E. Equatoria",
    TRUE                                   ~  State)) 


# Inspect the new names
ss_2008_census_data_short_state_names_tbl %>% 

  # Select and view the distinct state names
  pull(State) %>% unique() %>% 

  # Format state names as a tibble
  tibble()

Tabulate state total populations with gt

Now that we have wrangled and tidied our data let’s tabulate the state total populations to see which state has the largest population and which one has the smallest population.

While there are several R packages for creating data tables, I personally love the gt package – I use it every day at work. With that being said, the other R Packages for creating data tables can be found at the bottom of this link: gt.

# Tabulate state general populations with gt()
state_gt <- ss_2008_census_wrangled_tbl %>% 

  # Filter to obtain state populations
  filter(`Population Category`== "Total" &
    `Age Category`            == "Total") %>% 

  # Select state and total population columns only
  select(1, 4) %>%  

  # Arrange the data in descending order by population
  arrange(desc(Population)) %>% 

  # Add the total column to display South Sudan total population
  add_row(State      = "Total Population",
          Population = sum(.$Population)) %>%

  # Add the percentage column with mutate() & format values with the scales::percent()
  mutate(Proportion =  (Population / 8260490) %>% 
           scales::percent(accuracy = 0.1)) %>% 


  # Initialize a gt table
  gt() %>% 

  # Add the title and the subtitle
  tab_header(
    title    = "South Sudan 2008 Census Data by State", 
    subtitle = "Jonglei State has the largest population based on this analysis!"
      ) %>%

  # Format the column values; remove decimal points; apply a comma as the separator
  fmt_number(
    columns  = vars(Population),
    decimals = 0, 
    use_seps = TRUE) %>%

  # Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
  tab_options(
    heading.background.color       = "#4caf50",
    column_labels.background.color = "#010f05",
    heading.title.font.size        = 25,
    heading.subtitle.font.size     = 20,
    table.width                    = "50%"
              ) %>% 

  # Apply the color-coding to the total row
  tab_style(
    style              = list(
      cell_fill(color  = "#010f05"),
      cell_text(weight = "bold",
                color  = "white")),
    locations          = cells_body(
      columns          = vars(State, Population, Proportion),
      rows             = State == "Total" | Population == 8260490)) %>% 

  # Align column values
  cols_align(
      columns = 2:3,
      align   = "center"
     ) %>% 

  cols_align(
    columns = 1,
    align   = "left"
  ) %>% 

  # Add a source note
  tab_source_note(
      source_note = "Data Source: http://southsudan.opendataforafrica.org"

  )


# Display the gt table
state_gt

<!–html_preserve–>

South Sudan 2008 Census Data by State
Jonglei State has the largest population!
State Population Proportion
Jonglei 1,358,602 16.4%
Central Equatoria 1,103,557 13.4%
Warrap 972,928 11.8%
Upper Nile 964,353 11.7%
Eastern Equatoria 906,161 11.0%
Northern Bahr el Ghazal 720,898 8.7%
Lakes 695,730 8.4%
Western Equatoria 619,029 7.5%
Unity 585,801 7.1%
Western Bahr el Ghazal 333,431 4.0%
Total Population 8,260,490 100.0%
Data Source: http://southsudan.opendataforafrica.org

<!–/html_preserve–>

Tabulate the data by Gender

# Tabulate the state populations by gender
gender_gt <- ss_2008_census_wrangled_tbl %>% 

  # Filter to extract state total populations by gender
  filter(`Age Category` == "Total",
         `Population Category` %in% c("Female", "Male")) %>% 

  # Rename a column
  rename(Gender = `Population Category`) %>% 

 # Delete unwanted column
  select(-`Age Category`) %>% 

  # Spread the data with pivot_wider()
  pivot_wider(
         names_from  = Gender, 
         values_from = Population) %>% 

  # Arrange data in a descending order using the Male column
  arrange(desc(Male)) %>%

  # Initialize a gt table
  gt() %>% 

  # Add the title and the subtitle
  tab_header(
    title    = "South Sudan 2008 Census Data by State and Gender", 
    subtitle = "Jonglei State has the largest population in both categories: Male & Female based on this analysis"
      ) %>%

  # Add the row totals with grand_summary_rows()
  grand_summary_rows(
    columns          = vars("Female", "Male"),
    fns              = list(
      `Grand Totals` = ~ sum(.)),
    formatter        = fmt_number,
    use_seps         = FALSE
  ) %>% 

  # Align column values
  cols_align(
    columns = 2:3,
    align   = "center") %>% 

  # Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
  tab_options(
    heading.background.color       = "#4caf50",
    column_labels.background.color = "#010f05",
    heading.title.font.size        = 25,
    heading.subtitle.font.size     = 15,
    table.width                    = "50%"
              ) 


# Display the table
gender_gt

<!–html_preserve–>

South Sudan 2008 Census Data by State and Gender
Jonglei State has the largest population in both categories: Male & Female based on this analysis
State Female Male
Jonglei 624275 734327
Central Equatoria 521835 581722
Upper Nile 438923 525430
Warrap 502194 470734
Eastern Equatoria 440974 465187
Lakes 329850 365880
Northern Bahr el Ghazal 372608 348290
Western Equatoria 300586 318443
Unity 285554 300247
Western Bahr el Ghazal 156391 177040
Grand Totals — 3973190.00 4287300.00

<!–/html_preserve–>

Tabulate the data by Age Category

# Transform the data further to display it by age category
age_category_gt <- ss_2008_census_wrangled_tbl %>%

  # Exclude state total populations
  filter(
    `Age Category` != "Total"
    ) %>% 

  # Spread the data with pivot_wider()
  pivot_wider(
       names_from   = `Age Category`, 
       values_from  = Population) %>% 

  # Arrange by 0-19 column
  arrange(State, `Population Category`, desc(`0-19`)) %>%     


  # Initialize a gt table
  gt() %>% 

  # Add the title and the subtitle
  tab_header(
    title    = "South Sudan 2008 Census Data by Age Category", 
    subtitle = "Jonglei State still leads all the other states in all Age Categories"
      ) %>% 

  # Style the total rows
  tab_style(
    style              = list(
      cell_fill(color  = "white"),
      cell_text(weight = "bold")),
    locations          = cells_body(
      columns          = 1:7,
      rows             = `Population Category` == "Total" | State == "Total")
  ) %>% 

  # Modify the background color; adjust the title and the subtitle font sizes; adjust the table width
  tab_options(
    heading.background.color       = "#4caf50",
    column_labels.background.color = "grey",
    heading.title.font.size        = 30,
    heading.subtitle.font.size     = 15,
    table.width                    = "60%"
              ) 

# Display the gt table
age_category_gt

<!–html_preserve–>

South Sudan 2008 Census Data by Age Category
Jonglei State still leads all the other states in all Age Categories
State Population Category >= 65 0-19 20-34 35-49 50-64
Central Equatoria Female 8596 283092 139942 66745 23460
Central Equatoria Male 11409 308935 153332 79238 28808
Central Equatoria Total 20005 592027 293274 145983 52268
Eastern Equatoria Female 8637 243642 111079 57120 20496
Eastern Equatoria Male 12528 274404 99862 55139 23254
Eastern Equatoria Total 21165 518046 210941 112259 43750
Jonglei Female 12384 329048 164193 87198 31452
Jonglei Male 22658 419182 157319 90925 44243
Jonglei Total 35042 748230 321512 178123 75695
Lakes Female 6396 176918 86832 42932 16772
Lakes Male 10100 198581 87219 49536 20444
Lakes Total 16496 375499 174051 92468 37216
Northern Bahr el Ghazal Female 12585 200375 89179 48861 21608
Northern Bahr el Ghazal Male 13523 204291 63709 45635 21132
Northern Bahr el Ghazal Total 26108 404666 152888 94496 42740
Unity Female 7801 163798 66837 33267 13851
Unity Male 8999 179616 62313 34091 15228
Unity Total 16800 343414 129150 67358 29079
Upper Nile Female 10144 237435 108924 60058 22362
Upper Nile Male 15746 294848 113552 70681 30603
Upper Nile Total 25890 532283 222476 130739 52965
Warrap Female 10625 273397 127170 66936 24066
Warrap Male 12345 275805 94888 63010 24686
Warrap Total 22970 549202 222058 129946 48752
Western Bahr el Ghazal Female 3527 83151 41467 20767 7479
Western Bahr el Ghazal Male 4171 92265 45326 26307 8971
Western Bahr el Ghazal Total 7698 175416 86793 47074 16450
Western Equatoria Female 7369 148059 83592 45314 16252
Western Equatoria Male 11541 162324 77197 47857 19524
Western Equatoria Total 18910 310383 160789 93171 35776

<!–/html_preserve–>

Visualize the Data with ggplot2

Now that we have tabulated our data with gt, let’s visualize it with ggplot2 and plotly.

# Plot the state population data
state_gg <- ss_2008_census_wrangled_tbl %>% 

  # Filter to isolate state total populations
  filter(
    `Population Category` == "Total" &
      `Age Category`      == "Total") %>% 

  # Select the columns to use
  select(State, Population) %>%  

  # Arrange the data in descending order by population
  arrange(desc(Population)) %>%

  # Modify the state names to wrap them in the x-axis
  mutate(State = State %>% str_wrap(width = 8)) %>% 

  # Initialize a ggplot
  ggplot(aes(x = State %>% as_factor(), y = Population)) +

  # Add the bars
  geom_bar(aes(fill = State), 
           stat     = "identity") + 

  # Add the title
  labs(title    = "South Sudan 2008 Population Census Results by State", 
       x        = NULL,
       y        = NULL) +

  # Format the population values
  scale_y_continuous(labels = scales::number) +

  # Adjust the y-axis scale
  expand_limits(y = c(0, 1500000)) +

  # Hide the legend; center the title; add a margin around
  theme(
    plot.title      = element_text(hjust  = 0.5, 
                                   margin = margin(0, 0, 10, 0)),
    legend.position = "none",
    axis.text.x     = element_text(hjust = 1),
    strip.text.x    = element_text(margin = margin(5, 5, 5, 5, unit = "pt"))
  ) 

# Display the graph
state_gg

plot of chunk unnamed-chunk-11

Plot an interactive graph with plotly

# Plot an interactive graph
ggplotly(state_gg)
## Error in file(con, "rb"): cannot open the connection

Plot a Stacked Bar Graph

# Plot a stacked bar graph 
gender_gg_stacked <- ss_2008_census_wrangled_tbl %>% 

  # Filter to isolate the state total population by gender
  filter(`Age Category` == "Total" &
           `Population Category` %in% c('Male', 'Female')) %>% 

  # Rename a column
  rename(Gender = `Population Category`) %>% 

  # Delete unwanted column
  select(-`Age Category`) %>% 

  # Arrange by population column in descending order
  arrange(desc(Population)) %>% 

  # Modify state names to wrap around
  mutate(State = State %>% str_wrap(width = 8)) %>% 

  # Initialize a ggplot
  ggplot(aes(x = State %>% as_factor(), y = Population)) +

  # Add the bars
  geom_bar(aes(fill = Gender), 
           stat     = "identity") +

  # Add the title
  labs(
    title = "South Sudan 2008 Population Census Results by State and Gender",
    x   = NULL,
    y   = NULL) +

  # Hide the legend; center the title
  theme(
    plot.title      = element_text(hjust = 0.5),
    legend.position = "none",
    axis.text.x     = element_text(hjust = 1)) +

  # Add data labels on the y-axis
  scale_y_continuous(labels = scales::number) +

  # Adjust the y-axis scale
  expand_limits(y = c(0, 1500000)) 


# Display the graph
ggplotly(gender_gg_stacked)
## Error in file(con, "rb"): cannot open the connection

Visualize the Data by Age Category

# Subset the data
age_category_gg <- ss_2008_census_wrangled_tbl %>%

  # Filter out the totals
  filter(`Population Category` != "Total",
         `Age Category`        != "Total") %>% 

  # Arrange the data in descending order by population
  arrange(desc(Population)) %>%

  # Initialize the ggplot
  ggplot(aes(x = State %>% as_factor(), y = Population)) +

  # Add the bars
  geom_bar(aes(fill = State), stat = "identity") + 

  # Facet the graph with facet_wrap()
  facet_wrap(~ `Age Category`, scales = "free_y") +

  # Add the title
  labs(
    title = "South Sudan 2008 Population Census Results by Age Category",
    x  = NULL,
    y  = NULL) + 

  # Format the population values
  scale_y_continuous(labels = scales::number) +

  # Adjust the y-axis scale
  expand_limits(y = 0) +

  # Hide the legend; center the title 
  theme(
    plot.title      = element_text(margin = margin(0, 0, 15, 0, unit = "pt")),
    legend.position = "none",
    strip.text.x    = element_text(margin = margin(5, 5, 5, 5, unit = "pt")),
    axis.text.x     = element_text(angle = 45, hjust = 1)
  )

# Display the graph
age_category_gg

plot of chunk unnamed-chunk-14

Summary

Data wrangling is an important aspect of data science, and as such, any aspiring data scientist or data analyst must master the techniques of data wrangling with R (or with Python, for those who use Python) to optimize their productivity. Because, in practice, data scientists and data analysts spend about 80% of their time performing data wrangling and the remainder of their time on data analysis and data modeling. With that being said, in this article, we have thoroughly demonstrated how to perform data wrangling with dplyr; tabulate data with gt; and visualize data with both ggplot2 and plotly.

  • Share:
Alier Reng
Alier Reng

Data science, statistics, quantitative finance, machine learning

Alier is the head of the data science program and president at Jonglei Institute of Technology. He teaches data science, elementary statistics, and introductory personal financial management courses with Python and R. He has over six years of university-level teaching experience and more than four years of healthcare analytics experience. His research interests lie in data visualization, time series analysis, and text analytics.

He holds a Bachelor of Science in Neuroscience from the University of Texas at Dallas (2008), a Master of Science in Professional Science with a concentration in Biostatistics from Middle Tennessee State University (2011), and an MBA in Information Systems Management from LeTourneau (2017). And he's currently pursuing a Doctor of Business Administration (D.B.A) in Finance at Liberty University.

Previous post

‘Lost Boy’ Finds a Better Life in America
August 12, 2020

Leave A Reply Cancel reply

You must be logged in to post a comment.

Search

Categories

  • Data Science
  • General
  • R
  • Statistics

Latest Courses

Getting Started in R

Getting Started in R

Coming soon
Elementary Statistics

Elementary Statistics

Coming soon
Personal Financial Management

Personal Financial Management

Coming soon
jit-the-leader-of-excellence

JIT Leads the Way

We are leading the way in training the next generation of South Sudan data scientists and data analysts, for free.

Copyright © 2020 Jonglei Institute of Technology

Company

  • ABOUT US
  • CONTACT US
  • JIT LEADERSHIP
  • FAMILY FIRST INITIATIVE

Useful Links

  • FAQs
  • SITEMAP
  • FACEBOOK
  • PRIVACY POLICY
  • BECOME AN INSTRUCTOR

JIT Headquarters

P. O. Box (coming soon!), Murfreesboro, TN 37128, USA

Connect With Us

Login with your site account

Lost your password?

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT