7 Reshaping

It’s fairly common for datasets from public sources to come in formats that need to be reshaped. The World Development Indicators (WDI) is one such dataset that requires reshaping before we can analyse it. Let’s go over the steps to see how we can reshape the WDI dataset.

Let’s start by loading the tidyverse package first.

library(tidyverse)

Clear everything to make sure there’s nothing leftover in our environment

rm(list = ls())

We’re using a small sample of the WDI dataset here to simplify the tasks. Let’s load the dataset and see what it looks like.

wdi <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/wdi.csv", na = "..")

wdi
# A tibble: 5 x 7
      `¬Series.Name`    Series.Code Country.Name Country.Code X1995.YR1995
               <chr>          <chr>        <chr>        <chr>        <dbl>
1 Maternal mortality    SH.STA.MMRT       France          FRA    15.000000
2 Maternal mortality    SH.STA.MMRT        Spain          ESP     6.000000
3 Maternal mortality    SH.STA.MMRT                                     NA
4 Health expenditure SH.XPD.TOTL.ZS       France          FRA    10.355906
5 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP     7.444592
# ... with 2 more variables: X2000.YR2000 <dbl>, X2005.YR2005 <dbl>

But ideally, we’d like our data to look something like this:

# A tibble: 6 x 5
  CountryCode CountryName  Year MaternalMortality HealthExpenditure
        <chr>       <chr> <dbl>             <dbl>             <dbl>
1         ESP       Spain  1995                 6          7.444592
2         ESP       Spain  2000                 5          7.214756
3         ESP       Spain  2005                 5          8.288271
4         FRA      France  1995                15         10.355906
5         FRA      France  2000                12         10.084833
6         FRA      France  2005                10         10.932626

We can see that some country names and codes are blank, so let’s get rid of them first

wdi %>%
  filter(Country.Code != "") 
# A tibble: 4 x 7
      `¬Series.Name`    Series.Code Country.Name Country.Code X1995.YR1995
               <chr>          <chr>        <chr>        <chr>        <dbl>
1 Maternal mortality    SH.STA.MMRT       France          FRA    15.000000
2 Maternal mortality    SH.STA.MMRT        Spain          ESP     6.000000
3 Health expenditure SH.XPD.TOTL.ZS       France          FRA    10.355906
4 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP     7.444592
# ... with 2 more variables: X2000.YR2000 <dbl>, X2005.YR2005 <dbl>

So far so good. Note that we’re not making any changes yet so we can just add one function at a time to the pipeline and check the results. Once we’re satisfied with the results we save them to a variable.

We need to gather all columns that start with “X” that contain per-year values for each series (for example X1960..YR1960)

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X"))
# A tibble: 12 x 6
       `¬Series.Name`    Series.Code Country.Name Country.Code
                <chr>          <chr>        <chr>        <chr>
 1 Maternal mortality    SH.STA.MMRT       France          FRA
 2 Maternal mortality    SH.STA.MMRT        Spain          ESP
 3 Health expenditure SH.XPD.TOTL.ZS       France          FRA
 4 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
 5 Maternal mortality    SH.STA.MMRT       France          FRA
 6 Maternal mortality    SH.STA.MMRT        Spain          ESP
 7 Health expenditure SH.XPD.TOTL.ZS       France          FRA
 8 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
 9 Maternal mortality    SH.STA.MMRT       France          FRA
10 Maternal mortality    SH.STA.MMRT        Spain          ESP
11 Health expenditure SH.XPD.TOTL.ZS       France          FRA
12 Health expenditure SH.XPD.TOTL.ZS        Spain          ESP
# ... with 2 more variables: Year <chr>, Value <dbl>

Now all values are in the Value column, so we need to spread them out to individual columns based on the Series.Code. We have to make sure that we only keep the columns that make the country-year observations unique. We use select() to keep Country.Code, Country.Name, Year, plus the two columns (Series.Code and Value) that will make up the key-value pair for the spread() function.

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value) 
# A tibble: 6 x 5
  Country.Code Country.Name         Year SH.STA.MMRT SH.XPD.TOTL.ZS
*        <chr>        <chr>        <chr>       <dbl>          <dbl>
1          ESP        Spain X1995.YR1995           6       7.444592
2          ESP        Spain X2000.YR2000           5       7.214756
3          ESP        Spain X2005.YR2005           5       8.288271
4          FRA       France X1995.YR1995          15      10.355906
5          FRA       France X2000.YR2000          12      10.084833
6          FRA       France X2005.YR2005          10      10.932626

It looks good, so we can rename the variables to something meaningful.

wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value) %>% 
  rename(CountryName = Country.Name,
         CountryCode = Country.Code,
         MaternalMortality = SH.STA.MMRT,
         HealthExpenditure = SH.XPD.TOTL.ZS)
# A tibble: 6 x 5
  CountryCode CountryName         Year MaternalMortality HealthExpenditure
*       <chr>       <chr>        <chr>             <dbl>             <dbl>
1         ESP       Spain X1995.YR1995                 6          7.444592
2         ESP       Spain X2000.YR2000                 5          7.214756
3         ESP       Spain X2005.YR2005                 5          8.288271
4         FRA      France X1995.YR1995                15         10.355906
5         FRA      France X2000.YR2000                12         10.084833
6         FRA      France X2005.YR2005                10         10.932626

Now we just need to extract the 4-digit year from the Year column. The Year column is formatted as X1995.YR1995 which means that the 4-digits for the year are in position 2,3,4, and 5. We can use the substring() function to take all the characters from position 2 to 5 and assign it back to the Year column.

Since this is the last step we might as well assign the results to a new variable.

wdi_long <- wdi %>%
  filter(Country.Code != "") %>% 
  gather(Year, Value, starts_with("X")) %>% 
  select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
  spread(Series.Code, Value) %>% 
  rename(CountryName = Country.Name,
         CountryCode = Country.Code,
         MaternalMortality = SH.STA.MMRT,
         HealthExpenditure = SH.XPD.TOTL.ZS) %>%
  mutate(Year = as.numeric(substring(Year, 2, 5)))

wdi_long 
# A tibble: 6 x 5
  CountryCode CountryName  Year MaternalMortality HealthExpenditure
        <chr>       <chr> <dbl>             <dbl>             <dbl>
1         ESP       Spain  1995                 6          7.444592
2         ESP       Spain  2000                 5          7.214756
3         ESP       Spain  2005                 5          8.288271
4         FRA      France  1995                15         10.355906
5         FRA      France  2000                12         10.084833
6         FRA      France  2005                10         10.932626

You can assign it back to wdi if you want, but we’re using a different name in case we make a mistake and have to start again. This way we would’ve have to reload the file all over again.