7 Reshaping
It’s fairly common for datasets from public sources to come in formats that need to be reshaped. The World Development Indicators (WDI) is one such dataset that requires reshaping before we can analyse it. Let’s go over the steps to see how we can reshape the WDI dataset.
Let’s start by loading the tidyverse
package first.
library(tidyverse)
Clear everything to make sure there’s nothing leftover in our environment
rm(list = ls())
We’re using a small sample of the WDI dataset here to simplify the tasks. Let’s load the dataset and see what it looks like.
wdi <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/wdi.csv", na = "..")
wdi
# A tibble: 5 x 7
`¬Series.Name` Series.Code Country.Name Country.Code X1995.YR1995
<chr> <chr> <chr> <chr> <dbl>
1 Maternal mortality SH.STA.MMRT France FRA 15.000000
2 Maternal mortality SH.STA.MMRT Spain ESP 6.000000
3 Maternal mortality SH.STA.MMRT NA
4 Health expenditure SH.XPD.TOTL.ZS France FRA 10.355906
5 Health expenditure SH.XPD.TOTL.ZS Spain ESP 7.444592
# ... with 2 more variables: X2000.YR2000 <dbl>, X2005.YR2005 <dbl>
But ideally, we’d like our data to look something like this:
# A tibble: 6 x 5
CountryCode CountryName Year MaternalMortality HealthExpenditure
<chr> <chr> <dbl> <dbl> <dbl>
1 ESP Spain 1995 6 7.444592
2 ESP Spain 2000 5 7.214756
3 ESP Spain 2005 5 8.288271
4 FRA France 1995 15 10.355906
5 FRA France 2000 12 10.084833
6 FRA France 2005 10 10.932626
We can see that some country names and codes are blank, so let’s get rid of them first
wdi %>%
filter(Country.Code != "")
# A tibble: 4 x 7
`¬Series.Name` Series.Code Country.Name Country.Code X1995.YR1995
<chr> <chr> <chr> <chr> <dbl>
1 Maternal mortality SH.STA.MMRT France FRA 15.000000
2 Maternal mortality SH.STA.MMRT Spain ESP 6.000000
3 Health expenditure SH.XPD.TOTL.ZS France FRA 10.355906
4 Health expenditure SH.XPD.TOTL.ZS Spain ESP 7.444592
# ... with 2 more variables: X2000.YR2000 <dbl>, X2005.YR2005 <dbl>
So far so good. Note that we’re not making any changes yet so we can just add one function at a time to the pipeline and check the results. Once we’re satisfied with the results we save them to a variable.
We need to gather all columns that start with “X” that contain per-year values for each series (for example X1960..YR1960)
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X"))
# A tibble: 12 x 6
`¬Series.Name` Series.Code Country.Name Country.Code
<chr> <chr> <chr> <chr>
1 Maternal mortality SH.STA.MMRT France FRA
2 Maternal mortality SH.STA.MMRT Spain ESP
3 Health expenditure SH.XPD.TOTL.ZS France FRA
4 Health expenditure SH.XPD.TOTL.ZS Spain ESP
5 Maternal mortality SH.STA.MMRT France FRA
6 Maternal mortality SH.STA.MMRT Spain ESP
7 Health expenditure SH.XPD.TOTL.ZS France FRA
8 Health expenditure SH.XPD.TOTL.ZS Spain ESP
9 Maternal mortality SH.STA.MMRT France FRA
10 Maternal mortality SH.STA.MMRT Spain ESP
11 Health expenditure SH.XPD.TOTL.ZS France FRA
12 Health expenditure SH.XPD.TOTL.ZS Spain ESP
# ... with 2 more variables: Year <chr>, Value <dbl>
Now all values are in the Value
column, so we need to spread them out to individual columns based on the Series.Code
. We have to make sure that we only keep the columns that make the country-year observations unique. We use select()
to keep Country.Code
, Country.Name
, Year
, plus the two columns (Series.Code
and Value
) that will make up the key-value pair for the spread()
function.
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value)
# A tibble: 6 x 5
Country.Code Country.Name Year SH.STA.MMRT SH.XPD.TOTL.ZS
* <chr> <chr> <chr> <dbl> <dbl>
1 ESP Spain X1995.YR1995 6 7.444592
2 ESP Spain X2000.YR2000 5 7.214756
3 ESP Spain X2005.YR2005 5 8.288271
4 FRA France X1995.YR1995 15 10.355906
5 FRA France X2000.YR2000 12 10.084833
6 FRA France X2005.YR2005 10 10.932626
It looks good, so we can rename the variables to something meaningful.
wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value) %>%
rename(CountryName = Country.Name,
CountryCode = Country.Code,
MaternalMortality = SH.STA.MMRT,
HealthExpenditure = SH.XPD.TOTL.ZS)
# A tibble: 6 x 5
CountryCode CountryName Year MaternalMortality HealthExpenditure
* <chr> <chr> <chr> <dbl> <dbl>
1 ESP Spain X1995.YR1995 6 7.444592
2 ESP Spain X2000.YR2000 5 7.214756
3 ESP Spain X2005.YR2005 5 8.288271
4 FRA France X1995.YR1995 15 10.355906
5 FRA France X2000.YR2000 12 10.084833
6 FRA France X2005.YR2005 10 10.932626
Now we just need to extract the 4-digit year from the Year
column. The Year
column is formatted as X1995.YR1995
which means that the 4-digits for the year are in position 2
,3
,4
, and 5
. We can use the substring()
function to take all the characters from position 2
to 5
and assign it back to the Year
column.
Since this is the last step we might as well assign the results to a new variable.
wdi_long <- wdi %>%
filter(Country.Code != "") %>%
gather(Year, Value, starts_with("X")) %>%
select(Country.Code, Country.Name, Year, Series.Code, Value) %>%
spread(Series.Code, Value) %>%
rename(CountryName = Country.Name,
CountryCode = Country.Code,
MaternalMortality = SH.STA.MMRT,
HealthExpenditure = SH.XPD.TOTL.ZS) %>%
mutate(Year = as.numeric(substring(Year, 2, 5)))
wdi_long
# A tibble: 6 x 5
CountryCode CountryName Year MaternalMortality HealthExpenditure
<chr> <chr> <dbl> <dbl> <dbl>
1 ESP Spain 1995 6 7.444592
2 ESP Spain 2000 5 7.214756
3 ESP Spain 2005 5 8.288271
4 FRA France 1995 15 10.355906
5 FRA France 2000 12 10.084833
6 FRA France 2005 10 10.932626
You can assign it back to wdi
if you want, but we’re using a different name in case we make a mistake and have to start again. This way we would’ve have to reload the file all over again.