Introduction

In this section, we show how to reshape data from long to wide (and vice versa). To demonstrate, we use data from Sweeps 4 (23y) and 8 (50y) on cohort member’s height and weight collected.

The packages we use are:

# Load Packages
library(tidyverse) # For data manipulation
library(haven) # For importing .dta files

Reshaping Raw Data from Wide to Long

We begin by loading the data from each sweep and merging these together into a single wide format data frame; see Combining Data Across Sweeps for further explanation on how this is achieved. Note, the names of the height and weight variables in Sweep 4 and Sweep 8 follow a similar convention, which is the exception rather than the rule in NCDS data. Below, we convert the variable names in the Sweep 4 data frame to upper case so that they closely match those in the Sweep 8 data frame. This will make reshaping easier.

df_23y <- read_dta("23y/ncds4.dta",
                      col_select = c("ncdsid", "dvwt23", "dvht23")) %>%
rename_with(str_to_upper)

df_50y <- read_dta("50y/ncds_2008_followup.dta",
                   col_select = c("NCDSID", "DVWT50", "DVHT50"))

df_wide <- df_23y %>%
  full_join(df_50y, by = "NCDSID")

df_wide has 5 columns. Besides, the identifier, NCDSID, there are 4 columns for height and weight measurements at each sweep. Each of these 4 columns is suffixed by two numbers indicating the age at assessment. We can reshape the dataset into long format (one row per person x sweep combination) using the pivot_longer() function so that the resulting data frame has four columns: one person identifier, a variable for age of assessment (fup), and variables for height and weight. We specify the columns to be reshaped using the cols argument, provide the new variable names in the names_to argument, and the pattern the existing column names take using the names_pattern argument. For names_pattern we specify "^(.*)(\\d\\d)$", which breaks the column name into two pieces: the first characters ("(.*)") and two digits at the end of the name ("(\\d\\d)$"). names_pattern uses regular expressions. . matches single characters, and .* modifies this to make zero or more characters. \\d is a special character denoting a digit. As noted, the final two digits character hold information on age of assessment; in the reshaped data frame the character is stored as a value in a new column fup. .value is a placeholder for the new columns in the reshaped data frame that store the values from the columns selected by cols; these new columns are named using the first piece from names_pattern - in this case DVHT (height) and DVWT (weight).

df_long <- df_wide %>%
  pivot_longer(cols = matches("DV(HT|WT)\\d\\d"),
               names_to = c(".value", "fup"),
               names_pattern = "^(.*)(\\d\\d)$")

df_long

# A tibble: 28,028 × 4
   NCDSID  fup   DVHT      DVWT     
   <chr>   <chr> <dbl+lbl> <dbl+lbl>
N10001N 23     1.63     59.4     
N10001N 50    NA        66.7     
N10002P 23     1.90     73.5     
N10002P 50    NA        79.4     
N10004R 23     1.65     76.2     
N10004R 50    NA        NA       
N10007U 23     1.63     52.2     
N10007U 50    NA        72.1     
N10009W 23     1.73     66.7     
N10009W 50     1.7      78       
# ℹ 28,018 more rows

Reshaping Raw Data from Long to Wide

We can also reshape the data from long to wide format using the pivot_wider() function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the values_from argument, provide the old column names in the names_from argument, and use the names_glue argument to specify the convention to follow for the new column names. The names_glue argument uses curly braces ({}) to reference the values from the names_from and .value arguments. As we are specifying multiple columns in values_from, .value is a placeholder for the names of the variables selected in values_from.

df_long %>%
  pivot_wider(names_from = fup,
              values_from = matches("DV(HT|WT)"),
              names_glue = "{.value}{fup}")

# A tibble: 14,014 × 5
   NCDSID  DVHT23    DVHT50    DVWT23    DVWT50   
   <chr>   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
N10001N 1.63      NA         59.4      66.7    
N10002P 1.90      NA         73.5      79.4    
N10004R 1.65      NA         76.2      NA      
N10007U 1.63      NA         52.2      72.1    
N10009W 1.73       1.7       66.7      78      
N10011Q 1.68       1.7       63.5      95      
N10012R 1.96      NA        114.      133.     
N10013S 1.78      NA         83.5      95.2    
N10014T 1.55      NA         57.2      63.5    
N10015U 1.80      NA         73.0      78      
# ℹ 14,004 more rows

Note, in the original df_wide tibble, DVHT23 and DVWT23 were labelled numeric vectors - this class allows users to add metadata to variables (value labels, etc.). DVHT50 and DVWT50, on the other hand, were standard numeric vectors. When reshaping to long format, multiple variables are effectively appended together. The final reshape variables can only have one set of properties. pivot_longer() merges variables together to preserve variables attributes, but in some cases will throw an error (where variables are of inconsistent types) or print a warning (where value labels are inconsistent). Note above, where we reshape df_long back to wide format, all weight and height variables now have labelled numeric type.