Introduction
In this section, we show how to reshape data from long to wide (and vice versa). To demonstrate, we use data from Sweeps 8 (51y) and 11 (51y) on cohort member’s height and weight collected.
The packages we use are:
# Load Packages
library(tidyverse) # For data manipulation
library(haven) # For importing .dta files
Reshaping Raw Data from Wide to Long
We begin by loading the data from each sweep and merging these together into a single wide format data frame; see Combining Data Across Sweeps for further explanation on how this is achieved. Note, the names of the height and weight variables in Sweep 8 and Sweep 11 follow a similar convention, which is the exception rather than the rule in BCS70 data. Below, we convert the variable names in the Sweep 8 data frame to lower case so that they closely match those in the Sweep 11 data frame. This will make reshaping easier.
df_42y <- read_dta("42y/bcs70_2012_derived.dta",
col_select = c("BCSID", "BD9HGHTM", "BD9WGHTK")) %>%
rename_with(str_to_lower)
df_51y <- read_dta("51y/bcs11_age51_main.dta",
col_select = c("bcsid", "bd11hghtm", "bd11wghtk"))
df_wide <- df_42y %>%
full_join(df_51y, by = "bcsid")
df_wide
has 5 columns. Besides, the identifier, bcsid
, there are 4 columns for height and weight measurements at each sweep. Each of these 4 columns is prefix by three characters indicating the sweep at assessment. We can reshape the dataset into long format (one row per person x sweep combination) using the pivot_longer()
function so that the resulting data frame has four columns: one person identifier, a variable for age of assessment (fup
), and variables for height and weight. We specify the columns to be reshaped using the cols
argument, provide the new variable names in the names_to
argument, and the pattern the existing column names take using the names_pattern
argument. For names_pattern
we specify "^bd(\\d{1,2})([A-Za-z].+)$"
, which breaks the column name into two pieces: one or two digits indicating sweep (and after bd
; (\\d{1,2})
) and subsequent characters at the end of the name ("([A-Za-z].+)$"
). names_pattern
uses regular expressions. .
matches single characters, and .+
modifies this to make one or more characters. \\d
is a special character denoting a digit. [A-Za-z]
indicates any alphabetic character, upper or lower case. As noted, the digits hold information on sweep of assessment; in the reshaped data frame the character is stored as a value in a new column sweep
. .value
is a placeholder for the new columns in the reshaped data frame that store the values from the columns selected by cols
; these new columns are named using the first piece from names_pattern
- in this case hghtm
(height) and wghtk
(weight).
df_long <- df_wide %>%
pivot_longer(cols = matches("^bd"),
names_to = c("sweep", ".value"),
names_pattern = "^bd(\\d{1,2})([A-Za-z].+)$")
df_long
# A tibble: 21,366 × 4
bcsid sweep hghtm wghtk
<chr> <chr> <dbl+lbl> <dbl+lbl>
1 B10001N 9 1.55 55.8
2 B10001N 11 1.55 50.8
3 B10003Q 9 1.85 82.6
4 B10003Q 11 1.85 83.5
5 B10004R 9 1.60 57.2
6 B10004R 11 1.6 57.2
7 B10007U 9 1.52 82.6
8 B10007U 11 NA NA
9 B10009W 9 1.63 54.9
10 B10009W 11 1.63 60.3
# ℹ 21,356 more rows
Reshaping Raw Data from Long to Wide
We can also reshape the data from long to wide format using the pivot_wider()
function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the values_from
argument, provide the old column names in the names_from
argument, and use the names_glue
argument to specify the convention to follow for the new column names. The names_glue
argument uses curly braces ({}
) to reference the values from the names_from
and .value
arguments. As we are specifying multiple columns in values_from
, .value
is a placeholder for the names of the variables selected in values_from
.
df_long %>%
pivot_wider(names_from = sweep,
values_from = c(hghtm, wghtk),
names_glue = "{.value}_{sweep}")
# A tibble: 10,683 × 5
bcsid hghtm_9 hghtm_11 wghtk_9 wghtk_11
<chr> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 B10001N 1.55 1.55 55.8 50.8
2 B10003Q 1.85 1.85 82.6 83.5
3 B10004R 1.60 1.6 57.2 57.2
4 B10007U 1.52 NA 82.6 NA
5 B10009W 1.63 1.63 54.9 60.3
6 B10010P 1.65 NA -8 [No information] NA
7 B10011Q 1.63 1.65 76.2 82.6
8 B10013S 1.63 1.63 63.5 66.7
9 B10015U 1.83 1.8 77.6 82.6
10 B10016V 1.88 1.88 114. 118
# ℹ 10,673 more rows
Note, in the original df_wide
tibble, the height and weight variables were labelled numeric vectors - this class allows users to add metadata to variables (value labels, etc.). When reshaping to long format, multiple variables are effectively appended together, but the final reshape variables can only have one set of properties. pivot_longer()
tries to preserve variables attributes, but in some cases will throw an error (where variables are of inconsistent types) or print a warning (where value labels are inconsistent).