Introduction
In this section, we show how to combine NCDS data across sweeps.
As an example, we use data on cohort members’ weight. These are contained in files which have one row per cohort-member. As a reminder, we have organised the data files so that each sweep has its own folder, which is named according to the age of follow-up (e.g., 55y for the ninth major sweep).
We begin by combining data from the Sweeps 4 (23y) and Sweep 8 (50y), showing how to combine these datasets in wide (one row per observational unit) and long (multiple rows per observational unit) formats by merging and appending, respectively. Because variable names change between sweeps in unpredictable ways, it is not straightforwardly possible to combine data from multiple sweeps programmatically (as we are able to do for, e.g., the MCS).
We use the following packages:
# Load Packages
library(tidyverse) # For data manipulation
library(haven) # For importing .dta files
Merging Across Sweeps
The variables dvwt23
and DVWT50
contains the weight of the cohort member at Sweeps 4 (23y) and Sweep 8 (50y), respectively. Note, these are derived variable which convert raw weight measurements into kilograms. The variable names follow the same convention (with the exception that at age 23y, lower case is used). This bucks the more general case where conceptually similar variables have different (potentially, non-descriptive) names, when combining data including early sweeps.
We will use the read_dta()
function from haven
to read in the data from the four sweeps, specifying the col_select
argument to keep only the variables we need (the identifier and weight variables).
df_23y <- read_dta("23y/ncds4.dta",
col_select = c("ncdsid", "dvwt23"))
df_50y <- read_dta("50y/ncds_2008_followup.dta",
col_select = c("NCDSID", "DVWT50"))
We can merge these datasets by row using the *_join()
family of functions. These share a common syntax. They take two data frames (x
and y
) as arguments, as well as a by
argument that specifies the variable(s) to join on. The *_join()
functions are:
full_join()
: Returns all rows fromx
andy
, and all columns fromx
andy
. For rows without matches in bothx
andy
, the missing valueNA
is used for columns that are not used as identifiers.inner_join()
: Returns all rows fromx
andy
where there are matching rows in both data frames.left_join()
: Returns all rows fromx
, and all columns fromx
andy
. Rows inx
with no match iny
will haveNA
values in the new columns fromy
.right_join()
: Returns all rows fromy
, and all columns fromx
andy
. Rows iny
with no match inx
will haveNA
values in the columns ofx
.
In the current context, where x
is data from the Sweeps 4 (df_23y
) and y
is data from the 50y sweep (df_50y
): full_join()
will return a row for each individual present in the Sweep 4 or Sweep 8, with the weight from each sweep in the same row; inner_join()
will return a row for each individual who was present in all these sweeps, with the weight from each sweep in the same row; left_join()
will return a row for each individual in the fourth sweep, with the weight from the eighth sweep in the same row if the individual was present in the eighth sweep; right_join()
will return a row for each individual in the eighth sweep, with the weight from the fourth sweep in the same row if the individual was present in the fourth sweep.
The *_join()
functions can handle multiple variables to join on, and can also handle situations where the identifiers have different names across x
and y
. To specify the identifiers, we pass a vector to the by
argument. In this case, we pass a named vector so that ncdsid
in df_23y
can be matched to NCDSID
in df_50y
.
df_23y %>%
full_join(df_50y, by = c(ncdsid = "NCDSID"))
# A tibble: 14,014 × 3
ncdsid dvwt23 DVWT50
<chr> <dbl+lbl> <dbl>
1 N10001N 59.4 66.7
2 N10002P 73.5 79.4
3 N10004R 76.2 NA
4 N10007U 52.2 72.1
5 N10009W 66.7 78
6 N10011Q 63.5 95
7 N10012R 114. 133.
8 N10013S 83.5 95.2
9 N10014T 57.2 63.5
10 N10015U 73.0 78
# ℹ 14,004 more rows
df_23y %>%
inner_join(df_50y, by = c(ncdsid = "NCDSID"))
# A tibble: 8,313 × 3
ncdsid dvwt23 DVWT50
<chr> <dbl+lbl> <dbl>
1 N10001N 59.4 66.7
2 N10002P 73.5 79.4
3 N10007U 52.2 72.1
4 N10009W 66.7 78
5 N10011Q 63.5 95
6 N10012R 114. 133.
7 N10013S 83.5 95.2
8 N10014T 57.2 63.5
9 N10015U 73.0 78
10 N10016V 63.5 70.8
# ℹ 8,303 more rows
df_23y %>%
left_join(df_50y, by = c(ncdsid = "NCDSID"))
# A tibble: 12,537 × 3
ncdsid dvwt23 DVWT50
<chr> <dbl+lbl> <dbl>
1 N10001N 59.4 66.7
2 N10002P 73.5 79.4
3 N10004R 76.2 NA
4 N10007U 52.2 72.1
5 N10009W 66.7 78
6 N10011Q 63.5 95
7 N10012R 114. 133.
8 N10013S 83.5 95.2
9 N10014T 57.2 63.5
10 N10015U 73.0 78
# ℹ 12,527 more rows
df_23y %>%
right_join(df_50y, by = c(ncdsid = "NCDSID"))
# A tibble: 9,790 × 3
ncdsid dvwt23 DVWT50
<chr> <dbl+lbl> <dbl>
1 N10001N 59.4 66.7
2 N10002P 73.5 79.4
3 N10007U 52.2 72.1
4 N10009W 66.7 78
5 N10011Q 63.5 95
6 N10012R 114. 133.
7 N10013S 83.5 95.2
8 N10014T 57.2 63.5
9 N10015U 73.0 78
10 N10016V 63.5 70.8
# ℹ 9,780 more rows
Note, the *_join()
functions will merge any matching rows. Unlike Stata
, we do not have to explicitly state whether we want a 1-to-1, many-to-1, 1-to-many, or many-to-many merge. This is determined by the data that are inputted to *_join()
.
When the by = ...
isn’t used explicitly, the *_join()
will merge on any variables which have the same names across the two datasets. As df_23y
has variables in lower case and df_50y
has variables in upper case, we could have renamed the variables in df_23y
in one fell swoop with rename_with(str_to_upper)
. There are usually many ways of achieving the same thing.
df_23y %>%
rename_with(str_to_upper) %>% # Converts all variable names to upper case
full_join(df_50y)
Joining with `by = join_by(NCDSID)`
# A tibble: 14,014 × 3
NCDSID DVWT23 DVWT50
<chr> <dbl+lbl> <dbl>
1 N10001N 59.4 66.7
2 N10002P 73.5 79.4
3 N10004R 76.2 NA
4 N10007U 52.2 72.1
5 N10009W 66.7 78
6 N10011Q 63.5 95
7 N10012R 114. 133.
8 N10013S 83.5 95.2
9 N10014T 57.2 63.5
10 N10015U 73.0 78
# ℹ 14,004 more rows
Appending Sweeps
To put the data into long format, we can use the bind_rows()
function. (In this case, the data will have one row per cohort-member x sweep combination.) To work properly, we need to name the variables consistently across sweeps, which here means removing the age-specific suffixes (e.g., the number 23
from dvwt23
in df_3y
). We also need to add a variable to identify the sweep the data comes from. Below, we use the mutate()
function to create a sweep
variable and then use the rename_with()
function to remove the suffixes and rename the variables consistently across sweeps. (Given we only had one variable to rename, we could have done this manually with rename()
, but this approach is more scalable.)
df_23y_nosuffix <- df_23y %>%
rename_with(str_to_upper) %>%
rename_with(~ str_remove(.x, "23$")) %>% # Removes the suffix '23' from variable names
mutate(sweep = 23, .before = 1)
df_50y_nosuffix <- df_50y %>%
rename_with(~ str_remove(.x, "50$")) %>%
mutate(sweep = 50, .before = 1)
rename_with()
applies a function to the names of the variables. In this case, we use the str_remove()
function from the stringr
package (part of the tidyverse
) to remove the suffix from the variable names. The ~
symbol is used to create an anonymous function, which is applied to each variable name. The .x
symbol in the anonymous function is a placeholder for the variable name. str_remove()
takes a regular expression. The $
symbol is used to match the end of the string (so 23$
removes the 23
where it is the last characters in a variable name). Note, for the mutate()
call, the .before
argument is used to specify the position of the new variable in the data frame - here we specify sweep
as the first column. Below we see what the formatted data frames look like:
df_23y_nosuffix
# A tibble: 12,537 × 3
sweep NCDSID DVWT
<dbl> <chr> <dbl+lbl>
1 23 N10001N 59.4
2 23 N10002P 73.5
3 23 N10004R 76.2
4 23 N10007U 52.2
5 23 N10009W 66.7
6 23 N10011Q 63.5
7 23 N10012R 114.
8 23 N10013S 83.5
9 23 N10014T 57.2
10 23 N10015U 73.0
# ℹ 12,527 more rows
df_50y_nosuffix
# A tibble: 9,790 × 3
sweep NCDSID DVWT
<dbl> <chr> <dbl>
1 50 N10001N 66.7
2 50 N10002P 79.4
3 50 N10007U 72.1
4 50 N10008V 69.8
5 50 N10009W 78
6 50 N10011Q 95
7 50 N10012R 133.
8 50 N10013S 95.2
9 50 N10014T 63.5
10 50 N10015U 78
# ℹ 9,780 more rows
Now the data have been prepared, we can use bind_rows()
to append the data frames together. This will stack the data frames on top of each other, so the number of rows is equal to the sum of rows in the individual datasets. The bind_rows()
function can handle data frames with different numbers of columns. Missing columns are filled with NA
values.
bind_rows(df_23y_nosuffix, df_50y_nosuffix) %>%
arrange(NCDSID, sweep) # Sorts the dataset by ID and sweep
# A tibble: 22,327 × 3
sweep NCDSID DVWT
<dbl> <chr> <dbl+lbl>
1 23 N10001N 59.4
2 50 N10001N 66.7
3 23 N10002P 73.5
4 50 N10002P 79.4
5 23 N10004R 76.2
6 23 N10007U 52.2
7 50 N10007U 72.1
8 50 N10008V 69.8
9 23 N10009W 66.7
10 50 N10009W 78
# ℹ 22,317 more rows
Notice that with bind_rows()
a cohort member has only as many rows of data as the times they appeared in Sweeps 4 and 8. This differs from *_join()
where an explicit missing NA
value is generated for the missing sweep. The tidyverse
function complete()
can be used to create missing rows, which can be useful if you need to generate a balanced panel of observations from which to begin analysis with (e.g., when performing multiple imputation in long format).
bind_rows(df_23y_nosuffix, df_50y_nosuffix) %>%
complete(NCDSID, sweep) %>% # Ensure cohort members have a row for each sweep
arrange(NCDSID, sweep)
# A tibble: 28,028 × 3
NCDSID sweep DVWT
<chr> <dbl> <dbl+lbl>
1 N10001N 23 59.4
2 N10001N 50 66.7
3 N10002P 23 73.5
4 N10002P 50 79.4
5 N10004R 23 76.2
6 N10004R 50 NA
7 N10007U 23 52.2
8 N10007U 50 72.1
9 N10008V 23 NA
10 N10008V 50 69.8
# ℹ 28,018 more rows