Introduction

In this section, we show how to combine MCS data across sweeps, assuming the data to be merged are in a consistent format (e.g., one row per family); for information on munging data to have a consistent structure see the page Combining Data Within a Sweep

As an example, we use data on cohort members’ height, which was recorded in Sweeps 2-7 and is available in the mcs[2-7]_cm_interview.dta files. These files contain one row per cohort-member. As a reminder, we have organised the data files so that each sweep has its own folder, which is named according to the age of follow-up (e.g., 3y for the second sweep).

We begin by combining data from the second and third sweeps, showing how to combine these datasets in wide (one row per observational unit) and long (multiple rows per observational unit) formats by merging and appending, respectively. We then show how to combine data from multiple sweeps programmatically using the dplyr and purrr packages (from the tidyverse).

We use the following packages:

# Load Packages
library(tidyverse) # For data manipulation
library(haven) # For importing .dta files

Merging Across Sweeps

The variable [B-G]CHTCM00 contains the height of the cohort member at Sweep 2-7, except for Sweep 5, where the variable is called ECHTCMA0. The cohort-member identifiers are stored across two variables in the mcs[2-7]_cm_interview.dta files: MCSID and [A-G]CNUM00. MCSID is the family identifier and [A-G]CNUM00 identifies the cohort member within the family. We will use the read_dta() function from haven to read in the data from the second and third sweeps, specifying the col_select argument to keep only the variables we need (the two identifier variables and height).

df_3y <- read_dta("3y/mcs2_cm_interview.dta",
                  col_select = c("MCSID", "BCNUM00", "BCHTCM00"))

df_5y <- read_dta("5y/mcs3_cm_interview.dta",
                  col_select = c("MCSID", "CCNUM00", "CCHTCM00"))

We can merge these datasets by row using the *_join() family of functions. These share a common syntax. They take two data frames (x and y) as arguments, as well as a by argument that specifies the variable(s) to join on. The *_join() functions are:

full_join(): Returns all rows from x and y, and all columns from x and y. For rows without matches in both x and y, the missing value NA is used for columns that are not used as identifiers.
inner_join(): Returns all rows from x and y where there are matching rows in both data frames.
left_join(): Returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns from y.
right_join(): Returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the columns of x.

In the current context, where x is data from the second sweep (df_3y) and y is data from the third sweep (df_5y): full_join() will return a row for each individual present in the second or third sweeps, with the height from each sweep in the same row; inner_join() will return a row for each individual who was present in both sweeps, with the height from each sweep in the same row; left_join() will return a row for each individual in the second sweep, with the height from the third sweep in the same row if the individual was present in the third sweep; right_join() will return a row for each individual in the third sweep, with the height from the second sweep in the same row if the individual was present in the second sweep.

The *_join() functions can handle multiple variables to join on, and can also handle situations where the identifiers have different names across x and y. To specify the identifiers, we pass a vector to the by argument. In this case, we pass a named vector so that BCNUM00 in df_3y can be matched to CCNUM00 in df_5y.

df_3y %>%
  full_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00"))

# A tibble: 17,242 × 4
   MCSID   BCNUM00                             BCHTCM00                 CCHTCM00
   <chr>   <dbl+lbl>                           <dbl+lbl>                <dbl+lb>
M10001N 1 [1st Cohort Member of the family]  97                      114.    
M10002P 1 [1st Cohort Member of the family]  96                      110.    
M10007U 1 [1st Cohort Member of the family] 102                      118     
M10008V 1 [1st Cohort Member of the family]  -2 [No Measurement tak…  NA     
M10008V 2 [2nd Cohort Member of the family]  -2 [No Measurement tak…  NA     
M10011Q 1 [1st Cohort Member of the family] 106                      121     
M10014T 1 [1st Cohort Member of the family]  97                       NA     
M10015U 1 [1st Cohort Member of the family]  94                      110.    
M10016V 1 [1st Cohort Member of the family] 102                      118.    
M10017W 1 [1st Cohort Member of the family]  99                      110.    
# ℹ 17,232 more rows

df_3y %>%
  inner_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00"))

# A tibble: 13,967 × 4
   MCSID   BCNUM00                             BCHTCM00  CCHTCM00 
   <chr>   <dbl+lbl>                           <dbl+lbl> <dbl+lbl>
M10001N 1 [1st Cohort Member of the family]  97       114.     
M10002P 1 [1st Cohort Member of the family]  96       110.     
M10007U 1 [1st Cohort Member of the family] 102       118      
M10011Q 1 [1st Cohort Member of the family] 106       121      
M10015U 1 [1st Cohort Member of the family]  94       110.     
M10016V 1 [1st Cohort Member of the family] 102       118.     
M10017W 1 [1st Cohort Member of the family]  99       110.     
M10018X 1 [1st Cohort Member of the family]  97       113.     
M10020R 1 [1st Cohort Member of the family]  97       112.     
M10021S 1 [1st Cohort Member of the family]  90       108      
# ℹ 13,957 more rows

df_3y %>%
  left_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00"))

# A tibble: 15,778 × 4
   MCSID   BCNUM00                             BCHTCM00                 CCHTCM00
   <chr>   <dbl+lbl>                           <dbl+lbl>                <dbl+lb>
M10001N 1 [1st Cohort Member of the family]  97                      114.    
M10002P 1 [1st Cohort Member of the family]  96                      110.    
M10007U 1 [1st Cohort Member of the family] 102                      118     
M10008V 1 [1st Cohort Member of the family]  -2 [No Measurement tak…  NA     
M10008V 2 [2nd Cohort Member of the family]  -2 [No Measurement tak…  NA     
M10011Q 1 [1st Cohort Member of the family] 106                      121     
M10014T 1 [1st Cohort Member of the family]  97                       NA     
M10015U 1 [1st Cohort Member of the family]  94                      110.    
M10016V 1 [1st Cohort Member of the family] 102                      118.    
M10017W 1 [1st Cohort Member of the family]  99                      110.    
# ℹ 15,768 more rows

df_3y %>%
  right_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00"))

# A tibble: 15,431 × 4
   MCSID   BCNUM00                             BCHTCM00  CCHTCM00 
   <chr>   <dbl+lbl>                           <dbl+lbl> <dbl+lbl>
M10001N 1 [1st Cohort Member of the family]  97       114.     
M10002P 1 [1st Cohort Member of the family]  96       110.     
M10007U 1 [1st Cohort Member of the family] 102       118      
M10011Q 1 [1st Cohort Member of the family] 106       121      
M10015U 1 [1st Cohort Member of the family]  94       110.     
M10016V 1 [1st Cohort Member of the family] 102       118.     
M10017W 1 [1st Cohort Member of the family]  99       110.     
M10018X 1 [1st Cohort Member of the family]  97       113.     
M10020R 1 [1st Cohort Member of the family]  97       112.     
M10021S 1 [1st Cohort Member of the family]  90       108      
# ℹ 15,421 more rows

Note, the *_join() functions will merge any matching rows. Unlike Stata, we do not have to explicitly state whether we want a 1-to-1, many-to-1, 1-to-many, or many-to-many merge. This is determined by the data that are inputted to *_join(). R will not throw an error if matches are not 1-to-1, so care must be taken, for instance, when merging the different data structures. See Combining Data Within a Sweep for more information.

Appending Sweeps

To put the data into long format, we can use the bind_rows() function. (In this case, the data will have one row per cohort-member x sweep combination.) To work properly, we need to name the variables consistently across sweeps, which here means removing the sweep-specific prefixes (e.g., the letter B from BCNUM00 and BCHTCM00 in df_3y). We also need to add a variable to identify the sweep the data comes from. Below, we use the mutate() function to create a sweep variable and then use the rename_with() function to remove the prefixes and rename the variables consistently across sweeps.

df_3y_noprefix <- df_3y %>%
  mutate(sweep = 2, .before = 1) %>%
  rename_with(~ str_remove(.x, "^B"))

df_5y_noprefix <- df_5y %>%
  mutate(sweep = 3, .before = 1) %>%
  rename_with(~ str_remove(.x, "^C"))

rename_with() applies a function to the names of the variables. In this case, we use the str_remove() function from the stringr package (part of the tidyverse) to remove the prefix from the variable names. The ~ symbol is used to create an anonymous function, which is applied to each variable name. The .x symbol in the anonymous function is a placeholder for the variable name. str_remove() takes a regular expression. The ^ symbol is used to match the start of the string (so ^C removes the C where it is the first character in a variable name - necessary to avoid removing the C within, e.g., MCSID). Note, for the mutate() call, the .before argument is used to specify the position of the new variable in the data frame - here we specify sweep as the first column. Below we see what the formatted data frames look like:

df_3y_noprefix

# A tibble: 15,778 × 4
   sweep MCSID   CNUM00                              CHTCM00                   
   <dbl> <chr>   <dbl+lbl>                           <dbl+lbl>                 
   2 M10001N 1 [1st Cohort Member of the family]  97                       
   2 M10002P 1 [1st Cohort Member of the family]  96                       
   2 M10007U 1 [1st Cohort Member of the family] 102                       
   2 M10008V 1 [1st Cohort Member of the family]  -2 [No Measurement taken]
   2 M10008V 2 [2nd Cohort Member of the family]  -2 [No Measurement taken]
   2 M10011Q 1 [1st Cohort Member of the family] 106                       
   2 M10014T 1 [1st Cohort Member of the family]  97                       
   2 M10015U 1 [1st Cohort Member of the family]  94                       
   2 M10016V 1 [1st Cohort Member of the family] 102                       
   2 M10017W 1 [1st Cohort Member of the family]  99                       
# ℹ 15,768 more rows

df_5y_noprefix

# A tibble: 15,431 × 4
   sweep MCSID   CNUM00                              CHTCM00  
   <dbl> <chr>   <dbl+lbl>                           <dbl+lbl>
   3 M10001N 1 [1st Cohort Member of the family] 114.     
   3 M10002P 1 [1st Cohort Member of the family] 110.     
   3 M10007U 1 [1st Cohort Member of the family] 118      
   3 M10011Q 1 [1st Cohort Member of the family] 121      
   3 M10015U 1 [1st Cohort Member of the family] 110.     
   3 M10016V 1 [1st Cohort Member of the family] 118.     
   3 M10017W 1 [1st Cohort Member of the family] 110.     
   3 M10018X 1 [1st Cohort Member of the family] 113.     
   3 M10020R 1 [1st Cohort Member of the family] 112.     
   3 M10021S 1 [1st Cohort Member of the family] 108      
# ℹ 15,421 more rows

Now the data have been prepared, we can use bind_rows() to append the data frames together. This will stack the data frames on top of each other, so the number of rows is equal to the sum of rows in the individual datasets. The bind_rows() function can handle data frames with different numbers of columns. Missing columns are filled with NA values.

bind_rows(df_3y_noprefix, df_5y_noprefix) %>%
  arrange(MCSID, CNUM00, sweep) # Sorts the dataset by ID and sweep

Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -1

# A tibble: 31,209 × 4
   sweep MCSID   CNUM00                              CHTCM00                   
   <dbl> <chr>   <dbl+lbl>                           <dbl+lbl>                 
   2 M10001N 1 [1st Cohort Member of the family]  97                       
   3 M10001N 1 [1st Cohort Member of the family] 114.                      
   2 M10002P 1 [1st Cohort Member of the family]  96                       
   3 M10002P 1 [1st Cohort Member of the family] 110.                      
   2 M10007U 1 [1st Cohort Member of the family] 102                       
   3 M10007U 1 [1st Cohort Member of the family] 118                       
   2 M10008V 1 [1st Cohort Member of the family]  -2 [No Measurement taken]
   2 M10008V 2 [2nd Cohort Member of the family]  -2 [No Measurement taken]
   2 M10011Q 1 [1st Cohort Member of the family] 106                       
   3 M10011Q 1 [1st Cohort Member of the family] 121                       
# ℹ 31,199 more rows

Notice that with bind_rows() a cohort member has only as many rows of data as the times they appeared in Sweeps 2 and 3. This differs from *_join() where an explicit missing NA value is generated for the missing sweep. The tidyverse function complete() can be used to create missing rows, which can be useful if you need to generate a balanced panel of observations from which to begin analysis with (e.g., when performing multiple imputation in long format).

bind_rows(df_3y_noprefix, df_5y_noprefix) %>%
    complete(sweep, MCSID, CNUM00) %>% # Ensure cohort members have a row for each sweep
    arrange(MCSID, CNUM00, sweep) 

Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -1

# A tibble: 68,092 × 4
   sweep MCSID   CNUM00                              CHTCM00  
   <dbl> <chr>   <dbl+lbl>                           <dbl+lbl>
   2 M10001N 1 [1st Cohort Member of the family]  97      
   3 M10001N 1 [1st Cohort Member of the family] 114.     
   2 M10001N 2 [2nd Cohort Member of the family]  NA      
   3 M10001N 2 [2nd Cohort Member of the family]  NA      
   2 M10002P 1 [1st Cohort Member of the family]  96      
   3 M10002P 1 [1st Cohort Member of the family] 110.     
   2 M10002P 2 [2nd Cohort Member of the family]  NA      
   3 M10002P 2 [2nd Cohort Member of the family]  NA      
   2 M10007U 1 [1st Cohort Member of the family] 102      
   3 M10007U 1 [1st Cohort Member of the family] 118      
# ℹ 68,082 more rows

Combing Sweeps Programatically

Combining sweeps manually can become tedious when more than two sweeps need to be combined Instead, iterative programming can be used automate the process. Below we show how to merge and append multiple sweeps together with very little code using the purrr package (part of the tidyverse).

Merging Programmatically

Before merging the datasets together, we need to load the data for each sweep. We can do this by creating a function, load_height_wide(), which takes a single argument sweep and loads the height data for that sweep. The function uses the glue() function from the glue package to create the file path. We create and subset a vector of follow-up ages (fups) to identify the correct folder to obtain the mcs{sweep}_cm_interview.dta file from. The glue() function is used to create strings from R objects. The curly braces ({}) act as placeholders for variables or function calls that are computed when the string is evaluated - e.g., when sweep = 1, {fup}y/mcs{sweep}_cm_interview.dta = 0y/mcs1_cm_interview.dta. (fup is determined by subsetting the relevant element in the vectors fups.) glue is part of the tidyverse, but is not a core package, so needs to be loaded explicitly.

The file path is piped to the read_dta() function to read in the data, with the col_select argument used to keep only the variables we need. Note we use a regular expression to select the CNUM00 and height variables as these have slightly different names each sweep. Typically variable names only differ on the sweep prefix used (ACHTM00, BCHTM00), but in Sweep 5 (age 11y), the name of the height variable (ECHTCMA00) diverged slightly from this pattern. Below, we also include a step to rename() the [B-G]CNUM00 variable to CNUM00 to ensure consistency across sweeps as this will make merging more straightforward later.

library(glue)
fups <- c(0, 3, 5, 7, 11, 14, 17)

load_height_wide <- function(sweep){
  fup <- fups[sweep]
  prefix <- LETTERS[sweep]
  
  glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
    read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>%
    rename(CNUM00 = matches("CNUM00"))
}

To confirm the function is working correctly, below we call it twice to load data from the second and third sweeps.

load_height_wide(2)

# A tibble: 15,778 × 3
   MCSID   CNUM00                              BCHTCM00                  
   <chr>   <dbl+lbl>                           <dbl+lbl>                 
M10001N 1 [1st Cohort Member of the family]  97                       
M10002P 1 [1st Cohort Member of the family]  96                       
M10007U 1 [1st Cohort Member of the family] 102                       
M10008V 1 [1st Cohort Member of the family]  -2 [No Measurement taken]
M10008V 2 [2nd Cohort Member of the family]  -2 [No Measurement taken]
M10011Q 1 [1st Cohort Member of the family] 106                       
M10014T 1 [1st Cohort Member of the family]  97                       
M10015U 1 [1st Cohort Member of the family]  94                       
M10016V 1 [1st Cohort Member of the family] 102                       
M10017W 1 [1st Cohort Member of the family]  99                       
# ℹ 15,768 more rows

load_height_wide(3)

# A tibble: 15,431 × 3
   MCSID   CNUM00                              CCHTCM00 
   <chr>   <dbl+lbl>                           <dbl+lbl>
M10001N 1 [1st Cohort Member of the family] 114.     
M10002P 1 [1st Cohort Member of the family] 110.     
M10007U 1 [1st Cohort Member of the family] 118      
M10011Q 1 [1st Cohort Member of the family] 121      
M10015U 1 [1st Cohort Member of the family] 110.     
M10016V 1 [1st Cohort Member of the family] 118.     
M10017W 1 [1st Cohort Member of the family] 110.     
M10018X 1 [1st Cohort Member of the family] 113.     
M10020R 1 [1st Cohort Member of the family] 112.     
M10021S 1 [1st Cohort Member of the family] 108      
# ℹ 15,421 more rows

We could manually load and merge successively using multiple load_height_wide() and full_join() function calls. However, this is rather verbose:

load_height_wide(2) %>%
  full_join(load_height_wide(3), by = c("MCSID", "CNUM00")) %>%
  full_join(load_height_wide(4), by = c("MCSID", "CNUM00")) %>%
  full_join(load_height_wide(6), by = c("MCSID", "CNUM00")) %>%
  full_join(load_height_wide(7), by = c("MCSID", "CNUM00"))

# A tibble: 17,568 × 7
   MCSID   CNUM00                  BCHTCM00  CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
   <chr>   <dbl+lbl>               <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
M10001N 1 [1st Cohort Member o…  97       114.     128.      NA       NA     
M10002P 1 [1st Cohort Member o…  96       110.     123      163.     174.    
M10007U 1 [1st Cohort Member o… 102       118      129      174.     181.    
M10008V 1 [1st Cohort Member o…  -2 [No …  NA       NA       NA       NA     
M10008V 2 [2nd Cohort Member o…  -2 [No …  NA       NA       NA       NA     
M10011Q 1 [1st Cohort Member o… 106       121      137       NA       NA     
M10014T 1 [1st Cohort Member o…  97        NA       NA       NA       NA     
M10015U 1 [1st Cohort Member o…  94       110.     122.     164.     169     
M10016V 1 [1st Cohort Member o… 102       118.     130      167      185.    
M10017W 1 [1st Cohort Member o…  99       110.     121.      NA       NA     
# ℹ 17,558 more rows

More efficiently, we can use the map() function from the purrr package (part of the tidyverse) to apply the load_height_wide() function to each sweep in turn. The map() function takes an object to be looped over as its first argument and a function to apply as its second argument. The function can be written as an anonymous function, similar to rename_with(). .x is a placeholder for the current elements of the object being looped over (in this case: 2, then 3, then 4, …, then 7). The map() function returns the results as a list. (Variants of map() return other data types, as we see shortly). Below we use map() to run load_height_wide() for sweeps 2-7. To save space, we do not print the output.

map(2:7, ~ load_height_wide(.x))

To merge list of datasets returned by map() together, we can use the reduce() function from purrr. reduce() has a similar syntax to map(): it takes an object as its first argument, and a function as its second argument. It applies the function to the first two elements of the list, and then successively applies the function to the result and the third element of the list onwards, until the list is finished. Below, we use reduce() to apply the full_join() function to the list of data frames. We specify full_join() in an anonymous function. .x and .y are the first and second inputs, respectively. In this case, at the first iteration sweep 2 (.x) is merged with sweep 3 (.y), and at the second iteration, the result of the first iteration (.x) is merged with sweep 4 (.y). This process is repeated until sweep 7 has been merged in.

map(2:7, load_height_wide) %>%
  reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00")))

# A tibble: 17,614 × 8
   MCSID   CNUM00         BCHTCM00  CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
   <chr>   <dbl+lbl>      <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
M10001N 1 [1st Cohort…  97       114.     128.      NA       NA       NA     
M10002P 1 [1st Cohort…  96       110.     123      144.     163.     174.    
M10007U 1 [1st Cohort… 102       118      129      154.     174.     181.    
M10008V 1 [1st Cohort…  -2 [No …  NA       NA       NA       NA       NA     
M10008V 2 [2nd Cohort…  -2 [No …  NA       NA       NA       NA       NA     
M10011Q 1 [1st Cohort… 106       121      137      168.      NA       NA     
M10014T 1 [1st Cohort…  97        NA       NA       NA       NA       NA     
M10015U 1 [1st Cohort…  94       110.     122.     143      164.     169     
M10016V 1 [1st Cohort… 102       118.     130      152.     167      185.    
M10017W 1 [1st Cohort…  99       110.     121.      NA       NA       NA     
# ℹ 17,604 more rows

Appending Programmatically

Programatically appending datasets together is slightly more straightforward as we can use a variant of map() called map_dfr(). Instead of returning a list, map_dfr() returns a data frame by calling bind_rows() in the background at the end. First, we create a function, load_height_long(), to load the height data from a given sweep and format it so that it can be appended to the other sweeps (i.e., giving variables consistent names). As above, the rename_with() function renames the variables to remove the sweep-specific prefixes. The relevant prefix is determined by subsetting the inbuilt LETTERS vectors, which contains the letters of the alphabet in upper case ("A", "B", "C", …, "Z"; i.e., LETTERS[2] returns "B").

load_height_long <- function(sweep){
  fup <- fups[sweep]
  prefix <- LETTERS[sweep]
  
  glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
    read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>%
    rename_with(~ str_replace(.x, glue("^{prefix}"), "")) %>%
    mutate(sweep = !!sweep, .before = 1)
}

To load data from sweeps 2-7 and append them together, we can use map_dfr() with the load_height_long() function.

map_dfr(2:7, ~ load_height_long(.x))

Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -1

Warning: `..1$CHTCM00` and `..3$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -8 and -1

Warning: `..1$CHTCM00` and `..5$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -1

Warning: `..1$CHTCM00` and `..6$CHTCM00` have conflicting value labels.
ℹ Labels for these values will be taken from `..1$CHTCM00`.
✖ Values: -5 and -1

# A tibble: 80,873 × 5
   sweep MCSID   CNUM00                              CHTCM00             CHTCMA0
   <int> <chr>   <dbl+lbl>                           <dbl+lbl>           <dbl+l>
   2 M10001N 1 [1st Cohort Member of the family]  97                 NA     
   2 M10002P 1 [1st Cohort Member of the family]  96                 NA     
   2 M10007U 1 [1st Cohort Member of the family] 102                 NA     
   2 M10008V 1 [1st Cohort Member of the family]  -2 [No Measuremen… NA     
   2 M10008V 2 [2nd Cohort Member of the family]  -2 [No Measuremen… NA     
   2 M10011Q 1 [1st Cohort Member of the family] 106                 NA     
   2 M10014T 1 [1st Cohort Member of the family]  97                 NA     
   2 M10015U 1 [1st Cohort Member of the family]  94                 NA     
   2 M10016V 1 [1st Cohort Member of the family] 102                 NA     
   2 M10017W 1 [1st Cohort Member of the family]  99                 NA     
# ℹ 80,863 more rows

Coda: Merging Parent Level Files

As discussed in the Data Structures page, the mcs[1-7]_parent_*.dta files contain identifiers for the respondent (MCSID and [A-G]PNUM00), but also for the type of interview they completed (MCSID and [A-G]ELIG00). We can use either of these to merge parent-level datasets together across sweeps. When doing so, it is sometimes worth keeping the information on the other identifiers to retain information on the respondent or interview type; for instance, this may help to determine why a response the same survey item differed markedly between sweeps.

df_parent_5y <- read_dta("5y/mcs3_parent_cm_interview.dta",
                         col_select = c("MCSID", "CCNUM00", "CPNUM00", "CELIG00", "CPFRTP00"))

df_parent_7y <- read_dta("7y/mcs4_parent_cm_interview.dta",
                         col_select = c("MCSID", "DCNUM00", "DPNUM00", "DELIG00", "DPFRTP00"))

df_parent_5y %>%
  full_join(df_parent_7y, 
             by = c("MCSID",
                    "CCNUM00" = "DCNUM00",
                    "CPNUM00" = "DPNUM00")) # Merge by person

# A tibble: 27,861 × 7
   MCSID   CPNUM00   CELIG00               CCNUM00     CPFRTP00 DELIG00 DPFRTP00
   <chr>   <dbl+lbl> <dbl+lbl>             <dbl+lbl>   <dbl+lb> <dbl+l> <dbl+lb>
M10001N 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1 [Mai… 2 [Two] 
M10002P 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1 [Mai… 3 [Thre…
M10002P 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre…
M10007U 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1 [Mai… 3 [Thre…
M10007U 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre…
M10011Q 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1 [Mai… 3 [Thre…
M10011Q 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre…
M10015U 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1 [Mai… 1 [One] 
M10015U 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 1 [One] 
M10016V 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1 [Mai… 2 [Two] 
# ℹ 27,851 more rows

df_parent_5y %>%
  full_join(df_parent_7y, 
            by = c("MCSID", 
                   "CCNUM00" = "DCNUM00",
                   "CELIG00" = "DELIG00"))  # Merge by interview type

# A tibble: 27,770 × 7
   MCSID   CPNUM00   CELIG00               CCNUM00     CPFRTP00 DPNUM00 DPFRTP00
   <chr>   <dbl+lbl> <dbl+lbl>             <dbl+lbl>   <dbl+lb> <dbl+l> <dbl+lb>
M10001N 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1       2 [Two] 
M10002P 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1       3 [Thre…
M10002P 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2       3 [Thre…
M10007U 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1       3 [Thre…
M10007U 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2       3 [Thre…
M10011Q 1         1 [Main Interview]    1 [1st Coh…  3 [Thr… 1       3 [Thre…
M10011Q 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2       3 [Thre…
M10015U 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1       1 [One] 
M10015U 2         2 [Partner Interview] 1 [1st Coh… -1 [Not… 2       1 [One] 
M10016V 1         1 [Main Interview]    1 [1st Coh…  2 [Two] 1       2 [Two] 
# ℹ 27,760 more rows