|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Combining Data Across Sweeps |
| 4 | +nav_order: 3 |
| 5 | +parent: BCS70 |
| 6 | +format: docusaurus-md |
| 7 | +--- |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +[Download the R script for this |
| 13 | +page](../purl/bcs70-merging_across_sweeps.R) |
| 14 | + |
| 15 | +# Introduction |
| 16 | + |
| 17 | +In this section, we show how to combine NCDS data across sweeps. |
| 18 | + |
| 19 | +As an example, we use data on cohort members’ height. These are |
| 20 | +contained in files which have one row per cohort-member. As a reminder, |
| 21 | +we have organised the data files so that each sweep [has its own folder, |
| 22 | +which is named according to the age of |
| 23 | +follow-up](https://cls-data.github.io/docs/bcs70-sweep_folders.html) |
| 24 | +(e.g., 10y for the third major sweep). |
| 25 | + |
| 26 | +We begin by combining data from the Sweeps 9 (42y) and Sweep 11 (51y), |
| 27 | +showing how to combine these datasets in **wide** (one row per |
| 28 | +observational unit) and **long** (multiple rows per observational unit) |
| 29 | +formats by *merging* and *appending*, respectively. Because variable |
| 30 | +names change between sweeps in unpredictable ways, it is not |
| 31 | +straightforwardly possible to combine data from multiple sweeps |
| 32 | +*programmatically* (as we are able to do for, e.g., the |
| 33 | +[MCS](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html)). |
| 34 | + |
| 35 | +We use the following packages: |
| 36 | + |
| 37 | +```r |
| 38 | +# Load Packages |
| 39 | +library(tidyverse) # For data manipulation |
| 40 | +library(haven) # For importing .dta files |
| 41 | +``` |
| 42 | + |
| 43 | +# Merging Across Sweeps |
| 44 | + |
| 45 | +The variables `BD9HGHTM` and `bd11hghtm` contains the height of the |
| 46 | +cohort member at Sweeps 9 (42y) and Sweep 11 (51y), respectively. Note, |
| 47 | +these are derived variable which convert raw height measurements into |
| 48 | +kilograms. The variable names follow the same convention (with the |
| 49 | +exception that at age 51, lower case is used). This bucks the more |
| 50 | +general case where conceptually similar variables have different |
| 51 | +(potentially, non-descriptive) names, when combining data including |
| 52 | +early sweeps. |
| 53 | + |
| 54 | +We will use the `read_dta()` function from `haven` to read in the data |
| 55 | +from the four sweeps, specifying the `col_select` argument to keep only |
| 56 | +the variables we need (the identifier and height variables). |
| 57 | + |
| 58 | +```r |
| 59 | +df_42y <- read_dta("42y/bcs70_2012_derived.dta", |
| 60 | + col_select = c("BCSID", "BD9HGHTM")) |
| 61 | + |
| 62 | +df_51y <- read_dta("51y/bcs11_age51_main.dta", |
| 63 | + col_select = c("bcsid", "bd11hghtm")) |
| 64 | +``` |
| 65 | + |
| 66 | +We can merge these datasets by row using the `*_join()` family of |
| 67 | +functions. These share a common syntax. They take two data frames (`x` |
| 68 | +and `y`) as arguments, as well as a `by` argument that specifies the |
| 69 | +variable(s) to join on. The `*_join()` functions are: |
| 70 | + |
| 71 | +1. `full_join()`: Returns all rows from `x` and `y`, and all columns |
| 72 | + from `x` and `y`. For rows without matches in both `x` and `y`, the |
| 73 | + missing value `NA` is used for columns that are not used as |
| 74 | + identifiers. |
| 75 | +2. `inner_join()`: Returns all rows from `x` and `y` where there are |
| 76 | + matching rows in both data frames. |
| 77 | +3. `left_join()`: Returns all rows from `x`, and all columns from `x` |
| 78 | + and `y`. Rows in `x` with no match in `y` will have `NA` values in |
| 79 | + the new columns from `y`. |
| 80 | +4. `right_join()`: Returns all rows from `y`, and all columns from `x` |
| 81 | + and `y`. Rows in `y` with no match in `x` will have `NA` values in |
| 82 | + the columns of `x`. |
| 83 | + |
| 84 | +In the current context, where `x` is data from Sweep 9 (`df_42y`) and |
| 85 | +`y` is data from Sweep 11 (`df_51y`): `full_join()` will return a row |
| 86 | +for each individual present in Sweep 9 or Sweep 11, with the height from |
| 87 | +each sweep in the same row; `inner_join()` will return a row for each |
| 88 | +individual who was present in both these sweeps, with the height from |
| 89 | +each sweep in the same row; `left_join()` will return a row for each |
| 90 | +individual in the 9th sweep, with the height from the 11th sweep in the |
| 91 | +same row if the individual was present in the 11th sweep; `right_join()` |
| 92 | +will return a row for each individual in the 11th sweep, with the height |
| 93 | +from the 9th sweep in the same row if the individual was present in the |
| 94 | +9th sweep. |
| 95 | + |
| 96 | +The `*_join()` functions can handle multiple variables to join on, and |
| 97 | +can also handle situations where the identifiers have different names |
| 98 | +across `x` and `y`. To specify the identifiers, we pass a vector to the |
| 99 | +`by` argument. In this case, we pass a *named vector* so that `BCSID` in |
| 100 | +`df_42y` can be matched to `bcsid` in `df_51y`. |
| 101 | + |
| 102 | +```r |
| 103 | +df_42y %>% |
| 104 | + full_join(df_51y, by = c(BCSID = "bcsid")) |
| 105 | +``` |
| 106 | + |
| 107 | +``` text |
| 108 | +# A tibble: 10,683 × 3 |
| 109 | + BCSID BD9HGHTM bd11hghtm |
| 110 | + <chr> <dbl+lbl> <dbl+lbl> |
| 111 | + 1 B10001N 1.55 1.55 |
| 112 | + 2 B10003Q 1.85 1.85 |
| 113 | + 3 B10004R 1.60 1.6 |
| 114 | + 4 B10007U 1.52 NA |
| 115 | + 5 B10009W 1.63 1.63 |
| 116 | + 6 B10010P 1.65 NA |
| 117 | + 7 B10011Q 1.63 1.65 |
| 118 | + 8 B10013S 1.63 1.63 |
| 119 | + 9 B10015U 1.83 1.8 |
| 120 | +10 B10016V 1.88 1.88 |
| 121 | +# ℹ 10,673 more rows |
| 122 | +``` |
| 123 | + |
| 124 | +```r |
| 125 | +df_42y %>% |
| 126 | + inner_join(df_51y, by = c(BCSID = "bcsid")) |
| 127 | +``` |
| 128 | + |
| 129 | +``` text |
| 130 | +# A tibble: 7,174 × 3 |
| 131 | + BCSID BD9HGHTM bd11hghtm |
| 132 | + <chr> <dbl+lbl> <dbl+lbl> |
| 133 | + 1 B10001N 1.55 1.55 |
| 134 | + 2 B10003Q 1.85 1.85 |
| 135 | + 3 B10004R 1.60 1.6 |
| 136 | + 4 B10009W 1.63 1.63 |
| 137 | + 5 B10011Q 1.63 1.65 |
| 138 | + 6 B10013S 1.63 1.63 |
| 139 | + 7 B10015U 1.83 1.8 |
| 140 | + 8 B10016V 1.88 1.88 |
| 141 | + 9 B10018X 1.73 1.7 |
| 142 | +10 B10020R 1.50 1.47 |
| 143 | +# ℹ 7,164 more rows |
| 144 | +``` |
| 145 | + |
| 146 | +```r |
| 147 | +df_42y %>% |
| 148 | + left_join(df_51y, by = c(BCSID = "bcsid")) |
| 149 | +``` |
| 150 | + |
| 151 | +``` text |
| 152 | +# A tibble: 9,841 × 3 |
| 153 | + BCSID BD9HGHTM bd11hghtm |
| 154 | + <chr> <dbl+lbl> <dbl+lbl> |
| 155 | + 1 B10001N 1.55 1.55 |
| 156 | + 2 B10003Q 1.85 1.85 |
| 157 | + 3 B10004R 1.60 1.6 |
| 158 | + 4 B10007U 1.52 NA |
| 159 | + 5 B10009W 1.63 1.63 |
| 160 | + 6 B10010P 1.65 NA |
| 161 | + 7 B10011Q 1.63 1.65 |
| 162 | + 8 B10013S 1.63 1.63 |
| 163 | + 9 B10015U 1.83 1.8 |
| 164 | +10 B10016V 1.88 1.88 |
| 165 | +# ℹ 9,831 more rows |
| 166 | +``` |
| 167 | + |
| 168 | +```r |
| 169 | +df_42y %>% |
| 170 | + right_join(df_51y, by = c(BCSID = "bcsid")) |
| 171 | +``` |
| 172 | + |
| 173 | +``` text |
| 174 | +# A tibble: 8,016 × 3 |
| 175 | + BCSID BD9HGHTM bd11hghtm |
| 176 | + <chr> <dbl+lbl> <dbl+lbl> |
| 177 | + 1 B10001N 1.55 1.55 |
| 178 | + 2 B10003Q 1.85 1.85 |
| 179 | + 3 B10004R 1.60 1.6 |
| 180 | + 4 B10009W 1.63 1.63 |
| 181 | + 5 B10011Q 1.63 1.65 |
| 182 | + 6 B10013S 1.63 1.63 |
| 183 | + 7 B10015U 1.83 1.8 |
| 184 | + 8 B10016V 1.88 1.88 |
| 185 | + 9 B10018X 1.73 1.7 |
| 186 | +10 B10020R 1.50 1.47 |
| 187 | +# ℹ 8,006 more rows |
| 188 | +``` |
| 189 | + |
| 190 | +Note, the `*_join()` functions will merge any matching rows. Unlike |
| 191 | +`Stata`, we do not have to explicitly state whether we want a 1-to-1, |
| 192 | +many-to-1, 1-to-many, or many-to-many merge. This is determined by the |
| 193 | +data that are inputted to `*_join()`. |
| 194 | + |
| 195 | +When the `by = ...` isn’t used explicitly, the `*_join()` will merge on |
| 196 | +any variables which have the same names across the two datasets. As |
| 197 | +`df_42y` has variables in upper case and `df_51y` has variables in lower |
| 198 | +case, we could have renamed the variables in `df_42y` in one fell swoop |
| 199 | +with `rename_with(str_to_lower)`. There are usually many ways of |
| 200 | +achieving the same thing. |
| 201 | + |
| 202 | +```r |
| 203 | +df_42y %>% |
| 204 | + rename_with(str_to_lower) %>% # Converts all variable names to upper case |
| 205 | + full_join(df_51y) |
| 206 | +``` |
| 207 | + |
| 208 | +``` text |
| 209 | +Joining with `by = join_by(bcsid)` |
| 210 | +``` |
| 211 | + |
| 212 | +``` text |
| 213 | +# A tibble: 10,683 × 3 |
| 214 | + bcsid bd9hghtm bd11hghtm |
| 215 | + <chr> <dbl+lbl> <dbl+lbl> |
| 216 | + 1 B10001N 1.55 1.55 |
| 217 | + 2 B10003Q 1.85 1.85 |
| 218 | + 3 B10004R 1.60 1.6 |
| 219 | + 4 B10007U 1.52 NA |
| 220 | + 5 B10009W 1.63 1.63 |
| 221 | + 6 B10010P 1.65 NA |
| 222 | + 7 B10011Q 1.63 1.65 |
| 223 | + 8 B10013S 1.63 1.63 |
| 224 | + 9 B10015U 1.83 1.8 |
| 225 | +10 B10016V 1.88 1.88 |
| 226 | +# ℹ 10,673 more rows |
| 227 | +``` |
| 228 | + |
| 229 | +# Appending Sweeps |
| 230 | + |
| 231 | +To put the data into long format, we can use the `bind_rows()` function. |
| 232 | +(In this case, the data will have one row per cohort-member x sweep |
| 233 | +combination.) To work properly, we need to name the variables |
| 234 | +consistently across sweeps, which here means removing the sweep-specific |
| 235 | +lettering (e.g., the string `BD9` from `BD9HGHTM` in `df_42y`). We also |
| 236 | +need to add a variable to identify the sweep the data comes from. Below, |
| 237 | +we use the `mutate()` function to create a `sweep` variable and then use |
| 238 | +the `rename_with()` function to remove the suffixes and rename the |
| 239 | +variables consistently across sweeps. (Given we only had one variable to |
| 240 | +rename, we could have done this manually with `rename()`, but this |
| 241 | +approach is more scalable.) |
| 242 | + |
| 243 | +```r |
| 244 | +df_42y_nosuffix <- df_42y %>% |
| 245 | + rename_with(str_to_lower) %>% |
| 246 | + rename_with(~ str_remove(.x, "^bd9")) %>% # Removes the suffix '23' from variable names |
| 247 | + mutate(sweep = 9, .before = 1) |
| 248 | + |
| 249 | +df_51y_nosuffix <- df_51y %>% |
| 250 | + rename_with(~ str_remove(.x, "^bd11")) %>% |
| 251 | + mutate(sweep = 11, .before = 1) |
| 252 | +``` |
| 253 | + |
| 254 | +`rename_with()` applies a function to the names of the variables. In |
| 255 | +this case, we use the `str_remove()` function from the `stringr` package |
| 256 | +(part of the `tidyverse`) to remove the suffix from the variable names. |
| 257 | +The `~` symbol is used to create an [*anonymous |
| 258 | +function*](https://r4ds.hadley.nz/iteration.html), which is applied to |
| 259 | +each variable name. The `.x` symbol in the anonymous function is a |
| 260 | +placeholder for the variable name. `str_remove()` takes a regular |
| 261 | +expression. The `^` symbol is used to match the start of the string (so |
| 262 | +`^bd9` removes the `bd9` where it is the first characters in a variable |
| 263 | +name). Note, for the `mutate()` call, the `.before` argument is used to |
| 264 | +specify the position of the new variable in the data frame - here we |
| 265 | +specify `sweep` as the first column. Below we see what the formatted |
| 266 | +data frames look like: |
| 267 | + |
| 268 | +```r |
| 269 | +df_42y_nosuffix |
| 270 | +``` |
| 271 | + |
| 272 | +``` text |
| 273 | +# A tibble: 9,841 × 3 |
| 274 | + sweep bcsid hghtm |
| 275 | + <dbl> <chr> <dbl+lbl> |
| 276 | + 1 9 B10001N 1.55 |
| 277 | + 2 9 B10003Q 1.85 |
| 278 | + 3 9 B10004R 1.60 |
| 279 | + 4 9 B10007U 1.52 |
| 280 | + 5 9 B10009W 1.63 |
| 281 | + 6 9 B10010P 1.65 |
| 282 | + 7 9 B10011Q 1.63 |
| 283 | + 8 9 B10013S 1.63 |
| 284 | + 9 9 B10015U 1.83 |
| 285 | +10 9 B10016V 1.88 |
| 286 | +# ℹ 9,831 more rows |
| 287 | +``` |
| 288 | + |
| 289 | +```r |
| 290 | +df_51y_nosuffix |
| 291 | +``` |
| 292 | + |
| 293 | +``` text |
| 294 | +# A tibble: 8,016 × 3 |
| 295 | + sweep bcsid hghtm |
| 296 | + <dbl> <chr> <dbl+lbl> |
| 297 | + 1 11 B10001N 1.55 |
| 298 | + 2 11 B10003Q 1.85 |
| 299 | + 3 11 B10004R 1.6 |
| 300 | + 4 11 B10009W 1.63 |
| 301 | + 5 11 B10011Q 1.65 |
| 302 | + 6 11 B10013S 1.63 |
| 303 | + 7 11 B10015U 1.8 |
| 304 | + 8 11 B10016V 1.88 |
| 305 | + 9 11 B10018X 1.7 |
| 306 | +10 11 B10020R 1.47 |
| 307 | +# ℹ 8,006 more rows |
| 308 | +``` |
| 309 | + |
| 310 | +Now the data have been prepared, we can use `bind_rows()` to append the |
| 311 | +data frames together. This will stack the data frames on top of each |
| 312 | +other, so the number of rows is equal to the sum of rows in the |
| 313 | +individual datasets. The `bind_rows()` function can handle data frames |
| 314 | +with different numbers of columns. Missing columns are filled with `NA` |
| 315 | +values. |
| 316 | + |
| 317 | +```r |
| 318 | +bind_rows(df_42y_nosuffix, df_51y_nosuffix) %>% |
| 319 | + arrange(bcsid, sweep) # Sorts the dataset by ID and sweep |
| 320 | +``` |
| 321 | + |
| 322 | +``` text |
| 323 | +# A tibble: 17,857 × 3 |
| 324 | + sweep bcsid hghtm |
| 325 | + <dbl> <chr> <dbl+lbl> |
| 326 | + 1 9 B10001N 1.55 |
| 327 | + 2 11 B10001N 1.55 |
| 328 | + 3 9 B10003Q 1.85 |
| 329 | + 4 11 B10003Q 1.85 |
| 330 | + 5 9 B10004R 1.60 |
| 331 | + 6 11 B10004R 1.6 |
| 332 | + 7 9 B10007U 1.52 |
| 333 | + 8 9 B10009W 1.63 |
| 334 | + 9 11 B10009W 1.63 |
| 335 | +10 9 B10010P 1.65 |
| 336 | +# ℹ 17,847 more rows |
| 337 | +``` |
| 338 | + |
| 339 | +Notice that with `bind_rows()` a cohort member has only as many rows of |
| 340 | +data as the times they appeared in Sweeps 9 and 11. This differs from |
| 341 | +`*_join()` where an explicit missing `NA` value is generated for the |
| 342 | +missing sweep. The `tidyverse` function `complete()` [can be used to |
| 343 | +create missing |
| 344 | +rows](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit), |
| 345 | +which can be useful if you need to generate a balanced panel of |
| 346 | +observations from which to begin analysis with (e.g., when performing |
| 347 | +multiple imputation in long format). |
| 348 | + |
| 349 | +```r |
| 350 | +bind_rows(df_42y_nosuffix, df_51y_nosuffix) %>% |
| 351 | + complete(bcsid, sweep) %>% # Ensure cohort members have a row for each sweep |
| 352 | + arrange(bcsid, sweep) |
| 353 | +``` |
| 354 | + |
| 355 | +``` text |
| 356 | +# A tibble: 21,366 × 3 |
| 357 | + bcsid sweep hghtm |
| 358 | + <chr> <dbl> <dbl+lbl> |
| 359 | + 1 B10001N 9 1.55 |
| 360 | + 2 B10001N 11 1.55 |
| 361 | + 3 B10003Q 9 1.85 |
| 362 | + 4 B10003Q 11 1.85 |
| 363 | + 5 B10004R 9 1.60 |
| 364 | + 6 B10004R 11 1.6 |
| 365 | + 7 B10007U 9 1.52 |
| 366 | + 8 B10007U 11 NA |
| 367 | + 9 B10009W 9 1.63 |
| 368 | +10 B10009W 11 1.63 |
| 369 | +# ℹ 21,356 more rows |
| 370 | +``` |
0 commit comments