Skip to content

Commit 0fe1812

Browse files
committed
add download links
1 parent a593b20 commit 0fe1812

39 files changed

+1514
-153
lines changed

docs/bcs70-data_discovery.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ format: docusaurus-md
99

1010

1111

12+
[Download the R script for this page](../purl/bcs70-data_discovery.R)
13+
1214
# Introduction
1315

1416
In this section, we show a few `R` functions for exploring BCS70 data;
Lines changed: 370 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,370 @@
1+
---
2+
layout: default
3+
title: Combining Data Across Sweeps
4+
nav_order: 3
5+
parent: BCS70
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
[Download the R script for this
13+
page](../purl/bcs70-merging_across_sweeps.R)
14+
15+
# Introduction
16+
17+
In this section, we show how to combine NCDS data across sweeps.
18+
19+
As an example, we use data on cohort members’ height. These are
20+
contained in files which have one row per cohort-member. As a reminder,
21+
we have organised the data files so that each sweep [has its own folder,
22+
which is named according to the age of
23+
follow-up](https://cls-data.github.io/docs/bcs70-sweep_folders.html)
24+
(e.g., 10y for the third major sweep).
25+
26+
We begin by combining data from the Sweeps 9 (42y) and Sweep 11 (51y),
27+
showing how to combine these datasets in **wide** (one row per
28+
observational unit) and **long** (multiple rows per observational unit)
29+
formats by *merging* and *appending*, respectively. Because variable
30+
names change between sweeps in unpredictable ways, it is not
31+
straightforwardly possible to combine data from multiple sweeps
32+
*programmatically* (as we are able to do for, e.g., the
33+
[MCS](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html)).
34+
35+
We use the following packages:
36+
37+
```r
38+
# Load Packages
39+
library(tidyverse) # For data manipulation
40+
library(haven) # For importing .dta files
41+
```
42+
43+
# Merging Across Sweeps
44+
45+
The variables `BD9HGHTM` and `bd11hghtm` contains the height of the
46+
cohort member at Sweeps 9 (42y) and Sweep 11 (51y), respectively. Note,
47+
these are derived variable which convert raw height measurements into
48+
kilograms. The variable names follow the same convention (with the
49+
exception that at age 51, lower case is used). This bucks the more
50+
general case where conceptually similar variables have different
51+
(potentially, non-descriptive) names, when combining data including
52+
early sweeps.
53+
54+
We will use the `read_dta()` function from `haven` to read in the data
55+
from the four sweeps, specifying the `col_select` argument to keep only
56+
the variables we need (the identifier and height variables).
57+
58+
```r
59+
df_42y <- read_dta("42y/bcs70_2012_derived.dta",
60+
col_select = c("BCSID", "BD9HGHTM"))
61+
62+
df_51y <- read_dta("51y/bcs11_age51_main.dta",
63+
col_select = c("bcsid", "bd11hghtm"))
64+
```
65+
66+
We can merge these datasets by row using the `*_join()` family of
67+
functions. These share a common syntax. They take two data frames (`x`
68+
and `y`) as arguments, as well as a `by` argument that specifies the
69+
variable(s) to join on. The `*_join()` functions are:
70+
71+
1. `full_join()`: Returns all rows from `x` and `y`, and all columns
72+
from `x` and `y`. For rows without matches in both `x` and `y`, the
73+
missing value `NA` is used for columns that are not used as
74+
identifiers.
75+
2. `inner_join()`: Returns all rows from `x` and `y` where there are
76+
matching rows in both data frames.
77+
3. `left_join()`: Returns all rows from `x`, and all columns from `x`
78+
and `y`. Rows in `x` with no match in `y` will have `NA` values in
79+
the new columns from `y`.
80+
4. `right_join()`: Returns all rows from `y`, and all columns from `x`
81+
and `y`. Rows in `y` with no match in `x` will have `NA` values in
82+
the columns of `x`.
83+
84+
In the current context, where `x` is data from Sweep 9 (`df_42y`) and
85+
`y` is data from Sweep 11 (`df_51y`): `full_join()` will return a row
86+
for each individual present in Sweep 9 or Sweep 11, with the height from
87+
each sweep in the same row; `inner_join()` will return a row for each
88+
individual who was present in both these sweeps, with the height from
89+
each sweep in the same row; `left_join()` will return a row for each
90+
individual in the 9th sweep, with the height from the 11th sweep in the
91+
same row if the individual was present in the 11th sweep; `right_join()`
92+
will return a row for each individual in the 11th sweep, with the height
93+
from the 9th sweep in the same row if the individual was present in the
94+
9th sweep.
95+
96+
The `*_join()` functions can handle multiple variables to join on, and
97+
can also handle situations where the identifiers have different names
98+
across `x` and `y`. To specify the identifiers, we pass a vector to the
99+
`by` argument. In this case, we pass a *named vector* so that `BCSID` in
100+
`df_42y` can be matched to `bcsid` in `df_51y`.
101+
102+
```r
103+
df_42y %>%
104+
full_join(df_51y, by = c(BCSID = "bcsid"))
105+
```
106+
107+
``` text
108+
# A tibble: 10,683 × 3
109+
BCSID BD9HGHTM bd11hghtm
110+
<chr> <dbl+lbl> <dbl+lbl>
111+
1 B10001N 1.55 1.55
112+
2 B10003Q 1.85 1.85
113+
3 B10004R 1.60 1.6
114+
4 B10007U 1.52 NA
115+
5 B10009W 1.63 1.63
116+
6 B10010P 1.65 NA
117+
7 B10011Q 1.63 1.65
118+
8 B10013S 1.63 1.63
119+
9 B10015U 1.83 1.8
120+
10 B10016V 1.88 1.88
121+
# ℹ 10,673 more rows
122+
```
123+
124+
```r
125+
df_42y %>%
126+
inner_join(df_51y, by = c(BCSID = "bcsid"))
127+
```
128+
129+
``` text
130+
# A tibble: 7,174 × 3
131+
BCSID BD9HGHTM bd11hghtm
132+
<chr> <dbl+lbl> <dbl+lbl>
133+
1 B10001N 1.55 1.55
134+
2 B10003Q 1.85 1.85
135+
3 B10004R 1.60 1.6
136+
4 B10009W 1.63 1.63
137+
5 B10011Q 1.63 1.65
138+
6 B10013S 1.63 1.63
139+
7 B10015U 1.83 1.8
140+
8 B10016V 1.88 1.88
141+
9 B10018X 1.73 1.7
142+
10 B10020R 1.50 1.47
143+
# ℹ 7,164 more rows
144+
```
145+
146+
```r
147+
df_42y %>%
148+
left_join(df_51y, by = c(BCSID = "bcsid"))
149+
```
150+
151+
``` text
152+
# A tibble: 9,841 × 3
153+
BCSID BD9HGHTM bd11hghtm
154+
<chr> <dbl+lbl> <dbl+lbl>
155+
1 B10001N 1.55 1.55
156+
2 B10003Q 1.85 1.85
157+
3 B10004R 1.60 1.6
158+
4 B10007U 1.52 NA
159+
5 B10009W 1.63 1.63
160+
6 B10010P 1.65 NA
161+
7 B10011Q 1.63 1.65
162+
8 B10013S 1.63 1.63
163+
9 B10015U 1.83 1.8
164+
10 B10016V 1.88 1.88
165+
# ℹ 9,831 more rows
166+
```
167+
168+
```r
169+
df_42y %>%
170+
right_join(df_51y, by = c(BCSID = "bcsid"))
171+
```
172+
173+
``` text
174+
# A tibble: 8,016 × 3
175+
BCSID BD9HGHTM bd11hghtm
176+
<chr> <dbl+lbl> <dbl+lbl>
177+
1 B10001N 1.55 1.55
178+
2 B10003Q 1.85 1.85
179+
3 B10004R 1.60 1.6
180+
4 B10009W 1.63 1.63
181+
5 B10011Q 1.63 1.65
182+
6 B10013S 1.63 1.63
183+
7 B10015U 1.83 1.8
184+
8 B10016V 1.88 1.88
185+
9 B10018X 1.73 1.7
186+
10 B10020R 1.50 1.47
187+
# ℹ 8,006 more rows
188+
```
189+
190+
Note, the `*_join()` functions will merge any matching rows. Unlike
191+
`Stata`, we do not have to explicitly state whether we want a 1-to-1,
192+
many-to-1, 1-to-many, or many-to-many merge. This is determined by the
193+
data that are inputted to `*_join()`.
194+
195+
When the `by = ...` isn’t used explicitly, the `*_join()` will merge on
196+
any variables which have the same names across the two datasets. As
197+
`df_42y` has variables in upper case and `df_51y` has variables in lower
198+
case, we could have renamed the variables in `df_42y` in one fell swoop
199+
with `rename_with(str_to_lower)`. There are usually many ways of
200+
achieving the same thing.
201+
202+
```r
203+
df_42y %>%
204+
rename_with(str_to_lower) %>% # Converts all variable names to upper case
205+
full_join(df_51y)
206+
```
207+
208+
``` text
209+
Joining with `by = join_by(bcsid)`
210+
```
211+
212+
``` text
213+
# A tibble: 10,683 × 3
214+
bcsid bd9hghtm bd11hghtm
215+
<chr> <dbl+lbl> <dbl+lbl>
216+
1 B10001N 1.55 1.55
217+
2 B10003Q 1.85 1.85
218+
3 B10004R 1.60 1.6
219+
4 B10007U 1.52 NA
220+
5 B10009W 1.63 1.63
221+
6 B10010P 1.65 NA
222+
7 B10011Q 1.63 1.65
223+
8 B10013S 1.63 1.63
224+
9 B10015U 1.83 1.8
225+
10 B10016V 1.88 1.88
226+
# ℹ 10,673 more rows
227+
```
228+
229+
# Appending Sweeps
230+
231+
To put the data into long format, we can use the `bind_rows()` function.
232+
(In this case, the data will have one row per cohort-member x sweep
233+
combination.) To work properly, we need to name the variables
234+
consistently across sweeps, which here means removing the sweep-specific
235+
lettering (e.g., the string `BD9` from `BD9HGHTM` in `df_42y`). We also
236+
need to add a variable to identify the sweep the data comes from. Below,
237+
we use the `mutate()` function to create a `sweep` variable and then use
238+
the `rename_with()` function to remove the suffixes and rename the
239+
variables consistently across sweeps. (Given we only had one variable to
240+
rename, we could have done this manually with `rename()`, but this
241+
approach is more scalable.)
242+
243+
```r
244+
df_42y_nosuffix <- df_42y %>%
245+
rename_with(str_to_lower) %>%
246+
rename_with(~ str_remove(.x, "^bd9")) %>% # Removes the suffix '23' from variable names
247+
mutate(sweep = 9, .before = 1)
248+
249+
df_51y_nosuffix <- df_51y %>%
250+
rename_with(~ str_remove(.x, "^bd11")) %>%
251+
mutate(sweep = 11, .before = 1)
252+
```
253+
254+
`rename_with()` applies a function to the names of the variables. In
255+
this case, we use the `str_remove()` function from the `stringr` package
256+
(part of the `tidyverse`) to remove the suffix from the variable names.
257+
The `~` symbol is used to create an [*anonymous
258+
function*](https://r4ds.hadley.nz/iteration.html), which is applied to
259+
each variable name. The `.x` symbol in the anonymous function is a
260+
placeholder for the variable name. `str_remove()` takes a regular
261+
expression. The `^` symbol is used to match the start of the string (so
262+
`^bd9` removes the `bd9` where it is the first characters in a variable
263+
name). Note, for the `mutate()` call, the `.before` argument is used to
264+
specify the position of the new variable in the data frame - here we
265+
specify `sweep` as the first column. Below we see what the formatted
266+
data frames look like:
267+
268+
```r
269+
df_42y_nosuffix
270+
```
271+
272+
``` text
273+
# A tibble: 9,841 × 3
274+
sweep bcsid hghtm
275+
<dbl> <chr> <dbl+lbl>
276+
1 9 B10001N 1.55
277+
2 9 B10003Q 1.85
278+
3 9 B10004R 1.60
279+
4 9 B10007U 1.52
280+
5 9 B10009W 1.63
281+
6 9 B10010P 1.65
282+
7 9 B10011Q 1.63
283+
8 9 B10013S 1.63
284+
9 9 B10015U 1.83
285+
10 9 B10016V 1.88
286+
# ℹ 9,831 more rows
287+
```
288+
289+
```r
290+
df_51y_nosuffix
291+
```
292+
293+
``` text
294+
# A tibble: 8,016 × 3
295+
sweep bcsid hghtm
296+
<dbl> <chr> <dbl+lbl>
297+
1 11 B10001N 1.55
298+
2 11 B10003Q 1.85
299+
3 11 B10004R 1.6
300+
4 11 B10009W 1.63
301+
5 11 B10011Q 1.65
302+
6 11 B10013S 1.63
303+
7 11 B10015U 1.8
304+
8 11 B10016V 1.88
305+
9 11 B10018X 1.7
306+
10 11 B10020R 1.47
307+
# ℹ 8,006 more rows
308+
```
309+
310+
Now the data have been prepared, we can use `bind_rows()` to append the
311+
data frames together. This will stack the data frames on top of each
312+
other, so the number of rows is equal to the sum of rows in the
313+
individual datasets. The `bind_rows()` function can handle data frames
314+
with different numbers of columns. Missing columns are filled with `NA`
315+
values.
316+
317+
```r
318+
bind_rows(df_42y_nosuffix, df_51y_nosuffix) %>%
319+
arrange(bcsid, sweep) # Sorts the dataset by ID and sweep
320+
```
321+
322+
``` text
323+
# A tibble: 17,857 × 3
324+
sweep bcsid hghtm
325+
<dbl> <chr> <dbl+lbl>
326+
1 9 B10001N 1.55
327+
2 11 B10001N 1.55
328+
3 9 B10003Q 1.85
329+
4 11 B10003Q 1.85
330+
5 9 B10004R 1.60
331+
6 11 B10004R 1.6
332+
7 9 B10007U 1.52
333+
8 9 B10009W 1.63
334+
9 11 B10009W 1.63
335+
10 9 B10010P 1.65
336+
# ℹ 17,847 more rows
337+
```
338+
339+
Notice that with `bind_rows()` a cohort member has only as many rows of
340+
data as the times they appeared in Sweeps 9 and 11. This differs from
341+
`*_join()` where an explicit missing `NA` value is generated for the
342+
missing sweep. The `tidyverse` function `complete()` [can be used to
343+
create missing
344+
rows](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit),
345+
which can be useful if you need to generate a balanced panel of
346+
observations from which to begin analysis with (e.g., when performing
347+
multiple imputation in long format).
348+
349+
```r
350+
bind_rows(df_42y_nosuffix, df_51y_nosuffix) %>%
351+
complete(bcsid, sweep) %>% # Ensure cohort members have a row for each sweep
352+
arrange(bcsid, sweep)
353+
```
354+
355+
``` text
356+
# A tibble: 21,366 × 3
357+
bcsid sweep hghtm
358+
<chr> <dbl> <dbl+lbl>
359+
1 B10001N 9 1.55
360+
2 B10001N 11 1.55
361+
3 B10003Q 9 1.85
362+
4 B10003Q 11 1.85
363+
5 B10004R 9 1.60
364+
6 B10004R 11 1.6
365+
7 B10007U 9 1.52
366+
8 B10007U 11 NA
367+
9 B10009W 9 1.63
368+
10 B10009W 11 1.63
369+
# ℹ 21,356 more rows
370+
```

docs/bcs70-reshape_long_wide.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ format: docusaurus-md
99

1010

1111

12+
[Download the R script for this page](../purl/bcs70-reshape_long_wide.R)
13+
1214
# Introduction
1315

1416
In this section, we show how to reshape data from long to wide (and vice

0 commit comments

Comments
 (0)