Structuring code that is robust to updates in the data, changes in
methodological specifications, etc., and can get to presentation-ready
results in an automated way can mitigate errors caused by painful,
tedious tasks–it can also be a huge time saver. The cheese
package was designed with this philosophy in mind, heavily influenced by
(and explicit dependencies on) tidy
concepts. Its
intention is to provide general tools for working with data and results
during statistical analysis, in addition to a collection of functions
designated for specific statistical tasks.
Let’s take a closer look:
## Loading required package: cheese
This data comes from the UCI Machine Learning Repository, containing a collection of demographic and clinical characteristics from 303 patients. It was subsequently processed and cleaned into a format suitable for analysis–details of which can be found here.
#Look at the data
heart_disease
## # A tibble: 303 × 9
## Age Sex ChestPain BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
## <dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl> <fct> <fct>
## 1 63 Male Typical angina 145 233 TRUE 150 No No
## 2 67 Male Asymptomatic 160 286 FALSE 108 Yes Yes
## 3 67 Male Asymptomatic 120 229 FALSE 129 Yes Yes
## 4 37 Male Non-anginal pain 130 250 FALSE 187 No No
## 5 41 Female Atypical angina 130 204 FALSE 172 No No
## 6 56 Male Atypical angina 120 236 FALSE 178 No No
## 7 62 Female Asymptomatic 140 268 FALSE 160 No Yes
## 8 57 Female Asymptomatic 120 354 FALSE 163 Yes No
## 9 63 Male Asymptomatic 130 254 FALSE 147 No Yes
## 10 53 Male Asymptomatic 140 203 TRUE 155 Yes Yes
## # … with 293 more rows, and abbreviated variable names ¹Cholesterol,
## # ²BloodSugar, ³MaximumHR, ⁴ExerciseInducedAngina, ⁵HeartDisease
The functions that will be introduced are roughly in order of their
development, as many build off of one another. Selection of columns is
done with non-standard evaluation (NSE) or with
tidyselect::select_helpers
, where applicable.
divide()
is used to split up a data frame into a list of
subsets. Suppose you want to split the example data set by sex and heart
disease status:
## $Female
## $Female$No
## # A tibble: 72 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 41 Atypical angina 130 204 FALSE 172 No
## 2 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 56 Atypical angina 140 294 FALSE 153 No
## 4 48 Non-anginal pain 130 275 FALSE 139 No
## 5 58 Typical angina 150 283 TRUE 162 No
## 6 50 Non-anginal pain 120 219 FALSE 158 No
## 7 58 Non-anginal pain 120 340 FALSE 172 No
## 8 66 Typical angina 150 226 FALSE 114 No
## 9 69 Typical angina 140 239 FALSE 151 No
## 10 71 Atypical angina 160 302 FALSE 162 No
## # … with 62 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
## $Female$Yes
## # A tibble: 25 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 62 Asymptomatic 140 268 FALSE 160 No
## 2 65 Asymptomatic 150 225 FALSE 114 No
## 3 61 Asymptomatic 130 330 FALSE 169 No
## 4 51 Asymptomatic 130 305 FALSE 142 Yes
## 5 62 Asymptomatic 160 164 FALSE 145 No
## 6 60 Asymptomatic 150 258 FALSE 157 No
## 7 61 Asymptomatic 145 307 FALSE 146 Yes
## 8 43 Asymptomatic 132 341 TRUE 136 Yes
## 9 62 Non-anginal pain 130 263 FALSE 97 No
## 10 63 Asymptomatic 150 407 FALSE 154 No
## # … with 15 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
##
## $Male
## $Male$No
## # A tibble: 92 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 63 Typical angina 145 233 TRUE 150 No
## 2 37 Non-anginal pain 130 250 FALSE 187 No
## 3 56 Atypical angina 120 236 FALSE 178 No
## 4 57 Asymptomatic 140 192 FALSE 148 No
## 5 44 Atypical angina 120 263 FALSE 173 No
## 6 52 Non-anginal pain 172 199 TRUE 162 No
## 7 57 Non-anginal pain 150 168 FALSE 174 No
## 8 54 Asymptomatic 140 239 FALSE 160 No
## 9 49 Atypical angina 130 266 FALSE 171 No
## 10 64 Typical angina 110 211 FALSE 144 Yes
## # … with 82 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
## $Male$Yes
## # A tibble: 114 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 67 Asymptomatic 160 286 FALSE 108 Yes
## 2 67 Asymptomatic 120 229 FALSE 129 Yes
## 3 63 Asymptomatic 130 254 FALSE 147 No
## 4 53 Asymptomatic 140 203 TRUE 155 Yes
## 5 56 Non-anginal pain 130 256 TRUE 142 Yes
## 6 48 Atypical angina 110 229 FALSE 168 No
## 7 58 Atypical angina 120 284 FALSE 160 No
## 8 58 Non-anginal pain 132 224 FALSE 173 No
## 9 60 Asymptomatic 130 206 FALSE 132 Yes
## 10 40 Asymptomatic 110 167 FALSE 114 Yes
## # … with 104 more rows, and abbreviated variable name ¹ExerciseInducedAngina
The default behavior is to continually split the data in a
hierarchical structure based on the order of the input columns, and to
remove them from the result. The keep
argument can be used
to retain the column(s) in the data frames.
fasten()
allows you to take a list structure that
contains data frames at an arbitrary depth, and roll them back up into a
single data frame. This is useful when a statistical process needs to be
mapped to many subsets, and you want to be able to easily collect the
results without repeatedly binding data. You can call the function with
the divided data from above:
## # A tibble: 303 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 41 Atypical angina 130 204 FALSE 172 No
## 2 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 56 Atypical angina 140 294 FALSE 153 No
## 4 48 Non-anginal pain 130 275 FALSE 139 No
## 5 58 Typical angina 150 283 TRUE 162 No
## 6 50 Non-anginal pain 120 219 FALSE 158 No
## 7 58 Non-anginal pain 120 340 FALSE 172 No
## 8 66 Typical angina 150 226 FALSE 114 No
## 9 69 Typical angina 140 239 FALSE 151 No
## 10 71 Atypical angina 160 302 FALSE 162 No
## # … with 293 more rows, and abbreviated variable name ¹ExerciseInducedAngina
It was binded back to the original number of rows, but it lost the
columns it was split by. The into
argument can be used to
handle this:
## # A tibble: 303 × 9
## Sex HeartDisease Age ChestPain BP Chole…¹ Blood…² Maxim…³ Exerc…⁴
## <chr> <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 Female No 41 Atypical ang… 130 204 FALSE 172 No
## 2 Female No 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 Female No 56 Atypical ang… 140 294 FALSE 153 No
## 4 Female No 48 Non-anginal … 130 275 FALSE 139 No
## 5 Female No 58 Typical angi… 150 283 TRUE 162 No
## 6 Female No 50 Non-anginal … 120 219 FALSE 158 No
## 7 Female No 58 Non-anginal … 120 340 FALSE 172 No
## 8 Female No 66 Typical angi… 150 226 FALSE 114 No
## 9 Female No 69 Typical angi… 140 239 FALSE 151 No
## 10 Female No 71 Atypical ang… 160 302 FALSE 162 No
## # … with 293 more rows, and abbreviated variable names ¹Cholesterol,
## # ²BloodSugar, ³MaximumHR, ⁴ExerciseInducedAngina
The positions of the into
values always lineup with the
level of the list hierarchy. So, for example, if you don’t care about
retaining the heart disease status, you can do this:
## # A tibble: 303 × 8
## Sex Age ChestPain BP Cholesterol BloodSugar MaximumHR Exerci…¹
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 Female 41 Atypical angina 130 204 FALSE 172 No
## 2 Female 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 Female 56 Atypical angina 140 294 FALSE 153 No
## 4 Female 48 Non-anginal pain 130 275 FALSE 139 No
## 5 Female 58 Typical angina 150 283 TRUE 162 No
## 6 Female 50 Non-anginal pain 120 219 FALSE 158 No
## 7 Female 58 Non-anginal pain 120 340 FALSE 172 No
## 8 Female 66 Typical angina 150 226 FALSE 114 No
## 9 Female 69 Typical angina 140 239 FALSE 151 No
## 10 Female 71 Atypical angina 160 302 FALSE 162 No
## # … with 293 more rows, and abbreviated variable name ¹ExerciseInducedAngina
In contrast, if you want to forgo the sex indicator, empty strings will need to be used as placeholders so the names are applied at the correct levels.
## # A tibble: 303 × 8
## HeartDisease Age ChestPain BP Cholesterol Blood…¹ Maxim…² Exerc…³
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 No 41 Atypical angina 130 204 FALSE 172 No
## 2 No 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 No 56 Atypical angina 140 294 FALSE 153 No
## 4 No 48 Non-anginal pain 130 275 FALSE 139 No
## 5 No 58 Typical angina 150 283 TRUE 162 No
## 6 No 50 Non-anginal pain 120 219 FALSE 158 No
## 7 No 58 Non-anginal pain 120 340 FALSE 172 No
## 8 No 66 Typical angina 150 226 FALSE 114 No
## 9 No 69 Typical angina 140 239 FALSE 151 No
## 10 No 71 Atypical angina 160 302 FALSE 162 No
## # … with 293 more rows, and abbreviated variable names ¹BloodSugar, ²MaximumHR,
## # ³ExerciseInducedAngina
Obviously, the classes of the split columns are not retained from the original data since the splitting and binding processes are independent.
As shown above, the default behavior for these functions is to split
or bind “as much as possible”. However, it can be useful to have control
over the shape of the split. This is where the depth
argument comes in. Suppose you want the same data frames at the leaves
of the list, but only one level deep:
## $`Female|No`
## # A tibble: 72 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 41 Atypical angina 130 204 FALSE 172 No
## 2 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 56 Atypical angina 140 294 FALSE 153 No
## 4 48 Non-anginal pain 130 275 FALSE 139 No
## 5 58 Typical angina 150 283 TRUE 162 No
## 6 50 Non-anginal pain 120 219 FALSE 158 No
## 7 58 Non-anginal pain 120 340 FALSE 172 No
## 8 66 Typical angina 150 226 FALSE 114 No
## 9 69 Typical angina 140 239 FALSE 151 No
## 10 71 Atypical angina 160 302 FALSE 162 No
## # … with 62 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
## $`Male|No`
## # A tibble: 92 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 63 Typical angina 145 233 TRUE 150 No
## 2 37 Non-anginal pain 130 250 FALSE 187 No
## 3 56 Atypical angina 120 236 FALSE 178 No
## 4 57 Asymptomatic 140 192 FALSE 148 No
## 5 44 Atypical angina 120 263 FALSE 173 No
## 6 52 Non-anginal pain 172 199 TRUE 162 No
## 7 57 Non-anginal pain 150 168 FALSE 174 No
## 8 54 Asymptomatic 140 239 FALSE 160 No
## 9 49 Atypical angina 130 266 FALSE 171 No
## 10 64 Typical angina 110 211 FALSE 144 Yes
## # … with 82 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
## $`Female|Yes`
## # A tibble: 25 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 62 Asymptomatic 140 268 FALSE 160 No
## 2 65 Asymptomatic 150 225 FALSE 114 No
## 3 61 Asymptomatic 130 330 FALSE 169 No
## 4 51 Asymptomatic 130 305 FALSE 142 Yes
## 5 62 Asymptomatic 160 164 FALSE 145 No
## 6 60 Asymptomatic 150 258 FALSE 157 No
## 7 61 Asymptomatic 145 307 FALSE 146 Yes
## 8 43 Asymptomatic 132 341 TRUE 136 Yes
## 9 62 Non-anginal pain 130 263 FALSE 97 No
## 10 63 Asymptomatic 150 407 FALSE 154 No
## # … with 15 more rows, and abbreviated variable name ¹ExerciseInducedAngina
##
## $`Male|Yes`
## # A tibble: 114 × 7
## Age ChestPain BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
## <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 67 Asymptomatic 160 286 FALSE 108 Yes
## 2 67 Asymptomatic 120 229 FALSE 129 Yes
## 3 63 Asymptomatic 130 254 FALSE 147 No
## 4 53 Asymptomatic 140 203 TRUE 155 Yes
## 5 56 Non-anginal pain 130 256 TRUE 142 Yes
## 6 48 Atypical angina 110 229 FALSE 168 No
## 7 58 Atypical angina 120 284 FALSE 160 No
## 8 58 Non-anginal pain 132 224 FALSE 173 No
## 9 60 Asymptomatic 130 206 FALSE 132 Yes
## 10 40 Asymptomatic 110 167 FALSE 114 Yes
## # … with 104 more rows, and abbreviated variable name ¹ExerciseInducedAngina
You now have list with four elements containing the subsets–the names
of which are the concatenated levels of the split columns (see the
sep
argument).
This argument also works when binding data. Going back to the original divided data frame, you may only want to bind the internal lists.
## $Female
## # A tibble: 97 × 8
## HeartDisease Age ChestPain BP Cholesterol Blood…¹ Maxim…² Exerc…³
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 No 41 Atypical angina 130 204 FALSE 172 No
## 2 No 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 No 56 Atypical angina 140 294 FALSE 153 No
## 4 No 48 Non-anginal pain 130 275 FALSE 139 No
## 5 No 58 Typical angina 150 283 TRUE 162 No
## 6 No 50 Non-anginal pain 120 219 FALSE 158 No
## 7 No 58 Non-anginal pain 120 340 FALSE 172 No
## 8 No 66 Typical angina 150 226 FALSE 114 No
## 9 No 69 Typical angina 140 239 FALSE 151 No
## 10 No 71 Atypical angina 160 302 FALSE 162 No
## # … with 87 more rows, and abbreviated variable names ¹BloodSugar, ²MaximumHR,
## # ³ExerciseInducedAngina
##
## $Male
## # A tibble: 206 × 8
## HeartDisease Age ChestPain BP Cholesterol Blood…¹ Maxim…² Exerc…³
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 No 63 Typical angina 145 233 TRUE 150 No
## 2 No 37 Non-anginal pain 130 250 FALSE 187 No
## 3 No 56 Atypical angina 120 236 FALSE 178 No
## 4 No 57 Asymptomatic 140 192 FALSE 148 No
## 5 No 44 Atypical angina 120 263 FALSE 173 No
## 6 No 52 Non-anginal pain 172 199 TRUE 162 No
## 7 No 57 Non-anginal pain 150 168 FALSE 174 No
## 8 No 54 Asymptomatic 140 239 FALSE 160 No
## 9 No 49 Atypical angina 130 266 FALSE 171 No
## 10 No 64 Typical angina 110 211 FALSE 144 Yes
## # … with 196 more rows, and abbreviated variable names ¹BloodSugar, ²MaximumHR,
## # ³ExerciseInducedAngina
Note that the positions of the into
values are directly
related to how shallow/deep you want the result to be.
The depth can also be controlled relative to the leaves by using negative integers. This can be useful if it is unclear how deep the list will be (or is). In the example above, suppose you didn’t know how deep the list was, but you knew the heart disease status was the most internal split, and thus wanted to bind it.
## $Female
## # A tibble: 97 × 8
## HeartDisease Age ChestPain BP Cholesterol Blood…¹ Maxim…² Exerc…³
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 No 41 Atypical angina 130 204 FALSE 172 No
## 2 No 57 Asymptomatic 120 354 FALSE 163 Yes
## 3 No 56 Atypical angina 140 294 FALSE 153 No
## 4 No 48 Non-anginal pain 130 275 FALSE 139 No
## 5 No 58 Typical angina 150 283 TRUE 162 No
## 6 No 50 Non-anginal pain 120 219 FALSE 158 No
## 7 No 58 Non-anginal pain 120 340 FALSE 172 No
## 8 No 66 Typical angina 150 226 FALSE 114 No
## 9 No 69 Typical angina 140 239 FALSE 151 No
## 10 No 71 Atypical angina 160 302 FALSE 162 No
## # … with 87 more rows, and abbreviated variable names ¹BloodSugar, ²MaximumHR,
## # ³ExerciseInducedAngina
##
## $Male
## # A tibble: 206 × 8
## HeartDisease Age ChestPain BP Cholesterol Blood…¹ Maxim…² Exerc…³
## <chr> <dbl> <fct> <dbl> <dbl> <lgl> <dbl> <fct>
## 1 No 63 Typical angina 145 233 TRUE 150 No
## 2 No 37 Non-anginal pain 130 250 FALSE 187 No
## 3 No 56 Atypical angina 120 236 FALSE 178 No
## 4 No 57 Asymptomatic 140 192 FALSE 148 No
## 5 No 44 Atypical angina 120 263 FALSE 173 No
## 6 No 52 Non-anginal pain 172 199 TRUE 162 No
## 7 No 57 Non-anginal pain 150 168 FALSE 174 No
## 8 No 54 Asymptomatic 140 239 FALSE 160 No
## 9 No 49 Atypical angina 130 266 FALSE 171 No
## 10 No 64 Typical angina 110 211 FALSE 144 Yes
## # … with 196 more rows, and abbreviated variable names ¹BloodSugar, ²MaximumHR,
## # ³ExerciseInducedAngina
The new depth is one less than the input. In this case, the same result was achieved because of the symmetry.
The next set of functions are rooted in functional programming
(i.e. purrr
). In
each instance, given a pre-defined function, you can easily control the
scope of the data in which it is evaluated on. This is an incredibly
useful and efficient philosophy for generating reusable code that is
non-repetitive.
Often times a statistical analysis is concerned with multiple
outcomes/targets, though each one potentially uses the same set of
explanatory variables. You could of course hard code the specific
analysis steps for each outcome (e.g. formulas, etc.), but then the code
becomes repetitive. You could also make separate data sets for each
outcome with the associated predictors and then apply the same analysis
process to each data set, but then you waste resources by storing
multiple copies of the same data in memory. dish()
allows
you to distribute a function to various pairwise sets of columns from
the same data set in a single call. The motivation described above is
merely that–this is a general function that can be applied to any
context.
As a simple first example, lets compute the Pearson correlation of blood pressure and cholesterol with all numeric columns:
## $BP
## $BP$Age
## [1] 0.2849459
##
## $BP$BP
## [1] 1
##
## $BP$Cholesterol
## [1] 0.1301201
##
## $BP$MaximumHR
## [1] -0.04535088
##
##
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
##
## $Cholesterol$BP
## [1] 0.1301201
##
## $Cholesterol$Cholesterol
## [1] 1
##
## $Cholesterol$MaximumHR
## [1] -0.003431832
The left
argument controls which columns are entered in
the first argument of f
, and the right
argument controls which columns are entered into the second. In the
example, you’ll notice that the left
columns were also
included in the set of right
columns, because they are, in
fact, numeric()
. If you didn’t want this, you can omit the
right
argument, which will then include everything not in
left
as the second argument of f
:
heart_disease %>%
dplyr::select(where(is.numeric)) %>% #Need to do this so only numeric columns are evaluated
dish(
f = cor,
left = c(BP, Cholesterol)
)
## $BP
## $BP$Age
## [1] 0.2849459
##
## $BP$MaximumHR
## [1] -0.04535088
##
##
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
##
## $Cholesterol$MaximumHR
## [1] -0.003431832
Now lets suppose you want to regress both blood pressure and
cholesterol on all other variables in the data set. The
each_
arguments allow you to control whether the column
sets are entered into the function individually as vectors, or together
in a single data frame. The former is the default, so you’ll need to use
this argument here:
heart_disease %>%
dish(
f = function(y, x) lm(y ~ ., data = x),
left = c(BP, Cholesterol),
each_right = FALSE
)
## $BP
##
## Call:
## lm(formula = y ~ ., data = x)
##
## Coefficients:
## (Intercept) Age
## 97.09264 0.50748
## SexMale ChestPainAtypical angina
## -3.95940 -9.84788
## ChestPainNon-anginal pain ChestPainAsymptomatic
## -9.56350 -10.11788
## BloodSugarTRUE MaximumHR
## 6.70106 0.09688
## ExerciseInducedAnginaYes HeartDiseaseYes
## 1.72906 6.00819
##
##
## $Cholesterol
##
## Call:
## lm(formula = y ~ ., data = x)
##
## Coefficients:
## (Intercept) Age
## 127.5108 1.2471
## SexMale ChestPainAtypical angina
## -23.6185 8.8270
## ChestPainNon-anginal pain ChestPainAsymptomatic
## 5.7660 8.0778
## BloodSugarTRUE MaximumHR
## -0.7327 0.3491
## ExerciseInducedAnginaYes HeartDiseaseYes
## 7.8039 12.5303
stratiply()
streamlines the familiar
split()
then purrr::map()
approach by making
it simple to select the stratification columns, and apply a function to
each subset. Let’s run a multiple logistic regression model using heart
disease as the outcome, stratified by sex:
heart_disease %>%
stratiply(
f = glm,
by = Sex,
formula = HeartDisease ~ . -ChestPain,
family = "binomial"
)
## $Female
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP
## -6.95485 0.03362 0.04366
## Cholesterol BloodSugarTRUE MaximumHR
## 0.00354 0.47028 -0.02465
## ExerciseInducedAnginaYes
## 2.12039
##
## Degrees of Freedom: 96 Total (i.e. Null); 90 Residual
## Null Deviance: 110.7
## Residual Deviance: 76.21 AIC: 90.21
##
## $Male
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP
## 1.877142 0.024853 0.007710
## Cholesterol BloodSugarTRUE MaximumHR
## 0.008718 -0.390310 -0.042231
## ExerciseInducedAnginaYes
## 1.106387
##
## Degrees of Freedom: 205 Total (i.e. Null); 199 Residual
## Null Deviance: 283.2
## Residual Deviance: 211.2 AIC: 225.2
Adding multiple stratification columns will produce a deeper list:
heart_disease %>%
stratiply(
f = glm,
by = c(Sex, ExerciseInducedAngina),
formula = HeartDisease ~ . -ChestPain,
family = "binomial"
)
## $Female
## $Female$No
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP Cholesterol BloodSugarTRUE
## -9.58123 0.01027 0.07213 0.00406 0.65807
## MaximumHR
## -0.02540
##
## Degrees of Freedom: 74 Total (i.e. Null); 69 Residual
## Null Deviance: 62.53
## Residual Deviance: 49.93 AIC: 61.93
##
## $Female$Yes
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP Cholesterol BloodSugarTRUE
## 3.850121 0.049450 0.012214 0.001458 0.945153
## MaximumHR
## -0.056538
##
## Degrees of Freedom: 21 Total (i.e. Null); 16 Residual
## Null Deviance: 28.84
## Residual Deviance: 23.36 AIC: 35.36
##
##
## $Male
## $Male$No
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP Cholesterol BloodSugarTRUE
## 2.728309 0.053058 -0.003653 0.005387 -1.106214
## MaximumHR
## -0.042036
##
## Degrees of Freedom: 128 Total (i.e. Null); 123 Residual
## Null Deviance: 174
## Residual Deviance: 142.3 AIC: 154.3
##
## $Male$Yes
##
## Call: .f(formula = ..1, family = "binomial", data = .x[[i]])
##
## Coefficients:
## (Intercept) Age BP Cholesterol BloodSugarTRUE
## 0.70070 -0.01348 0.03561 0.01342 1.09404
## MaximumHR
## -0.04531
##
## Degrees of Freedom: 76 Total (i.e. Null); 71 Residual
## Null Deviance: 75.94
## Residual Deviance: 60.53 AIC: 72.53
typly()
allows you to apply a function to columns of a
data frame whose type inherits at least one (or none) of the specified
candidates. It is similar to purrr::map_if()
, but uses
methods::is()
for determining types (see
some_type()
), and only returns the results for the selected
columns:
## $Age
## [1] 54.43894
##
## $BP
## [1] 131.6898
##
## $Cholesterol
## [1] 246.6931
##
## $BloodSugar
## [1] 0.1485149
##
## $MaximumHR
## [1] 149.6073
You can use the negated
argument to apply the function
to columns that are not any of the listed types:
heart_disease %>%
typly(
f = table,
types = c("numeric", "logical"),
negated = TRUE,
useNA = "always"
)
## $Sex
##
## Female Male <NA>
## 97 206 0
##
## $ChestPain
##
## Typical angina Atypical angina Non-anginal pain Asymptomatic
## 23 50 86 144
## <NA>
## 0
##
## $ExerciseInducedAngina
##
## No Yes <NA>
## 204 99 0
##
## $HeartDisease
##
## No Yes <NA>
## 164 139 0
absorb()
allows you to create collections of strings
that use keys as placeholders, which are then populated by the values.
This is useful for setting up results of your analysis to be presented
in a specific format that is independent of the data set at hand. Here’s
a simple example:
absorb(
key = c("mean", "sd", "var"),
value = c("10", "2", "4"),
text =
c(
"MEAN: mean, SD: sd",
"VAR: var = sd^2",
MEAN = "mean"
)
)
## MEAN
## "MEAN: 10, SD: 2" "VAR: 4 = 2^2" "10"
The input text
is scanned to look for the keys
(interpreted as regular expressions) and are replaced with the
corresponding values.
Let’s look at a more useful example. Suppose we have a collection summary statistics for patient age by angina and heart disease status:
age_stats <-
heart_disease %>%
dplyr::group_by(
ExerciseInducedAngina,
HeartDisease
) %>%
dplyr::summarise(
mean = round(mean(Age), 2),
sd = round(sd(Age), 2),
median = median(Age),
iqr = IQR(Age),
.groups = "drop"
) %>%
tidyr::gather(
key = "key",
value = "value",
-ExerciseInducedAngina,
-HeartDisease
)
age_stats
## # A tibble: 16 × 4
## ExerciseInducedAngina HeartDisease key value
## <fct> <fct> <chr> <dbl>
## 1 No No mean 52.4
## 2 No Yes mean 57.1
## 3 Yes No mean 53.6
## 4 Yes Yes mean 56.2
## 5 No No sd 9.68
## 6 No Yes sd 7.56
## 7 Yes No sd 8.54
## 8 Yes Yes sd 8.27
## 9 No No median 52
## 10 No Yes median 58
## 11 Yes No median 53
## 12 Yes Yes median 57
## 13 No No iqr 15
## 14 No Yes iqr 10
## 15 Yes No iqr 12.5
## 16 Yes Yes iqr 8.25
Next, you can specify the string format you want to nicely summarize the information in each group:
age_stats %>%
dplyr::group_by(
ExerciseInducedAngina,
HeartDisease
) %>%
dplyr::summarise(
Summary =
absorb(
key = key,
value = value,
text =
c(
"mean (sd) | median (iqr)"
)
),
.groups = "drop"
)
## # A tibble: 4 × 3
## ExerciseInducedAngina HeartDisease Summary
## <fct> <fct> <chr>
## 1 No No 52.42 (9.68) | 52 (15)
## 2 No Yes 57.1 (7.56) | 58 (10)
## 3 Yes No 53.61 (8.54) | 53 (12.5)
## 4 Yes Yes 56.24 (8.27) | 57 (8.25)
Of course, this is much more useful when the summary data is already
in the format of age_stats
.
In the case where the key pattern found in the string template is repeated, its values are collapsed into a concatenated string:
## [1] "(52.42_57.1_53.61_56.24) / (9.68_7.56_8.54_8.27)"
## [2] "52_58_53_57"
The sep
argument can be used to customize how values are
separated.
It may be useful in some cases to evaluate the resulting string as an expression. You can setup the prior example so that the result contains valid expressions:
absorb(
key = age_stats$key,
value = age_stats$value,
text = c("(mean) / (sd)", "median"),
sep = "+",
evaluate = TRUE,
trace = TRUE
)
## (mean) / (sd) median
## (mean) / (sd) median
## (52.42+57.1+53.61+56.24) / (sd) median
## (52.42+57.1+53.61+56.24) / (sd) 52+58+53+57
## [[1]]
## [1] 6.442584
##
## [[2]]
## [1] 220
These statistics are not entirely useful here, but you get the point.
depths()
and depths_string()
are used for
traversing a list structure to find the elements that satisfy a
predicate. The former produces a vector of the unique depths that
contain at least one element where predicate
is true, and
the latter creates a string representation of the traversal with the
actual positions.
#Make a divided data frame
div_dat1 <-
heart_disease %>%
divide(
Sex,
HeartDisease
)
#Find the depths of the data frames
div_dat1 %>%
depths(
predicate = is.data.frame
)
## [1] 2
div_dat1 %>%
depths_string(
predicate = is.data.frame
)
## [1] "{1{-1,-2},2{-1,-2}}"
In the string result, the brackets represent the level of the list,
and the integers represent the index at that level, which are negative
if predicate
is true for that element.
By default, the algorithm continues when
rlang::is_bare_list()
so that the traversal can end with
list-like objects. However, the bare
argument can be used
to traverse deeper into the list:
div_dat1 %>%
depths_string(
predicate = is.data.frame,
bare = FALSE
)
## [1] "{1{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}},2{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}}}"
Now it also evaluated the columns of each data frame in the list, though it still found the correct positions where the predicate held.
muddle()
can be used to randomly shuffle columns in a
data frame. This is a convenient way to remove confounding when
exploring the effects of a variable on an outcome.
## # A tibble: 303 × 9
## Age Sex ChestPain BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
## <dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl> <fct> <fct>
## 1 43 Female Non-anginal pain 130 214 FALSE 120 Yes No
## 2 44 Male Asymptomatic 130 204 FALSE 143 Yes Yes
## 3 68 Female Asymptomatic 128 242 FALSE 125 Yes No
## 4 35 Female Asymptomatic 120 269 FALSE 163 Yes Yes
## 5 45 Female Atypical angina 138 240 FALSE 169 No No
## 6 54 Male Asymptomatic 128 209 FALSE 182 Yes No
## 7 61 Male Asymptomatic 100 258 FALSE 152 Yes Yes
## 8 57 Female Asymptomatic 152 229 FALSE 142 No No
## 9 67 Male Non-anginal pain 110 283 FALSE 138 No Yes
## 10 51 Female Atypical angina 125 204 TRUE 161 No No
## # … with 293 more rows, and abbreviated variable names ¹Cholesterol,
## # ²BloodSugar, ³MaximumHR, ⁴ExerciseInducedAngina, ⁵HeartDisease
All columns are permuted by default. Use the at
argument
to only shuffle certain ones:
## # A tibble: 303 × 9
## Age Sex ChestPain BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
## <dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl> <fct> <fct>
## 1 47 Male Typical angina 145 233 TRUE 150 No No
## 2 49 Male Asymptomatic 160 286 FALSE 108 Yes Yes
## 3 59 Male Asymptomatic 120 229 FALSE 129 Yes Yes
## 4 35 Male Non-anginal pain 130 250 FALSE 187 No No
## 5 44 Female Atypical angina 130 204 FALSE 172 No No
## 6 29 Male Atypical angina 120 236 FALSE 178 No No
## 7 45 Female Asymptomatic 140 268 FALSE 160 No Yes
## 8 59 Female Asymptomatic 120 354 FALSE 163 Yes No
## 9 46 Male Asymptomatic 130 254 FALSE 147 No Yes
## 10 54 Male Asymptomatic 140 203 TRUE 155 Yes Yes
## # … with 293 more rows, and abbreviated variable names ¹Cholesterol,
## # ²BloodSugar, ³MaximumHR, ⁴ExerciseInducedAngina, ⁵HeartDisease
Additional arguments can be passed to sample()
as well.
For example, you might want five random values from each column:
## # A tibble: 5 × 9
## Age Sex ChestPain BP Choleste…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
## <dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl> <fct> <fct>
## 1 54 Female Asymptomatic 136 275 TRUE 163 No No
## 2 62 Male Asymptomatic 125 271 FALSE 195 No No
## 3 52 Male Non-anginal pain 150 286 FALSE 105 No No
## 4 64 Male Asymptomatic 106 273 FALSE 163 Yes No
## 5 54 Male Atypical angina 120 199 FALSE 182 Yes Yes
## # … with abbreviated variable names ¹Cholesterol, ²BloodSugar, ³MaximumHR,
## # ⁴ExerciseInducedAngina, ⁵HeartDisease
stretch()
is for spanning keys and values across the
columns. It is similar to tidyr::spread()
, but allows for
any number of keys and values to be selected. It’s functionality is more
so targeted at setting up your results to be presented in a table,
rather than for general data manipulation. Thus, the resulting column
names and orderings are setup to make it a seamless transition.
To illustrate, lets first collect the parameter estimates and standard errors from multiple regression models using blood pressure and cholesterol as outcomes, stratified by sex:
mod_results <-
heart_disease %>%
#For each sex
stratiply(
by = Sex,
f =
~.x %>%
#Collect coefficients and standard errors for model estimates on multiple outcomes
dish(
f = function(y, x) {
model <- lm(y ~ ., data = x)
tibble::tibble(
Parameter = names(model$coefficients),
Estimate = model$coefficients,
SE = sqrt(diag(vcov(model)))
)
},
left = c(BP, Cholesterol),
each_right = FALSE
)
) %>%
#Gather results into a single data frame
fasten(
into = c("Sex", "Outcome")
)
mod_results
## # A tibble: 36 × 5
## Sex Outcome Parameter Estimate SE
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Female BP (Intercept) 93.7 24.7
## 2 Female BP Age 0.533 0.210
## 3 Female BP ChestPainAtypical angina -16.3 9.73
## 4 Female BP ChestPainNon-anginal pain -15.3 9.15
## 5 Female BP ChestPainAsymptomatic -14.2 9.64
## 6 Female BP BloodSugarTRUE 6.92 5.53
## 7 Female BP MaximumHR 0.123 0.0993
## 8 Female BP ExerciseInducedAnginaYes 8.00 4.89
## 9 Female BP HeartDiseaseYes 12.1 5.21
## 10 Female Cholesterol (Intercept) -12.1 91.6
## # … with 26 more rows
Now lets span the estimates and standard errors across the columns for both outcomes:
## # A tibble: 18 × 6
## Sex Parameter BP_Estimate BP_SE Cholesterol_Es…¹ Chole…²
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female (Intercept) 93.7 24.7 -12.1 91.6
## 2 Female Age 0.533 0.210 2.30 0.779
## 3 Female BloodSugarTRUE 6.92 5.53 24.6 20.6
## 4 Female ChestPainAsymptomatic -14.2 9.64 34.0 35.8
## 5 Female ChestPainAtypical angina -16.3 9.73 22.7 36.1
## 6 Female ChestPainNon-anginal pain -15.3 9.15 34.1 34.0
## 7 Female ExerciseInducedAnginaYes 8.00 4.89 8.14 18.2
## 8 Female HeartDiseaseYes 12.1 5.21 5.88 19.3
## 9 Female MaximumHR 0.123 0.0993 0.720 0.369
## 10 Male (Intercept) 105. 14.7 151. 38.7
## 11 Male Age 0.480 0.142 0.765 0.374
## 12 Male BloodSugarTRUE 4.96 3.12 -6.85 8.23
## 13 Male ChestPainAsymptomatic -10.0 4.25 3.38 11.2
## 14 Male ChestPainAtypical angina -8.61 4.68 9.55 12.3
## 15 Male ChestPainNon-anginal pain -6.99 4.31 -0.802 11.4
## 16 Male ExerciseInducedAnginaYes -1.08 2.78 6.88 7.33
## 17 Male HeartDiseaseYes 3.07 2.80 13.4 7.39
## 18 Male MaximumHR 0.0392 0.0601 0.239 0.158
## # … with abbreviated variable names ¹Cholesterol_Estimate, ²Cholesterol_SE
The sep
argument controls how the new column names are
concatenated.
Now suppose you wanted to also span sex across the columns. It can just be added to the keys:
## # A tibble: 9 × 9
## Parameter Femal…¹ Femal…² Femal…³ Femal…⁴ Male_B…⁵ Male_…⁶ Male_…⁷ Male_…⁸
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 93.7 24.7 -12.1 91.6 105. 14.7 151. 38.7
## 2 Age 0.533 0.210 2.30 0.779 0.480 0.142 0.765 0.374
## 3 BloodSugarTR… 6.92 5.53 24.6 20.6 4.96 3.12 -6.85 8.23
## 4 ChestPainAsy… -14.2 9.64 34.0 35.8 -10.0 4.25 3.38 11.2
## 5 ChestPainAty… -16.3 9.73 22.7 36.1 -8.61 4.68 9.55 12.3
## 6 ChestPainNon… -15.3 9.15 34.1 34.0 -6.99 4.31 -0.802 11.4
## 7 ExerciseIndu… 8.00 4.89 8.14 18.2 -1.08 2.78 6.88 7.33
## 8 HeartDisease… 12.1 5.21 5.88 19.3 3.07 2.80 13.4 7.39
## 9 MaximumHR 0.123 0.0993 0.720 0.369 0.0392 0.0601 0.239 0.158
## # … with abbreviated variable names ¹Female_BP_Estimate, ²Female_BP_SE,
## # ³Female_Cholesterol_Estimate, ⁴Female_Cholesterol_SE, ⁵Male_BP_Estimate,
## # ⁶Male_BP_SE, ⁷Male_Cholesterol_Estimate, ⁸Male_Cholesterol_SE
Notice how the resulting columns are positioned. They are
sequentially sorted starting with the outer-most key in a hierarchical
manner. Since there are multiple value
columns in this
example, their names are also appended to the result in the order in
which they were requested.
grable()
stands for “graded table”. It is used to stack
headers into a knitr::kable()
in an automated fashion by
iteratively parsing the column names by the specified delimiter. The
result of stretch()
was purposely made to flow directly
into this function.
Sex | Parameter | Estimate | SE | Estimate | SE |
---|---|---|---|---|---|
Female | (Intercept) | 93.7187493 | 24.6536900 | -12.1297785 | 91.6044593 |
Female | Age | 0.5330329 | 0.2096016 | 2.2975419 | 0.7788061 |
Female | BloodSugarTRUE | 6.9178015 | 5.5329199 | 24.5796821 | 20.5583885 |
Female | ChestPainAsymptomatic | -14.1810178 | 9.6423236 | 34.0195680 | 35.8274902 |
Female | ChestPainAtypical angina | -16.3247037 | 9.7289213 | 22.6736038 | 36.1492568 |
Female | ChestPainNon-anginal pain | -15.3471449 | 9.1541813 | 34.1239627 | 34.0137249 |
Female | ExerciseInducedAnginaYes | 7.9952541 | 4.8947727 | 8.1371920 | 18.1872573 |
Female | HeartDiseaseYes | 12.0913966 | 5.2058240 | 5.8795266 | 19.3430149 |
Female | MaximumHR | 0.1226586 | 0.0992801 | 0.7201695 | 0.3688901 |
Male | (Intercept) | 105.2923383 | 14.6933205 | 150.9743256 | 38.7467039 |
Male | Age | 0.4800104 | 0.1417168 | 0.7654107 | 0.3737111 |
Male | BloodSugarTRUE | 4.9616363 | 3.1197639 | -6.8540853 | 8.2269061 |
Male | ChestPainAsymptomatic | -10.0474949 | 4.2547960 | 3.3795322 | 11.2200180 |
Male | ChestPainAtypical angina | -8.6148546 | 4.6798629 | 9.5540927 | 12.3409315 |
Male | ChestPainNon-anginal pain | -6.9875791 | 4.3056911 | -0.8015539 | 11.3542299 |
Male | ExerciseInducedAnginaYes | -1.0827067 | 2.7780388 | 6.8846691 | 7.3257672 |
Male | HeartDiseaseYes | 3.0740495 | 2.8029917 | 13.4020228 | 7.3915687 |
Male | MaximumHR | 0.0391572 | 0.0600935 | 0.2387579 | 0.1584683 |
The at
argument specifies which columns should be
spanned in the header. Let’s try it with the two-key example from
above:
Parameter | Estimate | SE | Estimate | SE | Estimate | SE | Estimate | SE |
---|---|---|---|---|---|---|---|---|
(Intercept) | 93.7187493 | 24.6536900 | -12.1297785 | 91.6044593 | 105.2923383 | 14.6933205 | 150.9743256 | 38.7467039 |
Age | 0.5330329 | 0.2096016 | 2.2975419 | 0.7788061 | 0.4800104 | 0.1417168 | 0.7654107 | 0.3737111 |
BloodSugarTRUE | 6.9178015 | 5.5329199 | 24.5796821 | 20.5583885 | 4.9616363 | 3.1197639 | -6.8540853 | 8.2269061 |
ChestPainAsymptomatic | -14.1810178 | 9.6423236 | 34.0195680 | 35.8274902 | -10.0474949 | 4.2547960 | 3.3795322 | 11.2200180 |
ChestPainAtypical angina | -16.3247037 | 9.7289213 | 22.6736038 | 36.1492568 | -8.6148546 | 4.6798629 | 9.5540927 | 12.3409315 |
ChestPainNon-anginal pain | -15.3471449 | 9.1541813 | 34.1239627 | 34.0137249 | -6.9875791 | 4.3056911 | -0.8015539 | 11.3542299 |
ExerciseInducedAnginaYes | 7.9952541 | 4.8947727 | 8.1371920 | 18.1872573 | -1.0827067 | 2.7780388 | 6.8846691 | 7.3257672 |
HeartDiseaseYes | 12.0913966 | 5.2058240 | 5.8795266 | 19.3430149 | 3.0740495 | 2.8029917 | 13.4020228 | 7.3915687 |
MaximumHR | 0.1226586 | 0.0992801 | 0.7201695 | 0.3688901 | 0.0391572 | 0.0600935 | 0.2387579 | 0.1584683 |
The tables can then be piped through subsequent customized processing
with kableExtra
tools.
descriptives()
produces a data frame in long format with
a collection of descriptive statistics.
heart_disease %>%
descriptives()
## # A tibble: 111 × 9
## fun_eval fun_key col_ind col_lab val_ind val_lab val_dbl val_chr val_cbn
## <chr> <chr> <int> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 all available 1 Age 1 NA 303 NA 303
## 2 all available 2 Sex 1 NA 303 NA 303
## 3 all available 3 ChestPain 1 NA 303 NA 303
## 4 all available 4 BP 1 NA 303 NA 303
## 5 all available 5 Cholester… 1 NA 303 NA 303
## 6 all available 6 BloodSugar 1 NA 303 NA 303
## 7 all available 7 MaximumHR 1 NA 303 NA 303
## 8 all available 8 ExerciseI… 1 NA 303 NA 303
## 9 all available 9 HeartDise… 1 NA 303 NA 303
## 10 all class 1 Age 1 NA NA numeric numeric
## # … with 101 more rows
univariate_associations()
is a special case of
dish()
specifically for computing association metrics of
one or more responses
with a collection of
predictors
.
heart_disease %>%
univariate_associations(
f = list(AIC = function(y, x) AIC(lm(y ~ x))),
responses = c(BP, Cholesterol)
)
## # A tibble: 14 × 3
## response predictor AIC
## <chr> <chr> <dbl>
## 1 BP Age 2577.
## 2 BP Sex 2602.
## 3 BP ChestPain 2598.
## 4 BP BloodSugar 2593.
## 5 BP MaximumHR 2602.
## 6 BP ExerciseInducedAngina 2602.
## 7 BP HeartDisease 2596.
## 8 Cholesterol Age 3243.
## 9 Cholesterol Sex 3244.
## 10 Cholesterol ChestPain 3259.
## 11 Cholesterol BloodSugar 3257.
## 12 Cholesterol MaximumHR 3257.
## 13 Cholesterol ExerciseInducedAngina 3256.
## 14 Cholesterol HeartDisease 3255.
univariate_table()
allows you to create a custom
descriptive table for a data set. It uses almost every function in the
package across its span of capabilities.
heart_disease %>%
univariate_table()
Variable | Level | Summary |
---|---|---|
Age | 56 (48, 61) | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 130 (120, 140) | |
Cholesterol | 241 (211, 275) | |
MaximumHR | 153 (133.5, 166) | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
See vignette("describe")
for a more in-depth
introduction to these functions.