Once a dataset is cleaned and ready for statistical analysis, the
first step is typically to summarize it. The
univariate_table()
function makes it easy to create a
custom descriptive analysis while consistently producing clean,
presentation-ready output. It is built to integrate directly into your
analysis work flow (e.g. R markdown) but can also be called from the
console and be rendered in a number of formats.
## Loading required package: cheese
heart_disease %>%
univariate_table()
Variable | Level | Summary |
---|---|---|
Age | 56 (48, 61) | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 130 (120, 140) | |
Cholesterol | 241 (211, 275) | |
MaximumHR | 153 (133.5, 166) | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
By default, an HTML table is produced containing descriptive statistics for columns in the dataset.
In the table above, the summary statistics are presented within the
cells in a particular format for different types of data. You can use
the _summary
arguments to customize not only the appearance
that the results are presented with, but the values that go into the
results themselves.
Suppose instead of the "median (q1, q3)"
being displayed
for numeric data, you want the "mean [sd] / median"
, in
that exact format:
heart_disease %>%
univariate_table(
numeric_summary =
c(
Summary = "mean [sd] / median"
)
)
Variable | Level | Summary |
---|---|---|
Age | 54.44 [9.04] / 56 | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 131.69 [17.6] / 130 | |
Cholesterol | 246.69 [51.78] / 241 | |
MaximumHR | 149.61 [22.88] / 153 | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
The name Summary
was used to ensure that the result for
the numeric data binded in the same column as the result for the other
data types. If you chose to name it something else, you’d get a new
column with those summaries:
heart_disease %>%
univariate_table(
numeric_summary =
c(
NewSummary = "mean [sd] / median"
)
)
Variable | Level | NewSummary | Summary |
---|---|---|---|
Age | 54.44 [9.04] / 56 | ||
Sex | Female | 97 (32.01%) | |
Sex | Male | 206 (67.99%) | |
ChestPain | Typical angina | 23 (7.59%) | |
ChestPain | Atypical angina | 50 (16.5%) | |
ChestPain | Non-anginal pain | 86 (28.38%) | |
ChestPain | Asymptomatic | 144 (47.52%) | |
BP | 131.69 [17.6] / 130 | ||
Cholesterol | 246.69 [51.78] / 241 | ||
MaximumHR | 149.61 [22.88] / 153 | ||
ExerciseInducedAngina | No | 204 (67.33%) | |
ExerciseInducedAngina | Yes | 99 (32.67%) | |
HeartDisease | No | 164 (54.13%) | |
HeartDisease | Yes | 139 (45.87%) |
You can add as many summary columns as you want separately for each type of data:
heart_disease %>%
univariate_table(
numeric_summary =
c(
`Numeric only` = "mean [sd] / median",
Summary = "median (q1, q3)"
),
categorical_summary =
c(
Summary = "count",
`Categorical only` = "percent = 100 * proportion"
)
)
Variable | Level | Numeric only | Summary | Categorical only |
---|---|---|---|---|
Age | 54.44 [9.04] / 56 | 56 (48, 61) | ||
Sex | Female | 97 | 32.01 = 100 * 0.32 | |
Sex | Male | 206 | 67.99 = 100 * 0.68 | |
ChestPain | Typical angina | 23 | 7.59 = 100 * 0.08 | |
ChestPain | Atypical angina | 50 | 16.5 = 100 * 0.17 | |
ChestPain | Non-anginal pain | 86 | 28.38 = 100 * 0.28 | |
ChestPain | Asymptomatic | 144 | 47.52 = 100 * 0.48 | |
BP | 131.69 [17.6] / 130 | 130 (120, 140) | ||
Cholesterol | 246.69 [51.78] / 241 | 241 (211, 275) | ||
MaximumHR | 149.61 [22.88] / 153 | 153 (133.5, 166) | ||
ExerciseInducedAngina | No | 204 | 67.33 = 100 * 0.67 | |
ExerciseInducedAngina | Yes | 99 | 32.67 = 100 * 0.33 | |
HeartDisease | No | 164 | 54.13 = 100 * 0.54 | |
HeartDisease | Yes | 139 | 45.87 = 100 * 0.46 |
A more visually-appealing case for adding multiple summaries is probably when all the data is the same type:
heart_disease %>%
univariate_table(
categorical_types = NULL, #Easily disable categorical data from being summarized
numeric_summary =
c(
`Median (Q1, Q3)` = "median (q1, q3)",
`Min-Max` = "min - max",
`Mean (SD)` = "mean (sd)"
)
)
Variable | Median (Q1, Q3) | Min-Max | Mean (SD) |
---|---|---|---|
Age | 56 (48, 61) | 29 - 77 | 54.44 (9.04) |
BP | 130 (120, 140) | 94 - 200 | 131.69 (17.6) |
Cholesterol | 241 (211, 275) | 126 - 564 | 246.69 (51.78) |
MaximumHR | 153 (133.5, 166) | 71 - 202 | 149.61 (22.88) |
Or when adding a summary that applies to all columns:
heart_disease %>%
univariate_table(
all_summary =
c(
`# obs. non-missing` = "available of length"
)
)
Variable | Level | Summary | # obs. non-missing |
---|---|---|---|
Age | 56 (48, 61) | 303 of 303 | |
Sex | 303 of 303 | ||
Sex | Female | 97 (32.01%) | |
Sex | Male | 206 (67.99%) | |
ChestPain | 303 of 303 | ||
ChestPain | Typical angina | 23 (7.59%) | |
ChestPain | Atypical angina | 50 (16.5%) | |
ChestPain | Non-anginal pain | 86 (28.38%) | |
ChestPain | Asymptomatic | 144 (47.52%) | |
BP | 130 (120, 140) | 303 of 303 | |
Cholesterol | 241 (211, 275) | 303 of 303 | |
BloodSugar | 303 of 303 | ||
MaximumHR | 153 (133.5, 166) | 303 of 303 | |
ExerciseInducedAngina | 303 of 303 | ||
ExerciseInducedAngina | No | 204 (67.33%) | |
ExerciseInducedAngina | Yes | 99 (32.67%) | |
HeartDisease | 303 of 303 | ||
HeartDisease | No | 164 (54.13%) | |
HeartDisease | Yes | 139 (45.87%) |
These add an extra row for categorical variables. You may have also
noticed that the BloodSugar
column didn’t show up in the
table until the all_summary
argument was used–this is
because it is not classified as numeric or categorical data, and thus
not evaluated by default. See the “Backend functionality” section to
learn more.
The strata
argument takes a formula()
that
can be used to stratify the analysis by any number of variables. Columns
on the left side will appear down the rows, and columns on the right
side will spread across the columns. You can use +
on
either side to specify more than one column. Let’s start by stratifying
sex across the columns:
heart_disease %>%
univariate_table(
strata = ~ Sex
)
Variable | Level | Female | Male |
---|---|---|---|
Age | 57 (50, 63) | 54.5 (47, 59.75) | |
ChestPain | Typical angina | 4 (4.12%) | 19 (9.22%) |
ChestPain | Atypical angina | 18 (18.56%) | 32 (15.53%) |
ChestPain | Non-anginal pain | 35 (36.08%) | 51 (24.76%) |
ChestPain | Asymptomatic | 40 (41.24%) | 104 (50.49%) |
BP | 132 (120, 140) | 130 (120, 140) | |
Cholesterol | 254 (215, 302) | 235 (208.75, 268.5) | |
MaximumHR | 157 (142, 165) | 150.5 (132, 167.5) | |
ExerciseInducedAngina | No | 75 (77.32%) | 129 (62.62%) |
ExerciseInducedAngina | Yes | 22 (22.68%) | 77 (37.38%) |
HeartDisease | No | 72 (74.23%) | 92 (44.66%) |
HeartDisease | Yes | 25 (25.77%) | 114 (55.34%) |
You can do the same thing down the rows:
heart_disease %>%
univariate_table(
strata = Sex ~ 1
)
Sex | Variable | Level | Summary |
---|---|---|---|
Female | Age | 57 (50, 63) | |
Female | ChestPain | Typical angina | 4 (4.12%) |
Female | ChestPain | Atypical angina | 18 (18.56%) |
Female | ChestPain | Non-anginal pain | 35 (36.08%) |
Female | ChestPain | Asymptomatic | 40 (41.24%) |
Female | BP | 132 (120, 140) | |
Female | Cholesterol | 254 (215, 302) | |
Female | MaximumHR | 157 (142, 165) | |
Female | ExerciseInducedAngina | No | 75 (77.32%) |
Female | ExerciseInducedAngina | Yes | 22 (22.68%) |
Female | HeartDisease | No | 72 (74.23%) |
Female | HeartDisease | Yes | 25 (25.77%) |
Male | Age | 54.5 (47, 59.75) | |
Male | ChestPain | Typical angina | 19 (9.22%) |
Male | ChestPain | Atypical angina | 32 (15.53%) |
Male | ChestPain | Non-anginal pain | 51 (24.76%) |
Male | ChestPain | Asymptomatic | 104 (50.49%) |
Male | BP | 130 (120, 140) | |
Male | Cholesterol | 235 (208.75, 268.5) | |
Male | MaximumHR | 150.5 (132, 167.5) | |
Male | ExerciseInducedAngina | No | 129 (62.62%) |
Male | ExerciseInducedAngina | Yes | 77 (37.38%) |
Male | HeartDisease | No | 92 (44.66%) |
Male | HeartDisease | Yes | 114 (55.34%) |
Or even both:
heart_disease %>%
univariate_table(
strata = Sex ~ HeartDisease
)
Sex | Variable | Level | No | Yes |
---|---|---|---|---|
Female | Age | 54 (46, 63.25) | 60 (57, 62) | |
Female | ChestPain | Typical angina | 4 (5.56%) | 0 (0%) |
Female | ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) |
Female | ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) |
Female | ChestPain | Asymptomatic | 18 (25%) | 22 (88%) |
Female | BP | 130 (119.5, 140) | 140 (130, 158) | |
Female | Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | |
Female | MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | |
Female | ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) |
Female | ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) |
Male | Age | 52 (44, 57) | 57.5 (51, 61) | |
Male | ChestPain | Typical angina | 12 (13.04%) | 7 (6.14%) |
Male | ChestPain | Atypical angina | 25 (27.17%) | 7 (6.14%) |
Male | ChestPain | Non-anginal pain | 34 (36.96%) | 17 (14.91%) |
Male | ChestPain | Asymptomatic | 21 (22.83%) | 83 (72.81%) |
Male | BP | 130 (120, 140) | 130 (120, 140) | |
Male | Cholesterol | 229.5 (206.5, 250.75) | 247.5 (212, 282) | |
Male | MaximumHR | 163 (150, 175.75) | 141 (125, 156) | |
Male | ExerciseInducedAngina | No | 77 (83.7%) | 52 (45.61%) |
Male | ExerciseInducedAngina | Yes | 15 (16.3%) | 62 (54.39%) |
Now suppose you want both stratification variables across the columns:
heart_disease %>%
univariate_table(
strata = ~ Sex + HeartDisease
)
Variable | Level | No | Yes | No | Yes |
---|---|---|---|---|---|
Age | 54 (46, 63.25) | 60 (57, 62) | 52 (44, 57) | 57.5 (51, 61) | |
ChestPain | Typical angina | 4 (5.56%) | 0 (0%) | 12 (13.04%) | 7 (6.14%) |
ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) | 25 (27.17%) | 7 (6.14%) |
ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) | 34 (36.96%) | 17 (14.91%) |
ChestPain | Asymptomatic | 18 (25%) | 22 (88%) | 21 (22.83%) | 83 (72.81%) |
BP | 130 (119.5, 140) | 140 (130, 158) | 130 (120, 140) | 130 (120, 140) | |
Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | 229.5 (206.5, 250.75) | 247.5 (212, 282) | |
MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | 163 (150, 175.75) | 141 (125, 156) | |
ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) | 77 (83.7%) | 52 (45.61%) |
ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) | 15 (16.3%) | 62 (54.39%) |
The levels will span the columns in a hierarchical fashion depending on their order in the formula:
heart_disease %>%
univariate_table(
strata = ~ HeartDisease + Sex
)
Variable | Level | Female | Male | Female | Male |
---|---|---|---|---|---|
Age | 54 (46, 63.25) | 52 (44, 57) | 60 (57, 62) | 57.5 (51, 61) | |
ChestPain | Typical angina | 4 (5.56%) | 12 (13.04%) | 0 (0%) | 7 (6.14%) |
ChestPain | Atypical angina | 16 (22.22%) | 25 (27.17%) | 2 (8%) | 7 (6.14%) |
ChestPain | Non-anginal pain | 34 (47.22%) | 34 (36.96%) | 1 (4%) | 17 (14.91%) |
ChestPain | Asymptomatic | 18 (25%) | 21 (22.83%) | 22 (88%) | 83 (72.81%) |
BP | 130 (119.5, 140) | 130 (120, 140) | 140 (130, 158) | 130 (120, 140) | |
Cholesterol | 249 (210.75, 289.5) | 229.5 (206.5, 250.75) | 268 (236, 307) | 247.5 (212, 282) | |
MaximumHR | 159 (146.75, 167.25) | 163 (150, 175.75) | 146 (133, 157) | 141 (125, 156) | |
ExerciseInducedAngina | No | 64 (88.89%) | 77 (83.7%) | 11 (44%) | 52 (45.61%) |
ExerciseInducedAngina | Yes | 8 (11.11%) | 15 (16.3%) | 14 (56%) | 62 (54.39%) |
Similarly, the rows also collapse hierarchically:
heart_disease %>%
univariate_table(
strata = HeartDisease + Sex ~ 1
)
HeartDisease | Sex | Variable | Level | Summary |
---|---|---|---|---|
No | Female | Age | 54 (46, 63.25) | |
No | Female | ChestPain | Typical angina | 4 (5.56%) |
No | Female | ChestPain | Atypical angina | 16 (22.22%) |
No | Female | ChestPain | Non-anginal pain | 34 (47.22%) |
No | Female | ChestPain | Asymptomatic | 18 (25%) |
No | Female | BP | 130 (119.5, 140) | |
No | Female | Cholesterol | 249 (210.75, 289.5) | |
No | Female | MaximumHR | 159 (146.75, 167.25) | |
No | Female | ExerciseInducedAngina | No | 64 (88.89%) |
No | Female | ExerciseInducedAngina | Yes | 8 (11.11%) |
No | Male | Age | 52 (44, 57) | |
No | Male | ChestPain | Typical angina | 12 (13.04%) |
No | Male | ChestPain | Atypical angina | 25 (27.17%) |
No | Male | ChestPain | Non-anginal pain | 34 (36.96%) |
No | Male | ChestPain | Asymptomatic | 21 (22.83%) |
No | Male | BP | 130 (120, 140) | |
No | Male | Cholesterol | 229.5 (206.5, 250.75) | |
No | Male | MaximumHR | 163 (150, 175.75) | |
No | Male | ExerciseInducedAngina | No | 77 (83.7%) |
No | Male | ExerciseInducedAngina | Yes | 15 (16.3%) |
Yes | Female | Age | 60 (57, 62) | |
Yes | Female | ChestPain | Typical angina | 0 (0%) |
Yes | Female | ChestPain | Atypical angina | 2 (8%) |
Yes | Female | ChestPain | Non-anginal pain | 1 (4%) |
Yes | Female | ChestPain | Asymptomatic | 22 (88%) |
Yes | Female | BP | 140 (130, 158) | |
Yes | Female | Cholesterol | 268 (236, 307) | |
Yes | Female | MaximumHR | 146 (133, 157) | |
Yes | Female | ExerciseInducedAngina | No | 11 (44%) |
Yes | Female | ExerciseInducedAngina | Yes | 14 (56%) |
Yes | Male | Age | 57.5 (51, 61) | |
Yes | Male | ChestPain | Typical angina | 7 (6.14%) |
Yes | Male | ChestPain | Atypical angina | 7 (6.14%) |
Yes | Male | ChestPain | Non-anginal pain | 17 (14.91%) |
Yes | Male | ChestPain | Asymptomatic | 83 (72.81%) |
Yes | Male | BP | 130 (120, 140) | |
Yes | Male | Cholesterol | 247.5 (212, 282) | |
Yes | Male | MaximumHR | 141 (125, 156) | |
Yes | Male | ExerciseInducedAngina | No | 52 (45.61%) |
Yes | Male | ExerciseInducedAngina | Yes | 62 (54.39%) |
You can use any of the functionality described in the previous section with stratification variables as well:
heart_disease %>%
univariate_table(
strata = ~ Sex + HeartDisease,
numeric_summary =
c(
`Mean (SD)` = "mean (sd)"
),
categorical_summary =
c(
`Count (%)` = "count (percent%)"
)
)
Variable | Level | Mean (SD) | Count (%) | Mean (SD) | Count (%) | Mean (SD) | Count (%) | Mean (SD) | Count (%) |
---|---|---|---|---|---|---|---|---|---|
Age | 54.56 (10.27) | 59.08 (4.86) | 51.04 (8.62) | 56.09 (8.39) | |||||
ChestPain | Typical angina | 4 (5.56%) | 0 (0%) | 12 (13.04%) | 7 (6.14%) | ||||
ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) | 25 (27.17%) | 7 (6.14%) | ||||
ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) | 34 (36.96%) | 17 (14.91%) | ||||
ChestPain | Asymptomatic | 18 (25%) | 22 (88%) | 21 (22.83%) | 83 (72.81%) | ||||
BP | 128.74 (16.54) | 146.6 (21.12) | 129.65 (16.02) | 131.93 (17.22) | |||||
Cholesterol | 256.75 (66.22) | 276.16 (59.88) | 231.6 (37.64) | 246.06 (45.44) | |||||
MaximumHR | 154.03 (19.25) | 143.16 (20.18) | 161.78 (18.56) | 138.4 (23.08) | |||||
ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) | 77 (83.7%) | 52 (45.61%) | ||||
ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) | 15 (16.3%) | 62 (54.39%) |
The summary columns simply get added to the column-spanning hierarchy.
The add_n
argument will add the sample size to the label
for the stratification group:
heart_disease %>%
univariate_table(
strata = ~ Sex,
add_n = TRUE
)
Variable | Level | Female (N=97) | Male (N=206) |
---|---|---|---|
Age | 57 (50, 63) | 54.5 (47, 59.75) | |
ChestPain | Typical angina | 4 (4.12%) | 19 (9.22%) |
ChestPain | Atypical angina | 18 (18.56%) | 32 (15.53%) |
ChestPain | Non-anginal pain | 35 (36.08%) | 51 (24.76%) |
ChestPain | Asymptomatic | 40 (41.24%) | 104 (50.49%) |
BP | 132 (120, 140) | 130 (120, 140) | |
Cholesterol | 254 (215, 302) | 235 (208.75, 268.5) | |
MaximumHR | 157 (142, 165) | 150.5 (132, 167.5) | |
ExerciseInducedAngina | No | 75 (77.32%) | 129 (62.62%) |
ExerciseInducedAngina | Yes | 22 (22.68%) | 77 (37.38%) |
HeartDisease | No | 72 (74.23%) | 92 (44.66%) |
HeartDisease | Yes | 25 (25.77%) | 114 (55.34%) |
When multiple stratification variables are added on one side of the formula, the sample size will show up on the lowest level of the hierarchy, excluding summary columns:
heart_disease %>%
univariate_table(
strata = ~ Sex + HeartDisease,
add_n = TRUE
)
Variable | Level | No (N=72) | Yes (N=25) | No (N=92) | Yes (N=114) |
---|---|---|---|---|---|
Age | 54 (46, 63.25) | 60 (57, 62) | 52 (44, 57) | 57.5 (51, 61) | |
ChestPain | Typical angina | 4 (5.56%) | 0 (0%) | 12 (13.04%) | 7 (6.14%) |
ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) | 25 (27.17%) | 7 (6.14%) |
ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) | 34 (36.96%) | 17 (14.91%) |
ChestPain | Asymptomatic | 18 (25%) | 22 (88%) | 21 (22.83%) | 83 (72.81%) |
BP | 130 (119.5, 140) | 140 (130, 158) | 130 (120, 140) | 130 (120, 140) | |
Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | 229.5 (206.5, 250.75) | 247.5 (212, 282) | |
MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | 163 (150, 175.75) | 141 (125, 156) | |
ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) | 77 (83.7%) | 52 (45.61%) |
ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) | 15 (16.3%) | 62 (54.39%) |
A limitation is that when sample size is added in the presence of row and column strata, it is displayed for the marginal groups only:
heart_disease %>%
univariate_table(
strata = Sex ~ HeartDisease,
add_n = TRUE
)
Sex | Variable | Level | No (N=164) | Yes (N=139) |
---|---|---|---|---|
Female (N=97) | Age | 54 (46, 63.25) | 60 (57, 62) | |
Female (N=97) | ChestPain | Typical angina | 4 (5.56%) | 0 (0%) |
Female (N=97) | ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) |
Female (N=97) | ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) |
Female (N=97) | ChestPain | Asymptomatic | 18 (25%) | 22 (88%) |
Female (N=97) | BP | 130 (119.5, 140) | 140 (130, 158) | |
Female (N=97) | Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | |
Female (N=97) | MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | |
Female (N=97) | ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) |
Female (N=97) | ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) |
Male (N=206) | Age | 52 (44, 57) | 57.5 (51, 61) | |
Male (N=206) | ChestPain | Typical angina | 12 (13.04%) | 7 (6.14%) |
Male (N=206) | ChestPain | Atypical angina | 25 (27.17%) | 7 (6.14%) |
Male (N=206) | ChestPain | Non-anginal pain | 34 (36.96%) | 17 (14.91%) |
Male (N=206) | ChestPain | Asymptomatic | 21 (22.83%) | 83 (72.81%) |
Male (N=206) | BP | 130 (120, 140) | 130 (120, 140) | |
Male (N=206) | Cholesterol | 229.5 (206.5, 250.75) | 247.5 (212, 282) | |
Male (N=206) | MaximumHR | 163 (150, 175.75) | 141 (125, 156) | |
Male (N=206) | ExerciseInducedAngina | No | 77 (83.7%) | 52 (45.61%) |
Male (N=206) | ExerciseInducedAngina | Yes | 15 (16.3%) | 62 (54.39%) |
Often when a descriptive analysis is stratified by one or more
variables, it is also of interest to add statistics that compare each
variable across the groups. The associations
argument
allows you to add a list containing an unlimited number of functions
that can produce a scalar value to be placed in the table. First, let’s
define a function:
#Function for a p-value
pval <-
function(y, x) {
#For categorical data use Fisher's Exact test
if(some_type(x, "factor")) {
p <- fisher.test(factor(y), factor(x), simulate.p.value = TRUE)$p.value
#Otherwise use Kruskall-Wallis
} else {
p <- kruskal.test(x, factor(y))$p.value
}
ifelse(p < 0.001, "<0.001", as.character(round(p, 2)))
}
The stratification variable will be placed in the second argument of the function(s) provided. Now you can add it to the function call:
heart_disease %>%
univariate_table(
strata = ~ HeartDisease,
associations = list(`P-value` = pval)
)
Variable | Level | No | Yes | P-value |
---|---|---|---|---|
Age | 52 (44.75, 59) | 58 (52, 62) | 0.12 | |
Sex | <0.001 | |||
Sex | Female | 72 (43.9%) | 25 (17.99%) | |
Sex | Male | 92 (56.1%) | 114 (82.01%) | |
ChestPain | <0.001 | |||
ChestPain | Typical angina | 16 (9.76%) | 7 (5.04%) | |
ChestPain | Atypical angina | 41 (25%) | 9 (6.47%) | |
ChestPain | Non-anginal pain | 68 (41.46%) | 18 (12.95%) | |
ChestPain | Asymptomatic | 39 (23.78%) | 105 (75.54%) | |
BP | 130 (120, 140) | 130 (120, 145) | 0.51 | |
Cholesterol | 234.5 (208.75, 267.25) | 249 (217.5, 283.5) | 0.11 | |
MaximumHR | 161 (148.75, 172) | 142 (125, 156.5) | 0.08 | |
ExerciseInducedAngina | <0.001 | |||
ExerciseInducedAngina | No | 141 (85.98%) | 63 (45.32%) | |
ExerciseInducedAngina | Yes | 23 (14.02%) | 76 (54.68%) |
The name of function in the list is what becomes the column label.
The comparison will take place across the number of subgroups there are within the column stratification:
heart_disease %>%
univariate_table(
strata = ~ Sex + HeartDisease,
associations = list(`P-value` = pval)
)
Variable | Level | No | Yes | No | Yes | P-value |
---|---|---|---|---|---|---|
Age | 54 (46, 63.25) | 60 (57, 62) | 52 (44, 57) | 57.5 (51, 61) | 0.53 | |
ChestPain | <0.001 | |||||
ChestPain | Typical angina | 4 (5.56%) | 0 (0%) | 12 (13.04%) | 7 (6.14%) | |
ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) | 25 (27.17%) | 7 (6.14%) | |
ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) | 34 (36.96%) | 17 (14.91%) | |
ChestPain | Asymptomatic | 18 (25%) | 22 (88%) | 21 (22.83%) | 83 (72.81%) | |
BP | 130 (119.5, 140) | 140 (130, 158) | 130 (120, 140) | 130 (120, 140) | 0.55 | |
Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | 229.5 (206.5, 250.75) | 247.5 (212, 282) | 0.11 | |
MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | 163 (150, 175.75) | 141 (125, 156) | 0.01 | |
ExerciseInducedAngina | <0.001 | |||||
ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) | 77 (83.7%) | 52 (45.61%) | |
ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) | 15 (16.3%) | 62 (54.39%) |
However, using a row stratification makes the comparisons be within those groups:
heart_disease %>%
univariate_table(
strata = Sex ~ HeartDisease,
associations = list(`P-value` = pval)
)
Sex | Variable | Level | No | Yes | P-value |
---|---|---|---|---|---|
Female | Age | 54 (46, 63.25) | 60 (57, 62) | 0.17 | |
Female | ChestPain | <0.001 | |||
Female | ChestPain | Typical angina | 4 (5.56%) | 0 (0%) | |
Female | ChestPain | Atypical angina | 16 (22.22%) | 2 (8%) | |
Female | ChestPain | Non-anginal pain | 34 (47.22%) | 1 (4%) | |
Female | ChestPain | Asymptomatic | 18 (25%) | 22 (88%) | |
Female | BP | 130 (119.5, 140) | 140 (130, 158) | 0.37 | |
Female | Cholesterol | 249 (210.75, 289.5) | 268 (236, 307) | 0.58 | |
Female | MaximumHR | 159 (146.75, 167.25) | 146 (133, 157) | 0.15 | |
Female | ExerciseInducedAngina | <0.001 | |||
Female | ExerciseInducedAngina | No | 64 (88.89%) | 11 (44%) | |
Female | ExerciseInducedAngina | Yes | 8 (11.11%) | 14 (56%) | |
Male | Age | 52 (44, 57) | 57.5 (51, 61) | 0.29 | |
Male | ChestPain | <0.001 | |||
Male | ChestPain | Typical angina | 12 (13.04%) | 7 (6.14%) | |
Male | ChestPain | Atypical angina | 25 (27.17%) | 7 (6.14%) | |
Male | ChestPain | Non-anginal pain | 34 (36.96%) | 17 (14.91%) | |
Male | ChestPain | Asymptomatic | 21 (22.83%) | 83 (72.81%) | |
Male | BP | 130 (120, 140) | 130 (120, 140) | 0.71 | |
Male | Cholesterol | 229.5 (206.5, 250.75) | 247.5 (212, 282) | 0.11 | |
Male | MaximumHR | 163 (150, 175.75) | 141 (125, 156) | 0.26 | |
Male | ExerciseInducedAngina | <0.001 | |||
Male | ExerciseInducedAngina | No | 77 (83.7%) | 52 (45.61%) | |
Male | ExerciseInducedAngina | Yes | 15 (16.3%) | 62 (54.39%) |
In general, there must be at least one column stratification variable
in order to use association metrics. See
univariate_associations()
for more details on the workhorse
of this functionality.
descriptives()
is the function that drives the
computation behind the statistics for the columns of the input dataset.
Any of its arguments can be passed from univariate_table()
to add further customization.
As noted above, one of the columns did not appear in the table by
default because it was a logical()
type. By default, only
factor()
and numeric()
types are placed into
the result, though there are (at least) three ways to include it:
You could simply just make the column a conformable type outside of the call:
heart_disease %>%
dplyr::mutate(
BloodSugar = factor(BloodSugar)
) %>%
univariate_table()
Variable | Level | Summary |
---|---|---|
Age | 56 (48, 61) | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 130 (120, 140) | |
Cholesterol | 241 (211, 275) | |
BloodSugar | FALSE | 258 (85.15%) |
BloodSugar | TRUE | 45 (14.85%) |
MaximumHR | 153 (133.5, 166) | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
The _types
arguments allow you to specify the data types
that are to be interpreted by the high-level function call. Let’s allow
logical()
types to be treated as a categorical
variable:
heart_disease %>%
univariate_table(
categorical_types = c("factor", "logical")
)
Variable | Level | Summary |
---|---|---|
Age | 56 (48, 61) | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 130 (120, 140) | |
Cholesterol | 241 (211, 275) | |
BloodSugar | FALSE | 258 (85.15%) |
BloodSugar | TRUE | 45 (14.85%) |
MaximumHR | 153 (133.5, 166) | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
The most flexible approach would be to define its own set of functions. By default, the data type of anything that is not interpreted as categorical or numeric is considered “other”. There is infrastruce in place to supply functions and summaries in the same manner for these columns.
heart_disease %>%
univariate_table(
f_other = list(count = function(x) table(x)),
other_summary =
c(
Summary = "count"
)
)
Variable | Level | Summary |
---|---|---|
Age | 56 (48, 61) | |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | 130 (120, 140) | |
Cholesterol | 241 (211, 275) | |
BloodSugar | FALSE | 258 |
BloodSugar | TRUE | 45 |
MaximumHR | 153 (133.5, 166) | |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
You would need to also define functions for the percentages, proportions, etc. to exactly match the other examples.
You can also add custom functions that can be available for numeric or categorical columns:
heart_disease %>%
univariate_table(
categorical_types = NULL,
f_numeric =
list(
cv = ~sd(.x) / mean(.x)
),
numeric_summary =
c(
`Coef. of variation` = "sd / mean = cv"
)
)
Variable | Coef. of variation |
---|---|
Age | 9.04 / 54.44 = 0.17 |
BP | 17.6 / 131.69 = 0.13 |
Cholesterol | 51.78 / 246.69 = 0.21 |
MaximumHR | 22.88 / 149.61 = 0.15 |
The names of functions become the patterns that searched in the string templates.
Finally, we’ll look at a few of the appearance-related arguments. These can be applied with any combination of other arguments.
As mentioned above, the default format for the table is HTML, but you
could choose an alternative with the format
argument:
heart_disease %>%
univariate_table(
format = "none"
)
## # A tibble: 14 × 3
## Variable Level Summary
## <chr> <chr> <chr>
## 1 "Age" "" 56 (48, 61)
## 2 "Sex" "Female" 97 (32.01%)
## 3 "" "Male" 206 (67.99%)
## 4 "ChestPain" "Typical angina" 23 (7.59%)
## 5 "" "Atypical angina" 50 (16.5%)
## 6 "" "Non-anginal pain" 86 (28.38%)
## 7 "" "Asymptomatic" 144 (47.52%)
## 8 "BP" "" 130 (120, 140)
## 9 "Cholesterol" "" 241 (211, 275)
## 10 "MaximumHR" "" 153 (133.5, 166)
## 11 "ExerciseInducedAngina" "No" 204 (67.33%)
## 12 "" "Yes" 99 (32.67%)
## 13 "HeartDisease" "No" 164 (54.13%)
## 14 "" "Yes" 139 (45.87%)
There are also options for
"latex", "pandoc", "markdown"
.
You can use the labels
and levels
arguments
to add clean text to any of the variable or categorical level names, and
the order
argument to change the position of the variables
in the result:
heart_disease %>%
univariate_table(
labels =
c(
Age = "Age (years)",
ChestPain = "Chest pain"
),
levels =
list(
Sex =
c(
Male = "M"
)
),
order =
c(
"BP",
"Age",
"Cholesterol"
)
)
Variable | Level | Summary |
---|---|---|
BP | 130 (120, 140) | |
Age (years) | 56 (48, 61) | |
Cholesterol | 241 (211, 275) | |
Chest pain | Typical angina | 23 (7.59%) |
Chest pain | Atypical angina | 50 (16.5%) |
Chest pain | Non-anginal pain | 86 (28.38%) |
Chest pain | Asymptomatic | 144 (47.52%) |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |
MaximumHR | 153 (133.5, 166) | |
Sex | Female | 97 (32.01%) |
Sex | M | 206 (67.99%) |
Notice you only need to specify values that need to be changed. Also, ordering is done with the original names even when relabeled.
The variableName
and levelName
arguments
are used to change what the headers are for the column names and
categorical levels, while fill_blanks
determines what goes
in empty cells. Finally, the caption
argument specifies
labels the entire table:
heart_disease %>%
univariate_table(
variableName = "THESE ARE VARIABLES",
levelName = "THESE ARE LEVELS",
fill_blanks = "BLANK",
caption = "HERE IS MY CAPTION"
)
THESE ARE VARIABLES | THESE ARE LEVELS | Summary |
---|---|---|
Age | BLANK | 56 (48, 61) |
Sex | Female | 97 (32.01%) |
Sex | Male | 206 (67.99%) |
ChestPain | Typical angina | 23 (7.59%) |
ChestPain | Atypical angina | 50 (16.5%) |
ChestPain | Non-anginal pain | 86 (28.38%) |
ChestPain | Asymptomatic | 144 (47.52%) |
BP | BLANK | 130 (120, 140) |
Cholesterol | BLANK | 241 (211, 275) |
MaximumHR | BLANK | 153 (133.5, 166) |
ExerciseInducedAngina | No | 204 (67.33%) |
ExerciseInducedAngina | Yes | 99 (32.67%) |
HeartDisease | No | 164 (54.13%) |
HeartDisease | Yes | 139 (45.87%) |