Title: | Prepare Questionnaire Data for Analysis |
---|---|
Description: | Offers a suite of functions to prepare questionnaire data for analysis (perhaps other types of data as well). By data preparation, I mean data analytic tasks to get your raw data ready for statistical modeling (e.g., regression). There are functions to investigate missing data, reshape data, validate responses, recode variables, score questionnaires, center variables, aggregate by groups, shift scores (i.e., leads or lags), etc. It provides functions for both single level and multilevel (i.e., grouped) data. With a few exceptions (e.g., ncases()), functions without an "s" at the end of their primary word (e.g., center_by()) act on atomic vectors, while functions with an "s" at the end of their primary word (e.g., centers_by()) act on multiple columns of a data.frame. |
Authors: | David Disabato [aut, cre]
|
Maintainer: | David Disabato <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.0 |
Built: | 2025-02-06 05:27:46 UTC |
Source: | https://github.com/cran/quest |
quest
is a package for pre-processing questionnaire data
to get it ready for statistical modeling. It contains functions for
investigating missing data (e.g., rowNA
), reshaping data
(e.g., wide2long
), validating responses (e.g.,
revalids
), recoding variables (e.g., recodes
),
scoring (e.g., scores
), centering (e.g.,
centers
), aggregating (e.g., aggs
), shifting
(e.g., shifts
), etc. Functions whose first phrases end with
an s
are vectorized versions of their functions without an s
at the end of the first phrase. For example, center
inputs a
(atomic) vector and outputs a atomic vector to center and/or scale a single
variable; centers
inputs a data.frame and outputs a data.frame to
center and/or scale multiple variables. Functions that end in _by
are calculated by group. For example, center
does grand-mean
centering while center_by
does group-mean centering. Putting the two
together, centers_by
inputs a data.frame and outputs a data.frame to
center and/or scale multiple variables by group. Functions that end in
_ml
calculate a "multilevel" result with a within-group result and
between-group result. Functions that end in _if
are calculated
dependent on the frequency of observed values (aka amount of missing data).
The quest
package uses the str2str
package internally to
convert R objects from one structure to another. See str2str
for details.
There are three main types of functions. 1)
Helper functions that primarily exist to save a few lines of code and are
primarily for convenience (e.g., vecNA
). 2) Functions for
wrangling questionnaire data (e.g., nom2dum
,
reverses
). 3) Functions for preliminary statistical
calculation (e.g., means_diff
, corp_by
).
See the table below
variable
group
names
missing values
observed values
proportion
separator
correlations
identifier
return
function
data.frame
factor
nominal variable
binary variable
dummy variable
percentage of maximum possible
standardize
within-groups
between-groups
Maintainer: David Disabato [email protected] (ORCID)
cronbach()
Function.cronbach
is the function used by the boot
function
within the cronbach
function. It is primarily created to increase the
computational efficiency of bootstrap confidence intervals within the
cronbach
function by doing only the minimal computations needed to
compute cronbach's alpha.
.cronbach(dat, i, use)
.cronbach(dat, i, use)
dat |
data.frame with only the items you wish to include in the cronbach's alpha computation and no other variables. |
i |
integer vector of length = |
use |
character vector of length 1 specifying how missing data should be
handled when computing covariances. See |
double vector of length 1 providing cronbach's alpha
.cronbach(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), use = "pairwise")
.cronbach(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), use = "pairwise")
cronbachs()
Function.cronbachs
is the function used by the boot
function within the cronbachs
function. It is primarily created to
increase the computational efficiency of bootstrap confidence intervals
within the cronbachs
function by doing only the minimal computations
needed to compute cronbach's alpha for each set of variables/items.
.cronbachs(dat, i, nm.list, use)
.cronbachs(dat, i, nm.list, use)
dat |
data.frame of data. It can contain variables other than those used for cronbach's alpha calculation. |
i |
integer vector of length = |
nm.list |
list of character vectors specifying the sets of variables/items associated with each of the cronbach's alpha calculations. |
use |
character vector of length 1 specifying how missing data should be
handled when computing covariances. See |
double vector of length = length(nm.list)
providing cronbach's
alpha for each set of variables/items.
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) .cronbachs(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, use = "pairwise")
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) .cronbachs(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, use = "pairwise")
gtheory()
Function.gtheory
is the function used by the boot
function
within the gtheory
function. It is primarily created to
increase the computational efficiency of bootstrap confidence intervals
within the gtheory
function by doing only the minimal computations
needed to compute the generalizability theory coefficient.
.gtheory(dat, i, cross.vrb)
.gtheory(dat, i, cross.vrb)
dat |
data.frame with only the variables/items you wish to include in the generalizability theory coefficient and no other variables/items. |
i |
integer vector of length = |
cross.vrb |
logical vector of length 1 specifying whether the variables/items should be crossed when computing the generalizability theory coefficient. If TRUE, then only the covariance structure of the variables/items will be incorperated into the estimate of reliability. If FALSE, then the mean structure of the variables/items will be incorperated. |
double vector of length 1 providing the generalizability theory coefficient.
.gtheory(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), cross.vrb = TRUE) .gtheory(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), cross.vrb = FALSE)
.gtheory(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), cross.vrb = TRUE) .gtheory(dat = attitude, i = sample(x = 1:nrow(attitude), size = nrow(attitude), replace = TRUE), cross.vrb = FALSE)
gtheorys()
Function.gtheorys
is the function used by the boot
function within the gtheorys
function. It is primarily created
to increase the computational efficiency of bootstrap confidence intervals
within the gtheorys
function by doing only the minimal computations
needed to compute the generalizability theory coefficient.
.gtheorys(dat, i, nm.list, cross.vrb)
.gtheorys(dat, i, nm.list, cross.vrb)
dat |
data.frame of data. It can contain variables other than those used for generalizability theory coefficient calculation. |
i |
integer vector of length = |
nm.list |
list of character vectors specifying the sets of variables/items associated with each of the generalizability theory coefficient calculations. |
cross.vrb |
logical vector of length 1 specifying whether the variables/items should be crossed when computing the generalizability theory coefficient. If TRUE, then only the covariance structure of the variables/items will be incorperated into the estimate of reliability. If FALSE, then the mean structure of the variables/items will be incorperated. |
double vector of length = length(nm.list)
providing the
generalizability theory coefficients.
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) .gtheorys(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, cross.vrb = TRUE) .gtheorys(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, cross.vrb = FALSE)
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) .gtheorys(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, cross.vrb = TRUE) .gtheorys(dat = dat1, i = sample(x = 1:nrow(dat1), size = nrow(dat1), replace = TRUE), nm.list = vrb_nm_list, cross.vrb = FALSE)
add_sig
adds symbols for various p-values cutoffs of statistical
significance. The function inputs a numeric vector, matrix, or array of
effect sizes (e.g., correlation matrix) and a numeric vector, matrix, or
array of p-values that correspond to the effect size (i.e., each row and
column match) and then returns a character vector, matrix, or array of effect
sizes with appended significance symbols. One of the primary applications of
this function is use within corp
corp_by
, and
corp_ml
for correlation matrices.
add_sig( x, p, digits = 3, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE )
add_sig( x, p, digits = 3, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE )
x |
double numeric vector of effect sizes for which statistical significance is available. |
p |
double matrix of p-values for the effect sizes in |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any effect size significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any effect size significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any effect size significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any effect size significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place if the effect size is within 1 or -1. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive effect sizes (minus signs are always in front of negative effect sizes). |
There are several functions out there that do similar things. Here is one
posted to R-bloggers that does it for correlation matrices using the
corr
function from the Hmisc
package:
https://www.r-bloggers.com/2020/07/create-a-publication-ready-correlation-matrix-with-significance-levels-in-r/.
character vector, matrix, or array with the same dimensions as
x
and p
containing the effect sizes with their significance
symbols appended to the end of each value.
corr_test <- psych::corr.test(mtcars[1:5]) r <- corr_test[["r"]] p <- corr_test[["p"]] add_sig(x = r, p = p) add_sig(x = r, p = p, digits = 2) add_sig(x = r, p = p, lead.zero = TRUE, trail.zero = FALSE) add_sig(x = r, p = p, plus = TRUE) noquote(add_sig(x = r, p = p)) # no quotes for character elements
corr_test <- psych::corr.test(mtcars[1:5]) r <- corr_test[["r"]] p <- corr_test[["p"]] add_sig(x = r, p = p) add_sig(x = r, p = p, digits = 2) add_sig(x = r, p = p, lead.zero = TRUE, trail.zero = FALSE) add_sig(x = r, p = p, plus = TRUE) noquote(add_sig(x = r, p = p)) # no quotes for character elements
add_sig_cor
adds symbols for various p-values cutoffs of statistical
significance. The function inputs a correlation matrix and a numeric matrix
of p-values that correspond to the correlations (i.e., each row and column
match) and then returns a data.frame of correlations with appended
significance symbols. One of the primary applications of this function is use
within corp
corp_by
, and corp_ml
for correlation matrices.
add_sig_cor( r, p, digits = 3, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
add_sig_cor( r, p, digits = 3, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
r |
double numeric matrix of correlation coefficients for which statistical significance is available. Since its a correlation matrix, it must be symmetrical and is expected to be a full matrix with all elements included (not just lower or upper diagonals values included). |
p |
double matrix of p-values for the correlations in |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive correlations (minus signs are always in front of negative correlations). |
diags |
logical vector of length 1 specifying whether to retain the
values in the diagonal of the correlation matrix. If TRUE, then the
diagonal will be 1s with |
lower |
logical vector of length 1 specifying whether to retain the lower triangle of the correlation matrix. If TRUE, then the lower triangle correlations and their significance symbols are retained. If FAlSE, then the lower triangle will all be NA. |
upper |
logical vector of length 1 specifying whether to retain the upper triangle of the correlation matrix. If TRUE, then the upper triangle correlations and their significance symbols are retained. If FAlSE, then the upper triangle will all be NA. |
There are several functions out there that do similar things. Here is one
posted to R-bloggers that uses the corr
function from the Hmisc
package:
https://www.r-bloggers.com/2020/07/create-a-publication-ready-correlation-matrix-with-significance-levels-in-r/.
data.frame with the same dimensions as r
containing the
correlations and their significance symbols. Elements may or may not contain NA
values depending on the arguments diags
, lower
, and
upper
.
corr_test <- psych::corr.test(mtcars[1:5]) r <- corr_test[["r"]] p <- corr_test[["p"]] add_sig_cor(r = r, p = p) add_sig_cor(r = r, p = p, digits = 2) add_sig_cor(r = r, p = p, diags = TRUE) add_sig_cor(r = r, p = p, lower = FALSE, upper = TRUE) add_sig_cor(r = r, p = p, lead.zero = TRUE, trail.zero = FALSE) add_sig_cor(r = r, p = p, plus = TRUE)
corr_test <- psych::corr.test(mtcars[1:5]) r <- corr_test[["r"]] p <- corr_test[["p"]] add_sig_cor(r = r, p = p) add_sig_cor(r = r, p = p, digits = 2) add_sig_cor(r = r, p = p, diags = TRUE) add_sig_cor(r = r, p = p, lower = FALSE, upper = TRUE) add_sig_cor(r = r, p = p, lead.zero = TRUE, trail.zero = FALSE) add_sig_cor(r = r, p = p, plus = TRUE)
agg
evaluates a function separately for each group and combines the
results back together into an atomic vector of data.frame that is returned.
Depending on the argument rep
, the results of fun
are repeated
for each element of x
in the group (TRUE) or only once for each group
(FALSE). Depending on the argument rtn.grp
, the return object is a
data.frame and the groups within grp
are included in the data.frame as
columns (TRUE) or the return object is an atomic vector and the groups are
the names (FALSE).
agg(x, grp, rep = TRUE, rtn.grp = !rep, sep = "_", fun, ...)
agg(x, grp, rep = TRUE, rtn.grp = !rep, sep = "_", fun, ...)
x |
atomic vector. |
grp |
atomic vector or list of atomic vectors (e.g., data.frame)
specifying the groups. The atomic vector(s) must be the length of |
rep |
logical vector of length 1 specifying whether the result of
|
rtn.grp |
logical vector of length 1 specifying whether the groups
(i.e., |
sep |
character vector of length 1 specifying what string should
separate different group values when naming the return object. This
argument is only used if |
fun |
function to use for aggregation. This function is expected to return an atomic vector of length 1. |
... |
additional named arguments to |
If rep
= TRUE, then agg
calls ave
; if rep
=
FALSE, then agg
calls aggregate
.
result of fun
applied to x
for each group
within grp
. The structure of the return object depends on the
arguments rep
and rtn.grp
:
then the return
object is a data.frame with nrow = nrow(data)
where the first
columns are grp
and the last column is the result of fun
. If
grp
is not a list with names, then its colnames will be "Group.1",
"Group.2", "Group.3" etc. similar to aggregate
's return object. The
colname for the result of fun
will be "x".
then the return
object is an atomic vector with length = length(x)
where the values
are the result of fun
and the names = names(x)
.
then the return
object is a data.frame with nrow =
length(levels(interaction(grp)))
where the first columns are the unique group combinations in grp
and
the last column is the result of fun
. If grp
is not a list
with names, then its colnames will be "Group.1", "Group.2", "Group.3" etc.
similar to aggregate
's return object. The colname for the result of
fun
will be "x".
then the return
object is an atomic vector with length
length(levels(interaction(grp)))
where the values are the result of
fun
and the names are each group value pasted together by sep
if there are multiple grouping variables within grp
(i.e.,
is.list(grp) && length(grp) > 2
).
aggs
,
agg_dfm
,
ave
,
aggregate
,
# one grouping variable agg(x = airquality$"Solar.R", grp = airquality$"Month", fun = mean) agg(x = airquality$"Solar.R", grp = airquality$"Month", fun = mean, na.rm = TRUE) # ignoring missing values agg(x = setNames(airquality$"Solar.R", nm = row.names(airquality)), grp = airquality$"Month", fun = mean, na.rm = TRUE) # keeps the names in the return object agg(x = airquality$"Solar.R", grp = airquality$"Month", rep = FALSE, fun = mean, na.rm = TRUE) # do NOT repeat aggregated values agg(x = airquality$"Solar.R", grp = airquality$"Month", rep = FALSE, rtn.grp = FALSE, fun = mean, na.rm = TRUE) # groups are the names of the returned atomic vector # two grouping variables tmp_nm <- c("vs","am") # Roxygen2 doesn't like a c() within a [] agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = TRUE, fun = sd) agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, fun = sd) # do NOT repeat aggregated values agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, rtn.grp = FALSE, fun = sd) # groups are the names of the returned atomic vector agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, rtn.grp = FALSE, sep = ".", fun = sd) # change the separater for naming # error messages ## Not run: agg(x = airquality$"Solar.R", grp = mtcars[tmp_nm]) # error returned # b/c atomic vectors within \code{grp} not having the same length as \code{x} ## End(Not run)
# one grouping variable agg(x = airquality$"Solar.R", grp = airquality$"Month", fun = mean) agg(x = airquality$"Solar.R", grp = airquality$"Month", fun = mean, na.rm = TRUE) # ignoring missing values agg(x = setNames(airquality$"Solar.R", nm = row.names(airquality)), grp = airquality$"Month", fun = mean, na.rm = TRUE) # keeps the names in the return object agg(x = airquality$"Solar.R", grp = airquality$"Month", rep = FALSE, fun = mean, na.rm = TRUE) # do NOT repeat aggregated values agg(x = airquality$"Solar.R", grp = airquality$"Month", rep = FALSE, rtn.grp = FALSE, fun = mean, na.rm = TRUE) # groups are the names of the returned atomic vector # two grouping variables tmp_nm <- c("vs","am") # Roxygen2 doesn't like a c() within a [] agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = TRUE, fun = sd) agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, fun = sd) # do NOT repeat aggregated values agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, rtn.grp = FALSE, fun = sd) # groups are the names of the returned atomic vector agg(x = mtcars$"mpg", grp = mtcars[tmp_nm], rep = FALSE, rtn.grp = FALSE, sep = ".", fun = sd) # change the separater for naming # error messages ## Not run: agg(x = airquality$"Solar.R", grp = mtcars[tmp_nm]) # error returned # b/c atomic vectors within \code{grp} not having the same length as \code{x} ## End(Not run)
agg_dfm
evaluates a function on a set of variables in a data.frame
separately for each group and combines the results back together. The
rep
and rtn.grp
arguments determine exactly how the results are
combined together. If rep
= TRUE, then the result of fun
is
repeated for every row of the group in data[grp.nm]
; If rep
=
FALSE, then the result of fun
for each unique combination of
data[grp.nm]
is returned once. If rtn.grp
= TRUE, then the
results are returned in a data.frame where the first columns are the groups
from data[grp.nm]
; If rtn.grp
= FALSE, then the results are
returned in an atomic vector. Note, agg_dfm
evaluates fun
on
all the variables in data[vrb.nm]
as a whole, If instead, you want to
evaluate fun
separately for variable vrb.nm
in data
,
then use Agg
.
agg_dfm( data, vrb.nm, grp.nm, rep = FALSE, rtn.grp = !rep, sep = ".", rtn.result.nm = "result", fun, ... )
agg_dfm( data, vrb.nm, grp.nm, rep = FALSE, rtn.grp = !rep, sep = ".", rtn.result.nm = "result", fun, ... )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
rep |
logical vector of length 1 specifying whether the result of
|
rtn.grp |
logical vector of length 1 specifying whether the group
columns (i.e., |
sep |
character vector of length 1 specifying the string to paste the
group values together with when there are multiple grouping variables
(i.e., |
rtn.result.nm |
character vector of length 1 specifying the name for the
column of results in the return object. Only used if |
fun |
function to evaluate each grouping of |
... |
additional named arguments to |
If rep
= TRUE, then agg_dfm
calls ave_dfm
; if rep
= FALSE, then agg_dfm
calls by
. When rep
= FALSE and
rtn.grp
= TRUE, agg_dfm
is very similar to plyr::ddply
;
when rep
= FALSE and rtn.grp
= FALSE, then agg_dfm
is
very similar to plyr::daply
.
result of fun
applied to each grouping of
data[vrb.nm]
. The structure of the return object depends on the
arguments rep
and rtn.grp
.
then the return
object is a data.frame with nrow = nrow(data)
where the first
columns are data[grp.nm]
and the last column is the result of
fun
with colname = rtn.result.nm
.
then the return
object is an atomic vector with length = nrow(data)
where the values
are the result of fun
and the names = row.names(data)
.
then the return
object is a data.frame with nrow =
length(levels(interaction(data[grp.nm])))
where the first columns
are the unique group combinations in data[grp.nm]
and the last
column is the result of fun
with colname = rtn.result.nm
.
then the return
object is an atomic vector with length
length(levels(interaction(data[grp.nm])))
where the values are the
result of fun
and the names are each group value pasted together by
sep
if there are multiple grouping variables (i.e.,
length(grp.nm)
> 2).
### one grouping variable ## by in base R by(data = airquality[c("Ozone","Solar.R")], INDICES = airquality["Month"], simplify = FALSE, FUN = function(dat) cor(dat, use = "complete")[1,2]) ## rep = TRUE # rtn.group = TRUE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = TRUE, rtn.grp = TRUE, fun = function(dat) cor(dat, use = "complete")[1,2]) # rtn.group = FALSE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = TRUE, rtn.grp = FALSE, fun = function(dat) cor(dat, use = "complete")[1,2]) ## rep = FALSE # rtn.group = TRUE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, rtn.grp = TRUE, fun = function(dat) cor(dat, use = "complete")[1,2]) suppressWarnings(plyr::ddply(.data = airquality[c("Ozone","Solar.R","Month")], .variables = "Month", .fun = function(dat) cor(dat, use = "complete")[1,2])) # rtn.group = FALSE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, rtn.grp = FALSE, fun = function(dat) cor(dat, use = "complete")[1,2]) suppressWarnings(plyr::daply(.data = airquality[c("Ozone","Solar.R","Month")], .variables = "Month", .fun = function(dat) cor(dat, use = "complete")[1,2])) ### two grouping variables ## by in base R by(data = mtcars[c("mpg","cyl","disp")], INDICES = mtcars[c("vs","am")], FUN = nrow, simplify = FALSE) # with multiple group columns ## rep = TRUE # rtn.grp = TRUE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = TRUE, rtn.grp = TRUE, fun = nrow) # rtn.grp = FALSE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = TRUE, rtn.grp = FALSE, fun = nrow) ## rep = FALSE # rtn.grp = TRUE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = TRUE, fun = nrow) agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = TRUE, rtn.result.nm = "value", fun = nrow) # rtn.grp = FALSE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, fun = nrow) agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, sep = "_", fun = nrow)
### one grouping variable ## by in base R by(data = airquality[c("Ozone","Solar.R")], INDICES = airquality["Month"], simplify = FALSE, FUN = function(dat) cor(dat, use = "complete")[1,2]) ## rep = TRUE # rtn.group = TRUE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = TRUE, rtn.grp = TRUE, fun = function(dat) cor(dat, use = "complete")[1,2]) # rtn.group = FALSE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = TRUE, rtn.grp = FALSE, fun = function(dat) cor(dat, use = "complete")[1,2]) ## rep = FALSE # rtn.group = TRUE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, rtn.grp = TRUE, fun = function(dat) cor(dat, use = "complete")[1,2]) suppressWarnings(plyr::ddply(.data = airquality[c("Ozone","Solar.R","Month")], .variables = "Month", .fun = function(dat) cor(dat, use = "complete")[1,2])) # rtn.group = FALSE agg_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, rtn.grp = FALSE, fun = function(dat) cor(dat, use = "complete")[1,2]) suppressWarnings(plyr::daply(.data = airquality[c("Ozone","Solar.R","Month")], .variables = "Month", .fun = function(dat) cor(dat, use = "complete")[1,2])) ### two grouping variables ## by in base R by(data = mtcars[c("mpg","cyl","disp")], INDICES = mtcars[c("vs","am")], FUN = nrow, simplify = FALSE) # with multiple group columns ## rep = TRUE # rtn.grp = TRUE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = TRUE, rtn.grp = TRUE, fun = nrow) # rtn.grp = FALSE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = TRUE, rtn.grp = FALSE, fun = nrow) ## rep = FALSE # rtn.grp = TRUE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = TRUE, fun = nrow) agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = TRUE, rtn.result.nm = "value", fun = nrow) # rtn.grp = FALSE agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, fun = nrow) agg_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, sep = "_", fun = nrow)
aggs
evaluates a function separately for each group and combines the
results back together into a data.frame that is returned. Depending on
rep
, the results of fun
are repeated for each element of
data[vrb.nm]
in the group (TRUE) or only once for each group (FALSE).
Note, aggs
evaluates fun
separately for each variable
vrb.nm
within data
. If instead, you want to evaluate fun
for variables as a set data[vrb.nm]
, then use agg_dfm
.
aggs( data, vrb.nm, grp.nm, rep = TRUE, rtn.grp = !rep, sep = "_", suffix = "_a", fun, ... )
aggs( data, vrb.nm, grp.nm, rep = TRUE, rtn.grp = !rep, sep = "_", suffix = "_a", fun, ... )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
rep |
logical vector of length 1 specifying whether the result of
|
rtn.grp |
logical vector of length 1 specifying whether the group
columns (i.e., |
sep |
character vector of length 1 specifying what string should
separate different group values when naming the return object. This
argument is only used if |
suffix |
character vector of length 1 specifying the string to append to the end of the colnames in the return object. |
fun |
function to use for aggregation. This function is expected to return an atomic vector of length 1. |
... |
additional named arguments to |
If rep
= TRUE, then agg
calls ave
; if rep
=
FALSE, then agg
calls aggregate
.
data.frame of aggregated values. If rep
is TRUE, then nrow =
nrow(data)
. If rep
= FALSE, then nrow =
length(levels(interaction(data[grp.nm])))
. The names are specified
by paste0(vrb.nm, suffix)
. If rtn.grp
= TRUE, then the group
columns are appended to the begining of the data.frame.
aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", fun = mean, na.rm = TRUE) aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rtn.grp = TRUE, fun = mean, na.rm = TRUE) # include the group columns aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, fun = mean, na.rm = TRUE) # do NOT repeat aggregated values aggs(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, fun = mean, na.rm = TRUE) # with multiple group columns aggs(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, fun = mean, na.rm = TRUE) # without returning groups
aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", fun = mean, na.rm = TRUE) aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rtn.grp = TRUE, fun = mean, na.rm = TRUE) # include the group columns aggs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", rep = FALSE, fun = mean, na.rm = TRUE) # do NOT repeat aggregated values aggs(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, fun = mean, na.rm = TRUE) # with multiple group columns aggs(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), rep = FALSE, rtn.grp = FALSE, fun = mean, na.rm = TRUE) # without returning groups
amd_bi
by default computes the proportion of missing data for pairs of
variables in a data.frame, with arguments to allow for counts instead of
proportions (i.e., prop
) or observed data rather than missing data
(i.e., ov
). It is bivariate in that each pair of variables is treated
in isolation.
amd_bi(data, vrb.nm, prop = TRUE, ov = FALSE)
amd_bi(data, vrb.nm, prop = TRUE, ov = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of the colnames from |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
data.frame of nrow = ncol = length(vrb.nm)
and rowames =
colnames = vrb.nm
providing the frequency of missing (or observed if
ov
= TRUE) values per pair of variables. If prop
= TRUE, the
values will range from 0 to 1. If prop
= FALSE, the values will
range from 0 to nrow(data)
.
amd_bi(data = airquality, vrb.nm = names(airquality)) # proportion of missing data amd_bi(data = airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_bi(data = airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_bi(data = airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
amd_bi(data = airquality, vrb.nm = names(airquality)) # proportion of missing data amd_bi(data = airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_bi(data = airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_bi(data = airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
amd_multi
by default computes the proportion of missing data from
listwise deletion for a set of variables in a data.frame, with arguments to
allow for counts instead of proportions (i.e., prop
) or observed data
rather than missing data (i.e., ov
). It is multivariate in that the
variables are treated together as a set.
amd_multi(data, vrb.nm, prop = TRUE, ov = FALSE)
amd_multi(data, vrb.nm, prop = TRUE, ov = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of the colnames from |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
numeric vector of length 1 providing the frequency of missing (or
observed if ov
= TRUE) rows from listwise deletion for the set of
variables vrb.nm
. If prop
= TRUE, the value will range from 0
to 1. If prop
= FALSE, the value will range from 0 to
nrow(data)
.
amd_multi(airquality, vrb.nm = names(airquality)) # proportion of missing data amd_multi(airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_multi(airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_multi(airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
amd_multi(airquality, vrb.nm = names(airquality)) # proportion of missing data amd_multi(airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_multi(airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_multi(airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
amd_uni
by default computes the proportion of missing data for
variables in a data.frame, with arguments to allow for counts instead of
proportions (i.e., prop
) or observed data rather than missing data
(i.e., ov
). It is univariate in that each variable is treated in
isolation. amd_uni
is a simple wrapper for colNA
.
amd_uni(data, vrb.nm, prop = TRUE, ov = FALSE)
amd_uni(data, vrb.nm, prop = TRUE, ov = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of the colnames from |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
numeric vector of length = length(vrb.nm)
and names =
vrb.nm
providing the frequency of missing (or observed if ov
= TRUE) values per variable. If prop
= TRUE, the values will range
from 0 to 1. If prop
= FALSE, the values will range from 0 to
nrow(data)
.
amd_uni(data = airquality, vrb.nm = names(airquality)) # proportion of missing data amd_uni(data = airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_uni(data = airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_uni(data = airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
amd_uni(data = airquality, vrb.nm = names(airquality)) # proportion of missing data amd_uni(data = airquality, vrb.nm = names(airquality), ov = TRUE) # proportion of observed data amd_uni(data = airquality, vrb.nm = names(airquality), prop = FALSE) # count of missing data amd_uni(data = airquality, vrb.nm = names(airquality), prop = FALSE, ov = TRUE) # count of observed data
auto_by
computes the autoregressive coefficient by group for
longitudinal data where each observation within the group represents a
different timepoint. The function assumes the data are already sorted by
time.
auto_by( x, grp, n = -1L, how = "cor", cw = TRUE, method = "pearson", use = "na.or.complete", REML = TRUE, control = NULL, sep = "." )
auto_by( x, grp, n = -1L, how = "cor", cw = TRUE, method = "pearson", use = "na.or.complete", REML = TRUE, control = NULL, sep = "." )
x |
numeric vector. |
grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame),
which each have same length as |
n |
integer vector with length 1. Specifies the direction and magnitude
of the shift. See |
how |
character vector of length 1 specifying how to compute the
autoregressive coefficients. The options are 1) "cor" for correlation with
the |
cw |
logical vector of length 1 specifying whether the shifted vector
should be group-mean centered (TRUE) or not (FALSE). This only affects the
results for |
method |
character vector of length 1 specifying the type of correlation
or covariance to compute. Only used when |
use |
character vector of length 1 specifying how to handle missing
data. Only used when |
REML |
logical vector of length 1 specifying whether to use restricted
estimated maximum liklihood (TRUE) rather than traditional maximum
likelihood (FALSE). Only used when |
control |
list of control parameters for |
sep |
character vector of length 1 specifying what string should
separate different group values when naming the return object. This
argument is only used if |
There are several different ways to estimate the autoregressive parameter.
This function offers a variety of ways with the how
and cw
arguments. Note, that a recent simulation suggests that group-mean centering
via cw
is the best approach when using linear mixed effects modeling
via how
= "lme" or "lmer" (Hamaker & Grasman, 2015).
numeric vector of autoregressive coefficients with length =
length(levels(interaction(grp)))
and names = pasteing of the
grouping value(s) together separated by sep
.
Hamaker, E. L., & Grasman, R. P. (2015). To center or not to center? Investigating inertia with a multilevel autoregressive model. Frontiers in Psychology, 5, 1492.
# cor auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cor") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "cor") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "cor") # lag and lead identical for cor auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cor", cw = FALSE) # centering within-person identical for cor # cov auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cov") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "cov") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "cov") # lag and lead identical for cov auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cov", cw = FALSE) # centering within-person identical for cov # lm auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "lm") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "lm") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "lm") # lag and lead NOT identical for lm auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "lm", cw = FALSE) # centering within-person identical for lm # lme chick_weight <- as.data.frame(ChickWeight) auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme") control_lme <- nlme::lmeControl(maxIter = 250L, msMaxIter = 250L, tolerance = 1e-3, msTol = 1e-3) # custom controls auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme", control = control_lme) auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", n = -2L, how = "lme") # lag across 2 timepoints auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", n = +1L, how = "lme") # lag and lead NOT identical for lme auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme", cw = FALSE) # centering within-person NOT identical for lme # lmer bryant_2016 <- as.data.frame(lmeInfo::Bryant2016) ## Not run: auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer") control_lmer <- lme4::lmerControl(check.conv.grad = lme4::.makeCC("stop", tol = 2e-3, relTol = NULL), check.conv.singular = lme4::.makeCC("stop", tol = formals(lme4::isSingular)$"tol"), check.conv.hess = lme4::.makeCC(action = "stop", tol = 1e-6)) # custom controls auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer", control = control_lmer) # TODO: for some reason lmer doesn't like this # and is not taking into account the custom controls auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", n = -2L, how = "lmer") # lag across 2 timepoints auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", n = +1L, how = "lmer") # lag and lead NOT identical for lmer auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer", cw = FALSE) # centering within-person NOT identical for lmer ## End(Not run)
# cor auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cor") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "cor") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "cor") # lag and lead identical for cor auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cor", cw = FALSE) # centering within-person identical for cor # cov auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cov") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "cov") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "cov") # lag and lead identical for cov auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "cov", cw = FALSE) # centering within-person identical for cov # lm auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "lm") auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = -2L, how = "lm") # lag across 2 timepoints auto_by(x = airquality$"Ozone", grp = airquality$"Month", n = +1L, how = "lm") # lag and lead NOT identical for lm auto_by(x = airquality$"Ozone", grp = airquality$"Month", how = "lm", cw = FALSE) # centering within-person identical for lm # lme chick_weight <- as.data.frame(ChickWeight) auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme") control_lme <- nlme::lmeControl(maxIter = 250L, msMaxIter = 250L, tolerance = 1e-3, msTol = 1e-3) # custom controls auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme", control = control_lme) auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", n = -2L, how = "lme") # lag across 2 timepoints auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", n = +1L, how = "lme") # lag and lead NOT identical for lme auto_by(x = chick_weight$"weight", grp = chick_weight$"Chick", how = "lme", cw = FALSE) # centering within-person NOT identical for lme # lmer bryant_2016 <- as.data.frame(lmeInfo::Bryant2016) ## Not run: auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer") control_lmer <- lme4::lmerControl(check.conv.grad = lme4::.makeCC("stop", tol = 2e-3, relTol = NULL), check.conv.singular = lme4::.makeCC("stop", tol = formals(lme4::isSingular)$"tol"), check.conv.hess = lme4::.makeCC(action = "stop", tol = 1e-6)) # custom controls auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer", control = control_lmer) # TODO: for some reason lmer doesn't like this # and is not taking into account the custom controls auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", n = -2L, how = "lmer") # lag across 2 timepoints auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", n = +1L, how = "lmer") # lag and lead NOT identical for lmer auto_by(x = bryant_2016$"outcome", grp = bryant_2016$"case", how = "lmer", cw = FALSE) # centering within-person NOT identical for lmer ## End(Not run)
ave_dfm
evaluates a function on a set of variables vrb.nm
separately for each group within grp.nm
. The results are combined back
together in line with the rows of data
similar to ave
.
ave_dfm
is different than ave
or agg
because it operates
on a data.frame, not an atomic vector.
ave_dfm(data, vrb.nm, grp.nm, fun, ...)
ave_dfm(data, vrb.nm, grp.nm, fun, ...)
data |
data.frame of data. |
vrb.nm |
character vector of colnames in |
grp.nm |
character vector of colnames in |
fun |
function that returns an atomic vector of length 1. Probably makes sense to ensure the function always returns the same typeof as well. |
... |
additional named arguments to |
atomic vector of length = nrow(data)
providing the result of
the function fun
for the subset of data with that group value (i.e.,
data[levels(interaction(data[grp.nm]))[i], vrb.nm]
) for that row.
ave
for the same functionality with atomic vector inputs
agg_dfm
for similar functionality with data.frames, but can return
the result for each group once rather than repeating the result for each group
value in the data.frame
# one grouping variables ave_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", fun = function(dat) cor(dat, use = "complete")[1,2]) # two grouping variables ave_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), fun = nrow) # with multiple group columns
# one grouping variables ave_dfm(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", fun = function(dat) cor(dat, use = "complete")[1,2]) # two grouping variables ave_dfm(data = mtcars, vrb.nm = c("mpg","cyl","disp"), grp.nm = c("vs","am"), fun = nrow) # with multiple group columns
boot_ci
computes bootstrapped confidence intervals from a matrix of
coefficients (or any statistical information of interest). The function is an
alternative to confint2.boot
for when the user does not have an object
of class boot
, but rather creates their own matrix of coefficients. It
has limited types of bootstrapped confidence intervals at the moment, but
future versions are expected to have more options.
boot_ci(coef, est = colMeans(coef), boot.ci.type = "perc2", level = 0.95)
boot_ci(coef, est = colMeans(coef), boot.ci.type = "perc2", level = 0.95)
coef |
numeric matrix (or data.frame of numeric columns) of
coefficients. The rows correspond to each bootstrapped resample and the
columns to different coefficients. This is the equivalent of the "t"
element in a |
est |
numeric vector of observed coefficients from the full sample. This
is the equivalent of the "t0" element in a |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc2" for
the naive percentile method using |
level |
double vector of length 1 specifying the confidence level. Must be between 0 and 1. |
data.frame will be returned with nrow equal to the number of
coefficients bootstrapped and columns specified below. The rownames are the
colnames in the coef
argument or the names in the est
argument
(default data.frame rownames if neither have any names). The columns are the
following:
original parameter estimates
bootstrapped standard errors (does not differ by boot.ci.type
)
lower bound of the bootstrapped confidence intervals
upper bound of the bootstrapped confidence intervals
boot.ci
for the confidence interval function in the boot
package,
confint.boot
for an alternative function with boot
objects
tmp <- replicate(n = 100, expr = { i <- sample.int(nrow(attitude), replace = TRUE) colMeans(attitude[i, ]) }, simplify = FALSE) mat <- str2str::lv2m(tmp, along = 1) boot_ci(mat, est = colMeans(attitude))
tmp <- replicate(n = 100, expr = { i <- sample.int(nrow(attitude), replace = TRUE) colMeans(attitude[i, ]) }, simplify = FALSE) mat <- str2str::lv2m(tmp, along = 1) boot_ci(mat, est = colMeans(attitude))
by2
applies a function to data by group and is an alternative to the
base R function by
. The function is apart of the
split-apply-combine type of function discussed in the plyr
R package
and is very similar to dlply
. It splits up one data.frame
.data[.vrb.nm]
into a data.frame for each group in
.data[.grp.nm]
, applies a function .fun
to each data.frame, and
then returns the results as a list with names equal to the group values
unique(interaction(.data[.grp.nm], sep = .sep))
. by2
is simply
split.data.frame
+ lapply
. Similar to dlply
, The
arguments all start with .
so that they do not conflict with arguments
from the function .fun
. If you want to apply a function a (atomic)
vector rather than data.frame, then use tapply2
.
by2(.data, .vrb.nm, .grp.nm, .sep = ".", .fun, ...)
by2(.data, .vrb.nm, .grp.nm, .sep = ".", .fun, ...)
.data |
data.frame of data. |
.vrb.nm |
character vector specifying the colnames of |
.grp.nm |
character vector specifying the colnames of |
.sep |
character vector of length 1 specifying the string to combine the
group values together with. |
.fun |
function to apply to the set of variables |
... |
additional named arguments to pass to |
list of objects containing the return object of .fun
for each
group. The names are the unique combinations of the grouping variables
(i.e., unique(interaction(.data[.grp.nm], sep = .sep))
).
# one grouping variable by2(mtcars, .vrb.nm = c("mpg","cyl","disp"), .grp.nm = "vs", .fun = cov, use = "complete.obs") # two grouping variables x <- by2(mtcars, .vrb.nm = c("mpg","cyl","disp"), .grp.nm = c("vs","am"), .fun = cov, use = "complete.obs") print(x) str(x) # compare to by vrb_nm <- c("mpg","cyl","disp") # Roxygen runs the whole script if I put a c() in a [] grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] y <- by(mtcars[vrb_nm], INDICES = mtcars[grp_nm], FUN = cov, use = "complete.obs", simplify = FALSE) str(y) # has dimnames rather than names
# one grouping variable by2(mtcars, .vrb.nm = c("mpg","cyl","disp"), .grp.nm = "vs", .fun = cov, use = "complete.obs") # two grouping variables x <- by2(mtcars, .vrb.nm = c("mpg","cyl","disp"), .grp.nm = c("vs","am"), .fun = cov, use = "complete.obs") print(x) str(x) # compare to by vrb_nm <- c("mpg","cyl","disp") # Roxygen runs the whole script if I put a c() in a [] grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] y <- by(mtcars[vrb_nm], INDICES = mtcars[grp_nm], FUN = cov, use = "complete.obs", simplify = FALSE) str(y) # has dimnames rather than names
center
centers and/or standardized a numeric vector. It is an
alternative to scale.default
that returns a numeric vector rather than
a numeric matrix.
center(x, center = TRUE, scale = FALSE)
center(x, center = TRUE, scale = FALSE)
x |
numeric vector. |
center |
logical vector with length 1 specifying whether grand-mean centering should be done. |
scale |
logical vector with length 1 specifying whether grand-SD scaling should be done. |
center
first coerces x
to a matrix in preparation for the call
to scale.default
. If the coercion results in a non-numeric matrix
(e.g., x
is a character vector or factor), then an error is returned.
numeric vector of x
centered and/or standardized with the same
names as x
.
centers
center_by
centers_by
scale.default
center(x = mtcars$"disp") center(x = mtcars$"disp", scale = TRUE) center(x = mtcars$"disp", center = FALSE, scale = TRUE) center(x = setNames(mtcars$"disp", nm = row.names(mtcars)))
center(x = mtcars$"disp") center(x = mtcars$"disp", scale = TRUE) center(x = mtcars$"disp", center = FALSE, scale = TRUE) center(x = setNames(mtcars$"disp", nm = row.names(mtcars)))
center_by
centers and/or standardized a numeric vector by group. This
is sometimes called group-mean centering and/or group-SD standardizing.
center_by(x, grp, center = TRUE, scale = FALSE)
center_by(x, grp, center = TRUE, scale = FALSE)
x |
numeric vector. |
grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame)
containing the groups. They should each have same length as |
center |
logical vector with length 1 specifying whether group-mean centering should be done. |
scale |
logical vector with length 1 specifying whether group-SD scaling should be done. |
center_by
first coerces x
to a matrix in preparation for the
core of the function, which is essentially: lapply(X = split(x = x, f =
grp), FUN = scale.default)
. If the coercion results in a non-numeric matrix
(e.g., x
is a character vector or factor), then an error is returned.
An error is also returned if x
and the elements of grp
do not
have the same length.
numeric vector of x
centered and/or standardized by group with
the same names as x
.
centers_by
center
centers
scale.default
chick_data <- as.data.frame(ChickWeight) # because the "groupedData" class calls # `[.groupedData`, which is different than `[.data.frame` center_by(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]]) center_by(x = setNames(obj = ChickWeight[["weight"]], nm = row.names(ChickWeight)), grp = ChickWeight[["Chick"]]) # with names tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like a c() within a [] center_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], scale = TRUE) # multiple grouping vectors
chick_data <- as.data.frame(ChickWeight) # because the "groupedData" class calls # `[.groupedData`, which is different than `[.data.frame` center_by(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]]) center_by(x = setNames(obj = ChickWeight[["weight"]], nm = row.names(ChickWeight)), grp = ChickWeight[["Chick"]]) # with names tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like a c() within a [] center_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], scale = TRUE) # multiple grouping vectors
centers
centers and/or standardized data. It is an alternative to
scale.default
that returns a data.frame rather than a numeric matrix.
centers(data, vrb.nm, center = TRUE, scale = FALSE, suffix)
centers(data, vrb.nm, center = TRUE, scale = FALSE, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
center |
logical vector with length 1 specifying whether grand-mean centering should be done. |
scale |
logical vector with length 1 specifying whether grand-SD scaling should be done. |
suffix |
character vector with a single element specifying the string to
append to the end of the colnames of the return object. The default depends
on the |
centers
first coerces data[vrb.nm]
to a matrix in preparation
for the call to scale.default
. If the coercion results in a
non-numeric matrix (e.g., any columns in data[vrb.nm]
are character
vectors or factors), then an error is returned.
data.frame of centered and/or standardized variables with colnames
specified by paste0(vrb.nm, suffix)
.
center
centers_by
center_by
scale.default
centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec")) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), scale = TRUE) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), center = FALSE, scale = TRUE) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), scale = TRUE, suffix = "_std")
centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec")) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), scale = TRUE) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), center = FALSE, scale = TRUE) centers(data = mtcars, vrb.nm = c("disp","hp","drat","wt","qsec"), scale = TRUE, suffix = "_std")
centers_by
centers and/or standardized data by group. This is sometimes
called group-mean centering and/or group-SD standardizing. The groups can be
specified by multiple columns in data
(e.g., grp.nm
with length
> 1), and interaction
will be implicitly called to create the groups.
centers_by(data, vrb.nm, grp.nm, center = TRUE, scale = FALSE, suffix)
centers_by(data, vrb.nm, grp.nm, center = TRUE, scale = FALSE, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
center |
logical vector with length 1 specifying whether group-mean centering should be done. |
scale |
logical vector with length 1 specifying whether group-SD scaling should be done. |
suffix |
character vector with a single element specifying the string to
append to the end of the colnames of the return object. The default depends
on the |
centers_by
first coerces data[vrb.nm]
to a matrix in preparation
for the core of the function, which is essentially lapply(X = split(x =
data[vrb.nm], f = data[grp.nm]), FUN = scale.default)
If the coercion
results in a non-numeric matrix (e.g., any columns in data[vrb.nm]
are
character vectors or factors), then an error is returned.
data.frame of centered and/or standardized variables by group with
colnames specified by paste0(vrb.nm, suffix)
.
center_by
centers
center
scale.default
ChickWeight2 <- as.data.frame(ChickWeight) # because the "groupedData" class calls # `[.groupedData`, which is different than `[.data.frame` row.names(ChickWeight2) <- as.numeric(row.names(ChickWeight)) / 1000 centers_by(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick") centers_by(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick", scale = TRUE, suffix = "_within") centers_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), scale = TRUE) # multiple grouping columns
ChickWeight2 <- as.data.frame(ChickWeight) # because the "groupedData" class calls # `[.groupedData`, which is different than `[.data.frame` row.names(ChickWeight2) <- as.numeric(row.names(ChickWeight)) / 1000 centers_by(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick") centers_by(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick", scale = TRUE, suffix = "_within") centers_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), scale = TRUE) # multiple grouping columns
change
creates a change score (aka difference score) from a numeric
vector. It is assumed that the vector is already sorted by time such that the
first element is earliest in time and the last element is the latest in time.
change(x, n, undefined = NA)
change(x, n, undefined = NA)
x |
numeric vector. |
n |
integer vector with length 1. Specifies how the change score is
calculated. If |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shift
tries to circumvent this
issue by a call to round
within shift
if n
is not an
integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shift
truncates rather than rounds.
See details of shift
.
an atomic vector of the same length as x
that is the change
score. If x
and undefined
are different typeofs, then the
return will be coerced to the most complex typeof (i.e., complex to simple:
character, double, integer, logical).
changes
change_by
changes_by
shift
change(x = attitude[[1]], n = -1L) # use L to prevent problems with floating point numbers change(x = attitude[[1]], n = -2L) # can specify any integer up to the length of `x` change(x = attitude[[1]], n = +1L) # can specify negative or positive integers change(x = attitude[[1]], n = +2L, undefined = -999) # user-specified indefined value change(x = attitude[[1]], n = -2L, undefined = -999) # user-specified indefined value change(x = attitude[[1]], n = 0L) # returns a vector of zeros ## Not run: change(x = setNames(object = letters, nm = LETTERS), n = 3L) # character vector returns an error ## End(Not run)
change(x = attitude[[1]], n = -1L) # use L to prevent problems with floating point numbers change(x = attitude[[1]], n = -2L) # can specify any integer up to the length of `x` change(x = attitude[[1]], n = +1L) # can specify negative or positive integers change(x = attitude[[1]], n = +2L, undefined = -999) # user-specified indefined value change(x = attitude[[1]], n = -2L, undefined = -999) # user-specified indefined value change(x = attitude[[1]], n = 0L) # returns a vector of zeros ## Not run: change(x = setNames(object = letters, nm = LETTERS), n = 3L) # character vector returns an error ## End(Not run)
change_by
creates a change score (aka difference score) from a numeric
vector separately for each group. It is assumed that the vector is already
sorted within each group by time such that the first element for that group
is earliest in time and the last element for that group is the latest in
time.
change_by(x, grp, n, undefined = NA)
change_by(x, grp, n, undefined = NA)
x |
numeric vector. |
grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame),
which each have same length as |
n |
integer vector with length 1. Specifies how the change score is
calculated. If |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shift_by
tries to circumvent
this issue by a call to round
within shift_by
if n
is
not an integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shift_by
truncates rather than
rounds. See details of shift_by
.
an atomic vector of the same length as x
that is the change
score by group. If x
and undefined
are different typeofs,
then the return will be coerced to the more complex typoof (i.e., complex
to simple: character, double, integer, logical).
changes_by
change
changes
shift_by
change_by(x = ChickWeight[["Time"]], grp = ChickWeight[["Chick"]], n = -1L) tmp_nm <- c("vs","am") # multiple grouping vectors change_by(x = mtcars[["disp"]], grp = mtcars[tmp_nm], n = +1L) tmp_nm <- c("Type","Treatment") # multiple grouping vectors change_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n = 2L)
change_by(x = ChickWeight[["Time"]], grp = ChickWeight[["Chick"]], n = -1L) tmp_nm <- c("vs","am") # multiple grouping vectors change_by(x = mtcars[["disp"]], grp = mtcars[tmp_nm], n = +1L) tmp_nm <- c("Type","Treatment") # multiple grouping vectors change_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n = 2L)
changes
creates change scores (aka difference scores) from numeric
data. It is assumed that the data is already sorted by time such that the
first row is earliest in time and the last row is the latest in time.
changes
is a multivariate version of change
that operates
on multiple variabes rather than just one.
changes(data, vrb.nm, n, undefined = NA, suffix)
changes(data, vrb.nm, n, undefined = NA, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
n |
integer vector with length 1. Specifies how the change score is
calculated. If |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
suffix |
character vector of length 1 specifying the string to append to
the end of the colnames of the return object. The default depends on the
|
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shifts
tries to circumvent this
issue by a call to round
within shifts
if n
is not an
integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shifts
truncates rather than rounds.
See details of shifts
.
data.frame of change scores with colnames specified by
paste0(vrb.nm, suffix)
.
change
changes_by
change_by
shifts
changes(attitude, vrb.nm = names(attitude), n = -1L) # use L to prevent problems with floating point numbers changes(attitude, vrb.nm = names(attitude), n = -2L) # can specify any integer up to the length of `x` changes(attitude, vrb.nm = names(attitude), n = +1L) # can specify negative or positive integers changes(attitude, vrb.nm = names(attitude), n = +2L, undefined = -999) # user-specified indefined value changes(attitude, vrb.nm = names(attitude), n = -2L, undefined = -999) # user-specified indefined value ## Not run: changes(str2str::d2d(InsectSprays), names(InsectSprays), n = 3L) # character vector returns an error ## End(Not run)
changes(attitude, vrb.nm = names(attitude), n = -1L) # use L to prevent problems with floating point numbers changes(attitude, vrb.nm = names(attitude), n = -2L) # can specify any integer up to the length of `x` changes(attitude, vrb.nm = names(attitude), n = +1L) # can specify negative or positive integers changes(attitude, vrb.nm = names(attitude), n = +2L, undefined = -999) # user-specified indefined value changes(attitude, vrb.nm = names(attitude), n = -2L, undefined = -999) # user-specified indefined value ## Not run: changes(str2str::d2d(InsectSprays), names(InsectSprays), n = 3L) # character vector returns an error ## End(Not run)
changes_by
creates change scores (aka difference scores) from numeric
data separately for each group. It is assumed that the data is already sorted
within each group by time such that the first row for that group is earliest
in time and the last row for that group is the latest in time.
changes_by(data, vrb.nm, grp.nm, n, undefined = NA, suffix)
changes_by(data, vrb.nm, grp.nm, n, undefined = NA, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
n |
integer vector with length 1. Specifies how the change score is
calculated. If |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
suffix |
character vector of length 1 specifying the string to append to
the end of the colnames of the return object. The default depends on the
|
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shifts_by
tries to circumvent
this issue by a call to round
within shifts_by
if n
is
not an integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shifts_by
truncates rather than
rounds. See details of shifts_by
.
data.frame of change scores by group with colnames specified by
paste0(vrb.nm, suffix)
.
change_by
changes
change
shifts_by
changes_by(data = ChickWeight, vrb.nm = c("weight","Time"), grp.nm = "Chick", n = -1L) changes_by(data = mtcars, vrb.nm = c("disp","mpg"), grp.nm = c("vs","am"), n = 1L) changes_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n = 2L) # multiple grouping columns
changes_by(data = ChickWeight, vrb.nm = c("weight","Time"), grp.nm = "Chick", n = -1L) changes_by(data = mtcars, vrb.nm = c("disp","mpg"), grp.nm = c("vs","am"), n = 1L) changes_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n = 2L) # multiple grouping columns
colMeans_if
calculates the mean of every column in a numeric or
logical matrix conditional on the frequency of observed data. If the
frequency of observed values in that column is less than (or equal to) that
specified by ov.min
, then NA is returned for that row.
colMeans_if(x, ov.min = 1, prop = TRUE, inclusive = TRUE)
colMeans_if(x, ov.min = 1, prop = TRUE, inclusive = TRUE)
x |
numeric or logical matrix. If not a matrix, it will be coerced to one. |
ov.min |
minimum frequency of observed values required per column. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the mean
should be calculated if the frequency of observed values in a column is
exactly equal to |
Conceptually this function does: apply(X = x, MARGIN = 2, FUN =
mean_if, ov.min = ov.min, prop = prop, inclusive = inclusive)
. But for
computational efficiency purposes it does not because then the missing values
conditioning would not be vectorized. Instead, it uses colMeans
and
then inserts NAs for columns that have too few observed values.
numeric vector of length = ncol(x)
with names =
colnames(x)
providing the mean of each column or NA depending on the
frequency of observed values.
colSums_if
rowMeans_if
rowSums_if
colMeans
colMeans_if(airquality) colMeans_if(x = airquality, ov.min = 150, prop = FALSE)
colMeans_if(airquality) colMeans_if(x = airquality, ov.min = 150, prop = FALSE)
rowNA
compute the frequency of missing values in a matrix by column.
This function essentially does apply(X = x, MARGIN = 2, FUN = vecNA)
.
It is also used by other functions in the quest package related to missing
values (e.g., colMeans_if
).
colNA(x, prop = FALSE, ov = FALSE)
colNA(x, prop = FALSE, ov = FALSE)
x |
matrix with any typeof. If not a matrix, it will be coerced to a
matrix via |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
numeric vector of length = ncol(x)
, and names =
colnames(x)
providing the frequency of missing values (or observed
values if ov
= TRUE) per column. If prop
= TRUE, the values
will range from 0 to 1. If prop
= FALSE, the values will range from
1 to nrow(x)
.
colNA(as.matrix(airquality)) # count of missing values colNA(as.matrix(airquality), prop = TRUE) # proportion of missing values colNA(as.matrix(airquality), ov = TRUE) # count of observed values colNA(as.data.frame(airquality), prop = TRUE, ov = TRUE) # proportion of observed values
colNA(as.matrix(airquality)) # count of missing values colNA(as.matrix(airquality), prop = TRUE) # proportion of missing values colNA(as.matrix(airquality), ov = TRUE) # count of observed values colNA(as.data.frame(airquality), prop = TRUE, ov = TRUE) # proportion of observed values
colSums_if
calculates the sum of every column in a numeric or logical
matrix conditional on the frequency of observed data. If the frequency of
observed values in that column is less than (or equal to) that specified by
ov.min
, then NA is returned for that column. It also has the option to
return a value other than 0 (e.g., NA) when all columns are NA, which differs
from colSums(x, na.rm = TRUE)
.
colSums_if( x, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, allNA = NA_real_ )
colSums_if( x, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, allNA = NA_real_ )
x |
numeric or logical matrix. If not a matrix, it will be coerced to one. |
ov.min |
minimum frequency of observed values required per column. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the sum should
be calculated if the frequency of observed values in a column is exactly
equal to |
impute |
logical vector of length 1 specifying if missing values should
be imputed with the mean of observed values of |
allNA |
numeric vector of length 1 specifying what value should be
returned for columns that are all NA. This is most applicable when
|
Conceptually this function does: apply(X = x, MARGIN = 2, FUN = sum_if,
ov.min = ov.min, prop = prop, inclusive = inclusive)
. But for computational
efficiency purposes it does not because then the observed values conditioning
would not be vectorized. Instead, it uses colSums
and then inserts NAs
for columns that have too few observed values.
numeric vector of length = ncol(x)
with names =
colnames(x)
providing the sum of each column or NA depending on the
frequency of observed values.
colMeans_if
rowSums_if
rowMeans_if
colSums
colSums_if(airquality) colSums_if(x = airquality, ov.min = 150, prop = FALSE) x <- data.frame("x" = c(1, 2, NA), "y" = c(1, NA, NA), "z" = c(NA, NA, NA)) colSums_if(x) colSums_if(x, ov.min = 0) colSums_if(x, ov.min = 0, allNA = 0) identical(x = colSums(x, na.rm = TRUE), y = colSums_if(x, impute = FALSE, ov.min = 0, allNA = 0)) # identical to # colSums(x, na.rm = TRUE)
colSums_if(airquality) colSums_if(x = airquality, ov.min = 150, prop = FALSE) x <- data.frame("x" = c(1, 2, NA), "y" = c(1, NA, NA), "z" = c(NA, NA, NA)) colSums_if(x) colSums_if(x, ov.min = 0) colSums_if(x, ov.min = 0, allNA = 0) identical(x = colSums(x, na.rm = TRUE), y = colSums_if(x, impute = FALSE, ov.min = 0, allNA = 0)) # identical to # colSums(x, na.rm = TRUE)
composite
computes the composite reliability coefficient (sometimes
referred to as omega) for a set of variables/items. The composite reliability
computed in composite
assumes a undimensional factor model with no
error covariances. In addition to the coefficient itself, its standard error
and confidence interval are returned, the average standardized factor loading
from the factor model and number of variables/items, and (optional) model fit
indices of the factor model. Note, any reverse coded items need to be recoded
ahead of time so that all variables/items are keyed in the same direction.
composite( data, vrb.nm, level = 0.95, std = FALSE, ci.type = "delta", boot.ci.type = "bca.simple", R = 200L, fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), se = "standard", test = "standard", missing = "fiml", ... )
composite( data, vrb.nm, level = 0.95, std = FALSE, ci.type = "delta", boot.ci.type = "bca.simple", R = 200L, fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), se = "standard", test = "standard", missing = "fiml", ... )
data |
data.frame of data. |
vrb.nm |
character vector of colnames in |
level |
double vector of length 1 with a value between 0 and 1 specifying what confidence level to use. |
std |
logical element of length 1 specifying if the composite
reliability should be computed for the standardized version of the
variables |
ci.type |
character vector of length 1 specifying which type of confidence interval to compute. The "delta" option uses the delta method to compute a standard error and a symmetrical confidence interval. The "boot" option uses bootstrapping to compute an asymmetrical confidence interval as well as a (pseudo) standard error. |
boot.ci.type |
character vector of length 1 specifying which type of
bootstrapped confidence interval to compute. The options are: 1) "norm", 2)
"basic", 3) "perc", 4) "bca.simple". Only used if |
R |
integer vector of length 1 specifying how many bootstrapped
resamples to compute. Note, as the number of bootstrapped resamples
increases, the computation time will increase. Only used if |
fit.measures |
character vector specifying which model fit indices to
include in the return object. The default option includes the chi-square
test statistic ("chisq"), degrees of freedom ("df"), tucker-lewis index
("tli"), comparative fit index ("cfi"), root mean square error of
approximation ("rmsea"), and standardized root mean residual ("srmr"). If
NULL, then no model fit indices are included in the return object. See
|
se |
character vector of length 1 specifying which type of standard
errors to compute. If ci.type = "boot", then the input value is ignored and
set to "bootstrap". See |
test |
character vector of length 1 specifying which type of test
statistic to compute. If ci.type = "boot", then the input value is ignored
and set to "bootstrap". See |
missing |
character vector of length 1 specifying how to handle missing
data. The default is "fiml" for full information maximum likelihood). See
|
... |
other arguments passed to |
The factor model is estimated using the R package lavaan
. The
reliability coefficients are calculated based on the square of the sum of the
factor loadings divided by the sum of the square of the sum of the factors
loadings and the sum of the error variances (Raykov, 2001).
composite
is only able to use the "ML" estimator at the moment and
cannot model items as categorical/ordinal. However, different versions of
standard errors and test statistics are possible. For example, the "MLM"
estimator can be specified by se
= "robust.sem" and test
=
"satorra.bentler"; the "MLR" estimator can be specified by se
=
"robust.huber.white" and test
= "yuan.bentler.mplus". See
lavOptions
and scroll down to Estimation options.
double vector where the first element is the composite reliability
coefficient ("est") followed by its standard error ("se"), then its
confidence interval ("lwr" and "upr"), the average standardized factor
loading of the factor model ("average_l") and number of variables ("nvrb"),
and finally any of the fit.measures
requested.
Raykov, T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54(2), 315–323.
# data dat <- psych::bfi[1:250, 2:5] # the first item is reverse coded # delta method CI composite(data = dat, vrb.nm = names(dat), ci.type = "delta") composite(data = dat, vrb.nm = names(dat), ci.type = "delta", level = 0.99) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", std = TRUE) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", fit.measures = NULL) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", se = "robust.sem", test = "satorra.bentler", missing = "listwise") # MLM estimator composite(data = dat, vrb.nm = names(dat), ci.type = "delta", se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml") # MLR estimator ## Not run: # bootstrapped CI composite(data = dat, vrb.nm = names(dat), level = 0.95, ci.type = "boot") # slightly different estimate for some reason... composite(data = dat, vrb.nm = names(dat), level = 0.95, ci.type = "boot", boot.ci.type = "perc", R = 250L) # probably want to use more resamples - this is just an example ## End(Not run) # compare to semTools::reliability psymet_obj <- composite(data = dat, vrb.nm = names(dat)) psymet_est <- unname(psymet_obj["est"]) lavaan_obj <- lavaan::cfa(model = make.latent(names(dat)), data = dat, std.lv = TRUE, missing = "fiml") semTools_obj <- semTools::reliability(lavaan_obj) semTools_est <- semTools_obj["omega", "latent"] all.equal(psymet_est, semTools_est)
# data dat <- psych::bfi[1:250, 2:5] # the first item is reverse coded # delta method CI composite(data = dat, vrb.nm = names(dat), ci.type = "delta") composite(data = dat, vrb.nm = names(dat), ci.type = "delta", level = 0.99) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", std = TRUE) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", fit.measures = NULL) composite(data = dat, vrb.nm = names(dat), ci.type = "delta", se = "robust.sem", test = "satorra.bentler", missing = "listwise") # MLM estimator composite(data = dat, vrb.nm = names(dat), ci.type = "delta", se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml") # MLR estimator ## Not run: # bootstrapped CI composite(data = dat, vrb.nm = names(dat), level = 0.95, ci.type = "boot") # slightly different estimate for some reason... composite(data = dat, vrb.nm = names(dat), level = 0.95, ci.type = "boot", boot.ci.type = "perc", R = 250L) # probably want to use more resamples - this is just an example ## End(Not run) # compare to semTools::reliability psymet_obj <- composite(data = dat, vrb.nm = names(dat)) psymet_est <- unname(psymet_obj["est"]) lavaan_obj <- lavaan::cfa(model = make.latent(names(dat)), data = dat, std.lv = TRUE, missing = "fiml") semTools_obj <- semTools::reliability(lavaan_obj) semTools_est <- semTools_obj["omega", "latent"] all.equal(psymet_est, semTools_est)
composites
computes the composite reliability coefficient (sometimes
referred to as omega) for multiple sets of variables/items. The composite
reliability computed in composites
assumes a undimensional factor
model for each set of variables/items with no error covariances. In addition
to the coefficients themselves, their standard errors and confidence
intervals are returned, the average standardized factor loading from the
factor models and number of variables/items in each set, and (optional) model
fit indices of the factor models. Note, any reverse coded items need to be
recoded ahead of time so that all items are keyed in the same direction for
each set of variables/items.
composites( data, vrb.nm.list, level = 0.95, std = FALSE, ci.type = "delta", boot.ci.type = "bca.simple", R = 200L, fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), se = "standard", test = "standard", missing = "fiml", ... )
composites( data, vrb.nm.list, level = 0.95, std = FALSE, ci.type = "delta", boot.ci.type = "bca.simple", R = 200L, fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), se = "standard", test = "standard", missing = "fiml", ... )
data |
data.frame of data. |
vrb.nm.list |
list of character vectors containing colnames in
|
level |
double vector of length 1 with a value between 0 and 1 specifying what confidence level to use. |
std |
logical element of length 1 specifying if the composite
reliability should be computed for the standardized version of the
variables/items |
ci.type |
character vector of length 1 specifying which type of confidence interval to compute. The "delta" option uses the delta method to compute a standard error and a symmetrical confidence interval. The "boot" option uses bootstrapping to compute an asymmetrical confidence interval as well as a (pseudo) standard error. |
boot.ci.type |
character vector of length 1 specifying which type of
bootstrapped confidence interval to compute. The options are: 1) "norm", 2)
"basic", 3) "perc", 4) "bca.simple". Only used if |
R |
integer vector of length 1 specifying how many bootstrapped
resamples to compute. Note, as the number of bootstrapped resamples
increases, the computation time will increase. Only used if |
fit.measures |
character vector specifying which model fit indices to
include in the return object. The default option includes the chi-square
test statistic ("chisq"), degrees of freedom ("df"), tucker-lewis index
("tli"), comparative fit index ("cfi"), root mean square error of
approximation ("rmsea"), and standardized root mean residual ("srmr"). If
NULL, then no model fit indices are included in the return object. See
|
se |
character vector of length 1 specifying which type of standard
errors to compute. If ci.type = "boot", then the input value is ignored and
implicitly set to "bootstrap". See |
test |
character vector of length 1 specifying which type of test
statistic to compute. If ci.type = "boot", then the input value is ignored
and implicitly set to "bootstrap". See |
missing |
character vector of length 1 specifying how to handle missing
data. The default is "fiml" for full information maximum likelihood. See
|
... |
other arguments passed to |
The factor models are estimated using the R package lavaan
. The
reliability coefficients are calculated based on the square of the sum of the
factor loadings divided by the sum of the square of the sum of the factors
loadings and the sum of the error variances (Raykov, 2001).
composites
is only able to use the "ML" estimator at the moment and
cannot model items as categorical/ordinal. However, different versions of
standard errors and test statistics are possible. For example, the "MLM"
estimator can be specified by se
= "robust.sem" and test
=
"satorra.bentler"; the "MLR" estimator can be specified by se
=
"robust.huber.white" and test
= "yuan.bentler.mplus". See
lavOptions
and scroll down to Estimation options for
details.
data.frame containing the composite reliability of each set of variables/items.
estimate of the reliability coefficient
standard error of the reliability coefficient
lower bound of the confidence interval of the reliability coefficient
upper bound of the confidence interval of the reliability coefficient
average standardized factor loading from the factor model
number of variables/items
any model fit indices requested by the fit.measures
argument
Raykov, T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54(2), 315–323.
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) composites(data = dat1, vrb.nm.list = vrb_nm_list) ## Not run: start_time <- Sys.time() composites(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot", R = 5000L) # the function is not optimized for speed at the moment # since it will bootstrap separately for each set of variables/items end_time <- Sys.time() print(end_time - start_time) # takes 10 minutes on my laptop ## End(Not run) composites(data = attitude, vrb.nm.list = list(names(attitude))) # also works with only one set of variables/items
dat0 <- psych::bfi[1:250, ] dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) composites(data = dat1, vrb.nm.list = vrb_nm_list) ## Not run: start_time <- Sys.time() composites(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot", R = 5000L) # the function is not optimized for speed at the moment # since it will bootstrap separately for each set of variables/items end_time <- Sys.time() print(end_time - start_time) # takes 10 minutes on my laptop ## End(Not run) composites(data = attitude, vrb.nm.list = list(names(attitude))) # also works with only one set of variables/items
confint2
is a generic function for creating confidence intervals from
various statistical information (e.g., confint2.default
) or
object classes (e.g., confint2.boot
). It is an alternative to
the original confint
generic function in the stats
package.
confint2(obj, ...)
confint2(obj, ...)
obj |
object of a particular class (e.g., "boot") or the first argument
in the default method (e.g., the |
... |
additional arguments specific to the particular method of |
depends on the particular method of confint2
, but usually a data.frame
with a column for the parameter estimate ("est"), standard error ("se"),
lower bound of the confidence interval ("lwr"), and upper bound of the confidence interval ("upr").
confint2.default
for the default method,
confint2.boot
for the boot
method,
boot
Objectconfint2.boot
is the boot
method for the generic function
confint2
and computes bootstrapped confidence intervals from an object
of class boot
(aka an object returned by the function
boot
. The function is a simple wrapper for the car boot
methods for the summary
and confint
generics. See
hist.boot
for details on those methods.
## S3 method for class 'boot' confint2(obj, boot.ci.type = "perc", level = 0.95, ...)
## S3 method for class 'boot' confint2(obj, boot.ci.type = "perc", level = 0.95, ...)
obj |
an object of class |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc" for
the regular percentile method, 2) "bca" for bias-corrected and accelerated
percentile method, 3) "norm" for the normal method that uses the
bootstrapped standard error to construct symmetrical confidence intervals
with the classic formula around the bias-corrected estimate, and 4) "basic"
for the basic method. Note, "stud" for the studentized method is NOT an
option. See |
level |
double vector of length 1 specifying the confidence level. Must be between 0 and 1. |
... |
This argument has no use. Technically, it is additional arguments
for |
The bias-corrected and accelerated percentile method (boot.ci.type
=
"bca") will often fail if the number of bootstrapped resamples is less than
the sample size. Even still, it can fail for other reasons. Following
car:::confint.boot
, confint2.boot
gives a warning if the
bias-corrected and accelerated percentile method fails for any statistic, and
implicitly switches to the regular percentile method to prevent an error.
When multiple statistics were bootstrapped, it might be that the
bias-corrected and accelerated percentile method succeeded for most of the
statistics and only failed for one statistic; however, confint2.boot
will switch to using the regular percentile method for ALL the statistics.
This may change in the future.
data.frame will be returned with nrow equal to the number of
statistics bootstrapped and columns specified below. The rownames are the
names in the "t0" element of the boot
object (default data.frame
rownames if the "t0" element does not have any names). The columns are the
following:
original parameter estimates
bootstrapped standard errors (does not differ by boot.ci.type
)
lower bound of the bootstrapped confidence intervals
upper bound of the bootstrapped confidence intervals
# a single statistic mean2 <- function(x, i) mean(x[i], na.rm = TRUE) boot_obj <- boot::boot(data = attitude[[1]], statistic = mean2, R = 200L) confint2.boot(boot_obj) confint2.boot(boot_obj, boot.ci.type = "bca") confint2.boot(boot_obj, level = 0.99) # multiple statistics colMeans2 <- function(dat, i) colMeans(dat[i, ], na.rm = TRUE) boot_obj <- boot::boot(data = attitude, statistic = colMeans2, R = 200L) confint2.boot(boot_obj) confint2.boot(boot_obj, boot.ci.type = "bca") confint2.boot(boot_obj, level = 0.99)
# a single statistic mean2 <- function(x, i) mean(x[i], na.rm = TRUE) boot_obj <- boot::boot(data = attitude[[1]], statistic = mean2, R = 200L) confint2.boot(boot_obj) confint2.boot(boot_obj, boot.ci.type = "bca") confint2.boot(boot_obj, level = 0.99) # multiple statistics colMeans2 <- function(dat, i) colMeans(dat[i, ], na.rm = TRUE) boot_obj <- boot::boot(data = attitude, statistic = colMeans2, R = 200L) confint2.boot(boot_obj) confint2.boot(boot_obj, boot.ci.type = "bca") confint2.boot(boot_obj, level = 0.99)
confint2.default
is the default method for the generic function
confint2
and computes the statistical information for confidence
intervals from parameter estimates, standard errors, and degrees of freedom.
If degrees of freedom are not applicable or available, then df
can be
set to Inf
(the default) and critical z-values rather than critical
t-values will be used.
## Default S3 method: confint2(obj, se, df = Inf, level = 0.95, ...)
## Default S3 method: confint2(obj, se, df = Inf, level = 0.95, ...)
obj |
numeric vector of parameter estimates. A better name for this
argument would be |
se |
numeric vector of standard errors. Must be the same length as
|
df |
numeric vector of degrees of freedom. Must have length 1 or the
same length as |
level |
double vector of length 1 specifying the confidence level. Must be between 0 and 1. |
... |
This argument has no use. Technically, it is additional arguments
for |
data.frame with nrow equal to the lengths of obj
and
se
. The rownames are taken from obj
, unless obj
does not
have any names and then the rownames are taken from the names of se
.
If neither have names, then the rownames are automatic (i.e.,
1:nrow()
). The columns are the following:
parameter estimates
standard errors
lower bound of the confidence intervals
upper bound of the confidence intervals
# single estimate confint2.default(obj = 10, se = 3) # multiple estimates est <- colMeans(attitude) se <- apply(X = str2str::d2m(attitude), MARGIN = 2, FUN = function(vec) sqrt(var(vec) / length(vec))) df <- nrow(attitude) - 1 confint2.default(obj = est, se = se, df = df) confint2.default(obj = est, se = se) # default is df = Inf and use of ctitical z-values confint2.default(obj = est, se = se, df = df, level = 0.99) # error ## Not run: confint2.default(obj = c(10, 12), se = c(3, 4, 5)) ## End(Not run)
# single estimate confint2.default(obj = 10, se = 3) # multiple estimates est <- colMeans(attitude) se <- apply(X = str2str::d2m(attitude), MARGIN = 2, FUN = function(vec) sqrt(var(vec) / length(vec))) df <- nrow(attitude) - 1 confint2.default(obj = est, se = se, df = df) confint2.default(obj = est, se = se) # default is df = Inf and use of ctitical z-values confint2.default(obj = est, se = se, df = df, level = 0.99) # error ## Not run: confint2.default(obj = c(10, 12), se = c(3, 4, 5)) ## End(Not run)
cor_by
computes a correlation matrix for each group within numeric
data. Only the correlation coefficients are determined and not any NHST
information. If that is desired, use corp_by
which includes
significance symbols. cor_by
is simply cor
+ by2
.
cor_by( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", sep = ".", check = TRUE )
cor_by( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", sep = ".", check = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
use |
character vector of length 1 specifying how to handle missing data
when computing the correlations. The options are 1)
"pairwise.complete.obs", 2) "complete.obs", 3) "na.or.complete", 4)
"all.obs", or 5) "everything". See details of |
method |
character vector of length 1 specifying the type of
correlations to be computed. The options are 1) "pearson", 2) "kendall", or
3) "spearman". See details of |
sep |
character vector of length 1 specifying the string to combine the
group values together with. |
check |
logical vector of length 1 specifying whether to check the
structure of the input arguments. For example, check whether
|
list of numeric matrices containing the correlations from each group.
The listnames are the unique combinations of the grouping variables,
separated by "sep" if multiple grouping variables (i.e.,
length(grp.nm)
> 1) are input:
unique(interaction(data[grp.nm], sep = sep))
. The rownames and
colnames of each numeric matrix are vrb.nm
.
cor
for full sample correlation matrixes,
corp
for full sample correlation data.frames with significance symbols,
corp_by
for full sample correlation data.farmes with significance symbols
by group.
# one grouping variable cor_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month") cor_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month", use = "complete.obs", method = "spearman") # two grouping variables cor_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am")) cor_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am"), use = "complete.obs", method = "spearman", sep = "_")
# one grouping variable cor_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month") cor_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month", use = "complete.obs", method = "spearman") # two grouping variables cor_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am")) cor_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am"), use = "complete.obs", method = "spearman", sep = "_")
cor_miss
computes (point-biserial) correlations between missingness on
data columns and scores on other data columns.
cor_miss( data, x.nm, m.nm, ov = FALSE, use = "pairwise.complete.obs", method = "pearson" )
cor_miss( data, x.nm, m.nm, ov = FALSE, use = "pairwise.complete.obs", method = "pearson" )
data |
data.frame of data. |
x.nm |
character vector of colnames in |
m.nm |
character vector of colnames in |
ov |
logical vector of length 1 specifying whether the correlations should be with "observedness" rather than missingness. |
use |
character vector of length 1 specifying how to deal with missing
data in the predictor columns. See |
method |
character vector of length 1 specifying what type of
correlations to compute. See |
cor_miss
calls make.dumNA
to create dummy vectors representing
missingness on the data[m.nm]
columns.
numeric matrix of (point-biserial) correlations between rows of predictors and columns of missingness.
cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R")) cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), ov = TRUE) # correlations with "observedness" cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), use = "complete.obs", method = "kendall")
cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R")) cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), ov = TRUE) # correlations with "observedness" cor_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), use = "complete.obs", method = "kendall")
cor_ml
decomposes correlations from multilevel data into within-group
and between-group correlations. The workhorse of the function is
statsBy
.
cor_ml(data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson")
cor_ml(data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson")
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 of a colname from |
use |
character vector of length 1 specifying how to handle missing
values when computing the correlations. The options are: 1.
"pairwise.complete.obs" which uses pairwise deletion, 2. "complete.obs"
which uses listwise deletion, and 3. "everything" which uses all cases and
returns NA for any correlations from columns in |
method |
character vector of length 1 specifying which type of correlations to compute. The options are: 1. "pearson" for traditional Pearson product-moment correlations, 2. "kendall" for Kendall rank correlations, and 3. "spearman" for Spearman rank correlations. |
list with two elements named "within" and "between" each containing a
numeric matrix. The first "within" matrix is the within-group correlation
matrix and the second "between" matrix is the between-group correlation
matrix. The rownames and colnames of each numeric matrix are vrb.nm
.
corp_ml
for multilevel correlations with significance symbols,
cor_by
for correlation matrices by group,
cor
for traditional, single-level correlation matrices,
statsBy
the workhorse for the cor_ml
function,
# traditional use tmp <- c("outcome","case","session","trt_time") # roxygen2 does not like c() inside [] dat <- as.data.frame(lmeInfo::Bryant2016)[tmp] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat cor_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case") # varying the \code{use} and \code{method} arguments cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "pairwise", method = "pearson") cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "complete", method = "kendall") cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "everything", method = "spearman")
# traditional use tmp <- c("outcome","case","session","trt_time") # roxygen2 does not like c() inside [] dat <- as.data.frame(lmeInfo::Bryant2016)[tmp] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat cor_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case") # varying the \code{use} and \code{method} arguments cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "pairwise", method = "pearson") cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "complete", method = "kendall") cor_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "everything", method = "spearman")
corp
computes bivariate correlations and their associated p-values.
The function is primarily for preparing a correlation table for publication:
the correlations are appended by significant symbols (e.g., asterixis),
corp
is simply corr.test
+ add_sig_cor
.
corp( data, vrb.nm, use = "pairwise.complete.obs", method = "pearson", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
corp( data, vrb.nm, use = "pairwise.complete.obs", method = "pearson", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
use |
character vector of length 1 specifying how to handle missing data
when computing the correlations. The options are 1)
"pairwise.complete.obs", 2) "complete.obs", 3) "na.or.complete", 4)
"all.obs", or 5) "everything". See details of |
method |
character vector of length 1 specifying the type of
correlations to be computed. The options are 1) "pearson", 2) "kendall", or
3) "spearman". See details of |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive correlations (minus signs are always in front of negative correlations). |
diags |
logical vector of length 1 specifying whether to retain the
values in the diagonal of the correlation matrix. If TRUE, then the
diagonal will be 1s with |
lower |
logical vector of length 1 specifying whether to retain the lower triangle of the correlation matrix. If TRUE, then the lower triangle correlations and their significance symbols are retained. If FAlSE, then the lower triangle will all be NA. |
upper |
logical vector of length 1 specifying whether to retain the upper triangle of the correlation matrix. If TRUE, then the upper triangle correlations and their significance symbols are retained. If FAlSE, then the upper triangle will all be NA. |
data.frame with rownames and colnames equal to vrb.nm
containing the bivariate correlations with significance symbols after the
correlation value, specified by the arguments p.10
, p.05
,
p.01
, and p.001
arguments. The specific elements of the
return object are determined by the other arguments.
add_sig_cor
for adding significant symbols to a correlation matrix,
add_sig
for adding significant symbols to any (atomic) vector, matrix, or (3D+) array,
cor
for computing only the correlation coefficients themselves
corr.test
for a function providing confidence intervals as well
corp(data = mtcars, vrb.nm = c("mpg","cyl","disp","hp","drat")) # no quotes b/c a data.frame corp(data = attitude, vrb.nm = colnames(attitude)) corp(data = attitude, vrb.nm = colnames(attitude), p.10 = "'") # advance & privileges corp(data = airquality, vrb.nm = colnames(airquality), plus = TRUE)
corp(data = mtcars, vrb.nm = c("mpg","cyl","disp","hp","drat")) # no quotes b/c a data.frame corp(data = attitude, vrb.nm = colnames(attitude)) corp(data = attitude, vrb.nm = colnames(attitude), p.10 = "'") # advance & privileges corp(data = airquality, vrb.nm = colnames(airquality), plus = TRUE)
corp_by
computes a correlation data.frame for each group within
numeric data. The correlation coefficients are appended by their significant
symbols based on their associated p-values. If only the correlation
coefficients are desired, use cor_by
which returns a list of numeric
matrices. corp_by
is simply corp
+ by2
.
corp_by( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", sep = ".", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
corp_by( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", sep = ".", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
use |
character vector of length 1 specifying how to handle missing data
when computing the correlations. The options are 1)
"pairwise.complete.obs", 2) "complete.obs", 3) "na.or.complete", 4)
"all.obs", or 5) "everything". See details of |
method |
character vector of length 1 specifying the type of
correlations to be computed. The options are 1) "pearson", 2) "kendall", or
3) "spearman". See details of |
sep |
character vector of length 1 specifying the string to combine the
group values together with. |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive correlations (minus signs are always in front of negative correlations). |
diags |
logical vector of length 1 specifying whether to retain the
values in the diagonal of the correlation matrix. If TRUE, then the
diagonal will be 1s with |
lower |
logical vector of length 1 specifying whether to retain the lower triangle of the correlation matrix. If TRUE, then the lower triangle correlations and their significance symbols are retained. If FAlSE, then the lower triangle will all be NA. |
upper |
logical vector of length 1 specifying whether to retain the upper triangle of the correlation matrix. If TRUE, then the upper triangle correlations and their significance symbols are retained. If FAlSE, then the upper triangle will all be NA. |
list of data.frames containing the correlation coefficients and their
appended significance symbols based upon their associated p-values. The
listnames are the unique combinations of the grouping variables, separated
by "sep" if multiple grouping variables (i.e., length(grp.nm)
> 1)
are input: unique(interaction(data[grp.nm], sep = sep))
. For each
data.frame, the rownames and colnames = vrb.nm
. The significance
symbols are specified by the arguments p.10
, p.05
,
p.01
, and p.001
, after the correlation value. The specific
elements of the return object are determined by the other arguments.
# one grouping variable corp_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month") corp_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month", use = "complete.obs", method = "spearman") # two grouping variables corp_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am")) corp_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am"), use = "complete.obs", method = "spearman", sep = "_")
# one grouping variable corp_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month") corp_by(airquality, vrb.nm = c("Ozone","Solar.R","Wind"), grp.nm = "Month", use = "complete.obs", method = "spearman") # two grouping variables corp_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am")) corp_by(mtcars, vrb.nm = c("mpg","disp","drat","wt"), grp.nm = c("vs","am"), use = "complete.obs", method = "spearman", sep = "_")
corp_miss
computes (point-biserial) correlations between missingness
on data columns and scores on other data columns. It also appends
significance symbols at the end of the correlations.
corp_miss( data, x.nm, m.nm, ov = FALSE, use = "pairwise.complete.obs", method = "pearson", m.suffix = if (ov) "_ov" else "_na", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE )
corp_miss( data, x.nm, m.nm, ov = FALSE, use = "pairwise.complete.obs", method = "pearson", m.suffix = if (ov) "_ov" else "_na", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE )
data |
data.frame of data. |
x.nm |
character vector of colnames in |
m.nm |
character vector of colnames in |
ov |
logical vector of length 1 specifying whether the correlations should be with "observedness" rather than missingness. |
use |
character vector of length 1 specifying how to deal with missing
data in the predictor columns. See |
method |
character vector of length 1 specifying what type of
correlations to compute. See |
m.suffix |
character vector of length 1 specifying a string to oppend to
the end of the colnames to clarify whether they refer to missingness or
"observedness". Default is "_na" if |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive correlations (minus signs are always in front of negative correlations). |
cor_miss
calls make.dumNA
to create dummy vectors representing
missingness on the data[m.nm]
columns.
numeric matrix of (point-biserial) correlations between rows of predictors and columns of missingness.
corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R")) corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), ov = TRUE) # correlations with "observedness" corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), use = "complete.obs", method = "kendall")
corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R")) corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), ov = TRUE) # correlations with "observedness" corp_miss(data = airquality, x.nm = c("Wind","Temp","Month","Day"), m.nm = c("Ozone","Solar.R"), use = "complete.obs", method = "kendall")
corp_ml
decomposes correlations from multilevel data into within-group
and between-group correlations as well as adds significance symbols to the
end of each value. The workhorse of the function is
statsBy
. corp_ml
is simply a combination of
cor_ml
and add_sig_cor
.corp_ml
decomposes correlations from multilevel data into within-group
and between-group correlations as well as adds significance symbols to the
end of each value. The workhorse of the function is
statsBy
. corp_ml
is simply a combination of
cor_ml
and add_sig_cor
.
corp_ml( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
corp_ml( data, vrb.nm, grp.nm, use = "pairwise.complete.obs", method = "pearson", digits = 3L, p.10 = "", p.05 = "*", p.01 = "**", p.001 = "***", lead.zero = FALSE, trail.zero = TRUE, plus = FALSE, diags = FALSE, lower = TRUE, upper = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 of a colname from |
use |
character vector of length 1 specifying how to handle missing
values when computing the correlations. The options are: 1)
"pairwise.complete.obs" which uses pairwise deletion, 2) "complete.obs"
which uses listwise deletion, and 3) "everything" which uses all cases and
returns NA for any correlations from columns in |
method |
character vector of length 1 specifying which type of correlations to compute. The options are: 1) "pearson" for traditional Pearson product-moment correlations, 2) "kendall" for Kendall rank correlations, and 3) "spearman" for Spearman rank correlations. |
digits |
integer vector of length 1 specifying the number of decimals to round to. |
p.10 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .10 level. |
p.05 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .05 level. |
p.01 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .01 level. |
p.001 |
character vector of length 1 specifying which symbol to append to the end of any correlation significant at the p < .001 level. |
lead.zero |
logical vector of length 1 specifying whether to retain a zero in front of the decimal place. |
trail.zero |
logical vector of length 1 specifying whether to retain zeros after the decimal place (due to rounding). |
plus |
logical vector of length 1 specifying whether to include a plus sign in front of positive correlations (minus signs are always in front of negative correlations). |
diags |
logical vector of length 1 specifying whether to retain the
values in the diagonal of the correlation matrix. If TRUE, then the
diagonal will be 1s with |
lower |
logical vector of length 1 specifying whether to retain the lower triangle of the correlation matrix. If TRUE, then the lower triangle correlations and their significance symbols are retained. If FAlSE, then the lower triangle will all be NA. |
upper |
logical vector of length 1 specifying whether to retain the upper triangle of the correlation matrix. If TRUE, then the upper triangle correlations and their significance symbols are retained. If FAlSE, then the upper triangle will all be NA. |
list of two elements that are data.frames with names "within" and
"between". The first data.frame has the within-group correlations with
their significance symbols at the end of the statistically significant
correlations based on their associated p-value. The second data.frame has
the between-group correlations with their significance symbols at the end
of the statistically significant correlations based on their associated
p-values. The rownames and colnames of each dataframe are vrb.nm
.
The formatting of the two data.frames depends on several of the arguments.
cor_ml
for multilevel correlations without significance symbols,
corp_by
for correlations with significance symbols by group,
statsBy
the workhorse for the corp_ml
function,
add_sig_cor
for adding significant symbols to correlation matrices,
# traditional use tmp <- c("outcome","case","session","trt_time") # roxygen2 does not like c() inside [] dat <- as.data.frame(lmeInfo::Bryant2016)[tmp] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat corp_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case") # varying the `use` and `method` arguments corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "pairwise", method = "pearson") corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "complete", method = "kendall") corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "everything", method = "spearman")
# traditional use tmp <- c("outcome","case","session","trt_time") # roxygen2 does not like c() inside [] dat <- as.data.frame(lmeInfo::Bryant2016)[tmp] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat corp_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case") # varying the `use` and `method` arguments corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "pairwise", method = "pearson") corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "complete", method = "kendall") corp_ml(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind","Temp"), grp.nm = "Month", use = "everything", method = "spearman")
covs_test
computes sample covariances and tests for their significance
with the Pearson method assuming multivariate normality of the data. Note,
the normal-theory significance test for the covariance is much more sensitive
to departures from normality than the significant test for the mean. This
function is the covariance analogue to the psych::corr.test()
function
for correlations.
covs_test(data, vrb.nm, use = "pairwise", ci.level = 0.95, rtn.dfm = FALSE)
covs_test(data, vrb.nm, use = "pairwise", ci.level = 0.95, rtn.dfm = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames specifying the variables in
|
use |
character vector of length 1 specifying how missing values are
handled. Currently, there are only two options: 1) "pairwise" for pairwise
deletion (i.e., |
ci.level |
numeric vector of length 1 specifying the confidence level. It must be between 0 and 1 - or it can be NULL in which case confidence intervals are not computed and the return object does not have "lwr" or "upr" columns. |
rtn.dfm |
logical vector of length 1 specifying whether the return object should be an array (FALSE) or data.frame (TRUE). If an array, then the first two dimensions are the matrix dimensions from the covariance matrix and the 3rd dimension (aka layers) contains the statistical information (e.g., est, se, t). If data.frame, then the first two columns are the matrix dimensions from the covariance matrix expanded and the rest of the columns contain the statistical information (e.g., est, se, t). |
If rtn.dfm = FALSE
, an array where its first two dimensions
are the matrix dimensions from the covariance matrix and the 3rd dimension
(aka layers) contains the statistical information detailed below. If
rtn.dfm = TRUE
, a data.frame where its first two columns are the
expanded matrix dimensions from the covariance matrix and the rest of the
columns contain the statistical information detailed below:
sample covariances
standard errors of the covariances
t-values
degrees of freedom (n - 2)
two-sided p-values
lower bound of the confidence intervals (excluded if ci.level = NULL
)
upper bound of the confidence intervals (excluded if ci.level = NULL
)
cov
for covariance matrix estimates,
corr.test
for correlation matrix significant testing,
# traditional use covs_test(data = attitude, vrb.nm = names(attitude)) covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL) # no confidence intervals covs_test(data = attitude, vrb.nm = names(attitude), rtn.dfm = TRUE) # return object as data.frame # NOT same as simple linear regression slope covTest <- covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL, rtn.dfm = TRUE) x <- covTest[with(covTest, rownames == "rating" & colnames == "complaints"), ] lm_obj <- lm(rating ~ complaints, data = attitude) y <- coef(summary(lm_obj))["complaints", , drop = FALSE] print(x); print(y) z <- x[, "cov"] / var(attitude$"complaints") print(z) # dividing by variance of the predictor gives you the regression slope # but the t-values and p-values are still different # NOT same as correlation coefficient covTest <- covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL, rtn.dfm = TRUE) x <- covTest[with(covTest, rownames == "rating" & colnames == "complaints"), ] cor_test <- cor.test(x = attitude[[1]], y = attitude[[2]]) print(x); print(cor_test) z <- x[, "cov"] / sqrt(var(attitude$"rating") * var(attitude$"complaints")) print(z) # dividing by sqrt of the variances gives you the correlation # but the t-values and p-values are still different
# traditional use covs_test(data = attitude, vrb.nm = names(attitude)) covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL) # no confidence intervals covs_test(data = attitude, vrb.nm = names(attitude), rtn.dfm = TRUE) # return object as data.frame # NOT same as simple linear regression slope covTest <- covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL, rtn.dfm = TRUE) x <- covTest[with(covTest, rownames == "rating" & colnames == "complaints"), ] lm_obj <- lm(rating ~ complaints, data = attitude) y <- coef(summary(lm_obj))["complaints", , drop = FALSE] print(x); print(y) z <- x[, "cov"] / var(attitude$"complaints") print(z) # dividing by variance of the predictor gives you the regression slope # but the t-values and p-values are still different # NOT same as correlation coefficient covTest <- covs_test(data = attitude, vrb.nm = names(attitude), ci.level = NULL, rtn.dfm = TRUE) x <- covTest[with(covTest, rownames == "rating" & colnames == "complaints"), ] cor_test <- cor.test(x = attitude[[1]], y = attitude[[2]]) print(x); print(cor_test) z <- x[, "cov"] / sqrt(var(attitude$"rating") * var(attitude$"complaints")) print(z) # dividing by sqrt of the variances gives you the correlation # but the t-values and p-values are still different
cronbach
computes Cronbach's alpha for a set of variables/items as an
estimate of reliability for a score. There are three different options for
confidence intervals. Missing data can be handled by either pairwise deletion
(use
= "pairwise.complete.obs") or listwise deletion (use
=
"complete.obs"). cronbach
is a wrapper for the
alpha
function in the psych
package.
cronbach( data, vrb.nm, ci.type = "delta", level = 0.95, use = "pairwise.complete.obs", stats = c("average_r", "nvrb"), R = 200L, boot.ci.type = "perc" )
cronbach( data, vrb.nm, ci.type = "delta", level = 0.95, use = "pairwise.complete.obs", stats = c("average_r", "nvrb"), R = 200L, boot.ci.type = "perc" )
data |
data.frame of data. |
vrb.nm |
character vector of colnames of |
ci.type |
character vector of length 1 specifying the type of confidence
interval to compute. The options are 1) "classic" is the Feldt et al.
(1987) procedure using only the mean covariance, 2) "delta" is the
Duhhacheck & Iacobucci (2004) procedure using the delta method of the
covariance matrix, or 3) "boot" is bootstrapped confidence intervals with
the method specified by |
level |
double vector of length 1 with a value between 0 and 1 specifying what confidence level to use. |
use |
character vector of length 1 specifying how to handle missing data
when computing the covariances. The options are 1) "pairwise.complete.obs",
2) "complete.obs", 3) "na.or.complete", 4) "all.obs", or 5) "everything".
See details of |
stats |
character vector specifying the additional statistical information you could like related to cronbach's alpha. Options are: 1) "std.alpha" = cronbach's alpha of the standardized variables/items, 2) "G6(smc)" = Guttman's Lambda 6 reliability, 3) "average_r" = mean correlation between the variables/items, 4) "median_r" = median correlation between the variables/items, 5) "mean" = mean of the the score from averaging the variables/items together, 6) "sd" = standard deviation of the scores from averaging the variables/items together, 7) "nvrb" = number of variables/items. The default is "average_r" and "nvrb". |
R |
integer vector of length 1 specifying the number of bootstrapped
resamples to do. Only used when |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc" for
the regular percentile method, 2) "bca" for bias-corrected and accelerated
percentile method, 3) "norm" for the normal method that uses the
bootstrapped standard error to construct symmetrical confidence intervals
with the classic formula around the bias-corrected estimate, and 4) "basic"
for the basic method. Note, "stud" for the studentized method is NOT an
option. See |
When ci.type
= "classic" the confidence interval is based on the mean
covariance. It is the same as the confidence interval used by
alpha.ci
(Feldt, Woodruff, & Salih, 1987). When
ci.type
= "delta" the confidence interval is based on the delta method
of the covariance matrix. It is based on the standard error returned by
alpha
(Duhachek & Iacobucci, 2004).
double vector containing Cronbach's alpha, it's standard error, and
it's confidence interval, followed by any statistics requested via the
stats
argument.
Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement (11) 93-103.
Duhachek, A. and Iacobucci, D. (2004). Alpha's standard error (ase): An accurate and precise confidence interval estimate. Journal of Applied Psychology, 89(5):792-808.
tmp_nm <- c("A2","A3","A4","A5") psych::alpha(psych::bfi[tmp_nm])[["total"]] a <- suppressMessages(psych::alpha(attitude))[["total"]]["raw_alpha"] a.ci <- psych::alpha.ci(a, n.obs = 30, n.var = 7, digits = 7) # n.var is optional and only needed to find r.bar cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "classic") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "delta") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "boot") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), stats = NULL) ## Not run: cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "boot", boot.ci.type = "bca") # will automatically convert to "perc" when "bca" fails ## End(Not run)
tmp_nm <- c("A2","A3","A4","A5") psych::alpha(psych::bfi[tmp_nm])[["total"]] a <- suppressMessages(psych::alpha(attitude))[["total"]]["raw_alpha"] a.ci <- psych::alpha.ci(a, n.obs = 30, n.var = 7, digits = 7) # n.var is optional and only needed to find r.bar cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "classic") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "delta") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "boot") cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), stats = NULL) ## Not run: cronbach(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), ci.type = "boot", boot.ci.type = "bca") # will automatically convert to "perc" when "bca" fails ## End(Not run)
cronbachs
computes Cronbach's alpha for multiple sets of
variables/items as an estimate of reliability for multiple scores. There are
three different options for confidence intervals. Missing data can be handled
by either pairwise deletion (use
= "pairwise.complete.obs") or
listwise deletion (use
= "complete.obs"). cronbachs
is a
wrapper for the alpha
function in the psych
package.
cronbachs( data, vrb.nm.list, ci.type = "delta", level = 0.95, use = "pairwise.complete.obs", stats = c("average_r", "nvrb"), R = 200L, boot.ci.type = "perc" )
cronbachs( data, vrb.nm.list, ci.type = "delta", level = 0.95, use = "pairwise.complete.obs", stats = c("average_r", "nvrb"), R = 200L, boot.ci.type = "perc" )
data |
data.frame of data. |
vrb.nm.list |
list of character vectors specifying the sets of
variables/items. Each element of |
ci.type |
character vector of length 1 specifying the type of confidence
interval to compute. The options are 1) "classic" = the Feldt et al. (1987)
procedure using only the mean covariance, 2) "delta" = the Duhhacheck &
Iacobucci (2004) procedure using the delta method of the covariance matrix,
or 3) "boot" = bootstrapped confidence intervals with the method specified
by |
level |
double vector of length 1 with a value between 0 and 1 specifying what confidence level to use. |
use |
character vector of length 1 specifying how to handle missing data
when computing the covariances. The options are 1) "pairwise.complete.obs",
2) "complete.obs", 3) "na.or.complete", 4) "all.obs", or 5) "everything".
See details of |
stats |
character vector specifying the additional statistical information you could like related to cronbach's alpha. Options are: 1) "std.alpha" = cronbach's alpha of the standardized variables/items, 2) "G6(smc)" = Guttman's Lambda 6 reliability, 3) "average_r" = mean correlation between the variables/items, 4) "median_r" = median correlation between the variables/items, 5) "mean" = mean of the the scores from averaging the variables/items together, 6) "sd" = standard deviation of the scores from averaging the variables/items together, 7) "nvrb" = number of variables/items. The default is "average_r" and "nvrb". |
R |
integer vector of length 1 specifying the number of bootstrapped
resamples to do. Only used when |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc" for
the regular percentile method, 2) "bca" for bias-corrected and accelerated
percentile method, 3) "norm" for the normal method that uses the
bootstrapped standard error to construct symmetrical confidence intervals
with the classic formula around the bias-corrected estimate, and 4) "basic"
for the basic method. Note, "stud" for the studentized method is NOT an
option. See |
When ci.type
= "classic" the confidence interval is based on the mean
covariance. It is the same as the confidence interval used by
alpha.ci
(Feldt, Woodruff, & Salih, 1987). When
ci.type
= "delta" the confidence interval is based on the delta method
of the covariance matrix. It is based on the standard error returned by
alpha
(Duhachek & Iacobucci, 2004).
data.frame containing the following columns:
Cronbach's alpha itself
standard error for Cronbach's alpha
lower bound of the confidence interval of Cronbach's alpha
upper bound for the confidence interval of Cronbach's alpha
,
any statistics requested via the stats
argument
Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement (11) 93-103.
Duhachek, A. and Iacobucci, D. (2004). Alpha's standard error (ase): An accurate and precise confidence interval estimate. Journal of Applied Psychology, 89(5):792-808.
dat0 <- psych::bfi dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "classic") cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "delta") cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot") suppressMessages(cronbachs(data = attitude, vrb.nm.list = list(names(attitude)))) # also works with only one set of variables/items
dat0 <- psych::bfi dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "classic") cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "delta") cronbachs(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot") suppressMessages(cronbachs(data = attitude, vrb.nm.list = list(names(attitude)))) # also works with only one set of variables/items
decompose
decomposes a numeric vector into within-group and
between-group components via within-group centering and group-mean
aggregation. There is an option to create a grand-mean centered version of
the between-person component as well as lead/lag versions of the original
vector and the within-group component.
decompose(x, grp, grand = TRUE, n.shift = NULL, undefined = NA)
decompose(x, grp, grand = TRUE, n.shift = NULL, undefined = NA)
x |
numeric vector. |
grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame),
which each have same length as |
grand |
logical vector of length 1 specifying whether a grand-mean centered version of the the between-group component should be computed. |
n.shift |
integer vector specifying the direction and magnitude of the
shifts. For example a one-lead is +1 and a two-lag is -2. See |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
data.frame with nrow = length(x)
and row.names =
names(x)
. The first two columns correspond to the within-group component
(i.e., "wth") and the between-group component (i.e., "btw"). If grand =
TRUE, then the third column corresponds to the grand-mean centered
between-group component (i.e., "btw_c"). If shift != NULL, then the last
columns are the shifts indicated by n.shift, where the shifts of x
are first (i.e., "tot") and then the shifts of the within-group component
are second (i.e., "wth"). The naming of the shifted columns is based on the
default behavior of Shift_by
. See the details of Shift_by
. If
you don't like the default naming, then call Decompose
instead and
use the different suffix arguments.
decomposes
center_by
agg
shift_by
# single grouping variable chick_data <- as.data.frame(ChickWeight) # because the "groupedData" class # calls `[.groupedData`, which is different than `[.data.frame` decompose(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]]) decompose(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]], grand = FALSE) # no grand-mean centering decompose(x = setNames(obj = ChickWeight[["weight"]], nm = paste0(row.names(ChickWeight),"_row")), grp = ChickWeight[["Chick"]]) # with names # multiple grouping variables tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like c() in a [] decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm]) decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n.shift = 1) decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n.shift = c(+2, +1, -1, -2))
# single grouping variable chick_data <- as.data.frame(ChickWeight) # because the "groupedData" class # calls `[.groupedData`, which is different than `[.data.frame` decompose(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]]) decompose(x = ChickWeight[["weight"]], grp = ChickWeight[["Chick"]], grand = FALSE) # no grand-mean centering decompose(x = setNames(obj = ChickWeight[["weight"]], nm = paste0(row.names(ChickWeight),"_row")), grp = ChickWeight[["Chick"]]) # with names # multiple grouping variables tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like c() in a [] decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm]) decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n.shift = 1) decompose(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n.shift = c(+2, +1, -1, -2))
decomposes
decomposes numeric data by group into within-group and
between- group components via within-group centering and group-mean
aggregation. There is an option to create a grand-mean centered version of
the between-group components.
decomposes( data, vrb.nm, grp.nm, grand = TRUE, n.shift = NULL, undefined = NA, suffix.wth = "_w", suffix.btw = "_b", suffix.grand = "c", suffix.lead = "_dw", suffix.lag = "_gw" )
decomposes( data, vrb.nm, grp.nm, grand = TRUE, n.shift = NULL, undefined = NA, suffix.wth = "_w", suffix.btw = "_b", suffix.grand = "c", suffix.lead = "_dw", suffix.lag = "_gw" )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
grand |
logical vector of length 1 specifying whether grand-mean centered versions of the the between-group components should be computed. |
n.shift |
integer vector specifying the direction and magnitude of the
shifts. For example a one-lead is +1 and a two-lag is -2. See
|
undefined |
atomic vector of length 1 (probably makes sense to be the
same typeof as the vectors in |
suffix.wth |
character vector with a single element specifying the string to append to the end of the within-group component colnames of the return object. |
suffix.btw |
character vector with a single element specifying the string to append to the end of the between-group component colnames of the return object. |
suffix.grand |
character vector with a single element specifying the
string to append to the end of the grand-mean centered version of the
between-group component colnames of the return object. Note, this is a
string that is appended after |
suffix.lead |
character vector with a single element specifying the
string to append to the end of the positive shift colnames of the return
object. Note, |
suffix.lag |
character vector with a single element specifying the
string to append to the end of the negative shift colnames of the return
object. Note, |
data.frame with nrow = nrow(data
and rownames =
row.names(data)
. The first set of columns correspond to the
within-group components, followed by the between-group components. If grand
= TRUE, then the next set of columns correspond to the grand-mean centered
between-group components. If shift != NULL, then the last columns are the
shifts by group indicated by n.shift, where the shifts of
data[vrb.nm]
are first and then the shifts of the within-group
components are second.
decompose
centers_by
aggs
shifts_by
ChickWeight2 <- as.data.frame(ChickWeight) row.names(ChickWeight2) <- as.numeric(row.names(ChickWeight)) / 1000 decomposes(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick") decomposes(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick", suffix.wth = ".wth", suffix.btw = ".btw", suffix.grand = ".grand") decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment")) # multiple grouping columns decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n.shift = 1) # with lead decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n.shift = c(+2, +1, -1, -2)) # with multiple lead/lags
ChickWeight2 <- as.data.frame(ChickWeight) row.names(ChickWeight2) <- as.numeric(row.names(ChickWeight)) / 1000 decomposes(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick") decomposes(data = ChickWeight2, vrb.nm = c("weight","Time"), grp.nm = "Chick", suffix.wth = ".wth", suffix.btw = ".btw", suffix.grand = ".grand") decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment")) # multiple grouping columns decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n.shift = 1) # with lead decomposes(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n.shift = c(+2, +1, -1, -2)) # with multiple lead/lags
deff
computes the design effect for a multilevel numeric vector.
Design effects summarize how much larger sampling variances (i.e., squared
standard errors) are due to the multilevel structure of the data. By taking
the square root, the value summarizes how much larger standard errors are due
to the multilevel structure of the data.
deff(x, grp, how = "lme", REML = TRUE)
deff(x, grp, how = "lme", REML = TRUE)
x |
numeric vector. |
grp |
atomic vector the same length as |
how |
character vector of length 1 specifying how the ICC(1,1) should be
calculated. There are four options: 1) "lme" uses a linear mixed effects
model with the function |
REML |
logical vector of length 1 specifying whether restricted maximum likelihood estimation (TRUE) should be used rather than traditional maximum likelihood estimation (FALSE). Only used for linear mixed effects models if how = "lme" or how = "lmer". |
Design effects are a function of both the intraclass correlation (ICC) and the average group size. Design effects can be large due to large ICCs and small group sizes or small ICCs and large group sizes. For example, with an ICC = .01 and average group size of 100, the design effect would be 2.0, whose square root is 1.41. For more information, see myths 1 and 2 in Huang (2018).
double vector of lenght 1 providing the design effect.
Huang, F. L. (2018). Multilevel modeling myths School Psychology Quarterly, 33(3), 492-499.
icc_11(x = airquality$"Ozone", grp = airquality$"Month") length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = TRUE) deff(x = airquality$"Ozone", grp = airquality$"Month") sqrt(deff(x = airquality$"Ozone", grp = airquality$"Month")) # how much SE inflated
icc_11(x = airquality$"Ozone", grp = airquality$"Month") length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = TRUE) deff(x = airquality$"Ozone", grp = airquality$"Month") sqrt(deff(x = airquality$"Ozone", grp = airquality$"Month")) # how much SE inflated
deffs
computes the design effects for multilevel numeric data. Design
effects summarize how much larger sampling variances (i.e., squared standard
errors) are due to the multilevel structure of the data. By taking the square
root, the value summarizes how much larger standard errors are due to the
multilevel structure of the data.
deffs(data, vrb.nm, grp.nm, how = "lme", REML = FALSE)
deffs(data, vrb.nm, grp.nm, how = "lme", REML = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 of a colname from |
how |
character vector of length 1 specifying how the ICC(1,1) should be
calculated. There are four options: 1) "lme" uses a linear mixed effects
model with the function |
REML |
logical vector of length 1 specifying whether restricted maximum likelihood estimation (TRUE) should be used rather than traditional maximum likelihood estimation (FALSE). Only used for linear mixed effects models if how = "lme" or how = "lmer". |
Design effects are a function of both the intraclass correlation (ICC) and the average group size. Design effects can be large due to large ICCs and small group sizes or small ICCs and large group sizes. For example, with an ICC = .01 and average group size of 100, the design effect would be 2.0, whose square root is 1.41. For more information, see myths 1 and 2 in Huang (2018).
double vector providing the design effects with names =
vrb.nm
.
Huang, F. L. (2018). Multilevel modeling myths School Psychology Quarterly, 33(3), 492-499.
iccs_11(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month") lengths_by(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", na.rm = TRUE) deffs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month")
iccs_11(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month") lengths_by(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month", na.rm = TRUE) deffs(data = airquality, vrb.nm = c("Ozone","Solar.R"), grp.nm = "Month")
describe_ml
decomposes descriptive statistics from multilevel data
into within-group and between-group descriptives. The data is first separated
out into within-group components via centers_by
and between-group
components via aggs
. Then the psych
function
describe
is applied to both.
describe_ml( data, vrb.nm, grp.nm, na.rm = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE, trim = 0.1, type = 3, quant = NULL, IQR = FALSE )
describe_ml( data, vrb.nm, grp.nm, na.rm = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE, trim = 0.1, type = 3, quant = NULL, IQR = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 of a colname from |
na.rm |
logical vector of length 1 specifying whether missing values
should be removed before calculating the descriptive statistics. See
|
interp |
logical vector of length 1 specifying whether the median should be standard (FALSE) or interpolated (TRUE). |
skew |
logical vector of length 1 specifying whether skewness and kurtosis should be calculated (TRUE) or not (FALSE). |
ranges |
logical vector of length 1 specifying whether the minimum,
maximum, and range (i.e., maximum - minimum) should be calculated (TRUE) or
not (FALSE). Note, if |
trim |
numeric vector of length 1 specifying the top and bottom quantiles of data that are to be excluded when calculating the trimmed mean. For example, the default value of 0.1 means that only data within the 10th - 90th quantiles are used for calculating the trimmed mean. |
type |
numeric vector of length 1 specifying the type of skewness and
kurtosis coefficients to compute. See the details of
|
quant |
numeric vector specifying the quantiles to compute. Foe example,
the default value of c(0.25, 0.75) computes the 25th and 75th quantiles of
the group number of cases. If |
IQR |
logical vector of length 1 specifying whether to compute the Interquartile Range (TRUE) or not (FALSE), which is simply the 75th quantil - 25th quantile. |
list of two elements each containing a data.frame of descriptive statistics, the first for the within-person components ("within") and the second for the between-person components ("between").
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat describe_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case")
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat describe_ml(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case")
dum2nom
converts dummy variables to a nominal variable. The
information from the dummy columns in a data.frame are combined into a
character vector (or factor if rtn.fct
= TRUE) representing a nominal
variable. The unique values of the nominal variable will be the dummy
colnames (i.e., dum.nm
). Note, *all* the dummy variables associated
with a nominal variable are required for this function to work properly. In
regression-like models, data analysts will exclude one dummy variable for the
category that is the reference group. If d = number of categories in the
nominal variable, then that leads to d - 1 dummy variables in the model.
dum2nom
requires all d dummy variables.
dum2nom(data, dum.nm, yes = 1L, rtn.fct = FALSE)
dum2nom(data, dum.nm, yes = 1L, rtn.fct = FALSE)
data |
data.frame of data. |
dum.nm |
character vector of colnames from |
yes |
atomic vector of length 1 specifying the unique value of the category in each dummy column. This must be the same value for all the dummy variables. |
rtn.fct |
logical vector of length 1 specifying whether the return object should be a factor (TRUE) or a character vector (FALSE). |
dum2nom
tests to ensure that data[dum.nm]
are indeed a set of
dummy columns. First, the dummy columns are expected to have the same mode
such that there is one yes
unique value across the dummy columns.
Second, each row in data[dum.nm]
is expected to have either 0 or 1
instance of yes
. If there is more than one instance of yes
in a
row, then an error is returned. If there is 0 instances of yes
in a
row (e.g., all missing values), NA is returned for that row. Note, any value
other than yes
will be treated as a no.
character vector (or factor if rtn.fct
= TRUE) containing the
unique values of dum.nm
- one for each dummy variable.
dum <- data.frame( "Quebec_nonchilled" = ifelse(CO2$"Type" == "Quebec" & CO2$"Treatment" == "nonchilled", yes = 1L, no = 0L), "Quebec_chilled" = ifelse(CO2$"Type" == "Quebec" & CO2$"Treatment" == "chilled", yes = 1L, no = 0L), "Mississippi_nonchilled" = ifelse(CO2$"Type" == "Mississippi" & CO2$"Treatment" == "nonchilled", yes = 1L, no = 0L), "Mississippi_chilled" = ifelse(CO2$"Type" == "Mississippi" & CO2$"Treatment" == "chilled", yes = 1L, no = 0L) ) dum2nom(data = dum, dum.nm = names(dum)) # default dum2nom(data = dum, dum.nm = names(dum), rtn.fct = TRUE) # return as a factor ## Not run: dum2nom(data = npk, dum.nm = c("N","P","K")) # error due to overlapping dummy columns dum2nom(data = mtcars, dum.nm = c("vs","am"))# error due to overlapping dummy columns ## End(Not run)
dum <- data.frame( "Quebec_nonchilled" = ifelse(CO2$"Type" == "Quebec" & CO2$"Treatment" == "nonchilled", yes = 1L, no = 0L), "Quebec_chilled" = ifelse(CO2$"Type" == "Quebec" & CO2$"Treatment" == "chilled", yes = 1L, no = 0L), "Mississippi_nonchilled" = ifelse(CO2$"Type" == "Mississippi" & CO2$"Treatment" == "nonchilled", yes = 1L, no = 0L), "Mississippi_chilled" = ifelse(CO2$"Type" == "Mississippi" & CO2$"Treatment" == "chilled", yes = 1L, no = 0L) ) dum2nom(data = dum, dum.nm = names(dum)) # default dum2nom(data = dum, dum.nm = names(dum), rtn.fct = TRUE) # return as a factor ## Not run: dum2nom(data = npk, dum.nm = c("N","P","K")) # error due to overlapping dummy columns dum2nom(data = mtcars, dum.nm = c("vs","am"))# error due to overlapping dummy columns ## End(Not run)
freq
creates univariate frequency tables similar to table
. It
differs from table
by allowing for custom sorting by something other
than the alphanumerics of the unique values as well as returning an atomic
vector rather than a 1D-array.
freq( x, exclude = if (useNA == "no") c(NA, NaN), useNA = "always", prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE )
freq( x, exclude = if (useNA == "no") c(NA, NaN), useNA = "always", prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE )
x |
atomic vector or list vector. If not a vector, it will be coerced to
a vector via |
exclude |
unique values of |
useNA |
character vector of length 1 specifying how to handle missing
values (i.e., whether to include NA as an element in the returned table).
There are three options: 1) "no" = don't include missing values in the
table, 2) "ifany" = include missing values if there are any, 3) "always" =
include missing values in the table, regardless of whether there are any or
not. See |
prop |
logical vector of length 1 specifying whether the returned table should include counts (FALSE) or proportions (TRUE). If NAs are excluded (e.g., useNA = "no" or exclude = c(NA, NaN)), then the proportions will be based on the number of observed elements. |
sort |
character vector of length 1 specifying how the returned table
will be sorted. There are three options: 1) "frequency" = the frequency of
the unique values in |
decreasing |
logical vector of length 1 specifying whether the table should be sorted in decreasing (TRUE) or increasing (FALSE) order. |
na.last |
logical vector of length 1 specifying whether the table should
have NAs last or in whatever position they end up at. This argument is only
relevant if NAs exist in |
The name for the table element giving the frequency of missing values is
"(NA)". This is different from table
where the name is
NA_character_
. This change allows for the sorting of tables that
include missing values, as subsetting in R is not possible with
NA_character_
names. In future versions of the package, this might
change as it should be possible to avoid this issue by subetting with a
logical vector or integer indices instead of names. However, it is convenient
to be able to subset the return object fully by names.
numeric vector of frequencies as either counts (if prop
=
FALSE) or proportions (if prop
= TRUE) with the unique values of
x
as names (missing values have name = "(NA)"). Note, this is
different from table
, which returns a 1D-array and has class
"table".
freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "frequency", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "frequency", decreasing = FALSE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "position", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "position", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "position", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "position", decreasing = FALSE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "alphanum", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "alphanum", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "alphanum", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "alphanum", decreasing = FALSE, na.last = FALSE)
freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "frequency", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "frequency", decreasing = FALSE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "position", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "position", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "position", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "position", decreasing = FALSE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "alphanum", decreasing = TRUE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = FALSE, sort = "alphanum", decreasing = TRUE, na.last = FALSE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "alphanum", decreasing = FALSE, na.last = TRUE) freq(c(mtcars$"carb", NA, NA, mtcars$"gear"), prop = TRUE, sort = "alphanum", decreasing = FALSE, na.last = FALSE)
tables_by
creates a frequency table for a set of variables in a
data.frame by group. Depending on total
, frequencies for all the
variables together can be returned by group. The function probably makes the
most sense for sets of variables with similar unique values (e.g., items from
a questionnaire with similar response options).
freq_by( x, grp, exclude = if (useNA == "no") c(NA, NaN), useNA = "always", prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE )
freq_by( x, grp, exclude = if (useNA == "no") c(NA, NaN), useNA = "always", prop = FALSE, sort = "frequency", decreasing = TRUE, na.last = TRUE )
x |
atomic vector. |
grp |
atomic vector or list of atomic vectors (e.g., data.frame)
specifying the groups. The atomic vector(s) must be the length of |
exclude |
unique values of |
useNA |
character vector of length 1 specifying how to handle missing
values (i.e., whether to include NA as an element in the returned table).
There are three options: 1) "no" = don't include missing values in the
table, 2) "ifany" = include missing values if there are any, 3) "always" =
include missing values in the table, regardless of whether there are any or
not. See |
prop |
logical vector of length 1 specifying whether the returned table should include counts (FALSE) or proportions (TRUE). If NAs are excluded (e.g., useNA = "no" or exclude = c(NA, NaN)), then the proportions will be based on the number of observed elements. |
sort |
character vector of length 1 specifying how the returned table
will be sorted. There are three options: 1) "frequency" = the frequency of
the unique values in |
decreasing |
logical vector of length 1 specifying whether the table should be sorted in decreasing (TRUE) or increasing (FALSE) order. |
na.last |
logical vector of length 1 specifying whether the table should
have NAs last or in whatever position they end up at. This argument is only
relevant if NAs exist in |
tables_by
uses plyr::rbind.fill
to combine the results from
table
applied to each variable into a single data.frame for each
group. If a variable from data[vrb.nm]
for each group does not have
values present in other variables from data[vrb.nm]
for that group,
then the frequencies in the return object will be 0.
The name for the table element giving the frequency of missing values is
"(NA)". This is different from table
where the name is
NA_character_
. This change allows for the sorting of tables that
include missing values, as subsetting in R is not possible with
NA_character_
names. In future versions of the package, this might
change as it should be possible to avoid this issue by subetting with a
logical vector or integer indices instead of names. However, it is convenient
to be able to subset the return object fully by names.
list of numeric vector of frequencies by group. The number of list
elements are the groups specified by unique(interaction(grp, sep =
sep))
. The frequencies either counts (if prop
= FALSE) or
proportions (if prop
= TRUE) with the unique values of x
as
names (missing values have name = "(NA)"). Note, this is different from
table
, which returns a 1D-array and has class "table".
x <- freq_by(mtcars$"gear", grp = mtcars$"vs") str(x) y <- freq_by(mtcars$"am", grp = mtcars$"vs", useNA = "no") str(y) str2str::lv2m(lapply(X = y, FUN = rev), along = 1) # ready to pass to prop.test()
x <- freq_by(mtcars$"gear", grp = mtcars$"vs") str(x) y <- freq_by(mtcars$"am", grp = mtcars$"vs", useNA = "no") str(y) str2str::lv2m(lapply(X = y, FUN = rev), along = 1) # ready to pass to prop.test()
freqs
creates a frequency table for a set of variables in a
data.frame. Depending on total
, frequencies for all the variables
together can be returned. The function probably makes the most sense for sets
of variables with similar unique values (e.g., items from a questionnaire
with similar response options).
freqs(data, vrb.nm, prop = FALSE, useNA = "always", total = "no")
freqs(data, vrb.nm, prop = FALSE, useNA = "always", total = "no")
data |
data.fame of data. |
vrb.nm |
character vector of colnames from |
prop |
logical vector of length 1 specifying whether the frequencies
should be counts (FALSE) or proportions (TRUE). Note, whether the
proportions include missing values depends on the |
useNA |
character vector of length 1 specifying how missing values
should be handled. The three options are 1) "no" = do not include NA
frequencies in the return object, 2) "ifany" = only NA frequencies if there
are any missing values (in any variable from |
total |
character vector of length 1 specifying whether the frequencies
for the set of variables as a whole should be returned. The name "total"
refers to tabulating the frequencies for the variables from
|
freqs
uses plyr::rbind.fill
to combine the results from
table
applied to each variable into a single data.frame. If a variable
from data[vrb.nm]
does not have values present in other variables from
data[vrb.nm]
, then the frequencies in the return object will be 0.
The name for the table element giving the frequency of missing values is
"(NA)". This is different from table
where the name is
NA_character_
. This change allows for the sorting of tables that
include missing values, as subsetting in R is not possible with
NA_character_
names. In future versions of the package, this might
change as it should be possible to avoid this issue by subetting with a
logical vector or integer indices instead of names. However, it is convenient
to be able to subset the return object fully by names.
data.frame of frequencies for the variables in data[vrb.nm]
.
Depending on prop
, the frequencies are either counts (FALSE) or
proportions (TRUE). Depending on total
, the nrow is either 1)
length(vrb.nm)
(if total
= "no"), 1 + length(vrb.nm)
(if total
= "yes"), or 3) 1 (if total
= "only"). The rownames
are vrb.nm
for each variable in data[vrb.nm]
and "_total_"
for the total row (if present). The colnames are the unique values present
in data[vrb.nm]
, potentially including "(NA)" depending on
useNA
.
vrb_nm <- str2str::inbtw(names(psych::bfi), "A1","O5") freqs(data = psych::bfi, vrb.nm = vrb_nm) # default freqs(data = psych::bfi, vrb.nm = vrb_nm, prop = TRUE) # proportions by row freqs(data = psych::bfi, vrb.nm = vrb_nm, useNA = "no") # without NA counts freqs(data = psych::bfi, vrb.nm = vrb_nm, total = "yes") # include total counts
vrb_nm <- str2str::inbtw(names(psych::bfi), "A1","O5") freqs(data = psych::bfi, vrb.nm = vrb_nm) # default freqs(data = psych::bfi, vrb.nm = vrb_nm, prop = TRUE) # proportions by row freqs(data = psych::bfi, vrb.nm = vrb_nm, useNA = "no") # without NA counts freqs(data = psych::bfi, vrb.nm = vrb_nm, total = "yes") # include total counts
freqs_by
creates a frequency table for a set of variables in a
data.frame by group. Depending on total
, frequencies for all the
variables together can be returned by group. The function probably makes the
most sense for sets of variables with similar unique values (e.g., items from
a questionnaire with similar response options).
freqs_by( data, vrb.nm, grp.nm, prop = FALSE, useNA = "always", total = "no", sep = "." )
freqs_by( data, vrb.nm, grp.nm, prop = FALSE, useNA = "always", total = "no", sep = "." )
data |
data.fame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
prop |
logical vector of length 1 specifying whether the frequencies
should be counts (FALSE) or proportions (TRUE). Note, whether the
proportions include missing values depends on the |
useNA |
character vector of length 1 specifying how missing values
should be handled. The three options are 1) "no" = do not include NA
frequencies in the return object, 2) "ifany" = only NA frequencies if there
are any missing values (in any variable from |
total |
character vector of length 1 specifying whether the frequencies
for the set of variables as a whole should be returned. The name "total"
refers to tabulating the frequencies for the variables from
|
sep |
character vector of length 1 specifying the string to combine the
group values together with. |
freqs_by
uses plyr::rbind.fill
to combine the results from
table
applied to each variable into a single data.frame for each
group. If a variable from data[vrb.nm]
for each group does not have
values present in other variables from data[vrb.nm]
for that group,
then the frequencies in the return object will be 0.
The name for the table element giving the frequency of missing values is
"(NA)". This is different from table
where the name is
NA_character_
. This change allows for the sorting of tables that
include missing values, as subsetting in R is not possible with
NA_character_
names. In future versions of the package, this might
change as it should be possible to avoid this issue by subetting with a
logical vector or integer indices instead of names. However, it is convenient
to be able to subset the return object fully by names.
list of data.frames containing the frequencies for the variables in
data[vrb.nm]
by group. The number of list elements are the groups
specified by unique(interaction(data[grp.nm], sep = sep))
. Depending
on prop
, the frequencies are either counts (FALSE) or proportions
(TRUE) by group. Depending on total
, the nrow for each data.frame is
either 1) length(vrb.nm)
(if total
= "no"), 1 +
length(vrb.nm)
(if total
= "yes"), or 3) 1 (if total
=
"only"). The rownames are vrb.nm
for each variable in
data[vrb.nm]
and "_total_" for the total row (if present). The
colnames for each data.frame are the unique values present in
data[vrb.nm]
, potentially including "(NA)" depending on
useNA
.
vrb_nm <- str2str::inbtw(names(psych::bfi), "A1","O5") freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender") # default freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", prop = TRUE) # proportions by row freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", useNA = "no") # without NA counts freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", total = "yes") # include total counts freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = c("gender","education")) # multiple grouping variables
vrb_nm <- str2str::inbtw(names(psych::bfi), "A1","O5") freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender") # default freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", prop = TRUE) # proportions by row freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", useNA = "no") # without NA counts freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = "gender", total = "yes") # include total counts freqs_by(data = psych::bfi, vrb.nm = vrb_nm, grp.nm = c("gender","education")) # multiple grouping variables
gtheory
uses generalizability theory to compute the reliability
coefficient of a score. It assumes single-level data where the rows are cases
and the columns are variables/items. Generaliability theory coefficients in
this case are the same as intraclass correlations (ICC). The default computes
ICC(3,k), which is identical to cronbach's alpha, from cross.vrb
=
TRUE. When cross.vrb
is FALSE, ICC(2,k) is computed, which takes mean
differences between variables/items into account. gtheory
is a wrapper
function for ICC
.
gtheory( data, vrb.nm, ci.type = "classic", level = 0.95, cross.vrb = TRUE, R = 200L, boot.ci.type = "perc" )
gtheory( data, vrb.nm, ci.type = "classic", level = 0.95, cross.vrb = TRUE, R = 200L, boot.ci.type = "perc" )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
ci.type |
character vector of length = 1 specifying the type of confidence interval to compute. There are currently two options: 1) "classic" = traditional ICC-based confidence intervals (see details), 2) "boot" = bootstrapped confidence intervals. |
level |
double vector of length 1 specifying the confidence level from 0 to 1. |
cross.vrb |
logical vector of length 1 specifying whether the variables/items should be crossed when computing the generalizability theory coefficient. If TRUE, then only the covariance structure of the variables/items will be incorperated into the estimate of reliability. If FALSE, then the mean structure of the variables/items will be incorperated. |
R |
integer vector of length 1 specifying the number of bootstrapped
resamples to use. Only used if |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc" for
the regular percentile method, 2) "bca" for bias-corrected and accelerated
percentile method, 3) "norm" for the normal method that uses the
bootstrapped standard error to construct symmetrical confidence intervals
with the classic formula around the bias-corrected estimate, and 4) "basic"
for the basic method. Note, "stud" for the studentized method is NOT an
option. See |
When ci.type
= "classic" the confidence intervals are computed
according to the formulas laid out by McGraw, Kenneth, and Wong, (1996).
These are taken from the ICC
function in the
psych
package. They are appropriately non-symmetrical given ICCs are
not unbounded and range from 0 to 1. Therefore, there is no standard error
associated with the coefficient. Note, they differ from the confidence
intervals available in the cronbach
function. When
ci.type
= "boot" the standard deviation of the empirical sampling
distribution is returned as the standard error, which may or may not be
trustworthy depending on the value of the ICC and sample size.
double vector containing the generalizability theory coefficient,
it's standard error (if ci.type
= "boot"), and it's confidence
interval.
McGraw, Kenneth O. and Wong, S. P. (1996), Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46. + errata on page 390.
gtheory(attitude, vrb.nm = names(attitude), ci.type = "classic") ## Not run: gtheory(attitude, vrb.nm = names(attitude), ci.type = "boot") gtheory(attitude, vrb.nm = names(attitude), ci.type = "boot", R = 250L, boot.ci.type = "bca") ## End(Not run) # comparison to cronbach's alpha: gtheory(attitude, names(attitude)) gtheory(attitude, names(attitude), cross.vrb = FALSE) a <- suppressMessages(psych::alpha(attitude)[["total"]]["raw_alpha"]) psych::alpha.ci(a, n.obs = 30, n.var = 7, digits = 7) # slightly different confidence interval
gtheory(attitude, vrb.nm = names(attitude), ci.type = "classic") ## Not run: gtheory(attitude, vrb.nm = names(attitude), ci.type = "boot") gtheory(attitude, vrb.nm = names(attitude), ci.type = "boot", R = 250L, boot.ci.type = "bca") ## End(Not run) # comparison to cronbach's alpha: gtheory(attitude, names(attitude)) gtheory(attitude, names(attitude), cross.vrb = FALSE) a <- suppressMessages(psych::alpha(attitude)[["total"]]["raw_alpha"]) psych::alpha.ci(a, n.obs = 30, n.var = 7, digits = 7) # slightly different confidence interval
gtheory_ml
uses generalizability theory to compute the reliability
coefficients of a multilevel score. It computes a within-group coefficient
that assesses the reliability of the group-deviated score (e.g., after
calling center_by
) and a between-group coefficient that assess
the reliability of the mean aggregate score (e.g., after calling
agg
). It assumes two-level data where the rows are in long
format and the columns are the variables/items of the score. Generaliability
theory coefficients with multilevel data are analagous to intraclass
correlations (ICC), but add an additional grouping variable. The default
computes a multilevel version of ICC(3,k) from cross.obs
= TRUE. When
cross.obs
= FALSE, a multilevel version of ICC(2,k) is computed, which
takes mean differences between variables/items into account.
gtheory_ml
is a wrapper function for mlr
. Note,
this function can take several minutes to run if you have a moderate to large
dataset.
gtheory_ml(data, vrb.nm, grp.nm, obs.nm, cross.obs = TRUE)
gtheory_ml(data, vrb.nm, grp.nm, obs.nm, cross.obs = TRUE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 with colname from |
obs.nm |
character vector of of length 1 with colname from |
cross.obs |
logical vector of length 1 specifying whether the observations should be crossed when computing the generalizability theory coefficient. If TRUE, the observations are treated as fixed; if FALSE, they are treated as random. See details. |
gtheory_ml
uses mlr
, which is based on the
formulas in Shrout, Patrick, and Lane (2012). When cross.obs
= TRUE,
the within-group coefficient is Rc and the between-group coefficient is RkF.
When cross.obs
= FALSE, the within-group coefficient is Rcn and the
between-group coefficient is RkRn.
gtheory_ml
does not currently have standard errors or confidence
intervals. I am not aware of mathematical formulas for analytical confidence
intervals, and because the generaliability theory coefficients can take
several minutes to estimate, bootstraped confidence intervals seem too
time-intensive to be useful at the moment.
gtheory_ml
does not work with a single variable/item. You can still
use generalizability theory to estimate between-group reliability in that
instance though. To do so, reshape the variable/item from long to wide (e.g.,
unstack2
) so that you have a column for each
observation of that single variable/item and the rows are the groups. Then
you can use gtheory
and treat each observation as a "different"
variable/item.
list with two elements. The first is named "within" and refers to the
within-group reliability. The second is named "between" and refers to the
between-group reliability. Each contains a double vector where the first
element is named "est" and contains the generalizability theory coefficient
itself. The second element is named "average_r" and contains the average
correlation at that level of the data based on cor_ml
(which
is a wrapper for statsBy
). The third element is named
"nvrb" and contains the number of variables/items. These later two elements
are included because even though the reliability coefficients are
calculated from variance components, they are indirectly based on the
average correlation and number of variables/items, similar to Cronbach's
alpha.
Shrout, Patrick and Lane, Sean P (2012), Psychometrics. In M.R. Mehl and T.S. Conner (eds) Handbook of research methods for studying daily life, (p 302-320) New York. Guilford Press
shrout <- structure(list(Person = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), Time = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), Item1 = c(2L, 3L, 6L, 3L, 7L, 3L, 5L, 6L, 3L, 8L, 4L, 4L, 7L, 5L, 6L, 1L, 5L, 8L, 8L, 6L), Item2 = c(3L, 4L, 6L, 4L, 8L, 3L, 7L, 7L, 5L, 8L, 2L, 6L, 8L, 6L, 7L, 3L, 9L, 9L, 7L, 8L ), Item3 = c(6L, 4L, 5L, 3L, 7L, 4L, 7L, 8L, 9L, 9L, 5L, 7L, 9L, 7L, 8L, 4L, 7L, 9L, 9L, 6L)), .Names = c("Person", "Time", "Item1", "Item2", "Item3"), class = "data.frame", row.names = c(NA, -20L)) mlr_obj <- psych::mlr(x = shrout, grp = "Person", Time = "Time", items = c("Item1", "Item2", "Item3"), alpha = FALSE, icc = FALSE, aov = FALSE, lmer = TRUE, lme = FALSE, long = FALSE, plot = FALSE) gtheory_ml(data = shrout, vrb.nm = c("Item1", "Item2", "Item3"), grp.nm = "Person", obs.nm = "Time", cross.obs = TRUE) # crossed time gtheory_ml(data = shrout, vrb.nm = c("Item1", "Item2", "Item3"), grp.nm = "Person", obs.nm = "Time", cross.obs = FALSE) # nested time
shrout <- structure(list(Person = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), Time = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), Item1 = c(2L, 3L, 6L, 3L, 7L, 3L, 5L, 6L, 3L, 8L, 4L, 4L, 7L, 5L, 6L, 1L, 5L, 8L, 8L, 6L), Item2 = c(3L, 4L, 6L, 4L, 8L, 3L, 7L, 7L, 5L, 8L, 2L, 6L, 8L, 6L, 7L, 3L, 9L, 9L, 7L, 8L ), Item3 = c(6L, 4L, 5L, 3L, 7L, 4L, 7L, 8L, 9L, 9L, 5L, 7L, 9L, 7L, 8L, 4L, 7L, 9L, 9L, 6L)), .Names = c("Person", "Time", "Item1", "Item2", "Item3"), class = "data.frame", row.names = c(NA, -20L)) mlr_obj <- psych::mlr(x = shrout, grp = "Person", Time = "Time", items = c("Item1", "Item2", "Item3"), alpha = FALSE, icc = FALSE, aov = FALSE, lmer = TRUE, lme = FALSE, long = FALSE, plot = FALSE) gtheory_ml(data = shrout, vrb.nm = c("Item1", "Item2", "Item3"), grp.nm = "Person", obs.nm = "Time", cross.obs = TRUE) # crossed time gtheory_ml(data = shrout, vrb.nm = c("Item1", "Item2", "Item3"), grp.nm = "Person", obs.nm = "Time", cross.obs = FALSE) # nested time
gtheorys
uses generalizability theory to compute the reliability
coefficient of multiple scores. It assumes single-level data where the rows
are cases and the columns are variables/items. Generaliability theory
coefficients in this case are the same as intraclass correlations (ICC). The
default computes ICC(3,k), which is identical to cronbach's alpha, from
cross.vrb
= TRUE. When cross.vrb
is FALSE, ICC(2,k) is
computed, which takes mean differences between variables/items into account.
gtheorys
is a wrapper function for ICC
.
gtheorys( data, vrb.nm.list, ci.type = "classic", level = 0.95, cross.vrb = TRUE, R = 200L, boot.ci.type = "perc" )
gtheorys( data, vrb.nm.list, ci.type = "classic", level = 0.95, cross.vrb = TRUE, R = 200L, boot.ci.type = "perc" )
data |
data.frame of data. |
vrb.nm.list |
list of character vectors containing colnames from
|
ci.type |
character vector of length = 1 specifying the type of confidence interval to compute. There are currently two options: 1) "classic" = traditional ICC-based confidence intervals (see details), 2) "boot" = bootstrapped confidence intervals. |
level |
double vector of length 1 specifying the confidence level from 0 to 1. |
cross.vrb |
logical vector of length 1 specifying whether the variables/items should be crossed when computing the generalizability theory coefficients. If TRUE, then only the covariance structure of the variables/items will be incorperated into the estimates of reliability. If FALSE, then the mean structure of the variables/items will be incorperated. |
R |
integer vector of length 1 specifying the number of bootstrapped
resamples to use. Only used if |
boot.ci.type |
character vector of length 1 specifying the type of
bootstrapped confidence interval to compute. The options are 1) "perc" for
the regular percentile method, 2) "bca" for bias-corrected and accelerated
percentile method, 3) "norm" for the normal method that uses the
bootstrapped standard error to construct symmetrical confidence intervals
with the classic formula around the bias-corrected estimate, and 4) "basic"
for the basic method. Note, "stud" for the studentized method is NOT an
option. See |
When ci.type
= "classic" the confidence intervals are computed
according to the formulas laid out by McGraw, Kenneth and Wong (1996). These
are taken from the ICC
function in the psych
package. They are appropriately non-symmetrical given ICCs are not unbounded
and range from 0 to 1. Therefore, there is no standard error associated with
the coefficient. Note, they differ from the confidence intervals available in
the cronbachs
function. When ci.type
= "boot" the
standard deviation of the empirical sampling distribution is returned as the
standard error, which may or may not be trustworthy depending on the value of
the ICC and sample size.
data.frame containing the generalizability theory statistical information. The columns are as follows:
the generalizability theory coefficient itself
standard error of the reliability coefficient
lower bound of the confidence interval for the reliability coefficient
lower bound of the confidence interval for the reliability coefficient
McGraw, Kenneth O. and Wong, S. P. (1996), Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46. + errata on page 390.
dat0 <- psych::bfi[1:100, ] # reduce number of rows # to reduce computational time of boot examples dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) gtheorys(data = dat1, vrb.nm.list = vrb_nm_list) ## Not run: gtheorys(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot") # singular messages gtheorys(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot", R = 250L, boot.ci.type = "bca") ## End(Not run) gtheorys(data = attitude, vrb.nm.list = list(names(attitude))) # also works with only one set of variables/items
dat0 <- psych::bfi[1:100, ] # reduce number of rows # to reduce computational time of boot examples dat1 <- str2str::pick(x = dat0, val = c("A1","C4","C5","E1","E2","O2","O5", "gender","education","age"), not = TRUE, nm = TRUE) vrb_nm_list <- lapply(X = str2str::sn(c("E","N","C","A","O")), FUN = function(nm) { str2str::pick(x = names(dat1), val = nm, pat = TRUE)}) gtheorys(data = dat1, vrb.nm.list = vrb_nm_list) ## Not run: gtheorys(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot") # singular messages gtheorys(data = dat1, vrb.nm.list = vrb_nm_list, ci.type = "boot", R = 250L, boot.ci.type = "bca") ## End(Not run) gtheorys(data = attitude, vrb.nm.list = list(names(attitude))) # also works with only one set of variables/items
gtheorys_ml
uses generalizability theory to compute the reliability
coefficients of multiple multilevel score. It computes within-group
coefficients that assess the reliability of the group-deviated scores (e.g.,
after calling centers_by
) and between-group coefficients that
assess the reliability of the mean aggregate scores (e.g., after calling
aggs
). It assumes two-level data where the rows are in long
format and the columns are the variables/items of the score. Generaliability
theory coefficients with multilevel data are analagous to intraclass
correlations (ICC), but add an additional grouping variable. The default
computes a multilevel version of ICC(3,k) from cross.obs
= TRUE. When
cross.obs
= FALSE, a multilevel version of ICC(2,k) is computed, which
takes mean differences between variables/items into account.
gtheorys_ml
is a wrapper function for mlr
. Note,
this function can take several minutes to run if you have a moderate to large
dataset.
gtheorys_ml(data, vrb.nm.list, grp.nm, obs.nm, cross.obs = TRUE)
gtheorys_ml(data, vrb.nm.list, grp.nm, obs.nm, cross.obs = TRUE)
data |
data.frame of data. |
vrb.nm.list |
list of character vectors of colnames from |
grp.nm |
character vector of length 1 with colname from |
obs.nm |
character vector of of length 1 with colname from |
cross.obs |
logical vector of length 1 specifying whether the observations should be crossed when computing the generalizability theory coefficients. If TRUE, the observations are treated as fixed; if FALSE, they are treated as random. See details. |
gtheorys_ml
uses mlr
, which is based on the
formulas in Shrout, Patrick, and Lane (2012). When cross.obs
= TRUE,
the within-group coefficient is Rc and the between-group coefficient is RkF.
When cross.obs
= FALSE, the within-group coefficient is Rcn and the
between-group coefficient is RkRn.
gtheorys_ml
does not currently have standard errors or confidence
intervals. I am not aware of mathematical formulas for analytical confidence
intervals, and because the generaliability theory coefficients can take
several minutes to estimate, bootstraped confidence intervals seem too
time-intensive to be useful at the moment.
gtheorys_ml
does not work with multiple single variable/item scores.
You can still use generalizability theory to estimate between-group
reliability in that instance though. To do so, reshape the multiple single
variables/items from long to wide (e.g., long2wide
) so that you
have a column for each observation of that single variable/item and the rows
are the groups. Then you can use gtheorys
and treat each observation as
a "different" variable/item.
list with two elements. The first is named "within" and refers to the within-group reliability. The second is named "between" and refers to the between-group reliability. Each contains a data.frame with the following columns:
generalizability theory reliability coefficient itself
the average correlation at each level of the data based on
cor_ml
(which is a wrapper for statsBy
)
number of variables/items that make up that score
The later two columns are included because even though the reliability coefficients are calculated from variance components, they are indirectly based on the average correlation and number of variables/items similar to Cronbach's alpha.
Shrout, Patrick and Lane, Sean P (2012), Psychometrics. In M.R. Mehl and T.S. Conner (eds) Handbook of research methods for studying daily life, (p 302-320) New York. Guilford Press
dat <- psychTools::sai[psychTools::sai$"study" == "VALE", ] # 4 timepoints vrb_nm_list <- list("positive_affect" = c("calm","secure","at.ease","rested", "comfortable","confident"), # extra: "relaxed","content","joyful" "negative_affect" = c("tense","regretful","upset","worrying","anxious", "nervous")) # extra: "jittery","high.strung","worried","rattled" suppressMessages(gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list, grp.nm = "id", obs.nm = "time", cross.obs = TRUE)) suppressMessages(gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list, grp.nm = "id", obs.nm = "time", cross.obs = FALSE)) gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list["positive_affect"], grp.nm = "id", obs.nm = "time") # also works with only one set of variables/items
dat <- psychTools::sai[psychTools::sai$"study" == "VALE", ] # 4 timepoints vrb_nm_list <- list("positive_affect" = c("calm","secure","at.ease","rested", "comfortable","confident"), # extra: "relaxed","content","joyful" "negative_affect" = c("tense","regretful","upset","worrying","anxious", "nervous")) # extra: "jittery","high.strung","worried","rattled" suppressMessages(gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list, grp.nm = "id", obs.nm = "time", cross.obs = TRUE)) suppressMessages(gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list, grp.nm = "id", obs.nm = "time", cross.obs = FALSE)) gtheorys_ml(data = dat, vrb.nm.list = vrb_nm_list["positive_affect"], grp.nm = "id", obs.nm = "time") # also works with only one set of variables/items
icc_11
computes the intraclass correlation (ICC) based on a single
rater with a single dimension, aka ICC(1,1). Traditionally, this is the type
of ICC used for multilevel analysis where the value is interpreted as the
proportion of variance accounted for by group membership. In other words,
ICC(1,1) = the proportion of between-group variance; 1 - ICC(1,1) = the
proportion of within-group variance.
icc_11(x, grp, how = "lme", REML = TRUE)
icc_11(x, grp, how = "lme", REML = TRUE)
x |
numeric vector. |
grp |
atomic vector the same length as |
how |
character vector of length 1 specifying how the ICC(1,1) should be
calculated. There are four options: 1) "lme" uses a linear mixed effects
model with the function |
REML |
logical vector of length 1 specifying whether restricted maximum likelihood estimation (TRUE) should be used rather than traditional maximum likelihood estimation (FALSE). Only used for linear mixed effects models if how = "lme" or how = "lmer". |
numeric vector of length 1 providing ICC(1,1) and computed based on
the how
argument.
iccs_11
# ICC(1,1) for multiple variables,
icc_all_by
# all six types of ICCs by group,
lme
# how = "lme" function,
lmer
# how = "lmer" function,
aov
# how = "aov" function,
# BALANCED DATA (how = "aov" and "lme"/"lmer" do YES provide the same value) str(InsectSprays) icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "aov") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "lme") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "lmer") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "raw") # biased estimator and not recommended. Only available for teaching purposes. # UN-BALANCED DATA (how = "aov" and "lme"/"lmer" do NOT provide the same value) dat <- as.data.frame(lmeInfo::Bryant2016) icc_11(x = dat$"outcome", grp = dat$"case", how = "aov") icc_11(x = dat$"outcome", grp = dat$"case", how = "lme") icc_11(x = dat$"outcome", grp = dat$"case", how = "lmer") icc_11(x = dat$"outcome", grp = dat$"case", how = "lme", REML = FALSE) icc_11(x = dat$"outcome", grp = dat$"case", how = "lmer", REML = FALSE) # how = "lme" does not account for any correlation structure lme_obj <- nlme::lme(outcome ~ 1, random = ~ 1 | case, data = dat, na.action = na.exclude, correlation = nlme::corAR1(form = ~ 1 | case), method = "REML") var_corr <- nlme::VarCorr(lme_obj) # VarCorr.lme vars <- as.double(var_corr[, "Variance"]) btw <- vars[1] wth <- vars[2] btw / (btw + wth)
# BALANCED DATA (how = "aov" and "lme"/"lmer" do YES provide the same value) str(InsectSprays) icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "aov") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "lme") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "lmer") icc_11(x = InsectSprays$"count", grp = InsectSprays$"spray", how = "raw") # biased estimator and not recommended. Only available for teaching purposes. # UN-BALANCED DATA (how = "aov" and "lme"/"lmer" do NOT provide the same value) dat <- as.data.frame(lmeInfo::Bryant2016) icc_11(x = dat$"outcome", grp = dat$"case", how = "aov") icc_11(x = dat$"outcome", grp = dat$"case", how = "lme") icc_11(x = dat$"outcome", grp = dat$"case", how = "lmer") icc_11(x = dat$"outcome", grp = dat$"case", how = "lme", REML = FALSE) icc_11(x = dat$"outcome", grp = dat$"case", how = "lmer", REML = FALSE) # how = "lme" does not account for any correlation structure lme_obj <- nlme::lme(outcome ~ 1, random = ~ 1 | case, data = dat, na.action = na.exclude, correlation = nlme::corAR1(form = ~ 1 | case), method = "REML") var_corr <- nlme::VarCorr(lme_obj) # VarCorr.lme vars <- as.double(var_corr[, "Variance"]) btw <- vars[1] wth <- vars[2] btw / (btw + wth)
icc_all_by
computes each of the six intraclass correlations (ICC) in
Shrout & Fleiss (1979) by group. The ICCs differ by whether they treat
dimensions as fixed or random and whether they are for a single variable in
data[vrb.nm]
of the set of variables data[vrb.nm]
.
icc_all_by
also returns information about the linear mixed effects
modeling (using lmer
) used to compute the ICCs as well as
any warning or error messages by group. For an understanding of the six
different ICCs, see the following blogpost:
http://www.daviddisabato.com/blog/2021/10/1/the-six-different-types-of-intraclass-correlations-iccs.
icc_all_by
is a combination of by2
+
try_fun
+ ICC
(ICC
calls lmer
internally).
icc_all_by(data, vrb.nm, grp.nm, ci.level = 0.95, check = TRUE)
icc_all_by(data, vrb.nm, grp.nm, ci.level = 0.95, check = TRUE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
ci.level |
double vector of length 1 specifying the confidence level. It must range from 0 to 1. |
check |
logical vector of length 1 specifying whether to check the
structure of the input arguments. For example, check whether
|
icc_all_by
internally suppresses any messages, warnings, or errors
returned by lmer
(e.g., "boundary (singular) fit: see
?isSingular") because that information is provided in the returned
data.frame.
data.frame containing the unique combinations of the grouping variables
data[grp.nm]
and each group's intraclass correlations (ICCs), their confidence intervals,
information about the merMod
object from the linear mixed effects model,
and any warning or error messages from lmer
. For an understanding of the
six different ICCs, see the following blogpost:
http://www.daviddisabato.com/blog/2021/10/1/the-six-different-types-of-intraclass-correlations-iccs.
The first columns are always unique.data.frame(data[vrb.nm])
. All other columns are in the
following order with the following colnames:
ICC(1,1) parameter estimate
ICC(1,1) lower bound of the confidence interval
ICC(1,1) lower bound of the confidence interval
ICC(2,1) parameter estimate
ICC(2,1) lower bound of the confidence interval
ICC(2,1) lower bound of the confidence interval
ICC(3,1) parameter estimate
ICC(3,1) lower bound of the confidence interval
ICC(3,1) lower bound of the confidence interval
ICC(1,k) parameter estimate
ICC(1,k) lower bound of the confidence interval
ICC(1,k) lower bound of the confidence interval
ICC(2,k) parameter estimate
ICC(2,k) lower bound of the confidence interval
ICC(2,k) lower bound of the confidence interval
ICC(3,k) parameter estimate
ICC(3,k) lower bound of the confidence interval
ICC(3,k) lower bound of the confidence interval
number of observations used for the linear mixed effects model.
Note, this is the number of (non-missing) rows after data[vrb.nm]
has been stacked together via stack
.
number of groups used for the linear mixed effects model.
This is the number of unique combinations of the grouping variables after data[grp.nm]
.
logLik of the linear mixed effects model
binary variable where 1 = the linear mixed effects model had a singularity in the random effects covariance matrix or 0 = it did not
binary variable where 1 = the linear mixed effects model returned a warning or 0 = it did not
binary variable where 1 = the linear mixed effects model returned an error or 0 = it did not
character vector providing the warning messages for any warnings. If a group did not generate a warning, then the value is NA
character vector providing the error messages for any warnings. If a group did not generate an error, then the value is NA
Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
# one grouping variable x <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = "gender") # two grouping variables y <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = c("gender","education")) # with errors z <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = c("age")) # NA for all ICC columns when there is an error
# one grouping variable x <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = "gender") # two grouping variables y <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = c("gender","education")) # with errors z <- icc_all_by(data = psych::bfi, vrb.nm = c("A2","A3","A4","A5"), grp.nm = c("age")) # NA for all ICC columns when there is an error
iccs_11
computes the intraclass correlation (ICC) for multiple
variables based on a single rater with a single dimension, aka ICC(1,1).
Traditionally, this is the type of ICC used for multilevel analysis where the
value is interpreted as the proportion of variance accounted for by group
membership. In other words, ICC(1,1) = the proportion of between-group
variance; 1 - ICC(1,1) = the proportion of within-group variance.
iccs_11(data, vrb.nm, grp.nm, how = "lme", REML = FALSE)
iccs_11(data, vrb.nm, grp.nm, how = "lme", REML = FALSE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of length 1 of a colname from |
how |
character vector of length 1 specifying how the ICC(1,1) should be
calculated. There are four options: 1) "lme" uses a linear mixed effects
model with the function |
REML |
logical vector of length 1 specifying whether restricted maximum
likelihood estimation (TRUE) should be used rather than traditional maximum
likelihood (FALSE). This is only applicable to linear mixed effects models
when |
double vector containing ICC(1, 1) of the vrb.nm
columns in
data
with names of the return object equal to vrb.nm
.
icc_11
# ICC(1,1) for a single variable,
icc_all_by
# all six types of ICCs by group,
lme
# how = "lme" function,
lmer
# how = "lmer" function,
aov
# how = "aov" function,
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat iccs_11(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case")
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat iccs_11(data = dat, vrb.nm = c("outcome","session","trt_time"), grp.nm = "case")
length_by
computes the the length of a (atomic) vector by group. The
argument na.rm
can be used to include (FALSE) or exclude (TRUE)
missing values.
length_by(x, grp, na.rm = FALSE, sep = ".")
length_by(x, grp, na.rm = FALSE, sep = ".")
x |
atomic vector. |
grp |
atomic vector or list of atomic vectors (e.g., data.frame) specifying the groups. The atomic vector(s) must be the length of x or else an error is returned. |
na.rm |
logical vector of length 1 specifying whether to include (FALSE) or exclude (TRUE) missing values. |
sep |
character vector of length 1 specifying what string should separate different group values when naming the return object. This argument is only used if grp is a list of atomic vectors (e.g., data.frame). |
integer vector of length = length(levels(interaction(grp)))
with names = length(levels(interaction(grp)))
providing the number
of elements (excluding missing values if na.rm
= TRUE) in each
group.
length_by(x = mtcars$"mpg", grp = mtcars$"gear") length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = FALSE) length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = TRUE)
length_by(x = mtcars$"mpg", grp = mtcars$"gear") length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = FALSE) length_by(x = airquality$"Ozone", grp = airquality$"Month", na.rm = TRUE)
lengths_by
computes the the length of multiple columns in a data.frame
by group. The argument na.rm
can be used to include (FALSE) or exclude
(TRUE) missing values. Through the use of na.rm
= TRUE, the number of
observed values for each variable by each group can be computed.
lengths_by(data, vrb.nm, grp.nm, na.rm = FALSE, sep = ".")
lengths_by(data, vrb.nm, grp.nm, na.rm = FALSE, sep = ".")
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
na.rm |
logical vector of length 1 specifying whether to include (FALSE) or exclude (TRUE) missing values. |
sep |
character vector of length 1 specifying what string should separate different group values when naming the return object. This argument is only used if grp is a list of atomic vectors (e.g., data.frame). |
data.frame with colnames = vrb.nm
and rownames =
length(levels(interaction(grp)))
providing the number of elements
(excluding missing values if na.rm
= TRUE) in each column by group.
lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = "gear") lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = c("gear","vs")) # can handle multiple grouping variables lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = c("gear","am")) # can handle zero lengths lengths_by(airquality, c("Ozone","Solar.R","Wind"), grp = "Month", na.rm = FALSE) # include missing values lengths_by(airquality, c("Ozone","Solar.R","Wind"), grp = "Month", na.rm = TRUE) # exclude missing values
lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = "gear") lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = c("gear","vs")) # can handle multiple grouping variables lengths_by(mtcars, vrb.nm = c("mpg","cyl","disp"), grp = c("gear","am")) # can handle zero lengths lengths_by(airquality, c("Ozone","Solar.R","Wind"), grp = "Month", na.rm = FALSE) # include missing values lengths_by(airquality, c("Ozone","Solar.R","Wind"), grp = "Month", na.rm = TRUE) # exclude missing values
long2wide
reshapes data from long to wide. This if often necessary to
do with multilevel data where variables in the long format seek to be
reshaped to multiple sets of variables in the wide format. If only one column
needs to be reshaped, then you can use unstack2
or
cast
- but that does not work for *multiple* columns.
long2wide( data, vrb.nm, grp.nm, obs.nm, sep = ".", colnames.by.obs = TRUE, keep.attr = FALSE )
long2wide( data, vrb.nm, grp.nm, obs.nm, sep = ".", colnames.by.obs = TRUE, keep.attr = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
obs.nm |
character vector of length 1 with a colname from |
sep |
character vector of length 1 specifying the string that separates
the name prefix (e.g., score) from it's number suffix (e.g., timepoint). If
|
colnames.by.obs |
logical vector of length 1 specifying whether to sort
the return object colnames by the observation label (TRUE) or by the order
of |
keep.attr |
logical vector of length 1 specifying whether to keep the
"reshapeWide" attribute (from |
long2wide
uses reshape(direction = "wide")
to reshape the data.
It attempts to streamline the task of reshaping long to wide as the
reshape
arguments can be confusing because the same arguments are used
for wide vs. long reshaping. See reshape
if you are
curious.
data.frame with nrow equal to nrow(unique(data[grp.nm]))
and
number of reshaped columns equal to length(vrb.nm) *
unique(data[[obs.nm]])
. The colnames will have the structure
paste0(vrb.nm, sep, unique(data[[obs.nm]]))
. The reshaped colnames
are sorted by the observation labels if colnames.by.obs
= TRUE and
sorted by vrb.nm
if colnames.by.obs
= FALSE. Overall, the
columns are in the following order: 1) grp.nm
of the groups, 2)
reshaped columns, 3) additional columns that were not reshaped.
# SINGLE GROUPING VARIABLE dat_long <- as.data.frame(ChickWeight) # b/c groupedData class does weird things... w1 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time") # NAs inserted for missing observations in some groups w2 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time", sep = "_") head(w1); head(w2) w3 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time", sep = "_T", keep.attr = TRUE) attributes(w3) # MULTIPLE GROUPING VARIABLE tmp <- psychTools::sai grps <- interaction(tmp[1:3], drop = TRUE) dups <- duplicated(grps) dat_long <- tmp[!(dups), ] # for some reason there are duplicate groups in the data vrb_nm <- str2str::pick(names(dat_long), val = c("study","time","id"), not = TRUE) w4 <- long2wide(data = dat_long, vrb.nm = vrb_nm, grp.nm = c("study","id"), obs.nm = "time") w5 <- long2wide(data = dat_long, vrb.nm = vrb_nm, grp.nm = c("study","id"), obs.nm = "time", colnames.by.obs = FALSE) # colnames sorted by `vrb.nm` instead head(w4); head(w5)
# SINGLE GROUPING VARIABLE dat_long <- as.data.frame(ChickWeight) # b/c groupedData class does weird things... w1 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time") # NAs inserted for missing observations in some groups w2 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time", sep = "_") head(w1); head(w2) w3 <- long2wide(data = dat_long, vrb.nm = "weight", grp.nm = "Chick", obs.nm = "Time", sep = "_T", keep.attr = TRUE) attributes(w3) # MULTIPLE GROUPING VARIABLE tmp <- psychTools::sai grps <- interaction(tmp[1:3], drop = TRUE) dups <- duplicated(grps) dat_long <- tmp[!(dups), ] # for some reason there are duplicate groups in the data vrb_nm <- str2str::pick(names(dat_long), val = c("study","time","id"), not = TRUE) w4 <- long2wide(data = dat_long, vrb.nm = vrb_nm, grp.nm = c("study","id"), obs.nm = "time") w5 <- long2wide(data = dat_long, vrb.nm = vrb_nm, grp.nm = c("study","id"), obs.nm = "time", colnames.by.obs = FALSE) # colnames sorted by `vrb.nm` instead head(w4); head(w5)
make.dummy
creates dummy columns (i.e., dichotomous numeric vectors
coded 0 and 1) from logical conditions. If you want to make logical
conditions from columns of a data.frame, you will need to call the data.frame
and its columns explicitly as this function does not use non-standard
evaluation.
make.dummy(..., rtn.lgl = FALSE)
make.dummy(..., rtn.lgl = FALSE)
... |
logical conditions that evaluate to logical vectors of the same length. If the logical vectors are not the same length, an error is returned. The names of the arguments are the colnames in the return object. If unnamed, then default R data.frame naming is used, which can get ugly. |
rtn.lgl |
logical vector of length 1 specifying whether the dummy columns should be logical vectors (TRUE) rather than numeric vectors (FALSE). |
data.frame of dummy columns based on the logical conditions n
...
. If rtn.lgl
= TRUE, then the columns are logical vectors.
If out.lgl
= FALSE, then the columns are numeric vectors where 0 =
FALSE and 1 = TRUE. The colnames are the names of the arguments in
...
. If not specified, then default data.frame names are created
from the logical conditions themselves (which can get ugly).
make.dummy(attitude$"rating" > 50) # ugly colnames make.dummy("rating_50plus" = attitude$"rating" > 50, "advance_50minus" = attitude$"advance" < 50) make.dummy("rating_50plus" = attitude$"rating" > 50, "advance_50minus" = attitude$"advance" < 50, rtn.lgl = TRUE) ## Not run: make.dummy("rating_50plus" = attitude$"rating" > 50, "mpg_20plus" = mtcars$"mpg" > 20) ## End(Not run)
make.dummy(attitude$"rating" > 50) # ugly colnames make.dummy("rating_50plus" = attitude$"rating" > 50, "advance_50minus" = attitude$"advance" < 50) make.dummy("rating_50plus" = attitude$"rating" > 50, "advance_50minus" = attitude$"advance" < 50, rtn.lgl = TRUE) ## Not run: make.dummy("rating_50plus" = attitude$"rating" > 50, "mpg_20plus" = mtcars$"mpg" > 20) ## End(Not run)
make.dumNA
makes dummy columns (i.e., dichomotous numeric vectors
coded 0 and 1) for missing data. Each variable is treated in isolation.
make.dumNA(data, vrb.nm, ov = FALSE, rtn.lgl = FALSE, suffix = "_m")
make.dumNA(data, vrb.nm, ov = FALSE, rtn.lgl = FALSE, suffix = "_m")
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
ov |
logical vector of length 1 specifying whether the dummy columns should be reverse coded such that missing values = 0/FALSE and observed values = 1/TRUE. |
rtn.lgl |
logical vector of length 1 specifying whether the dummy columns should be logical vectors (TRUE) rather than numeric vectors (FALSE). |
suffix |
character vector of length 1 specifying the string that should be appended to the end of the colnames in the return object. |
data.frame of numeric (logical if rtn.lgl
= TRUE) columns
where missing = 1 and observed = 0 (flipped if ov
= TRUE) for each
variable. The colnames are created by paste0(vrb.nm, suffix)
.
make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R")) make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R"), rtn.lgl = TRUE) # logical vectors returned make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R"), ov = TRUE, suffix = "_o") # 1 = observed value
make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R")) make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R"), rtn.lgl = TRUE) # logical vectors returned make.dumNA(data = airquality, vrb.nm = c("Ozone","Solar.R"), ov = TRUE, suffix = "_o") # 1 = observed value
make.fun_if
makes a function that evaluates conditional on a specified
minimum frequency of observed values. Within the function, if the frequency
of observed values is less than (or equal to) ov.min
, then
false
is returned rather than the return value.
make.fun_if( fun, ..., ov.min.default = 1, prop.default = TRUE, inclusive.default = TRUE, false = NA )
make.fun_if( fun, ..., ov.min.default = 1, prop.default = TRUE, inclusive.default = TRUE, false = NA )
fun |
function that takes an atomic vector as its first argument. The
first argument does not have to be named "x" within |
... |
additional arguments with parameters to |
ov.min.default |
numeric vector of length 1 specifying what the default
should be for the argument |
prop.default |
logical vector of length 1 specifying what the default
should be for the argument |
inclusive.default |
logical vector of length 1 speicfying what the
default should be for the argument |
false |
vector of length 1 specifying what should be returned if the
observed values condition is not met within the returned function. The
default is NA. Whatever the value is, it will be coerced to the same mode
as |
function that takes an atomic vector x
as its first argument,
...
as other arguments, ending with ov.min
, prop
, and
inclusive
as final arguments with defaults specified by
ov.min.default
, prop.default
, and inclusive.default
,
respectively.
# SD sd_if <- make.fun_if(fun = sd, na.rm = TRUE) # always have na.rm = TRUE sd_if(x = airquality[[1]], ov.min = .75) # proportion of observed values sd_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observed values sd_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min values itself # skewness skew_if <- make.fun_if(fun = psych::skew, type = 1) # always have type = 1 skew_if(x = airquality[[1]], ov.min = .75) # proportion of observed values skew_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observed values skew_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min values itself # mode popular <- function(x) names(sort(table(x), decreasing = TRUE))[1] popular_if <- make.fun_if(fun = popular) # works with character vectors too popular_if(x = c(unlist(dimnames(HairEyeColor)), rep.int(x = NA, times = 10)), ov.min = .50) popular_if(x = c(unlist(dimnames(HairEyeColor)), rep.int(x = NA, times = 10)), ov.min = .60)
# SD sd_if <- make.fun_if(fun = sd, na.rm = TRUE) # always have na.rm = TRUE sd_if(x = airquality[[1]], ov.min = .75) # proportion of observed values sd_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observed values sd_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min values itself # skewness skew_if <- make.fun_if(fun = psych::skew, type = 1) # always have type = 1 skew_if(x = airquality[[1]], ov.min = .75) # proportion of observed values skew_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observed values skew_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min values itself # mode popular <- function(x) names(sort(table(x), decreasing = TRUE))[1] popular_if <- make.fun_if(fun = popular) # works with character vectors too popular_if(x = c(unlist(dimnames(HairEyeColor)), rep.int(x = NA, times = 10)), ov.min = .50) popular_if(x = c(unlist(dimnames(HairEyeColor)), rep.int(x = NA, times = 10)), ov.min = .60)
make.latent
makes the model syntax for a latent factor in
lavaan
. The return object can be used as apart of the model syntax for
calls to lavaan
, sem
,
cfa
, etc.
make.latent( x, nm.latent = "latent", error.var = FALSE, nm.par = FALSE, suffix.load = "_l", suffix.error = "_e" )
make.latent( x, nm.latent = "latent", error.var = FALSE, nm.par = FALSE, suffix.load = "_l", suffix.error = "_e" )
x |
character vector specifying the colnames in your data that correspond to the variables indicating the latent factor (e.g., questionnaire items). |
nm.latent |
character vector of length 1 specifying what the latent factor should be labeled as in the return object. |
error.var |
logical vector of length 1 specifying whether the model syntax for the error variances should be included in the return object. |
nm.par |
logical vector of length 1 specifying whether the model syntax should include names for the factor loading (and error variance) parameters. |
suffix.load |
character vector of length 1 specifying what string should
be appended to the end of the elements of |
suffix.error |
character vector of length 1 specifying what string
should be appended to the end of the elements of |
character vector of length 1 providing the model syntax. The regular expression "\n" is used to delineate new lines within the model syntax.
make.latent(x = names(psych::bfi)[1:5], error.var = FALSE, nm.par = FALSE) make.latent(x = names(psych::bfi)[1:5], error.var = FALSE, nm.par = TRUE) make.latent(x = names(psych::bfi)[1:5], error.var = TRUE, nm.par = FALSE) make.latent(x = names(psych::bfi)[1:5], error.var = TRUE, nm.par = TRUE)
make.latent(x = names(psych::bfi)[1:5], error.var = FALSE, nm.par = FALSE) make.latent(x = names(psych::bfi)[1:5], error.var = FALSE, nm.par = TRUE) make.latent(x = names(psych::bfi)[1:5], error.var = TRUE, nm.par = FALSE) make.latent(x = names(psych::bfi)[1:5], error.var = TRUE, nm.par = TRUE)
make.product
creates product terms (i.e., interactions) from various
components. make.product
uses Center
for the optional of
centering and/or scaling the predictors and/or moderators before making the
product terms.
make.product( data, x.nm, m.nm, center.x = FALSE, center.m = FALSE, scale.x = FALSE, scale.m = FALSE, suffix.x = "", suffix.m = "", sep = ":", combo = TRUE )
make.product( data, x.nm, m.nm, center.x = FALSE, center.m = FALSE, scale.x = FALSE, scale.m = FALSE, suffix.x = "", suffix.m = "", sep = ":", combo = TRUE )
data |
data.frame of data. |
x.nm |
character vector of colnames from |
m.nm |
character vector of colnames from |
center.x |
logical vector of length 1 specifying whether the predictor columns should be grand-mean centered before making the product terms. |
center.m |
logical vector of length 1 specifying whether the moderator columns should be grand-mean centered before making the product terms. |
scale.x |
logical vector of length 1 specifying whether the predictor columns should be grand-SD scaled before making the product terms. |
scale.m |
logical vector of length 1 specifying whether the moderator columns should be grand-SD scaled before making the product terms. |
suffix.x |
character vector of length 1 specifying any suffix to add to
the end of the predictor colnames |
suffix.m |
character vector of length 1 specifying any suffix to add to
the end of the moderator colnames |
sep |
character vector of length 1 specifying the string to connect
|
combo |
logical vector of length 1 specifying whether all combinations
of the predictors and moderators should be calculated or only those in
parallel to each other (i.e., |
data.frame with product terms (e.g., interactions) as columns. The
colnames are created by paste(paste0(x.nm, suffix.x), paste0(m.nm,
suffix.m), sep = sep)
.
make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = "learning", center.x = TRUE, center.m = TRUE, suffix.x = "_c", suffix.m = "_c") # with grand-mean centering make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = c("learning","raises"), combo = TRUE) # all possible combinations make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = c("learning","raises"), combo = FALSE) # only combinations "in parallel"
make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = "learning", center.x = TRUE, center.m = TRUE, suffix.x = "_c", suffix.m = "_c") # with grand-mean centering make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = c("learning","raises"), combo = TRUE) # all possible combinations make.product(data = attitude, x.nm = c("complaints","privileges"), m.nm = c("learning","raises"), combo = FALSE) # only combinations "in parallel"
mean_change
tests for mean change across two timepoints with a
dependent two-samples t-test. The function also calculates the descriptive
statistics for the timepoints and the standardized mean difference (i.e.,
Cohen's d) based on either the standard deviation of the pre-timepoint,
pooled standard deviation of the pre-timepoint and post-timepoint, or the
standard deviation of the change score (post - pre). mean_change
is
simply a wrapper for t.test
plus some extra
calculations.
mean_change( pre, post, standardizer = "pre", d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
mean_change( pre, post, standardizer = "pre", d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
pre |
numeric vector of the variable at the pre-timepoint. |
post |
numeric vector of the variable at the post-timepoint. The
elements must correspond to the same cases in |
standardizer |
chararacter vector of length 1 specifying what to use for standardization when computing the standardized mean difference (i.e., Cohen's d). There are three options: 1. "pre" for the standard deviation of the pre-timepoint, 2. "pooled" for the pooled standard deviation of the pre-timepoint and post-timepoint, 3. "change" for the standard deviation of the change score (post - pre). The default is "pre", which I believe makes the most theoretical sense (see Cumming, 2012); however, "change" is the traditional choice originally proposed by Jacob Cohen (Cohen, 1988). |
d.ci.type |
character vector of lenth 1 specifying how to compute the
confidence interval (and standard error) of the standardized mean
difference. There are currently two options: 1. "unbiased" which calculates
the unbiased standard error of Cohen's d based on the formulas in
Viechtbauer (2007). If |
ci.level |
double vector of length 1 specifying the confidence level.
|
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, checking whether
|
mean_change
calculates the mean change as post
- pre
such that increases over time have a positive mean change estimate and
decreases over time have a negative mean change estimate. This would be as if
the post-timepoint was x
and the pre-timepoint was y
in
t.test(paired = TRUE)
.
list of numeric vectors containing statistical information about the mean change: 1) nhst = dependent two-samples t-test stat info in a numeric vector, 2) desc = descriptive statistics stat info in a numeric vector, 3) std = standardized mean difference stat info in a numeric vector
1) nhst = dependent two-samples t-test stat info in a numeric vector
mean change estimate (i.e., post - pre)
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a numeric vector
mean of the post variable
mean of the pre variable
standard deviation of of the post variable
standard deviation of the pre variable
sample size of the change score
Pearson correlation between the pre and post variables
3) std = standardized mean difference stat info in a numeric vector
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd ed. Hillsdale, NJ: Erlbaum.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY: Rouledge.
Viechtbauer, W. (2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32(1), 39-60.
means_change
for multiple sets of prepost pairs of variables,
t.test
the workhorse for mean_change
,
mean_diff
for a independent two-samples t-test,
mean_test
for a one-sample t-test,
# dependent two-sample t-test mean_change(pre = mtcars$"disp", post = mtcars$"hp") # standardizer = "pre" mean_change(pre = mtcars$"disp", post = mtcars$"hp", d.ci.type = "classic") mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "pooled") mean_change(pre = mtcars$"disp", post = mtcars$"hp", ci.level = 0.99) mean_change(pre = mtcars$"hp", post = mtcars$"disp", ci.level = 0.99) # note, when flipping pre and post, the cohen's d estimate # changes with standardizer = "pre" because the "pre" variable is different. # This does not happen for standardizer = "pooled" or "change". For example... mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "pooled") mean_change(pre = mtcars$"hp", post = mtcars$"disp", standardizer = "pooled") mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "change") mean_change(pre = mtcars$"hp", post = mtcars$"disp", standardizer = "change") # same as intercept-only regression with the change score mean_change(pre = mtcars$"disp", post = mtcars$"hp") lm_obj <- lm(hp - disp ~ 1, data = mtcars) coef(summary(lm_obj))
# dependent two-sample t-test mean_change(pre = mtcars$"disp", post = mtcars$"hp") # standardizer = "pre" mean_change(pre = mtcars$"disp", post = mtcars$"hp", d.ci.type = "classic") mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "pooled") mean_change(pre = mtcars$"disp", post = mtcars$"hp", ci.level = 0.99) mean_change(pre = mtcars$"hp", post = mtcars$"disp", ci.level = 0.99) # note, when flipping pre and post, the cohen's d estimate # changes with standardizer = "pre" because the "pre" variable is different. # This does not happen for standardizer = "pooled" or "change". For example... mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "pooled") mean_change(pre = mtcars$"hp", post = mtcars$"disp", standardizer = "pooled") mean_change(pre = mtcars$"disp", post = mtcars$"hp", standardizer = "change") mean_change(pre = mtcars$"hp", post = mtcars$"disp", standardizer = "change") # same as intercept-only regression with the change score mean_change(pre = mtcars$"disp", post = mtcars$"hp") lm_obj <- lm(hp - disp ~ 1, data = mtcars) coef(summary(lm_obj))
mean_compare
compares means across 3+ independent groups with a
one-way ANOVA. The function also calculates the descriptive statistics for
each group and the variance explained (i.e., R^2 aka eta^2) by the nominal
grouping variable. mean_compare
is simply a wrapper for
oneway.test
plus some extra calculations.
mean_compare
will work with 2 independent groups; however it arguably
makes more sense to use mean_diff
in that case.
mean_compare( x, nom, lvl = levels(as.factor(nom)), var.equal = TRUE, r2.ci.type = "Fdist", ci.level = 0.95, rtn.table = TRUE, check = TRUE )
mean_compare( x, nom, lvl = levels(as.factor(nom)), var.equal = TRUE, r2.ci.type = "Fdist", ci.level = 0.95, rtn.table = TRUE, check = TRUE )
x |
numeric vector. |
nom |
atomic vector (e.g., factor) the same length as |
lvl |
character vector with length 3+ specifying the unique values for
the 3+ groups. If |
var.equal |
logical vector of length 1 specifying whether the variances of the groups are assumed to be equal (TRUE) or not (FALSE). If TRUE, a traditional one-way ANOVA is computed; if FALSE, Welch's ANOVA is computed. These two tests differ by their denominator degrees of freedom, F-value, and p-value. |
r2.ci.type |
character vector with length 1 specifying the type of confidence intervals to compute for the variance explained (i.e., R^2 aka eta^2). There are currently two options: 1) "Fdist" which calculates a non-symmetrical confidence interval based on the non-central F distribution (pg. 38, Smithson, 2003), 2) "classic" which calculates the confidence interval based on a large-sample theory standard error (eq. 3.6.3 in Cohen, Cohen, West, & Aiken, 2003), which is taken from Olkin & Finn (1995) - just above eq. 10. The confidence intervals for R^2-adjusted use the same formula as R^2, but replace R^2 with R^2 adjusted. Technically, the R^2 adjusted confidence intervals can have poor coverage (pg. 54, Smithson, 2003) |
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of length 1 specifying whether the traditional ANOVA table should be returned as the last element of the return object. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
list of numeric vectors containing statistical information about the mean comparison: 1) nhst = one-way ANOVA stat info in a numeric vector, 2) desc = descriptive statistics stat info in a numeric vector, 3) std = standardized effect sizes stat info in a numeric vector, 4) anova = traditional ANOVA table in a numeric matrix (only returned if rtn.table = TRUE).
1) nhst = one-way ANOVA stat info in a numeric vector
average mean difference across group pairs
NA to remind the user there is no standard error for the average mean difference
F-value
numerator degrees of freedom
denominator degrees of freedom
two-sided p-value
2) desc = descriptive statistics stat info in a numeric vector (note there could be more than 3 groups - groups i, j, and k are just provided as an example)
mean of group k
mean of group j
mean of group i
standard deviation of group k
standard deviation of group j
standard deviation of group i
sample size of group k
sample size of group j
sample size of group i
3) std = standardized effect sizes stat info in a numeric vector
R^2 estimate
R^2 standard error (only available if r2.ci.type
= "classic")
R^2 lower bound of the confidence interval
R^2 upper bound of the confidence interval
R^2-adjusted estimate
R^2-adjusted standard error (only available if r2.ci.type
= "classic")
R^2-adjusted lower bound of the confidence interval
R^2-adjusted upper bound of the confidence interval
4) anova = traditional ANOVA table in a numeric matrix (only returned if rtn.table = TRUE).
The dimlabels of the matrix was "effect" for the rows
and "info" for the columns. There are two rows with rownames 1. "nom" and 2.
"Residuals" where "nom" refers to the between-group effect of the nominal
variable and "Residuals" refers to the within-group residual errors. There
are 5 columns with colnames 1. "SS" = sum of squares, 2. "df" = degrees of
freedom, 3. "MS" = mean squares, 4. "F" = F-value. and 5. "p" = p-value. Note
the F-value and p-value will differ from the "nhst" returned vector if
var.equal
= FALSE because the traditional ANOVA table always assumes
variances are equal (i.e. var.equal
= TRUE).
Cohen, J., Cohen, P., West, A. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Science - third edition. New York, NY: Routledge.
Olkin, I., & Finn, J. D. (1995). Correlations redux. Psychological Bulletin, 118(1), 155-164.
Smithson, M. (2003). Confidence intervals. Thousand Oaks, CA: Sage Publications.
oneway.test
the workhorse for mean_compare
,
means_compare
for multiple variables across the same 3+ groups,
ci.R2
for confidence intervals of the variance explained,
mean_diff
for a single variable across only 2 groups,
mean_compare(x = mtcars$"mpg", nom = mtcars$"gear") mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", var.equal = FALSE) mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", rtn.table = FALSE) mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", r2.ci.type = "classic")
mean_compare(x = mtcars$"mpg", nom = mtcars$"gear") mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", var.equal = FALSE) mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", rtn.table = FALSE) mean_compare(x = mtcars$"mpg", nom = mtcars$"gear", r2.ci.type = "classic")
mean_diff
tests for mean differences across two independent groups
with an independent two-samples t-test. The function also calculates the
descriptive statistics for each group and the standardized mean difference
(i.e., Cohen's d) based on the pooled standard deviation. mean_diff
is
simply a wrapper for t.test
plus some extra
calculations.
mean_diff( x, bin, lvl = levels(as.factor(bin)), var.equal = TRUE, d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
mean_diff( x, bin, lvl = levels(as.factor(bin)), var.equal = TRUE, d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
x |
numeric vector. |
bin |
atomic vector (e.g., factor) the same length as |
lvl |
character vector with length 2 specifying the unique values for
the two groups. If |
var.equal |
logical vector of length 1 specifying whether the variances of the groups are assumed to be equal (TRUE) or not (FALSE). If TRUE, a traditional independent two-samples t-test is computed; if FALSE, Welch's t-test is computed. These two tests differ by their degrees of freedom and p-values. |
d.ci.type |
character vector with length 1 of specifying the type of
confidence intervals to compute for the standardized mean difference (i.e.,
Cohen's d). There are currently three options: 1) "unbiased" which
calculates the unbiased standard error of Cohen's d based on formula 25 in
Viechtbauer (2007). A symmetrical confidence interval is then calculated
based on the standard error. 2) "tdist" which calculates the confidence
intervals based on the t-distribution using the function
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
mean_diff
calculates the mean difference as x[bin == lvl[2] ]
-
x[bin == lvl[1] ]
such that it is group 2 - group 1. Group 1 corresponds
to the first factor level of bin
(after being coerced to a factor).
Group 2 correspond to the second factor level bin
(after being coerced
to a factor). This was set up to handle dummy coded treatment variables in a
desirable way. For example, if bin
is a numeric vector with values
0
and 1
, the default factor coersion will have the first factor
level be "0" and the second factor level "1". This would result will
correspond to 1 - 0. However, if the first factor level of bin
is
"treatment" and the second factor level is "control", the result will
correspond to control - treatment. If the opposite is desired (e.g.,
treatment - control), this can be reversed within the function by specifying
the lvl
argument as c("control","treatment")
. Note,
mean_diff
diverts from t.test
by calculating the mean
difference as group 2 - group 1 (as opposed to the group 1 - group 2 that
t.test
does). However, group 2 - group 1 is the convention that
psych::cohen.d
uses as well.
mean_diff
calculates the pooled standard deviation in a different way
than cohen.d
. Therefore, the Cohen's d estimates (and
confidence intervals if d.ci.type == "tdist") differ from those in
cohen.d
. mean_diff
uses the total degrees of
freedom in the denomenator while cohen.d
uses the total
sample size in the denomenator - based on the notation in McGrath & Meyer
(2006). However, almost every introduction to statistics textbook uses the
total degrees of freedom in the denomenator and that is what makes more sense
to me. See examples.
list of numeric vectors containing statistical information about the mean difference: 1) nhst = independent two-samples t-test stat info in a numeric vector, 2) desc = descriptive statistics stat info in a numeric vector, 3) std = standardized mean difference stat info in a numeric vector
1) nhst = independent two-samples t-test stat info in a numeric vector
mean difference estimate (i.e., group 2 - group 1)
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a numeric vector
mean of group 2
mean of group 1
standard deviation of group 2
standard deviation of group 1
sample size of group 2
sample size of group 1
3) std = standardized mean difference stat info in a numeric vector
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
McGrath, R. E., & Meyer, G. J. (2006). When effect sizes disagree: the case of r and d. Psychological Methods, 11(4), 386-401.
Viechtbauer, W. (2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32(1), 39-60.
t.test
the workhorse for mean_diff
,
means_diff
for multiple variables across the same two groups,
cohen.d
for another standardized mean difference function,
mean_change
for dependent two-sample t-test,
mean_test
for one-sample t-test,
# independent two-samples t-test mean_diff(x = mtcars$"mpg", bin = mtcars$"vs") mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", lvl = c("1","0")) mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", lvl = c(1, 0)) # levels don't have to be character mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "classic") # compare to psych::cohen.d() mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "tdist") tmp_nm <- c("mpg","vs") # because otherwise Roxygen2 gets upset cohend_obj <- psych::cohen.d(mtcars[tmp_nm], group = "vs") as.data.frame(cohend_obj[["cohen.d"]]) # different estimate of cohen's d # of course, this also leads to different confidence interval bounds as well # same as intercept-only regression when var.equal = TRUE mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "tdist") lm_obj <- lm(mpg ~ vs, data = mtcars) coef(summary(lm_obj)) # errors ## Not run: mean_diff(x = mtcars$"mpg", bin = attitude$"ratings") # `bin` has length different than `x` mean_diff(x = mtcars$"mpg", bin = mtcars$"gear") # `bin` has more than two unique values (other than missing values) ## End(Not run)
# independent two-samples t-test mean_diff(x = mtcars$"mpg", bin = mtcars$"vs") mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", lvl = c("1","0")) mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", lvl = c(1, 0)) # levels don't have to be character mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "classic") # compare to psych::cohen.d() mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "tdist") tmp_nm <- c("mpg","vs") # because otherwise Roxygen2 gets upset cohend_obj <- psych::cohen.d(mtcars[tmp_nm], group = "vs") as.data.frame(cohend_obj[["cohen.d"]]) # different estimate of cohen's d # of course, this also leads to different confidence interval bounds as well # same as intercept-only regression when var.equal = TRUE mean_diff(x = mtcars$"mpg", bin = mtcars$"vs", d.ci.type = "tdist") lm_obj <- lm(mpg ~ vs, data = mtcars) coef(summary(lm_obj)) # errors ## Not run: mean_diff(x = mtcars$"mpg", bin = attitude$"ratings") # `bin` has length different than `x` mean_diff(x = mtcars$"mpg", bin = mtcars$"gear") # `bin` has more than two unique values (other than missing values) ## End(Not run)
mean_if
calculates the mean of a numeric or logical vector conditional
on a specified minimum frequency of observed values. If the frequency of
observed values is less than (or equal to) ov.min
, then NA
is
returned rather than the mean.
mean_if(x, trim = 0, ov.min = 1, prop = TRUE, inclusive = TRUE)
mean_if(x, trim = 0, ov.min = 1, prop = TRUE, inclusive = TRUE)
x |
numeric or logical vector. |
trim |
numeric vector of length 1 specifying the proportion of values
from each end of |
ov.min |
minimum frequency of observed values required. If |
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the mean
should be calculated if the frequency of observed values is exactly equal
to |
numeric vector of length 1 providing the mean of x
or
NA
conditional on if the frequency of observed data is greater than
(or equal to) ov.min
.
mean.default
sum_if
make.fun_if
mean_if(x = airquality[[1]], ov.min = .75) # proportion of observed values mean_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observe values mean_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min value itself mean_if(x = c(TRUE, NA, FALSE, NA), ov.min = .50) # works with logical vectors as well as numeric
mean_if(x = airquality[[1]], ov.min = .75) # proportion of observed values mean_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observe values mean_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min value itself mean_if(x = c(TRUE, NA, FALSE, NA), ov.min = .50) # works with logical vectors as well as numeric
mean_test
computes the sample mean and compares it against a specified
population mu
value. This is sometimes referred to as a one-sample
t-test. It provides the same results as t.test
, but
provides the confidence interval for the mean difference from mu rather than
the mean itself. The function also calculates the descriptive statistics and
the standardized mean difference (i.e., Cohen's d) based on the sample
standard deviation.
mean_test(x, mu = 0, d.ci.type = "tdist", ci.level = 0.95, check = TRUE)
mean_test(x, mu = 0, d.ci.type = "tdist", ci.level = 0.95, check = TRUE)
x |
numeric vector. |
mu |
numeric vector of length 1 specifying the population mean value to compare the sample mean against. |
d.ci.type |
character vector with length 1 specifying the type of
confidence interval to compute for the standardized mean difference (i.e.,
Cohen's d). There are currently two options: 1. "tdist" which calculates
the confidence intervals based on the t-distribution using the function
|
ci.level |
numeric vector of length 1 specifying the confidence level. It must be between 0 and 1. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, checking whether
|
list of numeric vectors containing statistical information about the sample mean: 1) nhst = one-sample t-test stat info in a numeric vector, 2) desc = descriptive statistics stat info in a numeric vector, 3) std = standardized mean difference stat info in a numeric vector
1) nhst = one-sample t-test stat info in a numeric vector
mean - mu estimate
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a numeric vector
mean of x
population value of comparison
standard deviation of x
sample size of x
3) std = standardized mean difference stat info in a numeric vector
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
means_test
one-sample t-tests for multiple variables,
t.test
same results,
mean_diff
independent two-sample t-test,
mean_change
dependent two-sample t-test,
# one-sample t-test mean_test(x = mtcars$"mpg") mean_test(x = attitude$"rating", mu = 50) mean_test(x = attitude$"rating", mu = 50, d.ci.type = "classic") # compare to t.test() mean_test(x = attitude$"rating", mu = 50, ci.level = .99) t.test(attitude$"rating", mu = 50, conf.level = .99) # same as intercept-only regression when mu = 0 mean_test(x = mtcars$"mpg") lm_obj <- lm(mpg ~ 1, data = mtcars) coef(summary(lm_obj))
# one-sample t-test mean_test(x = mtcars$"mpg") mean_test(x = attitude$"rating", mu = 50) mean_test(x = attitude$"rating", mu = 50, d.ci.type = "classic") # compare to t.test() mean_test(x = attitude$"rating", mu = 50, ci.level = .99) t.test(attitude$"rating", mu = 50, conf.level = .99) # same as intercept-only regression when mu = 0 mean_test(x = mtcars$"mpg") lm_obj <- lm(mpg ~ 1, data = mtcars) coef(summary(lm_obj))
means_change
tests for mean changes across two timepoints for multiple
prepost pairs of variables via dependent two-samples t-tests. The function
also calculates the descriptive statistics for the timepoints and the
standardized mean differences (i.e., Cohen's d) based on either the standard
deviation of the pre-timepoint, pooled standard deviation of the
pre-timepoint and post-timepoint, or the standard deviation of the change
score (post - pre). means_change
is simply a wrapper for
t.test
plus some extra calculations.
means_change( data, prepost.nm.list, standardizer = "pre", d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
means_change( data, prepost.nm.list, standardizer = "pre", d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
data |
data.frame of data. |
prepost.nm.list |
list of length-2 character vectors specifying the
colnames from |
standardizer |
chararacter vector of length 1 specifying what to use for standardization when computing the standardized mean difference (i.e., Cohen's d). There are three options: 1. "pre" for the standard deviation of the pre-timepoint, 2. "pooled" for the pooled standard deviation of the pre-timepoint and post-timepoint, 3. "change" for the standard deviation of the change score (post - pre). The default is "pre", which I believe makes the most theoretical sense (see Cumming, 2012); however, "change" is the traditional choice originally proposed by Jacob Cohen (Cohen, 1988). |
d.ci.type |
character vector of lenth 1 specifying how to compute the
confidence intervals (and standard errors) of the standardized mean
differences. There are currently two options: 1. "unbiased" which
calculates the unbiased standard error of Cohen's d based on the formulas
in Viechtbauer (2007). If |
ci.level |
double vector of length 1 specifying the confidence level.
|
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, checking whether
|
For each prepost pair of variables, means_change
calculates the mean
change as data[[ prepost.nm.list[[i]][2] ]]
- data[[
prepost.nm.list[[i]][1] ]]
(which corresponds to post - pre) such that
increases over time have a positive mean change estimate and decreases over
time have a negative mean change estimate. This would be as if the
post-timepoint was x
and the pre-timepoint y
in
t.test(paired = TRUE)
.
list of data.frames containing statistical information about the mean
change for each prepost pair of variables (the rownames of the data.frames
are the names of prepost.nm.list
): 1) nhst = dependent two-samples
t-test stat info in a data.frame, 2) desc = descriptive statistics stat info
in a data.frame, 3) std = standardized mean difference stat info in a data.frame,
1) nhst = dependent two-samples t-test stat info in a data.frame
mean change estimate (i.e., post - pre)
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a data.frame
mean of the post variable
mean of the pre variable
standard deviation of of the post variable
standard deviation of the pre variable
sample size of the change score
Pearson correlation between the pre and post variables
3) std = standardized mean difference stat info in a data.frame
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd ed. Hillsdale, NJ: Erlbaum.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY: Rouledge.
Viechtbauer, W. (2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32(1), 39-60.
mean_change
for a single pair of prepost variables,
t.test
fixes the table of contents for some unknown reason,
means_diff
for multiple independent two-sample t-tests,
means_test
for multiple one-sample t-tests,
# dependent two-sample t-tests prepost_nm_list <- list("first_pair" = c("disp","hp"), "second_pair" = c("carb","gear")) means_change(mtcars, prepost.nm.list = prepost_nm_list) means_change(mtcars, prepost.nm.list = prepost_nm_list, d.ci.type = "classic") means_change(mtcars, prepost.nm.list = prepost_nm_list, standardizer = "change") means_change(mtcars, prepost.nm.list = prepost_nm_list, ci.level = 0.99) # same as intercept-only regression with the change score means_change(data = mtcars, prepost.nm.list = c("disp","hp")) lm_obj <- lm(hp - disp ~ 1, data = mtcars) coef(summary(lm_obj))
# dependent two-sample t-tests prepost_nm_list <- list("first_pair" = c("disp","hp"), "second_pair" = c("carb","gear")) means_change(mtcars, prepost.nm.list = prepost_nm_list) means_change(mtcars, prepost.nm.list = prepost_nm_list, d.ci.type = "classic") means_change(mtcars, prepost.nm.list = prepost_nm_list, standardizer = "change") means_change(mtcars, prepost.nm.list = prepost_nm_list, ci.level = 0.99) # same as intercept-only regression with the change score means_change(data = mtcars, prepost.nm.list = c("disp","hp")) lm_obj <- lm(hp - disp ~ 1, data = mtcars) coef(summary(lm_obj))
means_compare
compares means across 3+ independent groups with a
separate one-way ANOVA for each variable. The function also calculates the
descriptive statistics for each group and the variance explained (i.e., R^2 -
aka eta^2) by the nominal grouping variable. means_compare
is simply a
wrapper for oneway.test
plus some extra calculations.
mean_compare
will work with 2 independent groups; however it arguably
makes more sense to use mean_diff
in that case.
means_compare( data, vrb.nm, nom.nm, lvl = levels(as.factor(data[[nom.nm]])), var.equal = TRUE, r2.ci.type = "classic", ci.level = 0.95, rtn.table = TRUE, check = TRUE )
means_compare( data, vrb.nm, nom.nm, lvl = levels(as.factor(data[[nom.nm]])), var.equal = TRUE, r2.ci.type = "classic", ci.level = 0.95, rtn.table = TRUE, check = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector of length 1 with colnames from |
nom.nm |
character vector of length 1 with colnames from |
lvl |
character vector with length 3+ specifying the unique values for
the 3+ groups. If |
var.equal |
logical vector of length 1 specifying whether the variances of the groups are assumed to be equal (TRUE) or not (FALSE). If TRUE, a traditional one-way ANOVA is computed; if FALSE, Welch's ANOVA is computed. These two tests differ by their denominator degrees of freedoms, F-values, and p-values. |
r2.ci.type |
character vector with length 1 specifying the type of confidence intervals to compute for the variance explained (i.e., R^2 or eta^2). There are currently two options: 1) "Fdist" which calculates a non-symmetrical confidence interval based on the non-central F distribution (pg. 38, Smithson, 2003), 2) "classic" which calculates the confidence interval based on a large-sample theory standard error (eq. 3.6.3 in Cohen, Cohen, West, & Aiken, 2003), which is taken from Olkin & Finn (1995) - just above eq. 10. The confidence intervals for R^2-adjusted use the same formula as R^2, but replace R^2 with R^2 adjusted. Technically, the R^2 adjusted confidence intervals can have poor coverage (pg. 54, Smithson, 2003) |
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of length 1 specifying whether the traditional ANOVA tables should be returned as the last element of the return object. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
list of data.frames containing statistical information about the mean
comparisons for each variable (the rows of the data.frames are
vrb.nm
): 1) nhst = one-way ANOVA stat info in a data.frame,
2) desc = descriptive statistics stat info in a data.frame,
3) std = standardized effect sizes stat info in a data.frame,
4) anova = traditional ANOVA table in a numeric 3D array (only
returned if rtn.table = TRUE)
1) nhst = one-way ANOVA stat info in a data.frame
average mean difference across group pairs
NA to remind the user there is no standard error for the average mean difference
F-value
numerator degrees of freedom
denominator degrees of freedom
two-sided p-value
2) desc = descriptive statistics stat info in a data.frame (note there could be more than 3 groups - groups i, j, and k are just provided as an example)
mean of group k
mean of group j
mean of group i
standard deviation of group k
standard deviation of group j
standard deviation of group i
sample size of group k
sample size of group j
sample size of group i
3) std = standardized effect sizes stat info in a data.frame
R^2 estimate
R^2 standard error (only available if r2.ci.type
= "classic")
R^2 lower bound of the confidence interval
R^2 upper bound of the confidence interval
R^2-adjusted estimate
R^2-adjusted standard error (only available if r2.ci.type
= "classic")
R^2-adjusted lower bound of the confidence interval
R^2-adjusted upper bound of the confidence interval
4) anova = traditional ANOVA table in a numeric 3D array (only returned if rtn.table = TRUE).
The dimlabels of the array are "effect" for
the rows, "info" for the columns, and "vrb" for the layers. There are two
rows with rownames 1. "nom" and 2. "Residuals" where "nom" refers to the
between-group effect of the nominal variable and "Residuals" refers to the
within-group residual errors. There are 5 columns with colnames 1. "SS" = sum
of squares, 2. "df" = degrees of freedom, 3. "MS" = mean squares, 4. "F" =
F-value. and 5. "p" = p-value. Note the F-value and p-value will differ from
the "nhst" returned vector if var.equal
= FALSE because the
traditional ANOVA table always assumes variances are equal (i.e.
var.equal
= TRUE). There are as many layers as length(vrb.nm)
with the laynames equal to vrb.nm
.
Cohen, J., Cohen, P., West, A. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Science - third edition. New York, NY: Routledge.
Olkin, I., & Finn, J. D. (1995). Correlations redux. Psychological Bulletin, 118(1), 155-164.
Smithson, M. (2003). Confidence intervals. Thousand Oaks, CA: Sage Publications.
oneway.test
the workhorse for means_compare
,
mean_compare
for a single variable across the same 3+ groups,
ci.R2
for confidence intervals of the variance explained,
means_diff
for multiple variables across only 2 groups,
means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear") means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear", var.equal = FALSE) means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear", rtn.table = FALSE) means_compare(mtcars, vrb.nm = "mpg", nom.nm = "gear")
means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear") means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear", var.equal = FALSE) means_compare(mtcars, vrb.nm = c("mpg","wt","qsec"), nom.nm = "gear", rtn.table = FALSE) means_compare(mtcars, vrb.nm = "mpg", nom.nm = "gear")
means_diff
tests for mean differences across two independent groups
with independent two-samples t-tests. The function also calculates the
descriptive statistics for each group and the standardized mean differences
(i.e., Cohen's d) based on the pooled standard deviations. mean_diff
is simply a wrapper for t.test
plus some extra
calculations.
means_diff( data, vrb.nm, bin.nm, lvl = levels(as.factor(data[[bin.nm]])), var.equal = TRUE, d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
means_diff( data, vrb.nm, bin.nm, lvl = levels(as.factor(data[[bin.nm]])), var.equal = TRUE, d.ci.type = "unbiased", ci.level = 0.95, check = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames specifying the variables in
|
bin.nm |
character vector of length 1 specifying the binary variable in
|
lvl |
character vector with length 2 specifying the unique values for
the two groups. If |
var.equal |
logical vector of length 1 specifying whether the variances of the groups are assumed to be equal (TRUE) or not (FALSE). If TRUE, a traditional independent two-samples t-test is computed; if FALSE, Welch's t-test is computed. These two tests differ by their degrees of freedom and p-values. |
d.ci.type |
character vector with length 1 specifying the type of
confidence intervals to compute for the standardized mean difference (i.e.,
Cohen's d). There are currently three options: 1) "unbiased" which
calculates the unbiased standard error of Cohen's d based on formula 25 in
Viechtbauer (2007). A symmetrical confidence interval is then calculated
based on the standard error. 2) "tdist" which calculates the confidence
intervals based on the t-distribution using the function
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if
|
means_diff
calculates the mean differences as
data[[vrb.nm]][data[[bin.nm]] == lvl[2], ]
-
data[[vrb.nm]][data[[bin.nm]] == lvl[1], ]
such that it is group 2 -
group 1. Group 1 corresponds to the first factor level of
data[[bin.nm]]
(after being coerced to a factor). Group 2 correspond
to the second factor level of data[[bin.nm]]
(after being coerced to a
factor). This was set up to handle dummy coded treatment variables in a
desirable way. For example, if data[[bin.nm]]
is a numeric vector with
values 0
and 1
, the default factor coersion will have the first
factor level be "0" and the second factor level "1". This would result will
correspond to 1 - 0. However, if the first factor level of
data[[bin.nm]]
is "treatment" and the second factor level is
"control", the result will correspond to control - treatment. If the opposite
is desired (e.g., treatment - control), this can be reversed within the
function by specifying the lvl
argument as
c("control","treatment")
. Note, means_diff
diverts from
t.test
by calculating the mean difference as group 2 - group 1 (as
opposed to the group 1 - group 2 that t.test
does). However, group 2 -
group 1 is the convention that psych::cohen.d
uses as well.
means_diff
calculates the pooled standard deviation in a different way
than cohen.d
. Therefore, the Cohen's d estimates (and
confidence intervals if d.ci.type == "tdist") differ from those in
cohen.d
. means_diff
uses the total degrees of
freedom in the denomenator while cohen.d
uses the total
sample size in the denomenator - based on the notation in McGrath & Meyer
(2006). However, almost every introduction to statistics textbook uses the
total degrees of freedom in the denomenator and that is what makes more sense
to me. See examples.
list of data.frames vectors containing statistical information about
the mean differences (the rownames of each data.frame are vrb.nm
):
1) nhst = independent two-samples t-test stat info in a data.frame,
2) desc = descriptive statistics stat info in a data.frame,
3) std = standardized mean difference stat info in a data.frame
1) nhst = independent two-samples t-test stat info in a data.frame
mean difference estimate (i.e., group 2 - group 1)
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a data.frame
mean of group 2
mean of group 1
standard deviation of group 2
standard deviation of group 1
sample size of group 2
sample size of group 1
3) std = standardized mean difference stat info in a data.frame
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
McGrath, R. E., & Meyer, G. J. (2006). When effect sizes disagree: the case of r and d. Psychological Methods, 11(4), 386-401.
Viechtbauer, W. (2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32(1), 39-60.
means_diff
for independent two-sample t-test of a single variable,
t.test
the workhorse for mean_diff
,
cohen.d
for another standardized mean difference function,
means_change
for dependent two-sample t-tests,
means_test
for one-sample t-tests,
# independent two-samples t-tests means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs") means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", d.ci.type = "classic") means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("1","0")) # signs are reversed means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c(1,0)) # can provide numeric levels for dummy variables # compare to psych::cohen.d() means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", d.ci.type = "tdist") tmp_nm <- c("mpg","cyl","disp","vs") # so that Roxygen2 doesn't freak out cohend_obj <- psych::cohen.d(mtcars[tmp_nm], group = "vs") as.data.frame(cohend_obj[["cohen.d"]]) # different estimate of cohen's d # of course, this also leads to different confidence interval bounds as well # same as intercept-only regression when var.equal = TRUE means_diff(data = mtcars, vrb.nm = "mpg", bin.nm = "vs") lm_obj <- lm(mpg ~ vs, data = mtcars) coef(summary(lm_obj)) # if levels are not unique values in data[[bin.nm]] ## Not run: means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("zero", "1")) # an error message is returned means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("0", "one")) # an error message is returned ## End(Not run)
# independent two-samples t-tests means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs") means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", d.ci.type = "classic") means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("1","0")) # signs are reversed means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c(1,0)) # can provide numeric levels for dummy variables # compare to psych::cohen.d() means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", d.ci.type = "tdist") tmp_nm <- c("mpg","cyl","disp","vs") # so that Roxygen2 doesn't freak out cohend_obj <- psych::cohen.d(mtcars[tmp_nm], group = "vs") as.data.frame(cohend_obj[["cohen.d"]]) # different estimate of cohen's d # of course, this also leads to different confidence interval bounds as well # same as intercept-only regression when var.equal = TRUE means_diff(data = mtcars, vrb.nm = "mpg", bin.nm = "vs") lm_obj <- lm(mpg ~ vs, data = mtcars) coef(summary(lm_obj)) # if levels are not unique values in data[[bin.nm]] ## Not run: means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("zero", "1")) # an error message is returned means_diff(data = mtcars, vrb.nm = c("mpg","cyl","disp"), bin.nm = "vs", lvl = c("0", "one")) # an error message is returned ## End(Not run)
means_test
computes sample means and compares them against specified
population mu
values. These are sometimes referred to as one-sample
t-tests. It provides the same results as t.test
, but
provides the confidence intervals for the mean differences from mu rather
than the mean itself. The function also calculates the descriptive statistics
and the standardized mean differences (i.e., Cohen's d) based on the sample
standard deviations.
means_test( data, vrb.nm, mu = 0, d.ci.type = "tdist", ci.level = 0.95, check = TRUE )
means_test( data, vrb.nm, mu = 0, d.ci.type = "tdist", ci.level = 0.95, check = TRUE )
data |
data.frame or data. |
vrb.nm |
character vector of colnames specifying the variables in
|
mu |
numeric vector of length = |
d.ci.type |
character vector with length 1 of specifying the type of
confidence intervals to compute for the standardized mean differences
(i.e., Cohen's d). There are currently two options: 1. "tdist" which
calculates the confidence intervals based on the t-distribution using the
function |
ci.level |
numeric vector of length 1 specifying the confidence level. It must be between 0 and 1. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, checking whether
|
list of data.frames containing statistical information about the
sample means (the rownames of the data.frames are vrb.nm
): 1)
nhst = one-sample t-test stat info in a data.frame, 2) desc = descriptive
statistics stat info in a data.frame, 3) std = standardized mean difference
stat info in a data.frame
1) nhst = one-sample t-test stat info in a data.frame
mean - mu estimate
standard error
t-value
degrees of freedom
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a data.frame
mean of x
population value of comparison
standard deviation of x
sample size of x
3) std = standardized mean difference stat info in a data.frame
Cohen's d estimate
Cohen's d standard error
Cohen's d lower bound of the confidence interval
Cohen's d upper bound of the confidence interval
mean_test
one-sample t-test for a single variable,
t.test
same results,
means_diff
independent two-sample t-tests for multiple variables,
means_change
dependent two-sample t-tests for multiple variables,
# one-sample t-tests means_test(data = attitude, vrb.nm = names(attitude), mu = 50) means_test(data = attitude, vrb.nm = c("rating","complaints","privileges"), mu = c(60, 55, 50)) means_test(data = attitude, vrb.nm = names(attitude), mu = 50, ci.level = 0.90) means_test(airquality, vrb.nm = names(airquality)) # different df and n due to missing data # compare to t.test means_test(data = attitude, vrb.nm = "rating", mu = 50, ci.level = .99) t.test(attitude$"rating", mu = 50, conf.level = .99) # same as intercept-only regression means_test(data = attitude, vrb.nm = "rating") lm_obj <- lm(rating ~ 1, data = attitude) coef(summary(lm_obj))
# one-sample t-tests means_test(data = attitude, vrb.nm = names(attitude), mu = 50) means_test(data = attitude, vrb.nm = c("rating","complaints","privileges"), mu = c(60, 55, 50)) means_test(data = attitude, vrb.nm = names(attitude), mu = 50, ci.level = 0.90) means_test(airquality, vrb.nm = names(airquality)) # different df and n due to missing data # compare to t.test means_test(data = attitude, vrb.nm = "rating", mu = 50, ci.level = .99) t.test(attitude$"rating", mu = 50, conf.level = .99) # same as intercept-only regression means_test(data = attitude, vrb.nm = "rating") lm_obj <- lm(rating ~ 1, data = attitude) coef(summary(lm_obj))
mode2
calculates the statistical mode - a measure of central tendancy
- of a numeric vector. This is in contrast to mode
in base R,
which returns the storage mode of an object. In the case multiple modes
exist, the multiple
argument allows the user to specify if they want
the multiple modes returned or just one.
mode2(x, na.rm = FALSE, multiple = FALSE)
mode2(x, na.rm = FALSE, multiple = FALSE)
x |
atomic vector |
na.rm |
logical vector of length 1 specifying if missing values should
be removed from |
multiple |
logical vector of length 1 specifying if multiple modes
should be returned in the case they exist. If multiple modes exist and
|
atomic vector of the same storage mode as x
providing the
statistical mode(s).
# ONE MODE vec <- c(7,8,9,7,8,9,9) mode2(vec) mode2(vec, multiple = TRUE) # TWO MODES vec <- c(7,8,9,7,8,9,8,9) mode2(vec) mode2(vec, multiple = TRUE) # WITH NA vec <- c(7,8,9,7,8,9,NA,9) mode2(vec) mode2(vec, na.rm = TRUE) vec <- c(7,8,9,7,8,9,NA,9,NA,NA) mode2(vec) mode2(vec, multiple = TRUE)
# ONE MODE vec <- c(7,8,9,7,8,9,9) mode2(vec) mode2(vec, multiple = TRUE) # TWO MODES vec <- c(7,8,9,7,8,9,8,9) mode2(vec) mode2(vec, multiple = TRUE) # WITH NA vec <- c(7,8,9,7,8,9,NA,9) mode2(vec) mode2(vec, na.rm = TRUE) vec <- c(7,8,9,7,8,9,NA,9,NA,NA) mode2(vec) mode2(vec, multiple = TRUE)
n_compare
tests whether all the values for a variable have equal
frequency with a chi-square test of goodness of fit. n_compare
does
not currently allow for user-specified unequal frequencies of values; this is
possible with chisq.test
. The function also calculates
the counts and overall percentages for the value frequencies.
prop_test
is simply a wrapper for chisq.test
plus
some extra calculations.
n_compare(x, simulate.p.value = FALSE, B = 2000)
n_compare(x, simulate.p.value = FALSE, B = 2000)
x |
atomic vector. Probably makes sense to contain relatively few unique values. |
simulate.p.value |
logial vector of length 1 specifying whether the
p-value should be based on a Monte Carlo simulation rather than the classic
formula. See |
B |
integer vector of length 1 specifying how much Monte Carlo
simulations run. Only used if |
list of numeric vectors containing statistical information about the frequency comparison: 1) nhst = chi-square test of goodness of fit stat info in a numeric vector, 2) count = numeric vector of length 3 with table of counts, 3) percent = numeric vector of length 3 with table of overall percentages
1) nhst = chi-square test of goodness of fit stat info in a numeric vector
average difference in subsample sizes (i.e., |ni - nj|)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (# of unique values = 1)
two-sided p-value
2) count = numeric vector of length 3 with table of counts with an additional element for the total. The names are 1. "n_'lvl[k]'", 2. "n_'lvl[j]'", 3. "n_'lvl[i]'", ..., X = "total"
3) percent = numeric vector of length 3 with table of overall percentages with an additional element for the total. The names are 1. "n_'lvl[k]'", 2. "n_'lvl[j]'", 3. "n_'lvl[i]'", ..., X = "total"
chisq.test
the workhorse for n_compare
,
props_test
for multiple dummy variables,
prop_diff
for chi-square test of independence,
n_compare(mtcars$"cyl") n_compare(mtcars$"gear") n_compare(mtcars$"cyl", simulate.p.value = TRUE) # compare to chisq.test() n_compare(mtcars$"cyl") chisq.test(table(mtcars$"cyl"))
n_compare(mtcars$"cyl") n_compare(mtcars$"gear") n_compare(mtcars$"cyl", simulate.p.value = TRUE) # compare to chisq.test() n_compare(mtcars$"cyl") chisq.test(table(mtcars$"cyl"))
ncases
counts how many cases in a data.frame there are that have
a specified frequency of observed values across a set of columns. This function
is similar to nrow
and is essentially partial.cases
+ sum
. The user
can have ncases
return the number of complete cases by calling ov.min = 1
,
prop = TRUE
, and inclusive = TRUE
(the default).
ncases(data, vrb.nm = names(data), ov.min = 1, prop = TRUE, inclusive = TRUE)
ncases(data, vrb.nm = names(data), ov.min = 1, prop = TRUE, inclusive = TRUE)
data |
data.frame or matrix of data. |
vrb.nm |
a character vector of colnames from |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the case should
be included if the frequency of observed values in a row is exactly equal to |
integer vector of length 1 providing the nrow in data
with the given amount of observed values.
vrb_nm <- c("Ozone","Solar.R","Wind") nrow(airquality[vrb_nm]) # number of cases regardless of missing data sum(complete.cases(airquality[vrb_nm])) # number of complete cases ncases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = 2/3) # number of rows with at least 2 of the 3 variables observed
vrb_nm <- c("Ozone","Solar.R","Wind") nrow(airquality[vrb_nm]) # number of cases regardless of missing data sum(complete.cases(airquality[vrb_nm])) # number of complete cases ncases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = 2/3) # number of rows with at least 2 of the 3 variables observed
ncases_by
computes the ncases of a data.frame by group. Through the
use of the ov.min
, prop
, and inclusive
arguments, the
user can specify how many missing values are allowed in a row for it to be
counted. ncases_by
is simply a wrapper for ncases
+
agg_dfm
.
ncases_by( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, sep = ".", ov.min = 1L, prop = TRUE, inclusive = TRUE )
ncases_by( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, sep = ".", ov.min = 1L, prop = TRUE, inclusive = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
sep |
character vector of length 1 specifying what string to use to
separate the groups when naming the return object. |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the case
should be included if the frequency of observed values in a row is exactly
equal to |
atomic vector with names = unique(interaction(data[grp.nm], sep
= sep))
and length = length(unique(interaction(data[grp.nm], sep =
sep)))
providing the ncases for each group.
# one grouping variables tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat ncases_by(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) ncases_by(data = dat2, grp.nm = "Chick") # two grouping variables tmp <- reshape(psych::bfi[1:10, ], varying = 1:25, timevar = "item", ids = row.names(psych::bfi)[1:10], direction = "long", sep = "") tmp_nm <- c("id","item","N","E","C","A","O") # Roxygen runs the whole script dat3 <- str2str::stack2(tmp[tmp_nm], select.nm = c("N","E","C","A","O"), keep.nm = c("id","item")) ncases_by(dat3, grp.nm = c("id","vrb_names"))
# one grouping variables tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat ncases_by(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) ncases_by(data = dat2, grp.nm = "Chick") # two grouping variables tmp <- reshape(psych::bfi[1:10, ], varying = 1:25, timevar = "item", ids = row.names(psych::bfi)[1:10], direction = "long", sep = "") tmp_nm <- c("id","item","N","E","C","A","O") # Roxygen runs the whole script dat3 <- str2str::stack2(tmp[tmp_nm], select.nm = c("N","E","C","A","O"), keep.nm = c("id","item")) ncases_by(dat3, grp.nm = c("id","vrb_names"))
ncases_desc
computes descriptive statistics about the number of cases
by group in a data.frame. This is often done in diary studies to obtain
information about compliance for the sample. Through the use of the
ov.min
, prop
, and inclusive
arguments, the user can
specify how many missing values are allowed in a row for it to be counted.
ncases_desc
is simply ncases_by
+ psych::describe
.
ncases_desc( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, ov.min = 1, prop = TRUE, inclusive = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE, trim = 0.1, type = 3, quant = c(0.25, 0.75), IQR = FALSE )
ncases_desc( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, ov.min = 1, prop = TRUE, inclusive = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE, trim = 0.1, type = 3, quant = c(0.25, 0.75), IQR = FALSE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the case
should be included if the frequency of observed values in a row is exactly
equal to |
interp |
logical vector of length 1 specifying whether the median should be standard (FALSE) or interpolated (TRUE). |
skew |
logical vector of length 1 specifying whether skewness and kurtosis should be calculated (TRUE) or not (FALSE). |
ranges |
logical vector of length 1 specifying whether the minimum,
maximum, and range (i.e., maximum - minimum) should be calculated (TRUE) or
not (FALSE). Note, if |
trim |
numeric vector of length 1 specifying the top and bottom quantiles of data that are to be excluded when calculating the trimmed mean. For example, the default value of 0.1 means that only data within the 10th - 90th quantiles are used for calculating the trimmed mean. |
type |
numeric vector of length 1 specifying the type of skewness and
kurtosis coefficients to compute. See the details of
|
quant |
numeric vector specifying the quantiles to compute. Foe example,
the default value of c(0.25, 0.75) computes the 25th and 75th quantiles of
the group number of cases. If |
IQR |
logical vector of length 1 specifying whether to compute the Interquartile Range (TRUE) or not (FALSE), which is simply the 75th quantil - 25th quantile. |
numeric vector containing descriptive statistics about number of cases by group. Note, which elements are returned depends on the arguments. See each argument's description.
number of groups
mean
standard deviation
median (standard if interp
= FALSE, interpolated if interp
= TRUE)
trimmed mean based on trim
median absolute difference
minimum
maximum
maximum - minumum
skewness
kurtosis
standard error of the mean
75th quantile - 25th quantile
quantiles, which are named by quant
(e.g., 0.25 = "Q0.25")
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # doesn't include everything you want ncases_desc(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) ncases_desc(data = dat2, grp.nm = "Chick") ncases_desc(data = dat2, grp.nm = "Chick", trim = .05) ncases_desc(data = dat2, grp.nm = "Chick", ranges = FALSE) ncases_desc(data = dat2, grp.nm = "Chick", quant = NULL) ncases_desc(data = dat2, grp.nm = "Chick", IQR = TRUE)
tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # doesn't include everything you want ncases_desc(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) ncases_desc(data = dat2, grp.nm = "Chick") ncases_desc(data = dat2, grp.nm = "Chick", trim = .05) ncases_desc(data = dat2, grp.nm = "Chick", ranges = FALSE) ncases_desc(data = dat2, grp.nm = "Chick", quant = NULL) ncases_desc(data = dat2, grp.nm = "Chick", IQR = TRUE)
ncases_ml
computes the number cases and number of groups in the data
that are at least partially observed, given a specified frequency of observed
values across a set of columns. ncases_ml
allows the user to specify
the frequency of columns that need to be observed in order to count the case.
Groups can be excluded if no rows in the data for a group have enough
observed values to be counted as cases. This is simply a combination of
partial.cases
+ nrow_ml
. Note, ncases_ml
is essentially
a version of nrow_ml
that accounts for missing data.
ncases_ml( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, ov.min = 1L, prop = TRUE, inclusive = TRUE )
ncases_ml( data, vrb.nm = str2str::pick(names(data), val = grp.nm, not = TRUE), grp.nm, ov.min = 1L, prop = TRUE, inclusive = TRUE )
data |
data.frame of data. |
vrb.nm |
a character vector of colnames from |
grp.nm |
character vector of colnames from |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the case
should be included if the frequency of observed values in a row is exactly
equal to |
list with two elements providing the sample sizes (accouning for
missing data). The first element is named "within" and contains the number
of cases in the data. The second element is named "between" and contains
the number of groups in the data. Cases are counted if if the frequency of
observed values is greater than (or equal to, if inclusive
= TRUE).
nrow_ml
ncases_by
partial.cases
# NO MISSING DATA # one grouping variable ncases_ml(data = as.data.frame(ChickWeight), grp.nm = "Chick") # multiple grouping variables ncases_ml(data = mtcars, grp.nm = c("vs","am")) # YES MISSING DATA # only within nrow_ml(data = airquality, grp.nm = "Month") ncases_ml(data = airquality, grp.nm = "Month") # both within and between airquality2 <- airquality airquality2[airquality2$"Month" == 6, "Ozone"] <- NA nrow_ml(data = airquality2, grp.nm = "Month") ncases_ml(data = airquality2, grp.nm = "Month")
# NO MISSING DATA # one grouping variable ncases_ml(data = as.data.frame(ChickWeight), grp.nm = "Chick") # multiple grouping variables ncases_ml(data = mtcars, grp.nm = c("vs","am")) # YES MISSING DATA # only within nrow_ml(data = airquality, grp.nm = "Month") ncases_ml(data = airquality, grp.nm = "Month") # both within and between airquality2 <- airquality airquality2[airquality2$"Month" == 6, "Ozone"] <- NA nrow_ml(data = airquality2, grp.nm = "Month") ncases_ml(data = airquality2, grp.nm = "Month")
ngrp
computes the number of groups in data given one or more grouping
variables. This is simply a combination of unique.data.frame
+
nrow
.
ngrp(data, grp.nm)
ngrp(data, grp.nm)
data |
data.frame of data. |
grp.nm |
character vector of colnames from |
integer vector of length 1 specifying the number of groups.
nrow_ml
ncases_ml
nrow_by
ncases_by
# one grouping variable Orthodont2 <- as.data.frame(nlme::Orthodont) ngrp(Orthodont2, grp.nm = "Subject") length(unique(Orthodont2$"Subject")) # two grouping variable co2 <- as.data.frame(CO2) ngrp(co2, grp.nm = c("Plant")) grp_nm <- c("Type","Treatment") ngrp(co2, grp.nm = grp_nm) unique.data.frame(co2[grp_nm]) #TODO: how does it handle factor levels with no cases?
# one grouping variable Orthodont2 <- as.data.frame(nlme::Orthodont) ngrp(Orthodont2, grp.nm = "Subject") length(unique(Orthodont2$"Subject")) # two grouping variable co2 <- as.data.frame(CO2) ngrp(co2, grp.nm = c("Plant")) grp_nm <- c("Type","Treatment") ngrp(co2, grp.nm = grp_nm) unique.data.frame(co2[grp_nm]) #TODO: how does it handle factor levels with no cases?
nhst
computes the statistical information for null hypothesis
significance testing (NHST), t-values, p-values, etc., from parameter
estimates, standard errors, and degrees of freedom. If degrees of freedom are
not applicable or available, then df
can be set to Inf
(the
default) and z-values rather than t-values will be computed.
nhst(est, se, df = Inf, ci.level = 0.95, p.value = "two.sided")
nhst(est, se, df = Inf, ci.level = 0.95, p.value = "two.sided")
est |
numeric vector of parameter estimates. |
se |
numeric vector of standard errors. Must be the same length as
|
df |
numeric vector of degrees of freedom. Must be length of 1 or have
same length as |
ci.level |
double vector of length 1 specifying the confidence level. Must be between 0 and 1 - or can be NULL in which case no confidence intervals are computed and the return object does not have the columns "lwr" or "upr". |
p.value |
character vector of length 1 specifying the type of p-values to compute. The options are 1) "two.sided" which computed non-directional, two-tailed p-values, 2) "less", which computes negative-directional, one-tailed p-values, or 3) "greater", which computes positive-directional, one-tailed p-values. |
data.frame with nrow equal to the lengths of est
and
se
. The rownames are taken from est
, unless est
does not
have any names and then the rownames are taken from the names of se
.
If neither have names, then the rownames are automatic (i.e.,
1:nrow()
). The columns are the following:
parameter estimates
standard errors
t-values (z-values if df = Inf)
degrees of freedom
p-values
lower bound of the confidence intervals (excluded if ci.level = NULL
)
upper bound of the confidence intervals (excluded if ci.level = NULL
)
est <- colMeans(attitude) se <- apply(X = str2str::d2m(attitude), MARGIN = 2, FUN = function(vec) sqrt(var(vec) / length(vec))) df <- nrow(attitude) - 1 nhst(est = est, se = se, df = df) nhst(est = est, se = se) # default is df = Inf resulting in z-values nhst(est = est, se = se, df = df, ci.level = NULL) # no "lwr" or "upr" columns nhst(est = est, se = se, df = df, ci.level = 0.99)
est <- colMeans(attitude) se <- apply(X = str2str::d2m(attitude), MARGIN = 2, FUN = function(vec) sqrt(var(vec) / length(vec))) df <- nrow(attitude) - 1 nhst(est = est, se = se, df = df) nhst(est = est, se = se) # default is df = Inf resulting in z-values nhst(est = est, se = se, df = df, ci.level = NULL) # no "lwr" or "upr" columns nhst(est = est, se = se, df = df, ci.level = 0.99)
nom2dum
converts a nominal variable into a set of dummy variables.
There is one dummy variable for each unique value in the nominal variable.
Note, base R does this recoding internally through the
model.matrix.default
function, but it is used in the context of
regression-like models and it is not clear how to simplify it for general use
cases outside that context.
nom2dum(nom, yes = 1L, no = 0L, prefix = "", rtn.fct = FALSE)
nom2dum(nom, yes = 1L, no = 0L, prefix = "", rtn.fct = FALSE)
nom |
character vector (or any atomic vector, including factors, which will be then coerced to a character vector) specifying the nominal variable. |
yes |
atomic vector of length 1 specifying what unique value should represent rows when the nominal category of interest is present. For a traditional dummy variable this value would be 1. |
no |
atomic vector of length 1 specifying what unique value should represent rows when the nominal category of interest is absent. For a traditional dummy variable this value would be 0. |
prefix |
character vector of length 1 specifying the string that should be appended to the beginning of each colname in the return object. |
rtn.fct |
logical vector of length 1 specifying whether the columns of
the return object should be factors where the first level is |
Note, that yes
and no
are assumed to be the same typeof. If
they are not, then the columns in the return object will be coerced to the
most complex typeof (i.e., most to least: character, double, integer,
logical).
data.frame of dummy columns with colnames specified by
paste0(prefix, unique(nom))
and rownames specified by
names(nom)
or default data.frame
rownames (i.e.,
c("1","2","3", etc.) if names(nom)
is NULL
.
nom2dum(infert$"education") # default nom2dum(infert$"education", prefix = "edu_") # use of the `prefix` argument nom2dum(nom = infert$"education", yes = "one", no = "zero", rtn.fct = TRUE) # returns factor columns
nom2dum(infert$"education") # default nom2dum(infert$"education", prefix = "edu_") # use of the `prefix` argument nom2dum(nom = infert$"education", yes = "one", no = "zero", rtn.fct = TRUE) # returns factor columns
nrow_by
computes the nrow of a data.frame by group. nrow_by
is
simply a wrapper for nrow
+ agg_dfm
.
nrow_by(data, grp.nm, sep = ".")
nrow_by(data, grp.nm, sep = ".")
data |
data.frame of data. |
grp.nm |
character vector of colnames from |
sep |
character vector of length 1 specifying what string to use to
separate the groups when naming the return object. |
atomic vector with names = unique(interaction(data[grp.nm], sep
= sep))
and length = length(unique(interaction(data[grp.nm], sep =
sep)))
providing the nrow for each group.
# one grouping variables tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat nrow_by(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) nrow_by(data = dat2, grp.nm = "Chick") # two grouping variables tmp <- reshape(psych::bfi[1:10, ], varying = 1:25, timevar = "item", ids = row.names(psych::bfi)[1:10], direction = "long", sep = "") tmp_nm <- c("id","item","N","E","C","A","O") # Roxygen runs the whole script dat3 <- str2str::stack2(tmp[tmp_nm], select.nm = c("N","E","C","A","O"), keep.nm = c("id","item")) nrow_by(dat3, grp.nm = c("id","vrb_names"))
# one grouping variables tmp_nm <- c("outcome","case","session","trt_time") dat <- as.data.frame(lmeInfo::Bryant2016)[tmp_nm] stats_by <- psych::statsBy(dat, group = "case") # requires you to include "case" column in dat nrow_by(data = dat, grp.nm = "case") dat2 <- as.data.frame(ChickWeight) nrow_by(data = dat2, grp.nm = "Chick") # two grouping variables tmp <- reshape(psych::bfi[1:10, ], varying = 1:25, timevar = "item", ids = row.names(psych::bfi)[1:10], direction = "long", sep = "") tmp_nm <- c("id","item","N","E","C","A","O") # Roxygen runs the whole script dat3 <- str2str::stack2(tmp[tmp_nm], select.nm = c("N","E","C","A","O"), keep.nm = c("id","item")) nrow_by(dat3, grp.nm = c("id","vrb_names"))
nrow_ml
computes the number rows in the data as well as the number of
groups in the data. This corresponds to the within-group sample size and
between-group sample size (ignoring any missing data). This is simply a
combination of nrow
+ ngrp
.
nrow_ml(data, grp.nm)
nrow_ml(data, grp.nm)
data |
data.frame of data. |
grp.nm |
character vector of colnames from |
list with two elements providing the sample sizes (ignoring missing data). The first element is named "within" and contains the number of rows in the data. The second element is named "between" and contains the number of groups in the data.
ncases_ml
nrow_by
ncases_by
ngrp
# one grouping variable nrow_ml(data = as.data.frame(ChickWeight), grp.nm = "Chick") # multiple grouping variables nrow_ml(data = mtcars, grp.nm = c("vs","am"))
# one grouping variable nrow_ml(data = as.data.frame(ChickWeight), grp.nm = "Chick") # multiple grouping variables nrow_ml(data = mtcars, grp.nm = c("vs","am"))
partial.cases
indicates which cases are at least partially observed,
given a specified frequency of observed values across a set of columns. This
function builds off complete.cases
. While
complete.cases
requires completely observed cases,
partial.cases
allows the user to specify the frequency of columns
required to be observed. The default arguments are equal to
complete.cases
.
partial.cases(data, vrb.nm, ov.min = 1, prop = TRUE, inclusive = TRUE)
partial.cases(data, vrb.nm, ov.min = 1, prop = TRUE, inclusive = TRUE)
data |
data.frame or matrix of data. |
vrb.nm |
a character vector of colnames from |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the case
should be included if the frequency of observed values in a row is exactly
equal to |
logical vector of length = nrow(data)
with names =
rownames(data)
specifying if the frequency of observed values is
greater than (or equal to, if inclusive
= TRUE) ov.min
.
cases2keep <- partial.cases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = .66) airquality2 <- airquality[cases2keep, ] # all cases with 2/3 variables observed cases2keep <- partial.cases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = 1, prop = TRUE, inclusive = TRUE) complete_cases <- complete.cases(airquality) identical(x = unname(cases2keep), y = complete_cases) # partial.cases(ov.min = 1, prop = TRUE, # inclusive = TRUE) = complete.cases()
cases2keep <- partial.cases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = .66) airquality2 <- airquality[cases2keep, ] # all cases with 2/3 variables observed cases2keep <- partial.cases(data = airquality, vrb.nm = c("Ozone","Solar.R","Wind"), ov.min = 1, prop = TRUE, inclusive = TRUE) complete_cases <- complete.cases(airquality) identical(x = unname(cases2keep), y = complete_cases) # partial.cases(ov.min = 1, prop = TRUE, # inclusive = TRUE) = complete.cases()
pomp
recodes a numeric vector to percentage of maximum possible (POMP)
units. This can be useful when data is measured with arbitrary units (e.g.,
Likert scale).
pomp(x, mini, maxi, relative = FALSE, unit = 1)
pomp(x, mini, maxi, relative = FALSE, unit = 1)
x |
numeric vector. |
mini |
numeric vector of length 1 specifying the minimum numeric value possible. |
maxi |
numeric vector of length 1 specifying the maximum numeric value possible. |
relative |
logical vector of length 1 specifying whether relative POMP
scores (rather than absolute POMP scores) should be created. If TRUE, then
the |
unit |
numeric vector of length 1 specifying how many percentage points
is desired for the units. Traditionally, POMP scores use |
There are too common approaches to POMP scores: 1) absolute POMP units where the minimum and maximum are the smallest/largest values possible from the measurement instrument (e.g., 1 to 7 on a Likert scale) and 2) relative POMP units where the minimum and maximum are the smallest/largest values observed in the data (e.g., 1.3 to 6.8 on a Likert scale). Both will be correlated perfectly with the original units as they are each linear transformations.
numeric vector from recoding x
to percentage of maximum
possible (pomp) with units specified by unit
.
vec <- psych::bfi[[1]] pomp(x = vec, mini = 1, maxi = 6) # absolute POMP units pomp(x = vec, relative = TRUE) # relative POMP units pomp(x = vec, mini = 1, maxi = 6, unit = 100) # unit = 100 pomp(x = vec, mini = 1, maxi = 6, unit = 50) # unit = 50
vec <- psych::bfi[[1]] pomp(x = vec, mini = 1, maxi = 6) # absolute POMP units pomp(x = vec, relative = TRUE) # relative POMP units pomp(x = vec, mini = 1, maxi = 6, unit = 100) # unit = 100 pomp(x = vec, mini = 1, maxi = 6, unit = 50) # unit = 50
pomps
recodes numeric data to percentage of maximum possible (POMP)
units. This can be useful when data is measured with arbitrary units (e.g.,
Likert scale).
pomps( data, vrb.nm, mini, maxi, relative = FALSE, unit = 1, suffix = paste0("_p", unit) )
pomps( data, vrb.nm, mini, maxi, relative = FALSE, unit = 1, suffix = paste0("_p", unit) )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
mini |
numeric vector of length 1 specifying the minimum numeric value possible. Note, this is assumed to be the same for each variable. |
maxi |
numeric vector of length 1 specifying the maximum numeric value possible. Note, this is assumed to be the same for each variable. |
relative |
logical vector of length 1 specifying whether relative POMP
scores (rather than absolute POMP scores) should be created. If TRUE, then
the |
unit |
numeric vector of length 1 specifying how many percentage points
is desired for the units. Traditionally, POMP scores use |
suffix |
character vector of length 1 specifying the string to add to the end of the column names in the return object. |
There are too common approaches to POMP scores: 1) absolute POMP units where the minimum and maximum are the smallest/largest values possible from the measurement instrument (e.g., 1 to 7 on a Likert scale) and 2) relative POMP units where the minimum and maximum are the smallest/largest values observed in the data (e.g., 1.3 to 6.8 on a Likert scale). Both will be correlated perfectly with the original units as they are each linear transformations.
data.frame of variables recoded to percentage of maximum possible
(pomp) with units specified by unit
and names specified by
paste0(vrb.nm, suffix)
.
vrb_nm <- names(psych::bfi)[grepl(pattern = "A", x = names(psych::bfi))] pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6) # absolute POMP units pomps(data = psych::bfi, vrb.nm = vrb_nm, relative = TRUE) # relative POMP units pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, unit = 100) # unit = 100 pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, unit = 50) # unit = 50 pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, suffix = "_pomp")
vrb_nm <- names(psych::bfi)[grepl(pattern = "A", x = names(psych::bfi))] pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6) # absolute POMP units pomps(data = psych::bfi, vrb.nm = vrb_nm, relative = TRUE) # relative POMP units pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, unit = 100) # unit = 100 pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, unit = 50) # unit = 50 pomps(data = psych::bfi, vrb.nm = vrb_nm, min = 1, max = 6, suffix = "_pomp")
prop_compare
tests for proportion differences across 3+ independent
groups with a chi-square test of independence. The function also calculates
the descriptive statistics for each group, Cramer's V and its confidence
interval as a standardized effect size, and can provide the X by 2
contingency tables. prop_compare
is simply a wrapper for
prop.test
plus some extra calculations.
prop_compare( x, nom, lvl = levels(as.factor(nom)), yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
prop_compare( x, nom, lvl = levels(as.factor(nom)), yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
x |
numeric vector that only has values of 0 or 1 (or missing values), otherwise known as a dummy variable. |
nom |
atomic vector that takes on three or more unordered values (or missing values), otherwise known as a nominal variable. |
lvl |
character vector with length 2 specifying the unique values for
the two groups. If |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the X by 2 contingency table of counts with totals and the X by 2 overall percentages table. If TRUE, then the last two elements of the return object are "count" containing a matrix of counts and "percent" containing a matrix of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
The confidence interval for Cramer's V is calculated with fisher's r to z transformation as Cramer's V is a kind of multiple correlation coefficient. Cramer's V is transformed to fisher's z units, a symmetric confidence interval for fisher's z is calculated, and then the lower and upper bounds are back-transformed to Cramer's V units.
list of numeric vectors containing statistical information about the
proportion comparisons: 1) nhst = chi-square test of independence stat info
in a numeric vector, 2) desc = descriptive statistics stat info in a
numeric vector, 3) std = standardized effect size and its confidence
interval in a numeric vector, 4) count = numeric matrix with dim =
[X+1, 3]
of the X by 2 contingency table of counts with an
additional row and column for totals (if rtn.table
= TRUE), 5)
percent = numeric matrix with dim = [X+1, 3]
of the X by 2
contingency table of overall percentages with an additional row and column
for totals (if rtn.table
= TRUE).
1) nhst = chi-square test of independence stat info in a numeric vector
average proportion difference absolute value (i.e., |group j - group i|)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (of the nominal variable)
two-sided p-value
2) desc = descriptive statistics stat info in a numeric vector (note there could be more than 3 groups - groups i, j, and k are just provided as an example):
proportion of group k
proportion of group j
proportion of group i
standard deviation of group k
standard deviation of group j
standard deviation of group i
sample size of group k
sample size of group j
sample size of group i
3) std = standardized effect size and its confidence interval in a numeric vector
Cramer's V estimate
lower bound of Cramer's V confidence interval
upper bound of Cramer's V confidence interval
4) count = numeric matrix with dim = [X+1, 3]
of the X by 2
contingency table of counts with an additional row and column for totals (if
rtn.table
= TRUE).
The 3+ unique observed values of nom
- plus the total - are the rows
and the two unique observed values of x
(i.e., 0 and 1) - plus the
total - are the columns. The dimlabels are "nom" for the rows and "x" for the
columns. The rownames are 1. 'lvl[i]', 2. 'lvl[j]', 3. 'lvl[k]', 4. "total".
The colnames are 1. "0", 2. "1", 3. "total".
5) percent = numeric matrix with dim = [X+1, 3]
of the X by 2
contingency table of overall percentages with an additional row and column
for totals (if rtn.table
= TRUE).
The 3+ unique observed values of nom
- plus the total - are the rows
and the two unique observed values of x
(i.e., 0 and 1) - plus the
total - are the columns. The dimlabels are "nom" for the rows and "x" for the
columns. The rownames are 1. 'lvl[i]', 2. 'lvl[j]', 3. 'lvl[k]', 4. "total".
The rownames are 1. "0", 2. "1", 3. "total".
prop.test
the workhorse for prop_compare
,
props_compare
for multiple dummy variables,
prop_diff
for only 2 independent groups (aka binary variable),
tmp <- replicate(n = 10, expr = mtcars, simplify = FALSE) mtcars2 <- str2str::ld2d(tmp) mtcars2$"cyl_fct" <- car::recode(mtcars2$"cyl", recodes = "4='four'; 6='six'; 8='eight'", as.factor = TRUE) prop_compare(x = mtcars2$"am", nom = mtcars2$"cyl_fct") prop_compare(x = mtcars2$"am", nom = mtcars2$"cyl_fct", lvl = c("four","six","eight")) # specify order of levels in return object # more than 3 groups prop_compare(x = ifelse(airquality$"Wind" >= 10, yes = 1, no = 0), nom = airquality$"Month") prop_compare(x = ifelse(airquality$"Wind" >= 10, yes = 1, no = 0), nom = airquality$"Month", rtn.table = FALSE) # no contingency tables
tmp <- replicate(n = 10, expr = mtcars, simplify = FALSE) mtcars2 <- str2str::ld2d(tmp) mtcars2$"cyl_fct" <- car::recode(mtcars2$"cyl", recodes = "4='four'; 6='six'; 8='eight'", as.factor = TRUE) prop_compare(x = mtcars2$"am", nom = mtcars2$"cyl_fct") prop_compare(x = mtcars2$"am", nom = mtcars2$"cyl_fct", lvl = c("four","six","eight")) # specify order of levels in return object # more than 3 groups prop_compare(x = ifelse(airquality$"Wind" >= 10, yes = 1, no = 0), nom = airquality$"Month") prop_compare(x = ifelse(airquality$"Wind" >= 10, yes = 1, no = 0), nom = airquality$"Month", rtn.table = FALSE) # no contingency tables
prop_diff
tests for proportion differences across two independent
groups with a chi-square test of independence. The function also calculates
the descriptive statistics for each group, various standardized effect sizes
(e.g., Cramer's V), and can provide the 2x2 contingency tables.
prop_diff
is simply a wrapper for prop.test
plus
some extra calculations.
prop_diff( x, bin, lvl = levels(as.factor(bin)), yates = TRUE, zero.cell = 0.05, smooth = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
prop_diff( x, bin, lvl = levels(as.factor(bin)), yates = TRUE, zero.cell = 0.05, smooth = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
x |
numeric vector that only has values of 0 or 1 (or missing values), otherwise known as a dummy variable. |
bin |
atomic vector that only takes on two values (or missing values), otherwise known as a binary variable. |
lvl |
character vector with length 2 specifying the unique values for
the two groups. If |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
zero.cell |
numeric vector of length 1 specifying what value to impute
for zero cell counts in the 2x2 contingency table when computing the
tetrachoric correlation. See |
smooth |
logical vector of length 1 specifying whether a smoothing
algorithm should be applied when estimating the tetrachoric correlation.
See |
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the 2x2 contingency table of counts with totals and the 2x2 overall percentages table. If TRUE, then the last two elements of the return object are "count" containing a matrix of counts and "percent" containing a matrix of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
list of numeric vectors containing statistical information about the
mean difference: 1) nhst = chi-square test of independence stat info in a numeric vector,
2) desc = descriptive statistics stat info in a numeric vector, 3) std = various
standardized effect sizes in a numeric vector, 4) count = numeric matrix with
dim = [3, 3]
of the 2x2 contingency table of counts with an additional
row and column for totals (if rtn.table
= TRUE), 5) percent = numeric
matrix with dim = [3, 3]
of the 2x2 contingency table of overall percentages
with an additional row and column for totals (if rtn.table
= TRUE)
1) nhst = chi-square test of independence stat info in a numeric vector
mean difference estimate (i.e., group 2 - group 1)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (will always be 1)
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) desc = descriptive statistics stat info in a numeric vector
proportion of group 2
proportion of group 1
standard deviation of group 2
standard deviation of group 1
sample size of group 2
sample size of group 1
3) std = various standardized effect sizes in a numeric vector
Cramer's V estimate
Cohen's h estimate
Phi coefficient estimate
Yule coefficient estimate
Tetrachoric correlation estimate
odds ratio estimate
risk ratio estimate calculated as (i.e., group 2 / group 1). Note this value will often differ when recoding variables (as it should).
4) count = numeric matrix with dim = [3, 3]
of the 2x2 contingency table of
counts with an additional row and column for totals (if rtn.table
= TRUE).
The two unique observed values of x
(i.e., 0 and 1) - plus the
total - are the rows and the two unique observed values of bin
- plus
the total - are the columns. The dimlabels are "bin" for the rows and "x" for
the columns. The rownames are 1. "0", 2. "1", 3. "total". The colnames are 1.
'lvl[1]', 2. 'lvl[2]', 3. "total"
5) percent = numeric matrix with dim = [3, 3]
of the 2x2 contingency table of overall percentages with an additional
row and column for totals (if rtn.table
= TRUE).
The two unique observed values of x
(i.e., 0 and 1) - plus the total -
are the rows and the two unique observed values of bin
- plus the total -
are the columns. The dimlabels are "bin" for the rows and "x" for the columns.
The rownames are 1. "0", 2. "1", 3. "total". The colnames are 1. 'lvl[1]',
2. 'lvl[2]', 3. "total"
prop.test
the workhorse for prop_diff
,
props_diff
for multiple dummy variables,
phi
for another phi coefficient function
Yule
for another yule coefficient function
tetrachoric
for another tetrachoric coefficient function
# chi-square test of independence # x = "am", bin = "vs" mtcars2 <- mtcars mtcars2$"vs_bin" <- ifelse(mtcars$"vs" == 1, yes = "yes", no = "no") agg(mtcars2$"am", grp = mtcars2$"vs_bin", rep = FALSE, fun = mean) prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin") prop_diff(x = mtcars2$"am", bin = mtcars2$"vs") # using \code{lvl} argument prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin") prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin", lvl = c("yes","no")) # reverses the direction of the effect prop_diff(x = mtcars2$"am", bin = mtcars2$"vs", lvl = c(1, 0)) # levels don't have to be character # recoding the variables prop_diff(x = mtcars2$"am", bin = ifelse(mtcars2$"vs_bin" == "yes", yes = "no", no = "yes")) # reverses the direction of the effect prop_diff(x = ifelse(mtcars2$"am" == 1, yes = 0, no = 1), bin = mtcars2$"vs") # reverses the direction of the effect prop_diff(x = ifelse(mtcars2$"am" == 1, yes = 0, no = 1), bin = ifelse(mtcars2$"vs_bin" == "yes", yes = "no", no = "yes")) # double reverse means same direction of the effect # compare to stats::prop.test # x = "am", bin = "vs_bin" (binary as the rows; dummy as the columns) tmp <- c("vs_bin","am") # b/c Roxygen2 will cause problems table_obj <- table(mtcars2[tmp]) row_order <- nrow(table_obj):1 col_order <- ncol(table_obj):1 table_obj4prop <- table_obj[row_order, col_order] prop.test(table_obj4prop) # compare to stats:chisq.test chisq.test(x = mtcars2$"am", y = mtcars2$"vs_bin") # compare to psych::phi cor(mtcars2$"am", mtcars$"vs") psych::phi(table_obj, digits = 7) # compare to psych::yule() psych::Yule(table_obj) # compare to psych::tetrachoric psych::tetrachoric(table_obj) # Note, I couldn't find a case where psych::tetrachoric() failed to compute psych::tetrachoric(table_obj4prop) # different than single logistic regression summary(glm(am ~ vs, data = mtcars, family = binomial(link = "logit")))
# chi-square test of independence # x = "am", bin = "vs" mtcars2 <- mtcars mtcars2$"vs_bin" <- ifelse(mtcars$"vs" == 1, yes = "yes", no = "no") agg(mtcars2$"am", grp = mtcars2$"vs_bin", rep = FALSE, fun = mean) prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin") prop_diff(x = mtcars2$"am", bin = mtcars2$"vs") # using \code{lvl} argument prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin") prop_diff(x = mtcars2$"am", bin = mtcars2$"vs_bin", lvl = c("yes","no")) # reverses the direction of the effect prop_diff(x = mtcars2$"am", bin = mtcars2$"vs", lvl = c(1, 0)) # levels don't have to be character # recoding the variables prop_diff(x = mtcars2$"am", bin = ifelse(mtcars2$"vs_bin" == "yes", yes = "no", no = "yes")) # reverses the direction of the effect prop_diff(x = ifelse(mtcars2$"am" == 1, yes = 0, no = 1), bin = mtcars2$"vs") # reverses the direction of the effect prop_diff(x = ifelse(mtcars2$"am" == 1, yes = 0, no = 1), bin = ifelse(mtcars2$"vs_bin" == "yes", yes = "no", no = "yes")) # double reverse means same direction of the effect # compare to stats::prop.test # x = "am", bin = "vs_bin" (binary as the rows; dummy as the columns) tmp <- c("vs_bin","am") # b/c Roxygen2 will cause problems table_obj <- table(mtcars2[tmp]) row_order <- nrow(table_obj):1 col_order <- ncol(table_obj):1 table_obj4prop <- table_obj[row_order, col_order] prop.test(table_obj4prop) # compare to stats:chisq.test chisq.test(x = mtcars2$"am", y = mtcars2$"vs_bin") # compare to psych::phi cor(mtcars2$"am", mtcars$"vs") psych::phi(table_obj, digits = 7) # compare to psych::yule() psych::Yule(table_obj) # compare to psych::tetrachoric psych::tetrachoric(table_obj) # Note, I couldn't find a case where psych::tetrachoric() failed to compute psych::tetrachoric(table_obj4prop) # different than single logistic regression summary(glm(am ~ vs, data = mtcars, family = binomial(link = "logit")))
prop_test
tests for a sample proportion difference from a population
proportion with a chi-square test of goodness of fit. The default is that the
goodness of fit is consistent with a population proportion Pi of 0.50. The
function also calculates the descriptive statistics, various standardized
effect sizes (e.g., Cramer's V), and can provide the 1x2 contingency tables.
prop_test
is simply a wrapper for prop.test
plus
some extra calculations.
prop_test( x, pi = 0.5, yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
prop_test( x, pi = 0.5, yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
x |
numeric vector that only has values of 0 or 1 (or missing values), otherwise known as a dummy variable. |
pi |
numeric vector of length 1 specifying the population proportion value to compare the sample proportion against. |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the 1x2 contingency table of counts with totals and the 1x2 overall percentages table. If TRUE, then the last two elements of the return object are "count" containing a vector of counts and "percent" containing a vector of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
list of numeric vectors containing statistical information about the
proportion difference from pi: 1) nhst = chi-square test of goodness of fit stat
info in a numeric vector, 2) desc = descriptive statistics stat info in a
numeric vector, 3) std = various standardized effect sizes in a numeric vector,
4) count = numeric vector of length 3 with table of counts with an additional
element for the total (if rtn.table
= TRUE), 5) percent = numeric vector
of length 3 with table of overall percentages with an element for the total
(if rtn.table
= TRUE)
1) nhst = chi-square test of goodness of fit stat info in a numeric vector
proportion difference estimate (i.e., sample proportion - pi)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (will always be 1)
two-sided p-value
2) desc = descriptive statistics stat info in a numeric vector
sample proportion
popularion proportion provided by the user (or 0.50 by default)
standard deviation
sample size
lower bound of the confidence interval of the sample proportion itself
upper bound of the confidence interval of the sample proportion itself
3) std = various standardized effect sizes in a numeric vector
Cramer's V estimate
Cohen's h estimate
4) count = numeric vector of length 3 with table of counts with an additional
element for the total (if rtn.table
= TRUE). The names are 1. "0", 2.
"1", 3. "total"
5) percent = numeric vector of length 3 with table of overall percentages with
an element for the total (if rtn.table
= TRUE). The names are 1. "0", 2.
"1", 3. "total"
prop.test
the workhorse for prop_test
,
props_test
for multiple dummy variables,
prop_diff
for chi-square test of independence,
# chi-square test of goodness of fit table(mtcars$"am") prop_test(mtcars$"am") prop_test(ifelse(mtcars$"am" == 1, yes = 0, no = 1)) # different than intercept only logistic regression summary(glm(am ~ 1, data = mtcars, family = binomial(link = "logit"))) # error from non-dummy variable ## Not run: prop_test(ifelse(mtcars$"am" == 1, yes = "1", no = "0")) prop_test(ifelse(mtcars$"am" == 1, yes = 2, no = 1)) ## End(Not run)
# chi-square test of goodness of fit table(mtcars$"am") prop_test(mtcars$"am") prop_test(ifelse(mtcars$"am" == 1, yes = 0, no = 1)) # different than intercept only logistic regression summary(glm(am ~ 1, data = mtcars, family = binomial(link = "logit"))) # error from non-dummy variable ## Not run: prop_test(ifelse(mtcars$"am" == 1, yes = "1", no = "0")) prop_test(ifelse(mtcars$"am" == 1, yes = 2, no = 1)) ## End(Not run)
prop_compare
tests for proportion differences across 3+ independent
groups with chi-square tests of independence. The function also calculates
the descriptive statistics for each group, Cramer's V and its confidence
interval as a standardized effect size, and can provide the X by 2
contingency tables. prop_compare
is simply a wrapper for
prop.test
plus some extra calculations.
props_compare( data, vrb.nm, nom.nm, lvl = levels(as.factor(data[[nom.nm]])), yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
props_compare( data, vrb.nm, nom.nm, lvl = levels(as.factor(data[[nom.nm]])), yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
nom.nm |
character vector of length 1 specifying the colname in
|
lvl |
character vector with length 3+ specifying the unique values for
the 3+ independent groups. If |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the X by 2 contingency table of counts with totals for each dummy variable and the X by 2 overall percentages table with totals for each dummy variable. If TRUE, then the last two elements of the return object are "count" containing an array of counts and "percent" containing an array of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
The confidence interval for Cramer's V is calculated with fisher's r to z transformation as Cramer's V is a kind of multiple correlation coefficient. Cramer's V is transformed to fisher's z units, a symmetric confidence interval for fisher's z is calculated, and then the lower and upper bounds are back-transformed to Cramer's V units.
list of data.frames containing statistical information about the
proportion comparisons: 1) nhst = chi-square test of independence stat info
in a data.frame, 2) desc = descriptive statistics stat info in a data.frame
(note there could be more than 3 groups - groups i, j, and k are just
provided as an example), 3) std = standardized effect size and its
confidence interval in a data.frame, 4) count = numeric array with dim =
[X+1, 3, length(vrb.nm)]
of the X by 2 contingency table of counts
for each dummy variable with an additional row and column for totals (if
rtn.table
= TRUE), 5) percent = numeric array with dim = [X+1,
3, length(vrb.nm)]
of the X by 2 contingency table of overall percentages
for each dummy variable with an additional row and column for totals (if
rtn.table
= TRUE).
1) nhst = chi-square test of independence stat info in a data.frame
average proportion difference absolute value (i.e., |group j - group i|)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (of the nominal variable)
two-sided p-value
2) desc = descriptive statistics stat info in a data.frame (note there could be more than 3 groups - groups i, j, and k are just provided as an example):
proportion of group k
proportion of group j
proportion of group i
standard deviation of group k
standard deviation of group j
standard deviation of group i
sample size of group k
sample size of group j
sample size of group i
3) std = standardized effect size and its confidence interval in a data.frame
Cramer's V estimate
lower bound of Cramer's V confidence interval
upper bound of Cramer's V confidence interval
4) count = numeric array with dim = [X+1, 3, length(vrb.nm)]
of the X
by 2 contingency table of counts for each dummy variable with an additional
row and column for totals (if rtn.table
= TRUE).
The 3+ unique observed values of data[[nom.nm]]
- plus the total - are
the rows and the two unique observed values of data[[vrb.nm]]
(i.e., 0
and 1) - plus the total - are the columns. The variables in
data[vrb.nm]
are the layers. The dimlabels are "nom" for the rows and
"x" for the columns and "vrb" for the layers. The rownames are 1. 'lvl[i]',
2. 'lvl[j]', 3. 'lvl[k]', 4. "total". The colnames are 1. "0", 2. "1", 3.
"total". The laynames are vrb.nm
.
5) percent = numeric array with dim = [X+1, 3, length(vrb.nm)]
of the
X by 2 contingency table of overall percentages for each dummy variable with
an additional row and column for totals (if rtn.table
= TRUE).
The 3+ unique observed values of data[[nom.nm]]
- plus the total - are
the rows and the two unique observed values of data[[vrb.nm]]
(i.e., 0
and 1) - plus the total - are the columns. The variables in
data[vrb.nm]
are the layers. The dimlabels are "nom" for the rows, "x"
for the columns, and "vrb" for the layers. The rownames are 1. 'lvl[i]', 2.
'lvl[j]', 3. 'lvl[k]', 4. "total". The colnames are 1. "0", 2. "1", 3.
"total". The laynames are vrb.nm
.
prop.test
the workhorse for prop_compare
,
prop_compare
for a single dummy variable,
props_diff
for only 2 independent groups (aka binary variable),
# rtn.table = TRUE (default) # multiple variables tmp <- replicate(n = 10, expr = mtcars, simplify = FALSE) mtcars2 <- str2str::ld2d(tmp) mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { tmp <- c("cyl", nm) table(mtcars2[tmp]) }) props_compare(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), nom.nm = "cyl") # single variable props_compare(mtcars2, vrb.nm = "am", nom.nm = "cyl") # rtn.table = FALSE (no "count" or "percent" list elements) # multiple variables props_compare(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), nom.nm = "cyl", rtn.table = FALSE) # single variable props_compare(mtcars2, vrb.nm = "am", nom.nm = "cyl", rtn.table = FALSE) # more than 3 groups airquality2 <- airquality airquality2$"Wind_dum" <- ifelse(airquality$"Wind" >= 10, yes = 1, no = 0) airquality2$"Solar.R_dum" <- ifelse(airquality$"Solar.R" >= 100, yes = 1, no = 0) props_compare(airquality2, vrb.nm = c("Wind_dum","Solar.R_dum"), nom.nm = "Month") props_compare(airquality2, vrb.nm = "Wind_dum", nom.nm = "Month")
# rtn.table = TRUE (default) # multiple variables tmp <- replicate(n = 10, expr = mtcars, simplify = FALSE) mtcars2 <- str2str::ld2d(tmp) mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { tmp <- c("cyl", nm) table(mtcars2[tmp]) }) props_compare(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), nom.nm = "cyl") # single variable props_compare(mtcars2, vrb.nm = "am", nom.nm = "cyl") # rtn.table = FALSE (no "count" or "percent" list elements) # multiple variables props_compare(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), nom.nm = "cyl", rtn.table = FALSE) # single variable props_compare(mtcars2, vrb.nm = "am", nom.nm = "cyl", rtn.table = FALSE) # more than 3 groups airquality2 <- airquality airquality2$"Wind_dum" <- ifelse(airquality$"Wind" >= 10, yes = 1, no = 0) airquality2$"Solar.R_dum" <- ifelse(airquality$"Solar.R" >= 100, yes = 1, no = 0) props_compare(airquality2, vrb.nm = c("Wind_dum","Solar.R_dum"), nom.nm = "Month") props_compare(airquality2, vrb.nm = "Wind_dum", nom.nm = "Month")
props_diff
tests the proportion difference of multiple variables
across two independent groups with chi-square tests of independence. The
function also calculates the descriptive statistics for each group, various
standardized effect sizes (e.g., Cramer's V), and can provide the 2x2
contingency tables. props_diff
is simply a wrapper for
prop.test
plus some extra calculations.
props_diff( data, vrb.nm, bin.nm, lvl = levels(as.factor(data[[bin.nm]])), yates = TRUE, zero.cell = 0.05, smooth = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
props_diff( data, vrb.nm, bin.nm, lvl = levels(as.factor(data[[bin.nm]])), yates = TRUE, zero.cell = 0.05, smooth = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
data |
data.frame of data. |
vrb.nm |
character vector specifying the colnames in |
bin.nm |
character vector of length 1 specifying the colname in |
lvl |
character vector with length 2 specifying the unique values for
the two groups. If |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
zero.cell |
numeric vector of length 1 specifying what value to impute
for zero cell counts in the 2x2 contingency table when computing the
tetrachoric correlations. See |
smooth |
logical vector of length 1 specifying whether a smoothing
algorithm should be applied when estimating the tetrachoric correlations.
See |
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the 2x2 contingency table of counts with totals and the 2x2 overall percentages table. If TRUE, then the last two elements of the return object are "count" containing a 3D array of counts and "percent" containing a 3D array of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if
|
list of data.frames containing statistical information about the prop
differences (the rownames of each data.frame are vrb.nm
): 1)
chisqtest = chi-square tests of independence stat info in a data.frame, 2)
describes = descriptive statistics stat info in a data.frame, 3) effects =
various standardized effect sizes in a data.frame, 4) count = numeric 3D
array with dim = [3, 3, length(vrb.nm)]
of the 2x2 contingency
tables of counts with additional rows and columns for totals (if
rtn.table
= TRUE), 5) percent = numeric 3D array with dim =
[3, 3, length(vrb.nm)]
of the 2x2 contingency tables of overall
percentages with additional rows and columns for totals (if
rtn.table
= TRUE).
1) chisqtest = chi-square tests of independence stat info in a data.frame
mean difference estimate (i.e., group 2 - group 1)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (will always be 1)
two-sided p-value
lower bound of the confidence interval
upper bound of the confidence interval
2) describes = descriptive statistics stat info in a data.frame
proportion of group 2
proportion of group 1
standard deviation of group 2
standard deviation of group 1
sample size of group 2
sample size of group 1
3) effects = various standardized effect sizes in a data.frame
Cramer's V estimate
Cohen's h estimate
Phi coefficient estimate
Yule coefficient estimate
Tetrachoric correlation estimate
odds ratio estimate
risk ratio estimate calculated as (i.e., group 2 / group 1). Note this value will often differ when recoding variables (as it should).
4) count = numeric 3D array with dim = [3, 3, length(vrb.nm)]
of the
2x2 contingency tables of counts with additional rows and columns for totals
(if rtn.table
= TRUE).
The two unique observed values of data[vrb.nm]
(i.e., 0 and 1) -
plus the total - are the rows and the two unique observed values of
data[[bin.nm]]
- plus the total - are the columns. The variables
themselves as the layers (i.e., 3rd dimension of the array). The dimlabels
are "bin" for the rows, "x" for the columns, and "vrb" for the layers. The
rownames are 1. "0", 2. "1", 3. "total". The colnames are 1. 'lvl[1]', 2.
'lvl[2]', 3. "total". The laynames are vrb.nm
.
5) percent = numeric 3D array with dim = [3, 3, length(vrb.nm)]
of the
2x2 contingency tables of overall percentages with additional rows and
columns for totals (if rtn.table
= TRUE).
The two unique observed values of data[vrb.nm]
(i.e., 0 and 1) -
plus the total - are the rows and the two unique observed values of
data[[bin]]
- plus the total - are the columns. The variables
themselves as the layers (i.e., 3rd dimension of the array). The dimlabels
are "bin" for the rows, "x" for the columns, and "vrb" for the layers. The
rownames are 1. "0", 2. "1", 3. "total". The colnames are 1. 'lvl[1]', 2.
'lvl[2]', 3. "total". The laynames are vrb.nm
.
prop.test
the workhorse for props_diff
,
prop_diff
for a single dummy variable,
phi
for another phi coefficient function
Yule
for another yule coefficient function
tetrachoric
for another tetrachoric coefficient function
# rtn.table = TRUE (default) # multiple variables mtcars2 <- mtcars mtcars2$"vs_bin" <- ifelse(mtcars$"vs" == 1, yes = "yes", no = "no") mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { tmp <- c("vs_bin", nm) table(mtcars2[tmp]) }) props_diff(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), bin.nm = "vs_bin") # single variable props_diff(mtcars2, vrb.nm = "am", bin.nm = "vs_bin") # rtn.table = FALSE (no "count" or "percent" list elements) # multiple variables props_diff(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), bin.nm = "vs", rtn.table = FALSE) # single variable props_diff(mtcars, vrb.nm = "am", bin.nm = "vs", rtn.table = FALSE)
# rtn.table = TRUE (default) # multiple variables mtcars2 <- mtcars mtcars2$"vs_bin" <- ifelse(mtcars$"vs" == 1, yes = "yes", no = "no") mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { tmp <- c("vs_bin", nm) table(mtcars2[tmp]) }) props_diff(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), bin.nm = "vs_bin") # single variable props_diff(mtcars2, vrb.nm = "am", bin.nm = "vs_bin") # rtn.table = FALSE (no "count" or "percent" list elements) # multiple variables props_diff(data = mtcars2, vrb.nm = c("am","gear_dum","carb_dum"), bin.nm = "vs", rtn.table = FALSE) # single variable props_diff(mtcars, vrb.nm = "am", bin.nm = "vs", rtn.table = FALSE)
props_test
tests for multiple sample proportion difference from
population proportions with chi-square tests of goodness of fit. The default
is that the goodness of fit is consistent with a population proportion Pi of
0.50. The function also calculates the descriptive statistics, various
standardized effect sizes (e.g., Cramer's V), and can provide the 1x2
contingency tables. props_test
is simply a wrapper for
prop.test
plus some extra calculations.
props_test( data, dum.nm, pi = 0.5, yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
props_test( data, dum.nm, pi = 0.5, yates = TRUE, ci.level = 0.95, rtn.table = TRUE, check = TRUE )
data |
data.frame of data. |
dum.nm |
character vector of length 1 specifying the colnames in
|
pi |
numeric vector of length = |
yates |
logical vector of length 1 specifying whether the Yate's
continuity correction should be applied for small samples. See
|
ci.level |
numeric vector of length 1 specifying the confidence level.
|
rtn.table |
logical vector of lengh 1 specifying whether the return object should include the rbinded 1x2 contingency table of counts with totals and the rbinded 1x2 overall percentages table. If TRUE, then the last two elements of the return object are "count" containing a data.frame of counts and "percent" containing a data.frame of overall percentages. |
check |
logical vector of length 1 specifying whether the input
arguments should be checked for errors. For example, if |
list of data.frames containing statistical information about the
proportion differences from pi: 1) nhst = chi-square test of goodness of fit
stat info in a data.frame, 2) desc = descriptive statistics stat info in a
data.frame, 3) std = various standardized effect sizes in a data.frame,
4) count = data.frame containing the rbinded 1x2 tables of counts with an additional
column for the total (if rtn.table
= TRUE), 5) percent = data.frame
containing the rbinded 1x2 tables of overall percentages with an additional
column for the total (if rtn.table
= TRUE)
1) nhst = chi-square test of goodness of fit stat info in a data.frame
proportion difference estimate (i.e., sample proportion - pi)
NA (to remind the user there is no standard error for the test)
chi-square value
degrees of freedom (will always be 1)
two-sided p-value
2) desc = descriptive statistics stat info in a data.frame
sample proportion
popularion proportion provided by the user (or 0.50 by default)
standard deviation
sample size
lower bound of the confidence interval of the sample proportion itself
upper bound of the confidence interval of the sample proportion itself
3) std = various standardized effect sizes in a data.frame
Cramer's V estimate
Cohen's h estimate
4) count = data.frame containing the rbinded 1x2 tables of counts with an additional
column for the total (if rtn.table
= TRUE). The colnames are 1.
"0", 2. "1", 3. "total"
5) percent = data.frame containing the rbinded 1x2 tables of overall percentages
with an additional column for the total (if rtn.table
= TRUE). The
colnames are 1. "0", 2. "1", 3. "total"
prop.test
the workhorse for prop_test
,
prop_test
for a single dummy variables,
props_diff
for chi-square tests of independence,
# multiple variables mtcars2 <- mtcars mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { table(mtcars2[nm]) }) props_test(data = mtcars2, dum.nm = c("am","gear_dum","carb_dum")) props_test(data = mtcars2, dum.nm = c("am","gear_dum","carb_dum"), rtn.table = FALSE) # single variable props_test(data = mtcars2, dum.nm = "am") props_test(data = mtcars2, dum.nm = "am", rtn.table = FALSE) # error from non-dummy variables ## Not run: props_test(data = mtcars2, dum.nm = c("am","gear","carb")) ## End(Not run)
# multiple variables mtcars2 <- mtcars mtcars2$"gear_dum" <- ifelse(mtcars2$"gear" > 3, yes = 1L, no = 0L) mtcars2$"carb_dum" <- ifelse(mtcars2$"carb" > 3, yes = 1L, no = 0L) vrb_nm <- c("am","gear_dum","carb_dum") # dummy variables lapply(X = vrb_nm, FUN = function(nm) { table(mtcars2[nm]) }) props_test(data = mtcars2, dum.nm = c("am","gear_dum","carb_dum")) props_test(data = mtcars2, dum.nm = c("am","gear_dum","carb_dum"), rtn.table = FALSE) # single variable props_test(data = mtcars2, dum.nm = "am") props_test(data = mtcars2, dum.nm = "am", rtn.table = FALSE) # error from non-dummy variables ## Not run: props_test(data = mtcars2, dum.nm = c("am","gear","carb")) ## End(Not run)
recode2other
recodes multiple unique values in a character vector to
the same new value (e.g., "other", NA_character_). It's primary use is to
recode based on the minimum frequency of the unique values so that low
frequency values can be combined into the same category; however, it also
allows for recoding particular unique values given by the user (see details).
This function is a wrapper for car::recode
, which can handle general
recoding of character vectors.
recode2other( x, freq.min, prop = FALSE, inclusive = TRUE, other.nm = "other", extra.nm = NULL )
recode2other( x, freq.min, prop = FALSE, inclusive = TRUE, other.nm = "other", extra.nm = NULL )
x |
character vector. If not a character vector, it will be coarced to
one via |
freq.min |
numeric vector of length 1 specifying the minimum frequency of a unique value to keep it unchanged and consequentially recode any unique values with frequencues less than (or equal to) it. |
prop |
logical vector of length 1 specifying if |
inclusive |
logical vector of length 1 specifying whether the frequency
of a unique value exactly equal to |
other.nm |
character vector of length 1 specifying what value the other unique values should be recoded to. This can be NA_character_ resulting in recoding to a missing value. |
extra.nm |
character vector specifying extra unique values that should
be recoded to |
The extra.nm
argument allows for recode2other
to be used as
simpler function that just recodes particular unique values to the same new
value (although arguably this is easier to do using car::recode
directly). To do so set freq.min = 0
and provide the unique values to
extra.nm
. Note, that the current version of this function does not
allow for NA_character_ to be included in extra.nm
as it will end up
treating it as "NA" (see examples).
character vector of the same length as x
with unique values
with frequency less than freq.nm
recoded to other.nm
as well
as any unique values in extra.nm
. While the current version of the
function allows for recoding *to* NA values via other.nm
, it does
not allow for recoding *from* NA values via extra.nm
(see examples).
# based on minimum frequency unique values state_region <- as.character(state.region) recode2other(state_region, freq.min = 13) # freq.min as a count recode2other(state_region, freq.min = 0.26, prop = TRUE) # freq.min as a proportion recode2other(state_region, freq.min = 13, other.nm = "_blank_") recode2other(state_region, freq.min = 13, other.nm = NA) # allows for other.nm to be NA recode2other(state_region, freq.min = 13, extra.nm = "South") # add an extra unique value to recode recode2other(state_region, freq.min = 13, inclusive = FALSE) # recodes "West" to "other" # based on user given unique values recode2other(state_region, freq.min = 0, extra.nm = c("South","West")) # recodes manually rather than by freq.min # current version does NOT allow for NA to be a unique value that is converted to other state_region2 <- c(NA, state_region, NA) recode2other(state_region2, freq.min = 13) # NA remains in the character vector recode2other(state_region2, freq.min = 0, extra.nm = c("South","West",NA)) # NA remains in the character vector
# based on minimum frequency unique values state_region <- as.character(state.region) recode2other(state_region, freq.min = 13) # freq.min as a count recode2other(state_region, freq.min = 0.26, prop = TRUE) # freq.min as a proportion recode2other(state_region, freq.min = 13, other.nm = "_blank_") recode2other(state_region, freq.min = 13, other.nm = NA) # allows for other.nm to be NA recode2other(state_region, freq.min = 13, extra.nm = "South") # add an extra unique value to recode recode2other(state_region, freq.min = 13, inclusive = FALSE) # recodes "West" to "other" # based on user given unique values recode2other(state_region, freq.min = 0, extra.nm = c("South","West")) # recodes manually rather than by freq.min # current version does NOT allow for NA to be a unique value that is converted to other state_region2 <- c(NA, state_region, NA) recode2other(state_region2, freq.min = 13) # NA remains in the character vector recode2other(state_region2, freq.min = 0, extra.nm = c("South","West",NA)) # NA remains in the character vector
recodes
recodes data based on specified recodes using the
car::recode
function. This can be used for numeric or character
(including factors) data. See recode
for details. The
levels
argument from car::recode
is excluded because there is
no easy way to vectorize it when only a subset of the variables are factors.
recodes(data, vrb.nm, recodes, suffix = "_r", as.factor, as.numeric = TRUE)
recodes(data, vrb.nm, recodes, suffix = "_r", as.factor, as.numeric = TRUE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
recodes |
character vector of length 1 specifying the recodes. See
details of |
suffix |
character vector of length 1 specifying the string to add to the end of the colnames in the return object. |
as.factor |
logical vector of length 1 specifying if the recoded columns
should be returned as factors. The default depends on the column in
|
as.numeric |
logical vector of length 1 specifying if the recoded
columns should be returned as numeric vectors when possible. This can be
useful when having character vectors converted to numeric, such that
numbers with typeof character (e.g., "1") will be coerced to typeof numeric
(e.g., 1). Note, this argument has no effect on columns in
|
data.frame of recoded variables with colnames specified by
paste0(vrb.nm, suffix)
. In general, the columns of the data.frame
are the same typeof as those in data
except for instances when
as.factor
and/or as.numeric
change the typeof.
recodes(data = psych::bfi, vrb.nm = c("A1","C4","C5","E1","E2","O2","O5"), recodes = "1=6; 2=5; 3=4; 4=3; 5=2; 6=1") re_codes <- "'Quebec' = 'canada'; 'Mississippi' = 'usa'; 'nonchilled' = 'no'; 'chilled' = 'yes'" recodes(data = CO2, vrb.nm = c("Type","Treatment"), recodes = re_codes, as.factor = FALSE) # convert from factors to characters
recodes(data = psych::bfi, vrb.nm = c("A1","C4","C5","E1","E2","O2","O5"), recodes = "1=6; 2=5; 3=4; 4=3; 5=2; 6=1") re_codes <- "'Quebec' = 'canada'; 'Mississippi' = 'usa'; 'nonchilled' = 'no'; 'chilled' = 'yes'" recodes(data = CO2, vrb.nm = c("Type","Treatment"), recodes = re_codes, as.factor = FALSE) # convert from factors to characters
renames
renames columns in a data.frame from a codebook. The codebook is
assumed to be a list of data.frames containing the old and new column names.
See details for how the codebook should be structured. The idea is that the
codebook has been imported as an excel workbook with different sets of column
renaming information in different workbook sheets. This function is simply a wrapper
for plyr::rename
.
renames( data, codebook, old = 1L, new = 2L, warn_missing = TRUE, warn_duplicated = TRUE )
renames( data, codebook, old = 1L, new = 2L, warn_missing = TRUE, warn_duplicated = TRUE )
data |
data.frame of data. |
codebook |
list of data.frames containing the old and new column names. |
old |
numeric vector or character vector of length 1 specifying the
position or name of the column in the |
new |
numeric vector or character vector of length 1 specifying the
position or name of the column in the |
warn_missing |
logical vector of length 1 specifying whether |
warn_duplicated |
logical vector of length 1 specifying whether |
codebook
is a list of data.frames where one column refers to the old names
and another column refers to the new names. Therefore, each row of the data.frames
refers to a column in data
. The position or names of the columns in the
codebook
data.frames that contain the old (i.e., old
) and new
(i.e., new
) data
columns must be the same for each data.frame in
codebook
.
data.frame identical to data
except that the old names in
codebook
have been replaced by the new names in codebook
.
code_book <- list( data.frame("old" = c("rating","complaints"), "new" = c("RATING","COMPLAINTS")), data.frame("old" = c("privileges","learning"), "new" = c("PRIVILEGES","LEARNING")) ) renames(data = attitude, codebook = code_book, old = "old", new = "new")
code_book <- list( data.frame("old" = c("rating","complaints"), "new" = c("RATING","COMPLAINTS")), data.frame("old" = c("privileges","learning"), "new" = c("PRIVILEGES","LEARNING")) ) renames(data = attitude, codebook = code_book, old = "old", new = "new")
reorders
re-orders the levels of factor data. The factors are columns
in a data.frame where the same reordering scheme is desired. This is often
useful before using factor data in a statistical analysis (e.g., lm
)
or a graph (e.g., ggplot
). It is essentially a vectorized version of
reorder.default
.
reorders(data, fct.nm, ord.nm = NULL, fun, ..., suffix = "_r")
reorders(data, fct.nm, ord.nm = NULL, fun, ..., suffix = "_r")
data |
data.frame of data. |
fct.nm |
character vector of colnames in |
ord.nm |
character vector of length 1 or |
fun |
function that will be used to re-order the factor columns. The
function is expected to input an atomic vector of length =
|
... |
additional named arguments used by |
suffix |
character vector of length 1 specifying the string that will be appended to the end of the colnames in the return object. |
data.frame of re-ordered factor columns with colnames =
paste0(fct.nm, suffix)
.
# factor vector reorder(x = state.region, X = state.region, FUN = length) # least frequent to most frequent reorder(x = state.region, X = state.region, FUN = function(vec) {-1 * length(vec)}) # most frequent to least frequent # data.frame of factors infert_fct <- infert fct_nm <- c("education","parity","induced","case","spontaneous") infert_fct[fct_nm] <- lapply(X = infert[fct_nm], FUN = as.factor) x <- reorders(data = infert_fct, fct.nm = fct_nm, fun = length) # least frequent to most frequent lapply(X = x, FUN = levels) y <- reorders(data = infert_fct, fct.nm = fct_nm, fun = function(vec) {-1 * length(vec)}) # most frequent to least frequent lapply(X = y, FUN = levels) # ord.nm specified as a different column in data.frame z <- reorders(data = infert_fct, fct.nm = fct_nm, ord.nm = "pooled.stratum", fun = mean) # category with highest mean for pooled.stratum to # category with lowest mean for pooled.stratum lapply(X = z, FUN = levels)
# factor vector reorder(x = state.region, X = state.region, FUN = length) # least frequent to most frequent reorder(x = state.region, X = state.region, FUN = function(vec) {-1 * length(vec)}) # most frequent to least frequent # data.frame of factors infert_fct <- infert fct_nm <- c("education","parity","induced","case","spontaneous") infert_fct[fct_nm] <- lapply(X = infert[fct_nm], FUN = as.factor) x <- reorders(data = infert_fct, fct.nm = fct_nm, fun = length) # least frequent to most frequent lapply(X = x, FUN = levels) y <- reorders(data = infert_fct, fct.nm = fct_nm, fun = function(vec) {-1 * length(vec)}) # most frequent to least frequent lapply(X = y, FUN = levels) # ord.nm specified as a different column in data.frame z <- reorders(data = infert_fct, fct.nm = fct_nm, ord.nm = "pooled.stratum", fun = mean) # category with highest mean for pooled.stratum to # category with lowest mean for pooled.stratum lapply(X = z, FUN = levels)
revalid
recodes invalid data to specified values. For example,
sometimes invalid values are present in a vector of data (e.g., age = -1).
This function allows you to specify which values are possible and will then
recode any impossible values to undefined
. This function is a useful
wrapper for the function car::recode
, tailored for the specific use of
recoding invalid values.
revalid(x, valid, undefined = NA)
revalid(x, valid, undefined = NA)
x |
atomic vector. |
valid |
atomic vector of valid values for |
undefined |
atomic vector of length 1 specifying what the invalid values should be recoded to. |
atomic vector with the same typeof as x
where any values not
present in valid
have been recoded to undefined
.
revalids
valid_test
valids_test
revalid(x = attitude[[1]], valid = 25:75, undefined = NA) # numeric vector revalid(x = as.character(ToothGrowth[["supp"]]), valid = c('VC'), undefined = NA) # character vector revalid(x = ToothGrowth[["supp"]], valid = c('VC'), undefined = NA) # factor
revalid(x = attitude[[1]], valid = 25:75, undefined = NA) # numeric vector revalid(x = as.character(ToothGrowth[["supp"]]), valid = c('VC'), undefined = NA) # character vector revalid(x = ToothGrowth[["supp"]], valid = c('VC'), undefined = NA) # factor
revalids
recodes invalid data to specified values. For example,
sometimes invalid values are present in a vector of data (e.g., age = -1).
This function allows you to specify which values are possible and will then
recode any impossible values to undefined
. revalids
is simply a
vectorized version of revalid
to more easily revalid multiple columns
of a data.frame at the same time.
revalids(data, vrb.nm, valid, undefined = NA, suffix = "_v")
revalids(data, vrb.nm, valid, undefined = NA, suffix = "_v")
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
valid |
atomic vector of valid values for the data. Note, the valid values must be the same for each variable. |
undefined |
atomic vector of length 1 specifying what the invalid values should be recoded to. |
suffix |
character vector of length 1 specifying the string to add to the end of the colnames in the return object. |
data.frame of recoded variables where any values not present in
valid
have been recoded to undefined
with colnames specified
by paste0(vrb.nm, suffix)
.
revalid
valids_test
valid_test
revalids(data = attitude, vrb.nm = names(attitude), valid = 25:75) # numeric data revalids(data = as.data.frame(CO2), vrb.nm = c("Type","Treatment"), valid = c('Quebec','nonchilled')) # factors
revalids(data = attitude, vrb.nm = names(attitude), valid = 25:75) # numeric data revalids(data = as.data.frame(CO2), vrb.nm = c("Type","Treatment"), valid = c('Quebec','nonchilled')) # factors
reverse
reverse codes a numeric vector based on minimum and maximum
values. For example, say numerical values of response options can range from
1 to 4. The function will change 1 to 4, 2 to 3, 3 to 2, and 4 to 1. If there
are an odd number of response options, the middle in the sequence will be
unchanged.
reverse(x, mini, maxi)
reverse(x, mini, maxi)
x |
numeric vector. |
mini |
numeric vector of length 1 specifying the minimum numeric value. |
maxi |
numeric vector of length 1 specifying the maximum numeric value. |
numeric vector that correlates exactly -1 with x
.
x <- psych::bfi[[1]] head(x, n = 15) y <- reverse(x = psych::bfi[[1]], min = 1, max = 6) head(y, n = 15) cor(x, y, use = "complete.obs")
x <- psych::bfi[[1]] head(x, n = 15) y <- reverse(x = psych::bfi[[1]], min = 1, max = 6) head(y, n = 15) cor(x, y, use = "complete.obs")
reverses
reverse codes numeric data based on minimum and maximum
values. For example, say numerical values of response options can range from
1 to 4. The function will change 1 to 4, 2 to 3, 3 to 2, and 4 to 1. If there
are an odd number of response options, the middle in the sequence will be
unchanged.
reverses(data, vrb.nm, mini, maxi, suffix = "_r")
reverses(data, vrb.nm, mini, maxi, suffix = "_r")
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
mini |
numeric vector of length 1 specifying the minimum numeric value. |
maxi |
numeric vector of length 1 specifying the maximum numeric value. |
suffix |
character vector of length 1 specifying the string to add to the end of the colnames in the return object. |
reverses
is simply a vectorized version of reverse
to more
easily reverse code multiple columns of a data.frame at the same time.
data.frame of reverse coded variables with colnames specified by
paste0(vrb.nm, suffix)
.
tmp <- !(is.element(el = names(psych::bfi) , set = c("gender","education","age"))) vrb_nm <- names(psych::bfi)[tmp] reverses(data = psych::bfi, vrb.nm = vrb_nm, mini = 1, maxi = 6)
tmp <- !(is.element(el = names(psych::bfi) , set = c("gender","education","age"))) vrb_nm <- names(psych::bfi)[tmp] reverses(data = psych::bfi, vrb.nm = vrb_nm, mini = 1, maxi = 6)
rowMean_if
calculates the mean of every row in a numeric or logical
matrix conditional on the frequency of observed data. If the frequency of
observed values in that row is less than (or equal to) that specified by
ov.min
, then NA is returned for that row.
rowMeans_if(x, ov.min = 1, prop = TRUE, inclusive = TRUE)
rowMeans_if(x, ov.min = 1, prop = TRUE, inclusive = TRUE)
x |
numeric or logical matrix. If not a matrix, it will be coerced to one. |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the mean
should be calculated if the frequency of observed values in a row is
exactly equal to |
Conceptually this function does: apply(X = x, MARGIN = 1, FUN =
mean_if, ov.min = ov.min, prop = prop, inclusive = inclusive)
. But for
computational efficiency purposes it does not because then the observed
values conditioning would not be vectorized. Instead, it uses rowMeans
and then inserts NAs for rows that have too few observed values
numeric vector of length = nrow(x)
with names =
rownames(x)
providing the mean of each row or NA depending on the
frequency of observed values.
rowSums_if
colMeans_if
colSums_if
rowMeans
rowMeans_if(airquality) rowMeans_if(x = airquality, ov.min = 5, prop = FALSE)
rowMeans_if(airquality) rowMeans_if(x = airquality, ov.min = 5, prop = FALSE)
rowNA
compute the frequency of missing values in a matrix by row. This
function essentially does apply(X = x, MARGIN = 1, FUN = vecNA)
. It is
also used by other functions in the quest package related to missing values
(e.g., rowMeans_if
).
rowNA(x, prop = FALSE, ov = FALSE)
rowNA(x, prop = FALSE, ov = FALSE)
x |
matrix with any typeof. If not a matrix, it will be coerced to a
matrix via |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
numeric vector of length = nrow(x)
, and names =
rownames(x)
, providing the frequency of missing values (or observed
values if ov
= TRUE) per row. If prop
= TRUE, the
values will range from 0 to 1. If prop
= FALSE, the values will
range from 1 to ncol(x)
.
rowNA(as.matrix(airquality)) # count of missing values rowNA(as.data.frame(airquality)) # with rownames rowNA(as.matrix(airquality), prop = TRUE) # proportion of missing values rowNA(as.matrix(airquality), ov = TRUE) # count of observed values rowNA(as.data.frame(airquality), prop = TRUE, ov = TRUE) # proportion of observed values
rowNA(as.matrix(airquality)) # count of missing values rowNA(as.data.frame(airquality)) # with rownames rowNA(as.matrix(airquality), prop = TRUE) # proportion of missing values rowNA(as.matrix(airquality), ov = TRUE) # count of observed values rowNA(as.data.frame(airquality), prop = TRUE, ov = TRUE) # proportion of observed values
rowsNA
computes the frequency of missing values for multiple sets of
columns from a data.frame. The arguments prop
and ov
allow the
user to specify if they want to sum or mean the missing values as well as
compute the frequency of observed values rather than missing values. This
function is essentially a vectorized version of rowNA
that inputs and
outputs a data.frame.
rowsNA(data, vrb.nm.list, prop = FALSE, ov = FALSE)
rowsNA(data, vrb.nm.list, prop = FALSE, ov = FALSE)
data |
data.frame of data. |
vrb.nm.list |
list where each element is a character vector of colnames
in |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
data.frame with the frequency of missing values (or observed values
if ov
= TRUE) for each set of variables. The names are specified by
names(vrb.nm.list)
; if vrb.nm.list
does not have any names,
then the first element from vrb.nm.list[[i]]
is used.
vrb_list <- lapply(X = c("O","C","E","A","N"), FUN = function(chr) { tmp <- grepl(pattern = chr, x = names(psych::bfi)) names(psych::bfi)[tmp] }) rowsNA(data = psych::bfi, vrb.nm.list = vrb_list) # names set to first elements in `vrb.nm.list`[[i]] names(vrb_list) <- paste0(c("O","C","E","A","N"), "_m") rowsNA(data = psych::bfi, vrb.nm.list = vrb_list) # names set to names(`vrb.nm.list`)
vrb_list <- lapply(X = c("O","C","E","A","N"), FUN = function(chr) { tmp <- grepl(pattern = chr, x = names(psych::bfi)) names(psych::bfi)[tmp] }) rowsNA(data = psych::bfi, vrb.nm.list = vrb_list) # names set to first elements in `vrb.nm.list`[[i]] names(vrb_list) <- paste0(c("O","C","E","A","N"), "_m") rowsNA(data = psych::bfi, vrb.nm.list = vrb_list) # names set to names(`vrb.nm.list`)
rowSums_if
calculates the sum of every row in a numeric or logical
matrix conditional on the frequency of observed data. If the frequency of
observed values in that row is less than (or equal to) that specified by
ov.min
, then NA is returned for that row. It also has the option to
return a value other than 0 (e.g., NA) when all rows are NA, which differs
from rowSums(x, na.rm = TRUE)
.
rowSums_if( x, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, allNA = NA_real_ )
rowSums_if( x, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, allNA = NA_real_ )
x |
numeric or logical matrix. If not a matrix, it will be coerced to one. |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the sum should
be calculated if the frequency of observed values in a row is exactly equal
to |
impute |
logical vector of length 1 specifying if missing values should
be imputed with the mean of observed values of |
allNA |
numeric vector of length 1 specifying what value should be
returned for rows that are all NA. This is most applicable when
|
Conceptually this function is doing: apply(X = x, MARGIN = 1, FUN =
sum_if, ov.min = ov.min, prop = prop, inclusive = inclusive)
. But for
computational efficiency purposes it does not because then the observed
values conditioning would not be vectorized. Instead, it uses rowSums
and then inserts NAs for rows that have too few observed values.
numeric vector of length = nrow(x)
with names =
rownames(x)
providing the sum of each row or NA (or allNA
)
depending on the frequency of observed values.
rowMeans_if
colSums_if
colMeans_if
rowSums
rowSums_if(airquality) rowSums_if(x = airquality, ov.min = 5, prop = FALSE) x <- data.frame("x" = c(1, 1, NA), "y" = c(2, NA, NA), "z" = c(NA, NA, NA)) rowSums_if(x) rowSums_if(x, ov.min = 0) rowSums_if(x, ov.min = 0, allNA = 0) identical(x = rowSums(x, na.rm = TRUE), y = unname(rowSums_if(x, impute = FALSE, ov.min = 0, allNA = 0))) # identical to # rowSums(x, na.rm = TRUE)
rowSums_if(airquality) rowSums_if(x = airquality, ov.min = 5, prop = FALSE) x <- data.frame("x" = c(1, 1, NA), "y" = c(2, NA, NA), "z" = c(NA, NA, NA)) rowSums_if(x) rowSums_if(x, ov.min = 0) rowSums_if(x, ov.min = 0, allNA = 0) identical(x = rowSums(x, na.rm = TRUE), y = unname(rowSums_if(x, impute = FALSE, ov.min = 0, allNA = 0))) # identical to # rowSums(x, na.rm = TRUE)
score
calculates observed unweighted scores across a set of variables/items.
If a row's frequency of observed data is less than (or equal to)
ov.min
, then NA is returned for that row. data[vrb.nm]
is
coerced to a matrix before scoring. If the coercion leads to a character
matrix, an error is returned.
score( data, vrb.nm, avg = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, std = FALSE, std.data = std, std.score = std )
score( data, vrb.nm, avg = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, std = FALSE, std.data = std, std.score = std )
data |
data.frame or numeric/logical matrix |
vrb.nm |
character vector of colnames in |
avg |
logical vector of length 1 specifying whether mean scores (TRUE) or sum scores (FALSE) should be created. |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the score
should be calculated (rather than NA) if the frequency of observed values
in a row is exactly equal to |
impute |
logical vector of length 1 specifying if missing values should
be imputed with the mean of observed values from each row of
|
std |
logical vector of length 1 specifying whether 1)
|
std.data |
logical vector of length 1 specifying whether
|
std.score |
logical vector of length 1 specifying whether the score should be standardized after creation. |
numeric vector of the mean/sum of each row or NA
if the
frequency of observed values is less than (or equal to) ov.min
. The
names are the rownames of data
.
scores
rowMeans_if
rowSums_if
scoreItems
score(data = attitude, vrb.nm = c("complaints","privileges","learning","raises")) score(data = attitude, vrb.nm = c("complaints","privileges","learning","raises"), std = TRUE) # standardized scoring score(data = airquality, vrb.nm = c("Ozone","Solar.R","Temp"), ov.min = 0.75) # conditional on observed values
score(data = attitude, vrb.nm = c("complaints","privileges","learning","raises")) score(data = attitude, vrb.nm = c("complaints","privileges","learning","raises"), std = TRUE) # standardized scoring score(data = airquality, vrb.nm = c("Ozone","Solar.R","Temp"), ov.min = 0.75) # conditional on observed values
scores
calculates observed unweighted scores across multiple sets of
variables/items. If a row's frequency of observed data is less than (or equal
to) ov.min
, then NA is returned for that row. Each set of
variables/items are coerced to a matrix before scoring. If the coercion leads
to a character matrix, an error is returned. This can be tested with
lapply(X = vrb.nm.list, FUN = function(nm)
is.character(as.matrix(data[nm])))
.
scores( data, vrb.nm.list, avg = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, std = FALSE, std.data = std, std.score = std )
scores( data, vrb.nm.list, avg = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE, impute = TRUE, std = FALSE, std.data = std, std.score = std )
data |
data.frame or numeric/logical matrix |
vrb.nm.list |
list where each element is a character vector of colnames
in |
avg |
logical vector of length 1 specifying whether mean scores (TRUE) or sum scores (FALSE) should be created. |
ov.min |
minimum frequency of observed values required per row. If
|
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the scores
should be calculated (rather than NA) if the frequency of observed values
in a row is exactly equal to |
impute |
logical vector of length 1 specifying if missing values should
be imputed with the mean of observed values from each row of
|
std |
logical vector of length 1 specifying whether 1) the variables
should be standardized before scoring and 2) the score standardized after
creation. This argument is for convenience as these two standardization
processes are often used together. However, this argument will be
overwritten by any non-default value for |
std.data |
logical vector of length 1 specifying whether the variables/items should be standardized before scoring. |
std.score |
logical vector of length 1 specifying whether the scores should be standardized after creation. |
data.frame of mean/sum scores with NA
for any row with the
frequency of observed values less than (or equal to) ov.min
. The
colnames are specified by names(vrb.nm.list)
and rownames by
row.names(data)
.
score
rowMeans_if
rowSums_if
scoreItems
list_colnames <- list("first" = c("rating","complaints","privileges"), "second" = c("learning","raises","critical")) scores(data = attitude, vrb.nm.list = list_colnames) list_colnames <- list("first" = c("Ozone","Wind"), "second" = c("Solar.R","Temp")) scores(data = airquality, vrb.nm.list = list_colnames, ov.min = .50, inclusive = FALSE) # scoring conditional on observed values
list_colnames <- list("first" = c("rating","complaints","privileges"), "second" = c("learning","raises","critical")) scores(data = attitude, vrb.nm.list = list_colnames) list_colnames <- list("first" = c("Ozone","Wind"), "second" = c("Solar.R","Temp")) scores(data = airquality, vrb.nm.list = list_colnames, ov.min = .50, inclusive = FALSE) # scoring conditional on observed values
shift
shifts elements of a vector right (n
< 0) for lags or
left (n
> 0) for leads replacing the undefined data with a
user-defined value (e.g., NA). The number of elements shifted is equal to
abs(n)
. It is assumed that x
is already sorted by time such
that the first element is earliest in time and the last element is the latest
in time.
shift(x, n, undefined = NA)
shift(x, n, undefined = NA)
x |
atomic vector or list vector. |
n |
integer vector with length 1. Specifies the direction and magnitude of the shift. See details. |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
If n
is negative, then shift
inserts undefined
into the
first abs(n)
elements of x
, shifting all other values of
x
to the right abs(n)
positions, and then dropping the last
abs(n)
elements of x
to preserve the original length of
x
. If n
is positive, then shift
drops the first
abs(n)
elements of x
, shifting all other values of x
left abs(n)
positions, and then inserts undefined
into the last
abs(n)
elements of x
to preserve the original length of
x
. If n
is zero, then shift
simply returns x
.
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shift
tries to circumvent this
issue by a call to round
within shift
if n
is not an
integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shift
truncates rather than rounds.
an atomic vector of the same length as x
that is shifted. If
x
and undefined
are different typeofs, then the return will
be coerced to the more complex typeof (i.e., complex to simple: character,
double, integer, logical).
shift(x = attitude[[1]], n = -1L) # use L to prevent problems with floating point numbers shift(x = attitude[[1]], n = -2L) # can specify any integer up to the length of `x` shift(x = attitude[[1]], n = +1L) # can specify negative or positive integers shift(x = attitude[[1]], n = +2L, undefined = -999) # user-specified indefined value shift(x = setNames(object = letters, nm = LETTERS), n = 3L) # names are kept
shift(x = attitude[[1]], n = -1L) # use L to prevent problems with floating point numbers shift(x = attitude[[1]], n = -2L) # can specify any integer up to the length of `x` shift(x = attitude[[1]], n = +1L) # can specify negative or positive integers shift(x = attitude[[1]], n = +2L, undefined = -999) # user-specified indefined value shift(x = setNames(object = letters, nm = LETTERS), n = 3L) # names are kept
shift_by
shifts elements of a vector right (n
< 0) for lags or
left (n
> 0) for leads by group, replacing the undefined data with a
user-defined value (e.g., NA). The number of elements shifted is equal to
abs(n)
. It is assumed that x
is already sorted within each
group by time such that the first element for that group is earliest in time
and the last element for that group is the latest in time.
shift_by(x, grp, n, undefined = NA)
shift_by(x, grp, n, undefined = NA)
x |
atomic vector or list vector. |
grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame),
which each have same length as |
n |
integer vector with length 1. Specifies the direction and magnitude of the shift. See details. |
undefined |
atomic vector with length 1 (probably makes sense to be the
same typeof as |
If n
is negative, then shift_by
inserts undefined
into the
first abs(n)
elements of x
for each group, shifting all other
values of x
to the right abs(n)
positions, and then dropping
the last abs(n)
elements of x
to preserve the original length
of each group. If n
is positive, then shift_by
drops the first
abs(n)
elements of x
for each group, shifting all other values
of x
left abs(n)
positions, and then inserts undefined
into the last abs(n)
elements of x
to preserve the original
length of each group. If n
is zero, then shift_by
simply returns
x
.
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shift_by
tries to circumvent this
issue by a call to round
within shift_by
if n
is not an
integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shift_by
truncates rather than rounds.
an atomic vector of the same length as x
that is shifted by
group. If x
and undefined
are different typeofs, then the
return will be coerced to the most complex typeof (i.e., complex to simple:
character, double, integer, logical).
shift_by(x = ChickWeight[["Time"]], grp = ChickWeight[["Chick"]], n = -1L) tmp_nm <- c("vs","am") # b/c Roxygen2 doesn't like c() in a [] shift_by(x = mtcars[["disp"]], grp = mtcars[tmp_nm], n = 1L) tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like c() in a [] shift_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n = 2L) # multiple grouping vectors
shift_by(x = ChickWeight[["Time"]], grp = ChickWeight[["Chick"]], n = -1L) tmp_nm <- c("vs","am") # b/c Roxygen2 doesn't like c() in a [] shift_by(x = mtcars[["disp"]], grp = mtcars[tmp_nm], n = 1L) tmp_nm <- c("Type","Treatment") # b/c Roxygen2 doesn't like c() in a [] shift_by(x = as.data.frame(CO2)[["uptake"]], grp = as.data.frame(CO2)[tmp_nm], n = 2L) # multiple grouping vectors
shifts
shifts rows of data down (n
< 0) for lags or up (n
> 0) for leads replacing the undefined data with a user-defined value (e.g.,
NA). The number of rows shifted is equal to abs(n)
. It is assumed that
data[vrb.nm]
is already sorted by time such that the first row is
earliest in time and the last row is the latest in time.
shifts(data, vrb.nm, n, undefined = NA, suffix)
shifts(data, vrb.nm, n, undefined = NA, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
n |
integer vector of length 1. Specifies the direction and magnitude of the shift. See details. |
undefined |
atomic vector of length 1 (probably makes sense to be the
same typeof as the vectors in |
suffix |
character vector of length 1 specifying the string to append to
the end of the colnames of the return object. The default depends on the
|
If n
is negative, then shifts
inserts undefined
into the
first abs(n)
rows of data[vrb.nm]
, shifting all other rows of
x
down abs(n)
positions, and then dropping the last
abs(n)
row of data[vrb.nm]
to preserve the original nrow of
data
. If n
is positive, then shifts
drops the first
abs(n)
rows of x
, shifting all other rows of
data[vrb.nm]
up abs(n)
positions, and then inserts
undefined
into the last abs(n)
rows of x
to preserve the
original length of data
. If n
is zero, then shifts
simply
returns data[vrb.nm]
.
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shifts
tries to circumvent this
issue by a call to round
within shifts
if n
is not an
integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shifts
truncates rather than rounds.
data.frame of shifted data with colnames specified by suffix
.
shifts(data = attitude, vrb.nm = colnames(attitude), n = -1L) shifts(data = mtcars, vrb.nm = colnames(mtcars), n = 2L)
shifts(data = attitude, vrb.nm = colnames(attitude), n = -1L) shifts(data = mtcars, vrb.nm = colnames(mtcars), n = 2L)
shifts_by
shifts rows of data down (n
< 0) for lags or up (n
> 0) for leads replacing the undefined data with a user-defined value (e.g.,
NA). The number of rows shifted is equal to abs(n)
. It is assumed that
data[vrb.nm]
is already sorted within each group by time such that the
first row for that group is earliest in time and the last row for that group
is the latest in time. The groups can be specified by multiple columns in
data
(e.g., grp.nm
with length > 1), and interaction
will be implicitly called to create the groups.
shifts_by(data, vrb.nm, grp.nm, n, undefined = NA, suffix)
shifts_by(data, vrb.nm, grp.nm, n, undefined = NA, suffix)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
grp.nm |
character vector of colnames from |
n |
integer vector of length 1. Specifies the direction and magnitude of the shift. See details. |
undefined |
atomic vector of length 1 (probably makes sense to be the
same typeof as the vectors in |
suffix |
character vector of length 1 specifying the string to append to
the end of the colnames of the return object. The default depends on the
|
If n
is negative, then shifts_by
inserts undefined
into
the first abs(n)
rows of data[vrb.nm]
for each group, shifting
all other rows of x
down abs(n)
positions, and then dropping
the last abs(n)
row of data[vrb.nm]
to preserve the original
nrow of each group. If n
is positive, then shifts_by
drops the
first abs(n)
rows of x
for each group, shifting all other rows
of data[vrb.nm]
up abs(n)
positions, and then inserts
undefined
into the last abs(n)
rows of x
to preserve the
original length of each group. If n
is zero, then shifts_by
simply returns data[vrb.nm]
.
It is recommended to use L
when specifying n
to prevent
problems with floating point numbers. shifts_by
tries to circumvent
this issue by a call to round
within shifts_by
if n
is
not an integer; however that is not a complete fail safe. The problem is that
as.integer(n)
implicit in shifts_by
truncates rather than
rounds.
data.frame of shifted data by group with colnames specified by
suffix
.
shifts_by(data = ChickWeight, vrb.nm = c("weight","Time"), grp.nm = "Chick", n = -1L) shifts_by(data = mtcars, vrb.nm = c("disp","mpg"), grp.nm = c("vs","am"), n = 1L) shifts_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n = 2L) # multiple grouping columns
shifts_by(data = ChickWeight, vrb.nm = c("weight","Time"), grp.nm = "Chick", n = -1L) shifts_by(data = mtcars, vrb.nm = c("disp","mpg"), grp.nm = c("vs","am"), n = 1L) shifts_by(data = as.data.frame(CO2), vrb.nm = c("conc","uptake"), grp.nm = c("Type","Treatment"), n = 2L) # multiple grouping columns
sum_if
calculates the sum of a numeric or logical vector conditional
on a specified minimum frequency of observed values. If the amount of
observed data is less than (or equal to) ov.min
, then NA
is
returned rather than the sum.
sum_if(x, impute = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE)
sum_if(x, impute = TRUE, ov.min = 1, prop = TRUE, inclusive = TRUE)
x |
numeric or logical vector. |
impute |
logical vector of length 1 specifying if missing values should
be imputed with the mean of observed values of |
ov.min |
minimum frequency of observed values required. If |
prop |
logical vector of length 1 specifying whether |
inclusive |
logical vector of length 1 specifying whether the sum should
be calculated (rather than NA) if the frequency of observed values is
exactly equal to |
numeric vector of length 1 providing the sum of x
or NA
conditional on if the frequency of observed data is greater than (or equal
to) ov.min
.
sum_if(x = airquality[[1]], ov.min = .75) # proportion of observed values sum_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observe values sum_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min value itself sum_if(x = c(TRUE, NA, FALSE, NA), ov.min = .50) # works with logical vectors as well as numeric
sum_if(x = airquality[[1]], ov.min = .75) # proportion of observed values sum_if(x = airquality[[1]], ov.min = 116, prop = FALSE) # count of observe values sum_if(x = airquality[[1]], ov.min = 116, prop = FALSE, inclusive = FALSE) # not include ov.min value itself sum_if(x = c(TRUE, NA, FALSE, NA), ov.min = .50) # works with logical vectors as well as numeric
summary_ucfa
provides a summary of a unidimensional confirmatory
factor analysis on a set of variables/items. Unidimensional meaning a
one-factor model where all variables/items load on that factor. The function
is a wrapper for cfa
and returns a list with four
vectors/matrices: 1) model info, 2) fit measures, 3) factor loadings, 4)
covariance/correlation residuals. For details on all the
cfa
arguments see lavOptions
.
summary_ucfa( data, vrb.nm, std.ov = FALSE, std.lv = TRUE, ordered = FALSE, meanstructure = TRUE, estimator = "ML", se = "standard", test = "standard", missing = "fiml", fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), std.load = TRUE, resid.type = "cor.bollen", add.class = TRUE, ... )
summary_ucfa( data, vrb.nm, std.ov = FALSE, std.lv = TRUE, ordered = FALSE, meanstructure = TRUE, estimator = "ML", se = "standard", test = "standard", missing = "fiml", fit.measures = c("chisq", "df", "tli", "cfi", "rmsea", "srmr"), std.load = TRUE, resid.type = "cor.bollen", add.class = TRUE, ... )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
std.ov |
logical vector of length 1 specifying if the variables/items should be standardized |
std.lv |
logical vector of length 1 specifying if the latent factor
should be standardized resulting in all factor loadings being estimated. If
FALSE, then the first variable/item in |
ordered |
logical vector of length 1 specifying if the variables/items should be treated as ordered categorical items where polychoric correlations are used. |
meanstructure |
logical vector of length 1 specifying if the mean
structure of the factor model should be estimated. This would be the
variable/item intercepts (and latent factor mean if |
estimator |
character vector of length 1 specifying the estimator to use
for parameter estimation. Popular options are 1) "ML" = maximum likelihood
estimation based on the multivariate normal distribution, 2) "DWLS" =
diagonally weighted least squares which uses the diagnonal of the weight
matrix, 3) "WLS" for weighted least squares whiches uses the full weight
matrix (often results in computational problems), 4) "ULS" for unweighted
least squares that doesn't use a weight matrix. "DWLS", "WLS", and "ULS"
can each be used with ordered categorical items when |
se |
character vector of length 1 specifying how standard errors should be calculated. Popular options are 1) "standard" for conventional standard errors from inverting the information matrix, 2) "robust.sem" for robust standard errors, 3) "robust.huber.white" for sandwich standard errors. |
test |
character vector of length 1 specifying how the omnibus test statistic should be calculated. Popular options are 1) "standard" for the conventional chi-square statistic, 2) "Satorra-Bentler" for the Satorra-Bentler test statistic, 3) "Yaun.Bentler.Mplus" for the version of the Yuan-Bentler test statistic that Mplus uses, 4) "mean.var.adjusted" for a mean and variance adjusted test statistic, 5) "scaled.shifted" for the version of the mean and variance adjusted test statistic Mplus uses. |
missing |
character vector of length 1 specifying how to handle missing data. Popular options are 1) "fiml" = Full Information Maximum Likelihood (FIML), 2) "pairwise" = pairwise deletion, 3) "listwise" = listwise deletion. |
fit.measures |
character vector specifying which model fit indices to
include in the return object. The default option includes the chi-square
test statistic ("chisq"), degrees of freedom ("df"), tucker-lewis index
("tli"), comparative fit index ("cfi"), root mean square error of
approximation ("rmsea"), and standardized root mean residual ("srmr").
Note, if using robust corrections for |
std.load |
logical vector of length 1 specifying whether the factor loadings included in the return object should be standardized (TRUE) or not (FALSE). |
resid.type |
character vector of length 1 specifying the type of covariance/correlation residuals to include in the return object. Popular options are 1) "raw" for conventional covariance residuals, 2) "cor.bollen" for conventional correlation residuals, 3) "cor.bentler" for correlation residuals that standardizes the model-implied covariance matrix with the observed variances, 4) "standardized" for conventional z-scores of the covariance residuals. |
add.class |
logical vector of length 1 specifying whether the lavaan classes should be added to the returned vectors/matrices (TRUE) or not (FALSE). These classes do not change the underlying vector/matrix and only affect printing. |
... |
any other named arguments available in the
|
list of vectors/matrices providing statistical information about
the unidimensional confirmatory factor analysis. If add.class
= TRUE,
then the elements have lavaan classes which affect printing (except for the
first "model_info" element which always is just an integer vector). The four
elements are:
integer vector providing model information. The first element "converged" is 1 if the model converged and 0 if not. The second element "admissible" is 1 if the model is admissible (e.g., no negative variances) and 0 if not. The third element "nobs" is the number of observations used in the analysis. The fourth element "npar" is the number of parameter estimates.
double vector providing model fit indices. The number
and names of the fit indices is determined by the fit.measures
argument.
1-column double matrix providing factor loadings. The colname
is "latent" and the rownames are the vrb.nm
argument.
covariance/correlation residuals for the model. Note, even
though the name has "cov" in it, the residuals can be "cor" if the argument
resid.type
= "cor.bollen" or "cor.bentler".
# types of models dat <- psych::bfi[1:250, 16:20] # nueroticism items summary_ucfa(data = dat, vrb.nm = names(dat)) # default summary_ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLR se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","srmr")) summary_ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLM se = "robust.sem", test = "satorra.bentler", missing = "listwise", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","srmr")) summary_ucfa(data = dat, vrb.nm = names(dat), ordered = TRUE, estimator = "DWLS", # WLSMV se = "robust", test = "scaled.shifted", missing = "listwise", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","wrmr")) # types of info dat <- psych::bfi[1:250, 16:20] # nueroticism items w <- summary_ucfa(data = dat, vrb.nm = names(dat)) x <- summary_ucfa(data = dat, vrb.nm = names(dat), add.class = FALSE) y <- summary_ucfa(data = dat, vrb.nm = names(dat), std.load = FALSE, resid.type = "raw") z <- summary_ucfa(data = dat, vrb.nm = names(dat), std.load = FALSE, resid.type = "raw", add.class = FALSE) lapply(w, class) lapply(x, class) lapply(y, class) lapply(z, class)
# types of models dat <- psych::bfi[1:250, 16:20] # nueroticism items summary_ucfa(data = dat, vrb.nm = names(dat)) # default summary_ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLR se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","srmr")) summary_ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLM se = "robust.sem", test = "satorra.bentler", missing = "listwise", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","srmr")) summary_ucfa(data = dat, vrb.nm = names(dat), ordered = TRUE, estimator = "DWLS", # WLSMV se = "robust", test = "scaled.shifted", missing = "listwise", fit.measures = c("chisq.scaled","df.scaled","tli.scaled","cfi.scaled", "rmsea.scaled","wrmr")) # types of info dat <- psych::bfi[1:250, 16:20] # nueroticism items w <- summary_ucfa(data = dat, vrb.nm = names(dat)) x <- summary_ucfa(data = dat, vrb.nm = names(dat), add.class = FALSE) y <- summary_ucfa(data = dat, vrb.nm = names(dat), std.load = FALSE, resid.type = "raw") z <- summary_ucfa(data = dat, vrb.nm = names(dat), std.load = FALSE, resid.type = "raw", add.class = FALSE) lapply(w, class) lapply(x, class) lapply(y, class) lapply(z, class)
tapply2
applies a function to a (atomic) vector by group and is an
alternative to the base R function tapply
. The function is
apart of the split-apply-combine type of function discussed in the
plyr
R package and is somewhat similar to dlply
.
It splits up one (atomic) vector .x
into a (atomic) vector for each
group in .grp
, applies a function .fun
to each (atomic) vector,
and then returns the results as a list with names equal to the group values
unique(interaction(.grp.nm, sep = .sep))
. tapply2
is simply
split.default
+ lapply
. Similar to dlply
, The arguments
all start with .
so that they do not conflict with arguments from the
function .fun
. If you want to apply a function a data.frame rather
than a (atomic) vector, then use by2
.
tapply2(.x, .grp, .sep = ".", .fun, ...)
tapply2(.x, .grp, .sep = ".", .fun, ...)
.x |
atomic vector |
.grp |
list of atomic vector(s) and/or factor(s) (e.g., data.frame)
containing the groups. They should each have same length as |
.sep |
character vector of length 1 specifying the string to combine the
group values together with. |
.fun |
function to apply to |
... |
additional named arguments to pass to |
list of objects containing the return object of .fun
for each
group. The names are the unique combinations of the grouping variables
(i.e., unique(interaction(.grp, sep = .sep))
).
# one grouping variable tapply2(mtcars$"cyl", .grp = mtcars$"vs", .fun = median, na.rm = TRUE) # two grouping variables grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] x <- tapply2(mtcars$"cyl", .grp = mtcars[grp_nm], .fun = median, na.rm = TRUE) print(x) str(x) # compare to tapply grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] y <- tapply(mtcars$"cyl", INDEX = mtcars[grp_nm], FUN = median, na.rm = TRUE, simplify = FALSE) print(y) str(y) # has dimnames rather than names
# one grouping variable tapply2(mtcars$"cyl", .grp = mtcars$"vs", .fun = median, na.rm = TRUE) # two grouping variables grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] x <- tapply2(mtcars$"cyl", .grp = mtcars[grp_nm], .fun = median, na.rm = TRUE) print(x) str(x) # compare to tapply grp_nm <- c("vs","am") # Roxygen runs the whole script if I put a c() in a [] y <- tapply(mtcars$"cyl", INDEX = mtcars[grp_nm], FUN = median, na.rm = TRUE, simplify = FALSE) print(y) str(y) # has dimnames rather than names
ucfa
conducts a unidimensional confirmatory factor analysis on a set
of variables/items. Unidimensional meaning a one-factor model where all
variables/items load on that factor. The function is a wrapper for
cfa
and returns an object of class "lavaan":
lavaan
. This then allows the user to extract
statistical information from the object (e.g.,
lavInspect
). For details on all the arguments see
lavOptions
.
ucfa( data, vrb.nm, std.ov = FALSE, std.lv = TRUE, ordered = FALSE, meanstructure = TRUE, estimator = "ML", se = "standard", test = "standard", missing = "fiml", ... )
ucfa( data, vrb.nm, std.ov = FALSE, std.lv = TRUE, ordered = FALSE, meanstructure = TRUE, estimator = "ML", se = "standard", test = "standard", missing = "fiml", ... )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
std.ov |
logical vector of length 1 specifying if the variables/items should be standardized |
std.lv |
logical vector of length 1 specifying if the latent factor
should be standardized resulting in all factor loadings being estimated. If
FALSE, then the first variable/item in |
ordered |
logical vector of length 1 specifying if the variables/items should be treated as ordered categorical items where polychoric correlations are used. |
meanstructure |
logical vector of length 1 specifying if the mean
structure of the factor model should be estimated. This would be the
variable/item intercepts (and latent factor mean if |
estimator |
character vector of length 1 specifying the estimator to use
for parameter estimation. Popular options are 1) "ML" = maximum likelihood
estimation based on the multivariate normal distribution, 2) "DWLS" =
diagonally weighted least squares which uses the diagnonal of the weight
matrix, 3) "WLS" for weighted least squares whiches uses the full weight
matrix (often results in computational problems), 4) "ULS" for unweighted
least squares that doesn't use a weight matrix. "DWLS", "WLS", and "ULS"
can each be used with ordered categorical items when |
se |
character vector of length 1 specifying how standard errors should be calculated. Popular options are 1) "standard" for conventional standard errors from inverting the information matrix, 2) "robust.sem" for robust standard errors, 3) "robust.huber.white" for sandwich standard errors. |
test |
character vector of length 1 specifying how the omnibus test statistic should be calculated. Popular options are 1) "standard" for the conventional chi-square statistic, 2) "Satorra-Bentler" for the Satorra-Bentler test statistic, 3) "Yaun.Bentler.Mplus" for the version of the Yuan-Bentler test statistic that Mplus uses, 4) "mean.var.adjusted" for a mean and variance adjusted test statistic, 5) "scaled.shifted" for the version of the mean and variance adjusted test statistic Mplus uses. |
missing |
character vector of length 1 specifying how to handle missing data. Popular options are 1) "fiml" = Full Information Maximum Likelihood (FIML), 2) "pairwise" = pairwise deletion, 3) "listwise" = listwise deletion. |
... |
any other named arguments available in the
|
object of class "lavaan" lavaan
providing the return object from a call to cfa
.
dat <- psych::bfi[1:250, 16:20] # nueroticism items ucfa(data = dat, vrb.nm = names(dat)) ucfa(data = dat, vrb.nm = names(dat), std.ov = TRUE) ucfa(data = dat, vrb.nm = names(dat), meanstructure = FALSE, missing = "pairwise") ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLR se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml") ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLM se = "robust.sem", test = "satorra.bentler", missing = "listwise") ucfa(data = dat, vrb.nm = names(dat), ordered = TRUE, estimator = "DWLS", # WLSMV se = "robust", test = "scaled.shifted", missing = "listwise")
dat <- psych::bfi[1:250, 16:20] # nueroticism items ucfa(data = dat, vrb.nm = names(dat)) ucfa(data = dat, vrb.nm = names(dat), std.ov = TRUE) ucfa(data = dat, vrb.nm = names(dat), meanstructure = FALSE, missing = "pairwise") ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLR se = "robust.huber.white", test = "yuan.bentler.mplus", missing = "fiml") ucfa(data = dat, vrb.nm = names(dat), estimator = "ML", # MLM se = "robust.sem", test = "satorra.bentler", missing = "listwise") ucfa(data = dat, vrb.nm = names(dat), ordered = TRUE, estimator = "DWLS", # WLSMV se = "robust", test = "scaled.shifted", missing = "listwise")
valid_test
tests whether a vector has any invalid elements. Valid
values are specified by valid
. If the vector x
has any values
other than valid
, then FALSE is returned; If the vector x
only
has values in valid
, then TRUE is returned. This function can be
useful for checking data after manual human entry.
valid_test(x, valid, na.rm = TRUE)
valid_test(x, valid, na.rm = TRUE)
x |
atomic vector or list vector. |
valid |
atomic vector or list vector of valid values. |
na.rm |
logical vector of length 1 specifying whether NA should be ignored from the validity test. If TRUE (default), then any NAs are treated as valid. |
logical vector of length 1 specifying whether all elements in
x
are valid values. If FALSE, then (at least one) invalid values are
present.
valid_test(x = psych::bfi[[1]], valid = 1:6) # return TRUE valid_test(x = psych::bfi[[1]], valid = 0:5) # 6 is not present in `valid` valid_test(x = psych::bfi[[1]], valid = 1:6, na.rm = FALSE) # NA is not present in `valid`
valid_test(x = psych::bfi[[1]], valid = 1:6) # return TRUE valid_test(x = psych::bfi[[1]], valid = 0:5) # 6 is not present in `valid` valid_test(x = psych::bfi[[1]], valid = 1:6, na.rm = FALSE) # NA is not present in `valid`
Valid.test
tests whether data has any invalid elements. Valid values
are specified by valid
. Each variable is tested independently. If the
variable in data[vrb.nm]
has any values other than valid
, then
FALSE is returned for that variable; If the variable in data[vrb.nm]
only has values in valid
, then TRUE is returned for that variable.
valids_test(data, vrb.nm, valid, na.rm = TRUE)
valids_test(data, vrb.nm, valid, na.rm = TRUE)
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
valid |
atomic vector or list vector of valid values. |
na.rm |
logical vector of length 1 specifying whether NA should be ignored from the validity test. If TRUE (default), then any NAs are treated as valid. |
logical vector with length = length(vrb.nm)
and names =
vrb.nm
specifying whether all elements in each variable of
data[vrb.nm]
are valid. If FALSE, then (at least one) invalid values
are present in that variable of data[vrb.nm]
.
valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 1:6) # return TRUE valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 0:5) # 6 is not present in `valid` valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 1:6, na.rm = FALSE) # NA is not present in `valid` valids_test(data = ToothGrowth, vrb.nm = c("supp","dose"), valid = list("VC", "OJ", 0.5, 1.0, 2.0)) # list vector as `valid` to allow for # elements of different typeof
valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 1:6) # return TRUE valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 0:5) # 6 is not present in `valid` valids_test(data = psych::bfi, vrb.nm = names(psych::bfi)[1:25], valid = 1:6, na.rm = FALSE) # NA is not present in `valid` valids_test(data = ToothGrowth, vrb.nm = c("supp","dose"), valid = list("VC", "OJ", 0.5, 1.0, 2.0)) # list vector as `valid` to allow for # elements of different typeof
vecNA
computes the frequency of missing values in an atomic vector.
vecNA
is essentially a wrapper for sum
or mean
+
is.na
or !is.na
and can be useful for functional programming
(e.g., lapply(FUN = vecNA)
). It is also used by other functions in the
quest package related to missing values (e.g., mean_if
).
vecNA(x, prop = FALSE, ov = FALSE)
vecNA(x, prop = FALSE, ov = FALSE)
x |
atomic vector or list vector. If not a vector, it will be coerced to
a vector via |
prop |
logical vector of length 1 specifying whether the frequency of missing values should be returned as a proportion (TRUE) or a count (FALSE). |
ov |
logical vector of length 1 specifying whether the frequency of observed values (TRUE) should be returned rather than the frequency of missing values (FALSE). |
numeric vector of length 1 providing the frequency of missing values
(or observed values if ov
= TRUE). If prop
= TRUE, the value
will range from 0 to 1. If prop
= FALSE, the value will range from 1
to length(x)
.
vecNA(airquality[[1]]) # count of missing values vecNA(airquality[[1]], prop = TRUE) # proportion of missing values vecNA(airquality[[1]], ov = TRUE) # count of observed values vecNA(airquality[[1]], prop = TRUE, ov = TRUE) # proportion of observed values
vecNA(airquality[[1]]) # count of missing values vecNA(airquality[[1]], prop = TRUE) # proportion of missing values vecNA(airquality[[1]], ov = TRUE) # count of observed values vecNA(airquality[[1]], prop = TRUE, ov = TRUE) # proportion of observed values
wide2long
reshapes data from wide to long. This if often necessary to
do with multilevel data where multiple sets of variables in the wide format
seek to be reshaped to multiple rows in the long format. If only one set of
variables needs to be reshaped, then you can use
stack2
or melt.data.frame
- but that
does not work for *multiple* sets of variables. See details for more
information.
wide2long( data, vrb.nm.list, grp.nm = NULL, sep = ".", rtn.obs.nm = "obs", order.by.grp = TRUE, keep.attr = FALSE )
wide2long( data, vrb.nm.list, grp.nm = NULL, sep = ".", rtn.obs.nm = "obs", order.by.grp = TRUE, keep.attr = FALSE )
data |
data.frame of multilevel data in the wide format. |
vrb.nm.list |
A unique argument for the |
grp.nm |
character vector specifying the colnames in |
sep |
character vector of length 1 specifying the string in the column
names provided by |
rtn.obs.nm |
character vector of length 1 specifying the new colname in the return object indicating which observation within each group the row refers to. In longitudinal panel data, this would be the returned time variable. |
order.by.grp |
logical vector of length 1 specifying whether to sort the
return object first by |
keep.attr |
logical vector of length 1 specifying whether to keep the
"reshapeLong" attribute (from |
wide2long
uses reshape(direction = "long")
to reshape the data.
It attempts to streamline the task of reshaping wide to long as the
reshape
arguments can be confusing because the same arguments are used
for wide vs. long reshaping. See reshape
if you are
curious.
IF vrb.nm.list
IS A LIST OF CHARACTER VECTORS: The conventional use of
vrb.nm.list
is to provide a list of character vectors, which specify
each set of variables to be reshaped. For example, if data
contains
data from a longitudinal panel study with the same scores at different waves,
then there might be a column for each score at each wave. vrb.nm.list
would then contain an element for each score with each element containing a
character vector of the colnames for that score at each wave (see examples).
The names of the list elements would then be the colnames in the return
object for those scores.
IF vrb.nm.list
IS A CHARACTER VECTOR: The advanced use of
vrb.nm.list
is to provide a single character vector, which specify the
variables to be reshaped (not organized by sets). In this case (i.e., if
vrb.nm.list
is not a list), then wide2long
(really
reshape
) will attempt to guess which colnames go
together as a set. It is assumed the following column naming scheme has been
used: 1) have the same name prefix for columns within a set, 2) have the same
number suffixes for each set of columns, 3) use, *and only use*, sep
in the colnames to separate the name prefix and the number suffix. For
example, the name prefixes might be "predictor" and "outcome" while the
number suffixes might be "0", "1", and "2", and the separator might be ".",
resulting in column names such as "outcome.1". The name prefix could include
separators other than sep
(e.g., "outcome_item.1"), but it cannot
include sep
(e.g., "outcome.item.1"). So "outcome_item1.1" could be
acceptable, but "outcome.item1.1" would not.
data.frame with nrow equal to nrow(data) *
length(vrb.nm.list[[1]])
if vrb.nm.list
is a list (i.e.,
conventional use) or nrow(data)
* number of unique number suffixes
in vrb.nm.list
if vrb.nm.list
is not a list (i.e., advanced
use). The columns will be in the following order: 1) grp.nm
of the
groups, 2) rtn.obs.nm
of the observation labels, 3) the reshaped
columns, 4) the additional columns that were not reshaped and instead
repeated. How the returned data.frame is sorted depends on
order.by.grp
.
# SINGLE GROUPING VARIABLE dat_wide <- data.frame( x_1.1 = runif(5L), x_2.1 = runif(5L), x_3.1 = runif(5L), x_4.1 = runif(5L), x_1.2 = runif(5L), x_2.2 = runif(5L), x_3.2 = runif(5L), x_4.2 = runif(5L), x_1.3 = runif(5L), x_2.3 = runif(5L), x_3.3 = runif(5L), x_4.3 = runif(5L), y_1.1 = runif(5L), y_2.1 = runif(5L), y_1.2 = runif(5L), y_2.2 = runif(5L), y_1.3 = runif(5L), y_2.3 = runif(5L)) row.names(dat_wide) <- letters[1:5] print(dat_wide) # vrb.nm.list = list of character vectors (conventional use) vrb_pat <- c("x_1","x_2","x_3","x_4","y_1","y_2") vrb_nm_list <- lapply(X = setNames(vrb_pat, nm = vrb_pat), FUN = function(pat) { str2str::pick(x = names(dat_wide), val = pat, pat = TRUE)}) # without `grp.nm` z1 <- wide2long(dat_wide, vrb.nm = vrb_nm_list) # with `grp.nm` dat_wide$"ID" <- letters[1:5] z2 <- wide2long(dat_wide, vrb.nm = vrb_nm_list, grp.nm = "ID") dat_wide$"ID" <- NULL # vrb.nm.list = character vector + guessing (advanced use) vrb_nm <- str2str::pick(x = names(dat_wide), val = "ID", not = TRUE) # without `grp.nm` z3 <- wide2long(dat_wide, vrb.nm.list = vrb_nm) # with `grp.nm` dat_wide$"ID" <- letters[1:5] z4 <- wide2long(dat_wide, vrb.nm = vrb_nm, grp.nm = "ID") dat_wide$"ID" <- NULL # comparisons head(z1); head(z3); head(z2); head(z4) all.equal(z1, z3) all.equal(z2, z4) # keeping the reshapeLong attributes z7 <- wide2long(dat_wide, vrb.nm = vrb_nm_list, keep.attr = TRUE) attributes(z7) # MULTIPLE GROUPING VARIABLES bfi2 <- psych::bfi bfi2$"person" <- unlist(lapply(X = 1:400, FUN = rep.int, times = 7)) bfi2$"day" <- rep.int(1:7, times = 400L) head(bfi2, n = 15) # vrb.nm.list = list of character vectors (conventional use) vrb_pat <- c("A","C","E","N","O") vrb_nm_list <- lapply(X = setNames(vrb_pat, nm = vrb_pat), FUN = function(pat) { str2str::pick(x = names(bfi2), val = pat, pat = TRUE)}) z5 <- wide2long(bfi2, vrb.nm.list = vrb_nm_list, grp = c("person","day"), rtn.obs.nm = "item") # vrb.nm.list = character vector + guessing (advanced use) vrb_nm <- str2str::pick(x = names(bfi2), val = c("person","day","gender","education","age"), not = TRUE) z6 <- wide2long(bfi2, vrb.nm.list = vrb_nm, grp = c("person","day"), sep = "", rtn.obs.nm = "item") # need sep = "" because no character separating # scale name and item number all.equal(z5, z6)
# SINGLE GROUPING VARIABLE dat_wide <- data.frame( x_1.1 = runif(5L), x_2.1 = runif(5L), x_3.1 = runif(5L), x_4.1 = runif(5L), x_1.2 = runif(5L), x_2.2 = runif(5L), x_3.2 = runif(5L), x_4.2 = runif(5L), x_1.3 = runif(5L), x_2.3 = runif(5L), x_3.3 = runif(5L), x_4.3 = runif(5L), y_1.1 = runif(5L), y_2.1 = runif(5L), y_1.2 = runif(5L), y_2.2 = runif(5L), y_1.3 = runif(5L), y_2.3 = runif(5L)) row.names(dat_wide) <- letters[1:5] print(dat_wide) # vrb.nm.list = list of character vectors (conventional use) vrb_pat <- c("x_1","x_2","x_3","x_4","y_1","y_2") vrb_nm_list <- lapply(X = setNames(vrb_pat, nm = vrb_pat), FUN = function(pat) { str2str::pick(x = names(dat_wide), val = pat, pat = TRUE)}) # without `grp.nm` z1 <- wide2long(dat_wide, vrb.nm = vrb_nm_list) # with `grp.nm` dat_wide$"ID" <- letters[1:5] z2 <- wide2long(dat_wide, vrb.nm = vrb_nm_list, grp.nm = "ID") dat_wide$"ID" <- NULL # vrb.nm.list = character vector + guessing (advanced use) vrb_nm <- str2str::pick(x = names(dat_wide), val = "ID", not = TRUE) # without `grp.nm` z3 <- wide2long(dat_wide, vrb.nm.list = vrb_nm) # with `grp.nm` dat_wide$"ID" <- letters[1:5] z4 <- wide2long(dat_wide, vrb.nm = vrb_nm, grp.nm = "ID") dat_wide$"ID" <- NULL # comparisons head(z1); head(z3); head(z2); head(z4) all.equal(z1, z3) all.equal(z2, z4) # keeping the reshapeLong attributes z7 <- wide2long(dat_wide, vrb.nm = vrb_nm_list, keep.attr = TRUE) attributes(z7) # MULTIPLE GROUPING VARIABLES bfi2 <- psych::bfi bfi2$"person" <- unlist(lapply(X = 1:400, FUN = rep.int, times = 7)) bfi2$"day" <- rep.int(1:7, times = 400L) head(bfi2, n = 15) # vrb.nm.list = list of character vectors (conventional use) vrb_pat <- c("A","C","E","N","O") vrb_nm_list <- lapply(X = setNames(vrb_pat, nm = vrb_pat), FUN = function(pat) { str2str::pick(x = names(bfi2), val = pat, pat = TRUE)}) z5 <- wide2long(bfi2, vrb.nm.list = vrb_nm_list, grp = c("person","day"), rtn.obs.nm = "item") # vrb.nm.list = character vector + guessing (advanced use) vrb_nm <- str2str::pick(x = names(bfi2), val = c("person","day","gender","education","age"), not = TRUE) z6 <- wide2long(bfi2, vrb.nm.list = vrb_nm, grp = c("person","day"), sep = "", rtn.obs.nm = "item") # need sep = "" because no character separating # scale name and item number all.equal(z5, z6)
winsor
winsorizes a numeric vector by recoding extreme values as a user-identified boundary value, which is defined by z-score units. The to.na
argument provides the option of recoding the extreme values as missing.
winsor(x, z.min = -3, z.max = 3, rtn.int = FALSE, to.na = FALSE)
winsor(x, z.min = -3, z.max = 3, rtn.int = FALSE, to.na = FALSE)
x |
numeric vector |
z.min |
numeric vector of length 1 specifying the lower boundary value in z-score units. |
z.max |
numeric vector of length 1 specifying the upper boundary value in z-score units. |
rtn.int |
logical vector of length 1 specifying whether the recoded values should be rounded to the nearest integer. This can be useful when working with count data and decimal values are impossible. |
to.na |
logical vector of length 1 specifying whether the extreme values should be recoded to NA rather than winsorized to the boundary values. |
Note, the psych package also has a function called winsor
, which offers
the option to winsorize a numeric vector by quantiles rather than z-scores. If you have both the quest package and the psych
package attached in your current R session (e.g., using library
),
depending on which package you attached first, R might default to using the
winsor
function in either the quest package or the psych package. One
way to deal with this issue is to explicitly call which package you want to
use the winsor
package from. You can do this using the ::
function in base R where the package name comes before the ::
and the
function names comes after it (e.g., quest::winsor
).
numeric vector of the same length as x
with extreme values
recoded as either the boundary values or NA.
winsors
winsor
# psych package
# winsorize table(quakes$"stations") new <- winsor(quakes$"stations") table(new) # recode as NA vecNA(quakes$"stations") new <- winsor(quakes$"stations", to.na = TRUE) vecNA(new) # rtn.int = TRUE winsor(x = cars[[1]], z.min = -2, z.max = 2, rtn.int = FALSE) winsor(x = cars[[1]], z.min = -2, z.max = 2, rtn.int = TRUE)
# winsorize table(quakes$"stations") new <- winsor(quakes$"stations") table(new) # recode as NA vecNA(quakes$"stations") new <- winsor(quakes$"stations", to.na = TRUE) vecNA(new) # rtn.int = TRUE winsor(x = cars[[1]], z.min = -2, z.max = 2, rtn.int = FALSE) winsor(x = cars[[1]], z.min = -2, z.max = 2, rtn.int = TRUE)
winsors
winsorizes numeric data by recoding extreme values as a user
identified boundary value, which is defined by z-score units. The to.na
argument provides the option of recoding the extreme values as missing.
winsors( data, vrb.nm, z.min = -3, z.max = 3, rtn.int = FALSE, to.na = FALSE, suffix = "_win" )
winsors( data, vrb.nm, z.min = -3, z.max = 3, rtn.int = FALSE, to.na = FALSE, suffix = "_win" )
data |
data.frame of data. |
vrb.nm |
character vector of colnames from |
z.min |
numeric vector of length 1 specifying the lower boundary value in z-score units. |
z.max |
numeric vector of length 1 specifying the upper boundary value in z-score units. |
rtn.int |
logical vector of length 1 specifying whether the recoded values should be rounded to the nearest integer. This can be useful when working with count data and decimal values are impossible. |
to.na |
logical vector of length 1 specifying whether the extreme values should be recoded to NA rather than winsorized to the boundary values. |
suffix |
character vector of length 1 specifying the string to append to the end of the colnames in the return object. |
data.frame of winsorized data with extreme values recoded as either
the boundary values or NA and colnames = paste0(vrb.nm, suffix)
.
# winsorize lapply(X = quakes[c("mag","stations")], FUN = table) new <- winsors(quakes, vrb.nm = names(quakes)) lapply(X = new, FUN = table) # recode as NA vecNA(quakes) new <- winsors(quakes, vrb.nm = names(quakes), to.na = TRUE) vecNA(new) # rtn.int = TRUE winsors(data = cars, vrb.nm = names(cars), z.min = -2, z.max = 2, rtn.int = FALSE) winsors(data = cars, vrb.nm = names(cars), z.min = -2, z.max = 2, rtn.int = TRUE)
# winsorize lapply(X = quakes[c("mag","stations")], FUN = table) new <- winsors(quakes, vrb.nm = names(quakes)) lapply(X = new, FUN = table) # recode as NA vecNA(quakes) new <- winsors(quakes, vrb.nm = names(quakes), to.na = TRUE) vecNA(new) # rtn.int = TRUE winsors(data = cars, vrb.nm = names(cars), z.min = -2, z.max = 2, rtn.int = FALSE) winsors(data = cars, vrb.nm = names(cars), z.min = -2, z.max = 2, rtn.int = TRUE)