Identify studies contributing to heterogeneity patterns found in GOSH plots
gosh.diagnostics.Rd
This function uses three unsupervised learning learning algorithms (k-means, DBSCAN and Gaussian Mixture Models) to identify studies contributing to the heterogeneity-effect size patterns found in GOSH (graphic display of study heterogeneity) plots.
Usage
gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
km.params = list(centers = 3,
iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong",
"Lloyd", "Forgy","MacQueen"),
trace = FALSE),
db.params = list(eps = 0.15, MinPts = 5,
method = c("hybrid", "raw", "dist")),
gmm.params = list(G = NULL, modelNames = NULL,
prior = NULL, control = emControl(),
initialization = list(hcPairs = NULL,
subset = NULL,
noise = NULL),
Vinv = NULL,
warn = mclust.options("warn"),
x = NULL, verbose = FALSE),
seed = 123,
verbose = TRUE)
Arguments
- data
An object of class
gosh.rma
created through thegosh
function.- km
Logical. Should the k-Means algorithm be used to identify patterns in the GOSH plot matrix?
TRUE
by default.- db
Logical. Should the DBSCAN algorithm be used to identify patterns in the GOSH plot matrix?
TRUE
by default.- gmm
Logical. Should a bivariate Gaussian Mixture Model be used to identify patterns in the GOSH plot matrix?
TRUE
by default.- km.params
A list containing the parameters for the k-Means algorithm as implemented in
kmeans
. Run?kmeans
for further details.- db.params
A list containing the parameters for the DBSCAN algorithm as implemented in
dbscan
. Run?fpc::dbscan
for further details.- gmm.params
A list containing the parameters for the Gaussian Mixture Models as implemented in
mclustBIC
. Run?mclust::mclustBIC
for further details.- seed
Seed used for reproducibility. Default seed is
123
.- verbose
Logical. Should a progress bar be printed in the console during clustering?
Details
GOSH Plots
GOSH (graphic display of study heterogeneity) plots were proposed by Olkin, Dahabreh and Trikalinos (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots facilitate the detection of both (i) outliers and (ii) distinct homogeneous subgroups within the modeled data.
Data for the plots is generated by fitting a random-effects-model with the
same specifications as in the meta-analysis to all \(\mathcal{P}(k),
\emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6\) possible
subsets of studies in an analysis. For \(|\mathcal{P}(k)| > 10^6\), 1
million subsets are randomly sampled and used for model fitting when using
the gosh
function.
GOSH Plot Diagnostics
Although GOSH plots allow to detect heterogeneity patterns and distinct
subgroups within the data, interpretation which studies contribute to a
certain subgroup or pattern is often difficult or computationally
intensive. To facilitate the detection of studies responsible for specific
patterns within the GOSH plots, this function randomly samples \(10^4\)
data points from the GOSH Plot data (to speed up computation). Of the data
points, only the \(z\)-transformed \(I^2\) and effect size value is
used (as other heterogeneity metrics produced for the GOSH plot data using
the gosh
function are linear combinations of
\(I^2\)). To this data, three clustering algorithms are applied.
The first algorithm is k-Means clustering using the algorithm by Hartigan & Wong (1979) and \(m_k = 3\) cluster centers by default. The functions uses the
kmeans
implementation to perform k-Means clustering.As k-Means does not perform well in the presence of distinct arbitrary subclusters and noise, the function also applies DBSCAN (density reachability and connectivity clustering; Schubert et al., 2017). The hyperparameters \(\epsilon\) and \(MinPts\) can be tuned for each analysis to maintain a reasonable amount of granularity while not producing too many subclusters. The function uses the
dbscan
implementation to perform the DBSCAN clustering.Lastly, as a clustering approach using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002) are integrated in the function using an internal call to the
mclustBIC
implementation. Clustering hyperparameters can be tuned by providing a list of parameters of themclustBIC
function in themclust
package.
To assess which studies predominantly contribute to a detected cluster, the function calculates the cluster imbalance of a specific study using the difference between (i) the expected share of subsets containing a specific study if the cluster makeup was purely random (viz., representative for the full sample), and the (ii) actual share of subsets containing a specific study within a cluster. Cook's distance for each study is then calculated based on a linear intercept model to determine the leverage of a specific study for each cluster makeup. Studies with a leverage value three times above the mean in any of the generated clusters (for all used clustering algorithms) are returned as potentially influential cases and the GOSH plot is redrawn highlighting these specific studies.
References
Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28 (1). 100–108.
Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity. Research Synthesis Methods 3, (3). 214–23.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS) 42, (3). ACM: 19.