
Identify studies contributing to heterogeneity patterns found in GOSH plots
gosh.diagnostics.RdThis function uses three unsupervised learning learning algorithms (k-means, DBSCAN and Gaussian Mixture Models) to identify studies contributing to the heterogeneity-effect size patterns found in GOSH (graphic display of study heterogeneity) plots.
Usage
gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
km.params = list(centers = 3,
iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong",
"Lloyd", "Forgy","MacQueen"),
trace = FALSE),
db.params = list(eps = 0.15, MinPts = 5,
method = c("hybrid", "raw", "dist")),
gmm.params = list(G = NULL, modelNames = NULL,
prior = NULL, control = emControl(),
initialization = list(hcPairs = NULL,
subset = NULL,
noise = NULL),
Vinv = NULL,
warn = mclust.options("warn"),
x = NULL, verbose = FALSE),
seed = 123,
verbose = TRUE)Arguments
- data
An object of class
gosh.rmacreated through thegoshfunction.- km
Logical. Should the k-Means algorithm be used to identify patterns in the GOSH plot matrix?
TRUEby default.- db
Logical. Should the DBSCAN algorithm be used to identify patterns in the GOSH plot matrix?
TRUEby default.- gmm
Logical. Should a bivariate Gaussian Mixture Model be used to identify patterns in the GOSH plot matrix?
TRUEby default.- km.params
A list containing the parameters for the k-Means algorithm as implemented in
kmeans. Run?kmeansfor further details.- db.params
A list containing the parameters for the DBSCAN algorithm as implemented in
dbscan. Run?fpc::dbscanfor further details.- gmm.params
A list containing the parameters for the Gaussian Mixture Models as implemented in
mclustBIC. Run?mclust::mclustBICfor further details.- seed
Seed used for reproducibility. Default seed is
123.- verbose
Logical. Should a progress bar be printed in the console during clustering?
Details
GOSH Plots
GOSH (graphic display of study heterogeneity) plots were proposed by Olkin, Dahabreh and Trikalinos (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots facilitate the detection of both (i) outliers and (ii) distinct homogeneous subgroups within the modeled data.
Data for the plots is generated by fitting a random-effects-model with the
same specifications as in the meta-analysis to all P(k),∅∉P(k),∀2k−1≤106 possible
subsets of studies in an analysis. For |P(k)|>106, 1
million subsets are randomly sampled and used for model fitting when using
the gosh function.
GOSH Plot Diagnostics
Although GOSH plots allow to detect heterogeneity patterns and distinct
subgroups within the data, interpretation which studies contribute to a
certain subgroup or pattern is often difficult or computationally
intensive. To facilitate the detection of studies responsible for specific
patterns within the GOSH plots, this function randomly samples 104
data points from the GOSH Plot data (to speed up computation). Of the data
points, only the z-transformed I2 and effect size value is
used (as other heterogeneity metrics produced for the GOSH plot data using
the gosh function are linear combinations of
I2). To this data, three clustering algorithms are applied.
The first algorithm is k-Means clustering using the algorithm by Hartigan & Wong (1979) and mk=3 cluster centers by default. The functions uses the
kmeansimplementation to perform k-Means clustering.As k-Means does not perform well in the presence of distinct arbitrary subclusters and noise, the function also applies DBSCAN (density reachability and connectivity clustering; Schubert et al., 2017). The hyperparameters ϵ and MinPts can be tuned for each analysis to maintain a reasonable amount of granularity while not producing too many subclusters. The function uses the
dbscanimplementation to perform the DBSCAN clustering.Lastly, as a clustering approach using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002) are integrated in the function using an internal call to the
mclustBICimplementation. Clustering hyperparameters can be tuned by providing a list of parameters of themclustBICfunction in themclustpackage.
To assess which studies predominantly contribute to a detected cluster, the function calculates the cluster imbalance of a specific study using the difference between (i) the expected share of subsets containing a specific study if the cluster makeup was purely random (viz., representative for the full sample), and the (ii) actual share of subsets containing a specific study within a cluster. Cook's distance for each study is then calculated based on a linear intercept model to determine the leverage of a specific study for each cluster makeup. Studies with a leverage value three times above the mean in any of the generated clusters (for all used clustering algorithms) are returned as potentially influential cases and the GOSH plot is redrawn highlighting these specific studies.
References
Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28 (1). 100–108.
Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity. Research Synthesis Methods 3, (3). 214–23.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS) 42, (3). ACM: 19.