`R/gosh.diagnostics.R`

`gosh.diagnostics.Rd`

This function uses three unsupervised learning learning algorithms
(*k*-means, DBSCAN and Gaussian Mixture Models) to identify studies
contributing to the heterogeneity-effect size patterns found in GOSH (graphic
display of study heterogeneity) plots.

gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE, km.params = list(centers = 3, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy","MacQueen"), trace = FALSE), db.params = list(eps = 0.15, MinPts = 5, method = c("hybrid", "raw", "dist")), gmm.params = list(G = NULL, modelNames = NULL, prior = NULL, control = emControl(), initialization = list(hcPairs = NULL, subset = NULL, noise = NULL), Vinv = NULL, warn = mclust.options("warn"), x = NULL, verbose = FALSE), seed = 123, verbose = TRUE)

data | An object of class |
---|---|

km | Logical. Should the |

db | Logical. Should the DBSCAN algorithm be used to identify patterns
in the GOSH plot matrix? |

gmm | Logical. Should a bivariate Gaussian Mixture Model be used to
identify patterns in the GOSH plot matrix? |

km.params | A list containing the parameters for the |

db.params | A list containing the parameters for the DBSCAN algorithm
as implemented in |

gmm.params | A list containing the parameters for the Gaussian Mixture Models
as implemented in |

seed | Seed used for reproducibility. Default seed is |

verbose | Logical. Should a progress bar be printed in the console during clustering? |

**GOSH Plots**

GOSH (*graphic display of study
heterogeneity*) plots were proposed by Olkin, Dahabreh and Trikalinos
(2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots
facilitate the detection of both (i) outliers and (ii) distinct homogeneous
subgroups within the modeled data.

Data for the plots is generated by fitting a random-effects-model with the
same specifications as in the meta-analysis to all \(\mathcal{P}(k),
\emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6\) possible
subsets of studies in an analysis. For \(|\mathcal{P}(k)| > 10^6\), 1
million subsets are randomly sampled and used for model fitting when using
the `gosh`

function.

**GOSH Plot Diagnostics**

Although GOSH plots allow to detect heterogeneity patterns and distinct
subgroups within the data, interpretation which studies contribute to a
certain subgroup or pattern is often difficult or computationally
intensive. To facilitate the detection of studies responsible for specific
patterns within the GOSH plots, this function randomly samples \(10^4\)
data points from the GOSH Plot data (to speed up computation). Of the data
points, only the \(z\)-transformed \(I^2\) and effect size value is
used (as other heterogeneity metrics produced for the GOSH plot data using
the `gosh`

function are linear combinations of
\(I^2\)). To this data, three clustering algorithms are applied.

The first algorithm is

*k*-Means clustering using the algorithm by Hartigan & Wong (1979) and \(m_k = 3\) cluster centers by default. The functions uses the`kmeans`

implementation to perform*k*-Means clustering.As

*k*-Means does not perform well in the presence of distinct arbitrary subclusters and noise, the function also applies**DBSCAN**(*density reachability and connectivity clustering*; Schubert et al., 2017). The hyperparameters \(\epsilon\) and \(MinPts\) can be tuned for each analysis to maintain a reasonable amount of granularity while not producing too many subclusters. The function uses the`dbscan`

implementation to perform the DBSCAN clustering.Lastly, as a clustering approach using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002) are integrated in the function using an internal call to the

`mclustBIC`

implementation. Clustering hyperparameters can be tuned by providing a list of parameters of the`mclustBIC`

function in the`mclust`

package.

To assess which studies predominantly contribute to a detected cluster, the function calculates the cluster imbalance of a specific study using the difference between (i) the expected share of subsets containing a specific study if the cluster makeup was purely random (viz., representative for the full sample), and the (ii) actual share of subsets containing a specific study within a cluster. Cook's distance for each study is then calculated based on a linear intercept model to determine the leverage of a specific study for each cluster makeup. Studies with a leverage value three times above the mean in any of the generated clusters (for all used clustering algorithms) are returned as potentially influential cases and the GOSH plot is redrawn highlighting these specific studies.

Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis
and density estimation, *Journal of the American Statistical Association*,
97/458, pp. 611-631.

Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm.
*Journal of the Royal Statistical Society. Series C (Applied Statistics), 28* (1). 100–108.

Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity.
*Research Synthesis Methods 3*, (3). 214–23.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited:
Why and How You Should (Still) Use DBSCAN. *ACM Transactions on Database Systems (TODS) 42*, (3). ACM: 19.