class: center, middle, inverse, title-slide .title[ # Pushing the AR test beyond the limit ] .subtitle[ ## A novel goodness-of-fit test for high-dimensional applications ] .author[ ### Markku Kuismin ] .institute[ ### University of Oulu, Northern Finland Birth Cohorts ] .date[ ### 2026-06-05 ] --- <style type="text/css"> .highlight-last-item > ul > li, .highlight-last-item > ol > li { opacity: 0.5; } .highlight-last-item > ul > li:last-of-type, .highlight-last-item > ol > li:last-of-type { opacity: 1; } </style> --- class: center, middle # Very short author biography ---
- I work as a biostatistician at the University of Oulu School of Medicine. - My research interests are high-dimensional statistic, network estimation, non-parametric methods, (multivariate) statistical tests, cluster analysis, and **rejection sampling** just to mention a few examples. - Here I will introduce a novel multivariate test based on rejection sampling that works in high-dimensional case. --- class: center, middle # Background and motivation --- # Background and motivation - All parametric statistical methods are built on assumptions about the population distribution of the random variable. - In multivariate applications dealing high-dimensional data assume the population distribution is multivariate normal. - For example, - Gaussian Graphical Models. If not normal `->` strict conditional-independence interpretation becomes weaker. - High-dimensional linear discriminant analysis. If not normal `->` other methods such as RF or SVMs may be more robust alternatives. - Gaussian Mixture Models. If not normal `->` incorrectly merged clusters - In general: It is difficult to analyse data meaningfully without understanding the distribution from which they arise. --- # Background and motivation For the rest of this presentation, the null hypothesis can be defined as, `$$H_0: f = f_0$$` - For example, `\(H_0: f\text{ is a multivariate normal density.}\)` - Or a plain-language description: "*The underlying population distribution is multivariate normal.*" - Several test have been proposed (e.g., Kolmogorov–Smirnov, energy test by Székely and Rizzo (2005), a high-dimensional test by Chen, H & Xia (2023) and many others) - Until we have a test that can detect the difference in 100% power in every single scenario, I consider this as an open research problem... --- class: center, middle # The AR statistic --- # The AR statistic - Based on the Accept-Reject (AR) algorithm. `$$\rho = \frac{1}{n}\sum_{i=1}^n \text{I}\Bigl(\frac{f_0(X_i)}{Df(X_i)} > U\Bigr).$$` - `\(\rho\)` is the proportion of observations `\(X_i\)`, among `\(n\)` observations drawn from a distribution with density `\(f\)`, that would be accepted as samples from the distribution with density `\(f_0\)`. - Note: The fundamental objective of AR algorithms is to maximize the effective sample yield while minimizing sensitivity to a poor choice of the proposal distribution. - My objective is the complete opposite: I want that the algorithm is sensitive to a poor choice of the proposal distribution that is used to evaluate the null hypothesis. --- # The AR statistic - The algorithm was adapted in Kuismin (2026) for statistical inference. - Main idea: - Let `\(f\)` be the true density (population density) and `\(f_0\)` is the hypothesized density. - Compute `\(\rho\)` under null: Assume that `\(f_0\)` and `\(f\)` are the same distribution (set the normalizing constant `\(D = 1\)`). - Instead of generating pseudo random numbers from `\(f\)`, compute `\(\rho\)` with respect to the observed data. --- # The AR statistic - A similar test statistic has been proposed previously, without establishing a connection to rejection sampling: - Györfi, L., & Van Der Meulen, E. C. (1991) investigated a test statistic based on **Total variation distance** and showed how to build a procedure that is consistent against a large class of alternatives. - (I do not discuss about this test in detail and I am mentioning it to connect my test procedure in the earlier literature.) - We will later talk a little more about total variation distance, which is the common denominator underlying both methods. --- # The AR statistic - Technical challenges: - `\(f\)` is unknown (naturally) and we only have observations from it. How to determine `\(f\)`? - Solution: Estimate `\(f\)`, e.g., using KDE(...) - `\(\rho\)` depends on an external random variable `\(U\)`. How to determine a deterministic test statistic? - Solution: Use `\(E_U(\rho)\)` instead. - Denote `\(\rho(\textbf{X}) = E_U(\rho) = n^{-1}\sum_{1=1}^n \min\left(1, f_0(X_i)/\widehat{f}(X_i)\right)\)`. --- # The AR statistic - The statistic has a handful of attractive properties. - **Direct probabilistic interpretation:** - The AR statistic can be interpreted as an estimated acceptance probability. - **It measures how often observations would be accepted under the hypothesized population distribution.** - Values close to 1 indicate good agreement with the hypothesized distribution. - Values close to 0 indicate lack of fit and evidence against the model. - **Interval estimates for `\(\rho(\textbf{X})\)` can be computed using simulations or Poisson-Binomial distribution.** - **It is a global measure of distributional agreement.** - It can detect differences in location, covariance, tail behavior, skewness, multimodality, and other shape features. --- # The AR statistic - The statistic depends on a likelihood ratio `\(f_0(X_i)/f(X_i)\)`. - (Isn't it just an empirical likelihood test...) - Yes, but the random threshold step that makes a big difference: - Likelihood ratio and empirical likelihood tests asymptotic behavior is governed by the Kullback–Leibler divergence. - The AR statistic on the other hand is connected to *total variation distance*. --- # Total variation distance The total variation distance (TVD) between `\(f(x)\)` and `\(f_0(x)\)` is half the `\(L_1\)`-distance between densities or mass functions. `$$\|f(x) - f_0(x)\|_{\text{TV}} = \frac{1}{2}\int_{\mathcal{X}}|f(x) - f_0(x)|\, dx.$$` <!-- --> --- # Total variation distance In my testing framework, the similarity between the true density `\(f\)` and the hypothesized density `\(f_0\)` can be interpreted in terms of TVD. Asymptotically, as `\(n \to \infty\)`, `$$\rho(\textbf{X}) \xrightarrow{P} 1 - \|f(x) - f_0(x)\|_{\text{TV}},$$` where `\(f\)` is the true density (population density) and `\(f_0\)` is the hypothesized density. - See Kuismin (2026) for more detailed description and proof. --- # Total variation distance <!-- --> - Population dist. is mv. logistic, `\(p=5\)`, location vector `\(\mathbf{0}\)`, and variance matrix `\(\mathbf{I} = diag(1, \ldots, 1)_{5 \times 5}\)`. - The black horizontal line is 1 - TVD between true pop. density and mvuniform densities, approximated using Monte Carlo integration. Dashed horizontal lines illustrate the 95% interval estimates. --- # Total variation distance <!-- --> - Population dist. is multivariate t, `\(p=5\)`, location vector `\(\mathbf{0}\)`, and variance/scale matrix `\(\mathbf{I} = diag(1, \ldots, 1)_{5 \times 5}\)`. --- # Total variation distance <!-- --> - Population dist. is multivariate normal, `\(p=5\)`, location vector `\(\mathbf{0}\)`, and variance matrix `\(\mathbf{I} = diag(1, \ldots, 1)_{5 \times 5}\)`. --- class: center, middle # Distribution of the test statistic --- # Distribution of the test statistic - The observed test statistic follows Poisson-Binomial distribution (see Kuismin, 2026 for more details). - Compare the distribution of the test statistic when it is computed: 1. By running the AR algorithm multiple times (grey bars). 2. Using Poisson-Binomial distribution (blue bars). - (These distributions will overlap in the barplots presented in the next slide.) --- # Distribution of the test statistic - Example when the population distribution is mvuniform, `\(p = 10\)`. - The black vertical line is 1 - TVD between mvnormal and mvuniform densities, approximated using Monte Carlo integration. <img src="nordstat_2026_presentation_files/figure-html/unnamed-chunk-7-1.png" alt="" width="100%" /> --- # Challenges - The statistic depends on a density estimate `\(f\)`. `->` Estimating `\(f\)` is more or less hopeless in high-dimensional setting. - However, let's test how well `\(\rho(\textbf{X})\)` works when making decision whether to reject `\(H_0\)` or not. --- class: center, middle # Going beyond the limit --- # Going beyond the limit - Although density estimation becomes unfeasible when `\(p \gg n\)` and `\(p\)` increases, can we still get something useful? - How does the AR statistic hold up? - For comparison, the energy test statistic cannot be computed when `\(p > n\)`, ``` r n = 40 p = 50 mu = rep(0, p) I = diag(1, p) x = mvtnorm::rmvnorm(n = n, mean = mu, sigma = I) energy::mvnorm.test(x, R = 500) ``` ``` ## ## Energy test of multivariate normality: estimated parameters ## ## data: x, sample size 40, dimension 50, replicates 500 ## E-statistic = NA, p-value = NA ``` --- class: center, middle # Going beyond the limit ## Interval estimates --- ## TVD - multivariate logistic - When there are "enough" samples and the population distribution is not too "challenging", the test works reasonably well. <!-- --> --- ## Interval estimates - multivariate logistic <!-- --> --- ## TVD - multivariate logistic - When `\(p \gg n\)`, interpretation of point and interval estimates becomes weaker/unreliable. <!-- --> --- ## Interval estimates - multivariate logistic - However, the test statistic still seems to be able to distinguish the population density from the hypothesized `\(N_p\!\left(\boldsymbol{0}, \boldsymbol{I}\right)\)` density. <!-- --> --- ## TVD - multivariate normal <!-- --> --- ## Interval estimates - multivariate normal <!-- --> --- class: center, middle # Going beyond the limit ## Power simulations --- ## Power simulations `$$H_0: \mathbf{X}_1, \ldots, \mathbf{X}_n \overset{\mathrm{i.i.d.}}{\sim} N_p\!\left(\boldsymbol{0}, \boldsymbol{I}\right).$$` - Consider five different multivariate population distributions: - Multivariate t, `\(MVT(\mathbf{0}, \mathbf{I}, 5)\)`. - Multivariate normal, `\(MVN(\mathbf{0}, \mathbf{I})\)`. - Multivariate logistic, `\(MVLOGIS(\mathbf{0}, \mathbf{1})\)`. - Multivariate Burr, `\(MVBURR(1, \mathbf{1}, \mathbf{1})\)`. - Multivariate uniform, `\(MVUNIF(1)\)`. - `\(n = 50\)`, `\(p \in \{40, 50, 70\}\)`. - Estimate the power using `\(p-\)`values and Monte Carlo simulations. --- ## Power simulations - Compare the AR test with three other tests for multivariate normality: - Energy test (Székely & Rizzo, 2005). - NEW test (Chen & Xia, 2023). - Zhou-Shao's test ("Tn test") (Zhou & Shao, 2014). - Power evaluation: estimate how often the test rejects `\(H_0\)` under different alternatives at a pre-specified Type I error rate `\(\alpha = 0.05\)`. --- ## Power simulations
--- # References Chen, H. & Xia, Y. (2023). A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach, Journal of the American Statistical Association, 118, 719-731, https://doi.org/10.1080/01621459.2021.1953507 Györfi, L. & Van Der Meulen, E. C. (1991). A consistent goodness of fit test based on the total variation distance. In Nonparametric Functional Estimation and Related Topics (pp. 631-645). Dordrecht: Springer Netherlands. Kuismin, M. (2025). Using the rejection sampling for finding tests. arXiv preprint arXiv:2509.10325, https://doi.org/10.48550/arXiv.2509.10325 Székely, G. J. & Rizzo, M. L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis 93, 58-80, https://doi.org/10.1016/j.jmva.2003.12.002 Zhou, M. & Shao, Y. (2014). A powerful test for multivariate normality. Journal of applied statistics, 41, 351-363, https://doi.org/10.1080/02664763.2013.839637 --- class: middle, title-slide .pull-left[ # Thank you! <br/> ] .pull-right[ .right[ <img style="border-radius: 10%;" src="Rlogo.png" width="150px" alt="R logo" /> <!-- [
github.io](https://markkukuismin.github.io/)<br/> --> [
github.io](https://markkukuismin.github.io/)<br/> [
markku.kuismin@oulu.fi](markku.kuismin@oulu.fi)<br> ]]