Analysis of Large Data Sets

Dane kontaktowe:

środa

14:15

16:00

Uczestnicy:

Analysis of Large Data Sets

Dane kontaktowe:

środa

14:15

16:00

Uczestnicy:

piątek, 12-06-2020 - 14:15, https://lu-se.zoom.us/j/65067339175

Insights and algorithms for the multivariate square-root lasso

Aaron Molstad (University of Florida)

We study the multivariate square-root lasso, a method for
fitting the multivariate response (i.e. multi-task) linear regression model
with dependent errors. This estimator minimizes the nuclear norm of the
residual matrix plus a convex penalty. Unlike some existing methods for
multivariate response linear regression, which require explicit estimates
of the error covariance matrix or its inverse, the multivariate square-root
lasso criterion implicitly adapts to dependent errors and is convex. To
justify the use of this estimator, we establish an error bound which
illustrates that like the univariate square-root lasso, the multivariate
square-root lasso is pivotal with respect to the unknown error covariance
matrix. Based on our theory, we propose a simple tuning approach which
requires fitting the model for only a single value of the tuning parameter,
e.g., does not require cross-validation. We propose two algorithms to
compute the estimator: a prox-linear alternating direction method of
multipliers algorithm, and an accelerated first order algorithm which can
be applied in certain cases. In both simulation studies and a genomic data
application, we show that the multivariate square-root lasso can outperform
more computationally intensive methods which estimate both the regression
coefficient matrix and error precision matrix.

piątek, 05-06-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

The adaptive incorporation of multiple sources of information in Brain Imaging via penalized optimization

Damian Brzyski

The use of multiple sources of information in regression modeling has recently received a lot of attention in the statistical and brain imaging literature. This talk introduces a novel, fully-automatic statistical procedure that addresses the problem of linear regression coefficients' estimation in the situation when the additional information about connectivities between variables is given. Our method, “Adaptive Information Merging Estimator for Regression” (AIMER) enables for the incorporation of multiple sources of such information as well as for the division of one source into pieces and determining their impact on the estimates. We performed extensive simulations to visualize the desired adjusting properties of our method and show its advantages over the existing approaches. We also applied AIMER to analyze structural brain imaging data and to reveal the association between cortical thickness and HIV-related outcomes.

piątek, 29-05-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

Fast and robust procedures in high-dimensional variable selection

Wojciech Rejchel

We investigate the variable selection problem in the single index model Y=g(β′X,ϵ), where g is unknown function. Moreover, we make no assumptions on the distribution of errors, existence of their moments etc. We propose a computationally fast variable selection procedure, which is based on standard Lasso with response variables replaced by their ranks. If response variables are binary, our approach is even simpler: we just treat their class labels as they were numbers and apply standard Lasso. We present theoretical and numerical results describing variable selection properties of the methods.

piątek, 22-05-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

Brain imaging and wearable devices; statistical learning to the rescue

Jaroslaw Harezlak

The amount of medical data collected has been growing exponentially over the past few decades. This growth in data acquisition has not been, unfortunately, paralleled by the same growth rate of statistical learning methods’ development. In my talk, I will give a brief overview of the analytical methods developed by my group and their applications in the medical and public health areas. Specifically,
(1) regularization methods applied to the structural brain imaging data and
(2) signal processing techniques utilized in extracting physical activity information from the raw accelerometry data will be emphasized.

piątek, 08-05-2020 - 14:15, https://lu-se.zoom.us/j/65067339175

Screening rules for the lasso and SLOPE

Patrick Tardivel, Johan Larsson

Info about the seminar:
https://statistical-learning-seminars.github.io/

czwartek, 30-01-2020 - 14:15, 603

Data-Driven Kaplan-Meier One-Sided Two-Sample Tests

Grzegorz Wyłupek

In the talk, we discuss existing approaches, known from the literature, to detection of stochastic ordering of the two survival curves as well as pose and solve the novel testing problem on it. Specifically, the null hypothesis asserts the lack of the ordering, while the alternative expresses its existence. An introduced test statistic is a functional of the standardized two-sample Kaplan-Meier process sampling in a randomly selected number of the random points being the observed survival times in the pooled sample and exploits the information contained in a specially defined one-sided weighted log-rank statistic. It automatically weighs the magnitude and sign of their components becoming a sensible procedure in the considered testing problem. As a result, the corresponding test asymptoticly controls the errors of both kinds at the specified significance level α. The conducted simulation study shows that the errors are also satisfactorily controlled when sample sizes are finite. Furthermore, in the comparison to the best and most popular tests, the new solution turns out to be a promising procedure which improves them upon. A real data analysis confirms that findings.

czwartek, 23-01-2020 - 14:15, 603

On irrepresentable condition for LASSO and SLOPE estimators

Patrick Tardivel

The irrepresentable condition is a well known condition for sign recovery by LASSO.
In this talk we introduce a similar condition for model recovery by SLOPE.

czwartek, 16-01-2020 - 14:15, 603

Finding structured estimates in matrix regression problems

Damian Brzyski (PWr)

Classical scalar-response regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. In medical applications, however, regressors are often in a form of multi-dimensional arrays. For example, one may be interested in using MRI imaging to identify which brain regions are associated with a health outcome. Vectorizing the two-dimensional image arrays is an unsatisfactory approach since it destroys the inherent spatial structure of the images and can be computationally challenging. We present an alternative approach - regularized matrix regression - where the matrix of regression coefficients is defined as a solution to the specific optimization problem. The method, called SParsity Inducing Nuclear Norm EstimatoR (SpINNEr), simultaneously imposes two penalty types on the regression coefficient matrix - the nuclear norm and the lasso norm - to encourage a low rank matrix solution that also has entry-wise sparsity. A novel implementation of the alternating direction method of multipliers (ADMM) is used to build a fast and efficient numerical solver. Our simulations show that SpINNEr outperforms others methods in estimation accuracy when the response-related entries (representing the brain's functional connectivity) are arranged in well-connected communities. SpINNEr is applied to investigate associations between HIV disease-related outcomes and functional connectivity in the human brain.

czwartek, 05-12-2019 - 14:15, 603

Statistical challenges in mass spectrometry data analysis: shared peptides

Mateusz Staniak

Mass spectrometry (MS) is one of the most important technologies for study of proteins. MS experiments generate massive amounts of complex data which require advanced pre-processing and careful statistical analysis.
In bottom-up approach to MS, peptides - smaller segments of proteins - enter the mass spectrometer and thus measurements are made on a peptide level.
Because of this, one of the problems in protein quantification based on MS is the presence of peptides that can be assigned to multiple proteins.
Such peptides are referred to as shared or degenerate peptides.
Since it is not obvious how to assign the abundance of shared peptides to proteins, they are often discarded from the analysis. This leads to a loss of a substantial amount of data.
In this talk, I will first present the basics of Mass Spectrometry data analysis. Then, I will review existing methods for handling shared peptides.
I will finish with a summary of our progress on improving methodology of protein quantification with shared peptides and related statistical challenges.
The talk is based on an ongoing collaboration with Tomasz Burzykowski (Hasselt University) and Jurgen Claesen (Belgian Nuclear Research Centre).

czwartek, 14-11-2019 - 14:15, 603

On the Model Selection Properties and Uniqueness of the Lasso and Related Estimators

Ulrike Schneider (Vienna University of Technology)

We investigate the model selection properties of the Lasso estimator in finite samples with no conditions on the regressor matrix X. We show that which covariates the Lasso estimator may potentially choose in high dimensions (where the number of explanatory variables p exceeds sample
size n) depends only on X and the given penalization weights. This set of potential covariates can be determined through a geometric condition on X and may be small enough (less than or equal to n in cardinality). Related to the geometric conditions in our considerations, we also provide a necessary and sufficient condition for uniqueness of the Lasso solutions. Finally, we discuss how these results carry over to other model selection procedures such as the SLOPE

czwartek, 07-11-2019 - 14:15, 603

Selection of colored saturated Gaussian models

Piotr Graczyk (Université d'Angers)

TBA

wtorek, 29-10-2019 - 14:15, 605

Analysis of HDX-MS data: a pristine land for bioinformatics

Michał Burdukiewicz (MI2 DataLab, PW)

Hydrogen-deuterium exchange monitored by mass spectrometry (HDX-MS) has recently become a staple tool in studies of protein structure. The main application of this technique is to compare the structure of a protein altered by several factors (so-called states). Introduced statistical frameworks address the screening part of the analysis, i.e., search for significant differences between states, but miss the post-screening phase of analysis. We critically evaluate existing models and point their strengths and weaknesses. Additionally, we provide a novel solution to a multi-state comparison problem where the region of the interest inside the protein structure is already well-defined.

czwartek, 24-10-2019 - 14:15, 603

Counting faces of random polytopes and applications

Patrick Tardivel

Abstract in the attachment

Pliki:

czwartek, 17-10-2019 - 14:15, 603

Statistical inference with missing values

Wei Jiang

Missing data exist in almost all areas of empirical research. There are various reasons why missing data may occur, including survey non-response, unavailability of measurements, and lost data. In this presentation, I will share my experience on how to do parametric estimation with missing covariates, based on likelihood methods and Expectation-Maximization algorithm. Then I will focus on recent results in a supervised learning setting, for performing logistic regression with missing values. We illustrate the method on a dataset of severely traumatized patients from Paris hospitals to predict the occurrence of hemorrhagic shock, a leading cause of early preventable death in severe trauma cases. The methodology is implemented in the R package misaem.

środa, 07-11-2018 - 14:15, 711/712

Topics on stochastic optimization and long-time approximation of stochastic processes

Fabien Panloup (Angers)

Stochastic optimization is a way of approximating minima of deterministic functions by a stochastic approach. I will begin my talk by some background on this topic and on the Robbins-Monro algorithm. Then, I will state some recent non-asymptotic results about Ruppert-Polyak algorithm, which is an averaged version of the Robbins-Monro algorithm. In a last part, I will briefly introduce the problem of long-time approximation of diffusion processes and its link with approximation of Gibbs distributions. I will conclude some statistical applications of these methods. This talk is based on collaborations with Sébastien Gadat and Gilles Pagès