Bayesreg: Bayesian penalized regression with continuous shrinkage prior densities
This is a comprehensive, user-friendly MATLAB toolbox implementing the state-of-the-art in Bayesian linear regression and Bayesian logistic regression developed by Daniel Schmidt in conjunction with Enes Makalic at the University of Melbourne. The toolbox provides highly efficient and numerically stable implementations of ridge, lasso, horseshoe and horseshoe+ regression. The lasso, horseshoe and horseshoe+ priors are recommended for data sets where the number of predictors is greater than the sample size. The toolbox allows predictors to be assigned to logical groupings (potentially overlapping, so that predictors can be part of multiple groups). This can be used to exploit a priori knowledge regarding predictors and how they may be related to each other (for example, in grouping genetic data into genes and collections of genes such as pathways).
To support analysis of data with outliers, we provide two heavy-tailed error models in our implementation of Bayesian linear regression: Laplace and Student-t distribution errors. Most features are straightforward to use and the toolbox can work directly with MATLAB tables (including automatically handling categorical variables), or you can use standard MATLAB matrices.
- DBA - Dynamic Time Warping Barycenter Averaging
- HDP - Hierarchical Dirichlet Processes
- TSI - Time Series Indexing
- ALR - Accelerated Higher Order Logistic Regression
- Concept Drift
The aim of Chordalysis is to learn the structure of graphical models (of the joint distribution) for datasets with 1,000+ variables. It performs a forward search on junction trees, which are a very interesting class of models, because it means that you can get a Bayesian Network or a Markov Random Field as an output. Chordalysis supports standard statistical testing (with multiple correction), as well as Bayesian ones such as QNML or MML.
- ICDM 2013: Scaling log-linear analysis to high-dimensional data (http://francois-petitjean.com/Research/Petitjean2013-ICDM.pdf)
- ICDM 2014: A statistically efficient and scalable method for log-linear analysis of high-dimensional data (http://francois-petitjean.com/Research/Petitjean2014-ICDM-MML.pdf)
- SDM 2015: Scaling log-linear analysis to datasets with thousands of variables (http://francois-petitjean.com/Research/Petitjean2015-SDM.pdf)
- KDD 2016: A multiple test correction for streams and cascades of statistical hypothesis tests (http://francois-petitjean.com/Research/WebbPetitjean2016-KDD.pdf)
- Behaviormetrika 2018: Experiments with Learning Graphical Models on Text (http://francois-petitjean.com/Research/Capdevila2018-Behaviormetrika.pdf)
DBA - Dynamic Time Warping Barycenter Averaging
DBA is an averaging method that takes into account non-linear warping of the time axis.
- Pattern Recognition 2011: A global averaging method for Dynamic Time Warping (http://francois-petitjean.com/Research/Petitjean2011-PR.pdf)
- ICDM 2014: Dynamic Time Warping Averaging of Time Series allows Faster and more Accurate Classification (http://francois-petitjean.com/Research/Petitjean2014-ICDM-DTW.pdf)
ICDM 2017: Generating synthetic time series to augment sparse datasets (http://francois-petitjean.com/Research/ForestierPetitjean2017-ICDM.pdf)
HDP - Hierarchical Dirichlet Processes
One of the central questions in Machine Learning is "what can you abstract from your data" and "when should you trust your data". In this work, we show how to hierarchically smooth a categorical probability distribution; how much we smooth and when we smooth is all learned using the theory behind Hierarchical Dirichlet Processes. The software is very easy to use: you simply give it a set of observations about a target variable and some known variables, and you can then query our model for the probability estimates. It can be the probability of getting cancer given some categorical variables about age, height, weight and risk factors; or be the next work someone is going to write given the previous ones. The theory is very solid and the software versatile.
Paper: [Machine Learning 2018] Accurate estimation of conditional categorical probability distributions using Hierarchical Dirichlet Processes (http://francois-petitjean.com/Research/Petitjean2018-HDP.pdf)
TSI - Time Series Indexing
TSI is an important task for time series analysis. For example, to find the stock that has performed the most similarly to yours over all the days of trading and all the stocks of the past 20 years; it could be that you want to find the crop that has evolved as similarly as yours given the database of all crops in the world, etc. Time series have intrinsic properties that make them hard to index; this is essentially due to the fact that 2 time series can be considered similar even if they are progressing at different speed, with a bit of time lag, etc. In this work we show how to very efficiently query time series databases.
Paper: [SDM 2017] Indexing millions of time series under time warping (http://francois-petitjean.com/Research/Petitjean2017-SDM.pdf)
Skopus is a method to discover interesting patterns from sequential data. This data could for example be the sequence of pages browsed by the visitor of your website, the sequence of actions taken by a client or a sequence of decisions made about a process. Skopus is completely unsupervised and extracts the patterns that are the most unexpected rather than defining expected patterns.
ALR - Accelerated Higher Order Logistic Regression
The aim of this work is to develop a pre-conditioner for logistic regression with high-order features. High-order features allow for a low-bias learner (required for big data), while our preconditioner makes it possible to train our model much quicker in terms of the number of iterations.
Paper: [Machine Learning 2016] Accelerated Higher Order Logistic Regression. (http://francois-petitjean.com/Research/ALR.pdf)
Concept drift research tackles learning from data when the distribution from which it is sampled moves over time. It could for example be that you're studying a disease that is mutating over time, some characteristics are similar and so you should make the most of them, but others are changing so you should adapt to those. This work gives the conceptual framework to allow the analysis of drift in existing datasets. The software attached provides a tool to generate synthetic data with existing drift.
Paper: [Data Mining and Knowledge Discovery 2016] Characterizing Concept Drift (http://arxiv.org/pdf/1511.03816v6)