BMLS (Bayesian Multi-label Learning with Sparse Features and Labels) is the software of a state-of-the-art Bayesian multi-label classification model with improved performance and efficiency by leveraging the sparsity in both features and labels. In addition, by leveraging label correlations, BMLS works effectively in the case where the labels of a datum are partially missing.
- DBA - Dynamic Time Warping Barycenter Averaging
- HDP - Hierarchical Dirichlet Processes
- TSI - Time Series Indexing
- Time Series Regression
- ALR - Accelerated Higher Order Logistic Regression
- Concept Drift
Bayesreg: Bayesian penalized regression with continuous shrinkage prior densities
This is a comprehensive, user-friendly MATLAB toolbox implementing the state-of-the-art in Bayesian linear regression and Bayesian logistic regression developed by Daniel Schmidt in conjunction with Enes Makalic at the University of Melbourne. The toolbox provides highly efficient and numerically stable implementations of ridge, lasso, horseshoe and horseshoe+ regression. The lasso, horseshoe and horseshoe+ priors are recommended for data sets where the number of predictors is greater than the sample size. The toolbox allows predictors to be assigned to logical groupings (potentially overlapping, so that predictors can be part of multiple groups). This can be used to exploit a priori knowledge regarding predictors and how they may be related to each other (for example, in grouping genetic data into genes and collections of genes such as pathways).
To support analysis of data with outliers, we provide two heavy-tailed error models in our implementation of Bayesian linear regression: Laplace and Student-t distribution errors. Most features are straightforward to use and the toolbox can work directly with MATLAB tables (including automatically handling categorical variables), or you can use standard MATLAB matrices.
The aim of Chordalysis is to learn the structure of graphical models (of the joint distribution) for datasets with 1,000+ variables. It performs a forward search on junction trees, which are a very interesting class of models, because it means that you can get a Bayesian Network or a Markov Random Field as an output. Chordalysis supports standard statistical testing (with multiple correction), as well as Bayesian ones such as QNML or MML.
- ICDM 2013: Scaling log-linear analysis to high-dimensional data (http://francois-petitjean.com/Research/Petitjean2013-ICDM.pdf)
- ICDM 2014: A statistically efficient and scalable method for log-linear analysis of high-dimensional data (http://francois-petitjean.com/Research/Petitjean2014-ICDM-MML.pdf)
- SDM 2015: Scaling log-linear analysis to datasets with thousands of variables (http://francois-petitjean.com/Research/Petitjean2015-SDM.pdf)
- KDD 2016: A multiple test correction for streams and cascades of statistical hypothesis tests (http://francois-petitjean.com/Research/WebbPetitjean2016-KDD.pdf)
- Behaviormetrika 2018: Experiments with Learning Graphical Models on Text (http://francois-petitjean.com/Research/Capdevila2018-Behaviormetrika.pdf)
DBA - Dynamic Time Warping Barycenter Averaging
DBA is an averaging method that takes into account non-linear warping of the time axis.
- Pattern Recognition 2011: A global averaging method for Dynamic Time Warping (http://francois-petitjean.com/Research/Petitjean2011-PR.pdf)
- ICDM 2014: Dynamic Time Warping Averaging of Time Series allows Faster and more Accurate Classification (http://francois-petitjean.com/Research/Petitjean2014-ICDM-DTW.pdf)
ICDM 2017: Generating synthetic time series to augment sparse datasets (http://francois-petitjean.com/Research/ForestierPetitjean2017-ICDM.pdf)
FastEE - Fast Ensemble of Elastic Distances
FastEE is a fast and scalable state-of-the-art time series classification algorithm. It is the more efficient version of the Ensemble of Elastic Distances (EE), a major component of one of the most accurate TSC algorithms, HIVE-COTE. FastEE contains 11 1-NN time series classifiers with different distance measures. FastEE speeds up EE by leveraging the relationship of each distance measure with its parameters.
- [SDM 2018] Efficient search of the best warping window for Dynamic Time Warping
- [DMKD] FastEE: Fast Ensembles of Elastic Distances for time series classification
HDP - Hierarchical Dirichlet Processes
One of the central questions in Machine Learning is "what can you abstract from your data" and "when should you trust your data". In this work, we show how to hierarchically smooth a categorical probability distribution; how much we smooth and when we smooth is all learned using the theory behind Hierarchical Dirichlet Processes. The software is very easy to use: you simply give it a set of observations about a target variable and some known variables, and you can then query our model for the probability estimates. It can be the probability of getting cancer given some categorical variables about age, height, weight and risk factors; or be the next work someone is going to write given the previous ones. The theory is very solid and the software versatile.
Paper: [Machine Learning 2018] Accurate estimation of conditional categorical probability distributions using Hierarchical Dirichlet Processes (http://francois-petitjean.com/Research/Petitjean2018-HDP.pdf)
MetaTM (Topic Modelling with Metadata) consists of the software packages of a series of the state-of-the-art topic models for text analysis, which leverage metadata such as document labels and word embeddings to boost the performance and interpretability of topic modelling. In particular, MetaTM enjoys significantly better performance for analysing short texts, such as tweets or news abstracts. MetaTM can be used to visualise the topical content of a given corpus, as well as to performa tasks like document classification and clustering.
MetaTM consists of three packages:
- MetaLDA incorporates binary metadata, with a scalable multi-thread Java implementation.
- MetaFTM enables topics to focused on the most related words informed by word embeddings.
- MIGA is a topic-based document clustering model informed by document labels.
NARM (Node Attribute Relational Model) is the software of a state-of-the-art Bayesian random graph model that incorporates the node metadata (attributes) in a relational graph. NARM can be used for modelling different kinds of relational graph such as social networks, bibliographic networks, and drug interactions, in the tasks of link prediction and community detection.
NBVAE (Negative-Binomial VAE) is the software of a state-of-the-art Variational AutoEncoder (VAE) for modelling discrete data such as texts or relational data. NBVAE achieves improved performance on multiple tasks including text analysis, collaborative filtering, and multi-label classification. NBVAE is implemented in TensorFlow and runs efficiently with GPUs.
ROCKET is a fast state-of-the-art method for time series calssification, it is much faster and can scale to much larger datasets than other methods of comparable accuracy.
- [Data Mining and Knowledge Discovery 2020] Exceptionally fast and accurate time series classification using random convolutional kernels
TSI - Time Series Indexing
TSI is an important task for time series analysis. For example, to find the stock that has performed the most similarly to yours over all the days of trading and all the stocks of the past 20 years; it could be that you want to find the crop that has evolved as similarly as yours given the database of all crops in the world, etc. Time series have intrinsic properties that make them hard to index; this is essentially due to the fact that 2 time series can be considered similar even if they are progressing at different speed, with a bit of time lag, etc. In this work we show how to very efficiently query time series databases.
Paper: [SDM 2017] Indexing millions of time series under time warping (http://francois-petitjean.com/Research/Petitjean2017-SDM.pdf)
- WEDTM discovers sub-topics for each normal topic, which give finer-grained semantic interpretations of both topics and documents.
- DirBN automatically discovers tree-structured topic hierarchies, where topics on the higher levels are more general than those on the lower levels.
- [NeurIPS] Dirichlet Belief Networks for Topic Structure Learning
- [ICML] Inter and Intra Topic Structure Learning with Word Embeddings
Time Series Regression
Skopus is a method to discover interesting patterns from sequential data. This data could for example be the sequence of pages browsed by the visitor of your website, the sequence of actions taken by a client or a sequence of decisions made about a process. Skopus is completely unsupervised and extracts the patterns that are the most unexpected rather than defining expected patterns.
ALR - Accelerated Higher Order Logistic Regression
The aim of this work is to develop a pre-conditioner for logistic regression with high-order features. High-order features allow for a low-bias learner (required for big data), while our preconditioner makes it possible to train our model much quicker in terms of the number of iterations.
Paper: [Machine Learning 2016] Accelerated Higher Order Logistic Regression. (http://francois-petitjean.com/Research/ALR.pdf)
Concept drift research tackles learning from data when the distribution from which it is sampled moves over time. It could for example be that you're studying a disease that is mutating over time, some characteristics are similar and so you should make the most of them, but others are changing so you should adapt to those. This work gives the conceptual framework to allow the analysis of drift in existing datasets. The software attached provides a tool to generate synthetic data with existing drift.
Paper: [Data Mining and Knowledge Discovery 2016] Characterizing Concept Drift (http://arxiv.org/pdf/1511.03816v6)