Software

Machine Learning software programs

The Machine Learning group has a long history of developing state-of-the-art research software that find use in many other projects.

Featured Program

BMLS

BMLS (Bayesian Multi-label Learning with Sparse Features and Labels) is the software of a state-of-the-art Bayesian multi-label classification model with improved performance and efficiency by leveraging the sparsity in both features and labels. In addition, by leveraging label correlations, BMLS works effectively in the case where the labels of a datum are partially missing.

Read more and download

All software

Bayesreg
Chordalysis
DBA - Dynamic Time Warping Barycenter Averaging
FastEE
HDP - Hierarchical Dirichlet Processes
MetaTM
NARM
NBVAE
ROCKET
TSI - Time Series Indexing
TSL
Time Series Regression
Skopus
ALR - Accelerated Higher Order Logistic Regression
Concept Drift

Bayesreg: Bayesian penalized regression with continuous shrinkage prior densities

This is a comprehensive, user-friendly MATLAB toolbox implementing the state-of-the-art in Bayesian linear regression and Bayesian logistic regression developed by Daniel Schmidt in conjunction with Enes Makalic at the University of Melbourne. The toolbox provides highly efficient and numerically stable implementations of ridge, lasso, horseshoe and horseshoe+ regression. The lasso, horseshoe and horseshoe+ priors are recommended for data sets where the number of predictors is greater than the sample size. The toolbox allows predictors to be assigned to logical groupings (potentially overlapping, so that predictors can be part of multiple groups). This can be used to exploit a priori knowledge regarding predictors and how they may be related to each other (for example, in grouping genetic data into genes and collections of genes such as pathways).

To support analysis of data with outliers, we provide two heavy-tailed error models in our implementation of Bayesian linear regression: Laplace and Student-t distribution errors. Most features are straightforward to use and the toolbox can work directly with MATLAB tables (including automatically handling categorical variables), or you can use standard MATLAB matrices.

Chordalysis

The aim of Chordalysis is to learn the structure of graphical models (of the joint distribution) for datasets with 1,000+ variables. It performs a forward search on junction trees, which are a very interesting class of models, because it means that you can get a Bayesian Network or a Markov Random Field as an output. Chordalysis supports standard statistical testing (with multiple correction), as well as Bayesian ones such as QNML or MML.

Learn more about Chordalysis

Papers:

ICDM 2013: Scaling log-linear analysis to high-dimensional data (http://francois-petitjean.com/Research/Petitjean2013-ICDM.pdf)
ICDM 2014: A statistically efficient and scalable method for log-linear analysis of high-dimensional data (http://francois-petitjean.com/Research/Petitjean2014-ICDM-MML.pdf)
SDM 2015: Scaling log-linear analysis to datasets with thousands of variables (http://francois-petitjean.com/Research/Petitjean2015-SDM.pdf)
KDD 2016: A multiple test correction for streams and cascades of statistical hypothesis tests (http://francois-petitjean.com/Research/WebbPetitjean2016-KDD.pdf)
Behaviormetrika 2018: Experiments with Learning Graphical Models on Text (http://francois-petitjean.com/Research/Capdevila2018-Behaviormetrika.pdf)

DBA - Dynamic Time Warping Barycenter Averaging

DBA is an averaging method that takes into account non-linear warping of the time axis.

Learn more about DBA

Papers:

Pattern Recognition 2011: A global averaging method for Dynamic Time Warping (http://francois-petitjean.com/Research/Petitjean2011-PR.pdf)
ICDM 2014: Dynamic Time Warping Averaging of Time Series allows Faster and more Accurate Classification (http://francois-petitjean.com/Research/Petitjean2014-ICDM-DTW.pdf)

ICDM 2017: Generating synthetic time series to augment sparse datasets (http://francois-petitjean.com/Research/ForestierPetitjean2017-ICDM.pdf)

FastEE - Fast Ensemble of Elastic Distances

FastEE is a fast and scalable state-of-the-art time series classification algorithm. It is the more efficient version of the Ensemble of Elastic Distances (EE), a major component of one of the most accurate TSC algorithms, HIVE-COTE. FastEE contains 11 1-NN time series classifiers with different distance measures. FastEE speeds up EE by leveraging the relationship of each distance measure with its parameters.

Papers:

Learn more about FastEE

HDP - Hierarchical Dirichlet Processes

One of the central questions in Machine Learning is "what can you abstract from your data" and "when should you trust your data". In this work, we show how to hierarchically smooth a categorical probability distribution; how much we smooth and when we smooth is all learned using the theory behind Hierarchical Dirichlet Processes. The software is very easy to use: you simply give it a set of observations about a target variable and some known variables, and you can then query our model for the probability estimates. It can be the probability of getting cancer given some categorical variables about age, height, weight and risk factors; or be the next work someone is going to write given the previous ones. The theory is very solid and the software versatile.

Paper: [Machine Learning 2018] Accurate estimation of conditional categorical probability distributions using Hierarchical Dirichlet Processes (http://francois-petitjean.com/Research/Petitjean2018-HDP.pdf)

Learn more about HDP

MetaTM

MetaTM (Topic Modelling with Metadata) consists of the software packages of a series of the state-of-the-art topic models for text analysis, which leverage metadata such as document labels and word embeddings to boost the performance and interpretability of topic modelling. In particular, MetaTM enjoys significantly better performance for analysing short texts, such as tweets or news abstracts. MetaTM can be used to visualise the topical content of a given corpus, as well as to performa tasks like document classification and clustering.

MetaTM consists of three packages:

MetaLDA incorporates binary metadata, with a scalable multi-thread Java implementation.
MetaFTM enables topics to focused on the most related words informed by word embeddings.
MIGA is a topic-based document clustering model informed by document labels.

Read more and download

NARM

NARM (Node Attribute Relational Model) is the software of a state-of-the-art Bayesian random graph model that incorporates the node metadata (attributes) in a relational graph. NARM can be used for modelling different kinds of relational graph such as social networks, bibliographic networks, and drug interactions, in the tasks of link prediction and community detection.

Related paper:

[ICML] Leveraging Node Attributes for Incomplete Relational Data

Read more and download

NBVAE

NBVAE (Negative-Binomial VAE) is the software of a state-of-the-art Variational AutoEncoder (VAE) for modelling discrete data such as texts or relational data. NBVAE achieves improved performance on multiple tasks including text analysis, collaborative filtering, and multi-label classification. NBVAE is implemented in TensorFlow and runs efficiently with GPUs.

Related paper:

[AISTATS] Variational Autoencoders for Sparse and Overdispersed Discrete Data

Read more and download

ROCKET

ROCKET is a fast state-of-the-art method for time series calssification, it is much faster and can scale to much larger datasets than other methods of comparable accuracy.

Related paper:

[Data Mining and Knowledge Discovery 2020] Exceptionally fast and accurate time series classification using random convolutional kernels

Read more and download

TSI - Time Series Indexing

TSI is an important task for time series analysis. For example, to find the stock that has performed the most similarly to yours over all the days of trading and all the stocks of the past 20 years; it could be that you want to find the crop that has evolved as similarly as yours given the database of all crops in the world, etc. Time series have intrinsic properties that make them hard to index; this is essentially due to the fact that 2 time series can be considered similar even if they are progressing at different speed, with a bit of time lag, etc. In this work we show how to very efficiently query time series databases.

Paper: [SDM 2017] Indexing millions of time series under time warping (http://francois-petitjean.com/Research/Petitjean2017-SDM.pdf)

Learn more about TSI

TSL

TSL (Topic Structure Learning) consists of the software packages of a series of the state-of-the-art topic models for text analysis, which discovers topics with informative structures from documents in an unsupervised way. TSL enjoys significantly better performance and interpretability for analysing both long and short documents. MetaTM can be used to visualise the topical structures of a given corpus, as well as to performa tasks like document classification and clustering.

TSL consists of two packages:

WEDTM discovers sub-topics for each normal topic, which give finer-grained semantic interpretations of both topics and documents.
DirBN automatically discovers tree-structured topic hierarchies, where topics on the higher levels are more general than those on the lower levels.

Time Series Regression

Time series regression is a repository that contains different models for time series regression, the task of predicting a numeric value given a time series. This task is also commonly known as Scalar-On-Function regression in the statistics community.

Papers:

Learn more about TSR

Skopus

Skopus is a method to discover interesting patterns from sequential data. This data could for example be the sequence of pages browsed by the visitor of your website, the sequence of actions taken by a client or a sequence of decisions made about a process. Skopus is completely unsupervised and extracts the patterns that are the most unexpected rather than defining expected patterns.

Learn more about Skopus

Paper: http://francois-petitjean.com/Research/Petitjean2016-Skopus.pdf

ALR - Accelerated Higher Order Logistic Regression

The aim of this work is to develop a pre-conditioner for logistic regression with high-order features. High-order features allow for a low-bias learner (required for big data), while our preconditioner makes it possible to train our model much quicker in terms of the number of iterations.

Learn more about ALR

Paper: [Machine Learning 2016] Accelerated Higher Order Logistic Regression. (http://francois-petitjean.com/Research/ALR.pdf)

Concept Drift

Concept drift research tackles learning from data when the distribution from which it is sampled moves over time. It could for example be that you're studying a disease that is mutating over time, some characteristics are similar and so you should make the most of them, but others are changing so you should adapt to those. This work gives the conceptual framework to allow the analysis of drift in existing datasets. The software attached provides a tool to generate synthetic data with existing drift.

Paper: [Data Mining and Knowledge Discovery 2016] Characterizing Concept Drift (http://arxiv.org/pdf/1511.03816v6)

Learn more about Concept Drift