Software

Machine Learning software programs

The Machine Learning group has a long history of developing state-of-the-art research software that find use in many other projects.

Featured Program

BMLS

BMLS (Bayesian Multi-label Learning with Sparse Features and Labels) is the software of a state-of-the-art Bayesian multi-label classification model with improved performance and efficiency by leveraging the sparsity in both features and labels. In addition, by leveraging label correlations, BMLS works effectively in the case where the labels of a datum are partially missing.

Related paper:

[AISTATS] Bayesian multi-label learning with sparse features and labels, and label co-occurrences

Read more and download


All software


Bayesreg: Bayesian penalized regression with continuous shrinkage prior densities

This is a comprehensive, user-friendly MATLAB toolbox implementing the state-of-the-art in Bayesian linear regression and Bayesian logistic regression developed by Daniel Schmidt in conjunction with Enes Makalic at the University of Melbourne. The toolbox provides highly efficient and numerically stable implementations of ridge, lasso, horseshoe and horseshoe+ regression. The lasso, horseshoe and horseshoe+ priors are recommended for data sets where the number of predictors is greater than the sample size. The toolbox allows predictors to be assigned to logical groupings (potentially overlapping, so that predictors can be part of multiple groups). This can be used to exploit a priori knowledge regarding predictors and how they may be related to each other (for example, in grouping genetic data into genes and collections of genes such as pathways).

To support analysis of data with outliers, we provide two heavy-tailed error models in our implementation of Bayesian linear regression: Laplace and Student-t distribution errors. Most features are straightforward to use and the toolbox can work directly with MATLAB tables (including automatically handling categorical variables), or you can use standard MATLAB matrices.

Read more and download See the Arxiv Manuscript

Chordalysis

The aim of Chordalysis is to learn the structure of graphical models (of the joint distribution) for datasets with 1,000+ variables. It performs a forward search on junction trees, which are a very interesting class of models, because it means that you can get a Bayesian Network or a Markov Random Field as an output. Chordalysis supports standard statistical testing (with multiple correction), as well as Bayesian ones such as QNML or MML.

Learn more about Chordalysis

Papers:

DBA - Dynamic Time Warping Barycenter Averaging

DBA is an averaging method that takes into account non-linear warping of the time axis.

Learn more about DBA

Papers:

ICDM 2017: Generating synthetic time series to augment sparse datasets (http://francois-petitjean.com/Research/ForestierPetitjean2017-ICDM.pdf)

FastEE - Fast Ensemble of Elastic Distances

FastEE is a fast and scalable state-of-the-art time series classification algorithm. It is the more efficient version of the Ensemble of Elastic Distances (EE), a major component of one of the most accurate TSC algorithms, HIVE-COTE. FastEE contains 11 1-NN time series classifiers with different distance measures. FastEE speeds up EE by leveraging the relationship of each distance measure with its parameters.

Papers:

Learn more about FastEE

HDP - Hierarchical Dirichlet Processes

One of the central questions in Machine Learning is "what can you abstract from your data" and "when should you trust your data". In this work, we show how to hierarchically smooth a categorical probability distribution; how much we smooth and when we smooth is all learned using the theory behind Hierarchical Dirichlet Processes. The software is very easy to use: you simply give it a set of observations about a target variable and some known variables, and you can then query our model for the probability estimates. It can be the probability of getting cancer given some categorical variables about age, height, weight and risk factors; or be the next work someone is going to write given the previous ones. The theory is very solid and the software versatile.

Paper: [Machine Learning 2018] Accurate estimation of conditional categorical probability distributions using Hierarchical Dirichlet Processes (http://francois-petitjean.com/Research/Petitjean2018-HDP.pdf)

Learn more about HDP

MetaTM

MetaTM (Topic Modelling with Metadata) consists of the software packages of a series of the state-of-the-art topic models for text analysis, which leverage metadata such as document labels and word embeddings to boost the performance and interpretability of topic modelling. In particular, MetaTM enjoys significantly better performance for analysing short texts, such as tweets or news abstracts. MetaTM can be used to visualise the topical content of a given corpus, as well as to performa tasks like document classification and clustering.

MetaTM consists of three packages:

  • MetaLDA incorporates binary metadata, with a scalable multi-thread Java implementation.
  • MetaFTM enables topics to focused on the most related words informed by word embeddings.
  • MIGA is a topic-based document clustering model informed by document labels.

Related papers:

Read more and download

NARM

NARM (Node Attribute Relational Model) is the software of a state-of-the-art Bayesian random graph model that incorporates the node metadata (attributes) in a relational graph. NARM can be used for modelling different kinds of relational graph such as social networks, bibliographic networks, and drug interactions, in the tasks of link prediction and community detection.

Related paper:

Read more and download

NBVAE

NBVAE (Negative-Binomial VAE) is the software of a state-of-the-art Variational AutoEncoder (VAE) for modelling discrete data such as texts or relational data. NBVAE achieves improved performance on multiple tasks including text analysis, collaborative filtering, and multi-label classification. NBVAE is implemented in TensorFlow and runs efficiently with GPUs.

Related paper:

Read more and download

ROCKET

ROCKET is a fast state-of-the-art method for time series calssification, it is much faster and can scale to much larger datasets than other methods of comparable accuracy.

Related paper:

Read more and download

TSI - Time Series Indexing

TSI is an important task for time series analysis. For example, to find the stock that has performed the most similarly to yours over all the days of trading and all the stocks of the past 20 years; it could be that you want to find the crop that has evolved as similarly as yours given the database of all crops in the world, etc. Time series have intrinsic properties that make them hard to index; this is essentially due to the fact that 2 time series can be considered similar even if they are progressing at different speed, with a bit of time lag, etc. In this work we show how to very efficiently query time series databases.

Paper: [SDM 2017] Indexing millions of time series under time warping (http://francois-petitjean.com/Research/Petitjean2017-SDM.pdf)

Learn more about TSI

TSL

TSL (Topic Structure Learning) consists of the software packages of a series of the state-of-the-art topic models for text analysis, which discovers topics with informative structures from documents in an unsupervised way. TSL enjoys significantly better performance and interpretability for analysing both long and short documents. MetaTM can be used to visualise the topical structures of a given corpus, as well as to performa tasks like document classification and clustering.
TSL consists of two packages:
  • WEDTM discovers sub-topics for each normal topic, which give finer-grained semantic interpretations of both topics and documents.
  • DirBN automatically discovers tree-structured topic hierarchies, where topics on the higher levels are more general than those on the lower levels.

Related papers:

Learn more about TSL

Time Series Regression

Time series regression is a repository that contains different models for time series regression, the task of predicting a numeric value given a time series. This task is also commonly known as Scalar-On-Function regression in the statistics community.

Papers:

Learn more about TSR

Skopus

Skopus is a method to discover interesting patterns from sequential data. This data could for example be the sequence of pages browsed by the visitor of your website, the sequence of actions taken by a client or a sequence of decisions made about a process. Skopus is completely unsupervised and extracts the patterns that are the most unexpected rather than defining expected patterns.

Learn more about Skopus

Paper: http://francois-petitjean.com/Research/Petitjean2016-Skopus.pdf

ALR - Accelerated Higher Order Logistic Regression

The aim of this work is to develop a pre-conditioner for logistic regression with high-order features. High-order features allow for a low-bias learner (required for big data), while our preconditioner makes it possible to train our model much quicker in terms of the number of iterations.

Learn more about ALR

Paper: [Machine Learning 2016] Accelerated Higher Order Logistic Regression. (http://francois-petitjean.com/Research/ALR.pdf)

Concept Drift

Concept drift research tackles learning from data when the distribution from which it is sampled moves over time. It could for example be that you're studying a disease that is mutating over time, some characteristics are similar and so you should make the most of them, but others are changing so you should adapt to those. This work gives the conceptual framework to allow the analysis of drift in existing datasets. The software attached provides a tool to generate synthetic data with existing drift.

Paper: [Data Mining and Knowledge Discovery 2016] Characterizing Concept Drift (http://arxiv.org/pdf/1511.03816v6)

Learn more about Concept Drift