Data analytics: Characterising disease pathways using machine learning

This project is a collaboration of scientists working in two HBP Subprojects:

  • HPAC Platform (SP7)
  • Medical Informatics Platform (SP8)


The aim of this project is to characterise complete disease pathways, from the molecular level, up to observable disorders of cognition and behaviour, and to identify unique combinations of biological and clinical signals associated with specific pathways. To pursue this goal, data mining will be used to identify biological signatures of disease across different scales (biological, anatomical, physiological and clinical variables). The data mining will use established machine learning algorithms running on HPC. It is planned to demonstrate the capabilities of using HPC resources for large scale machine learning.


  • Research data: imaging, SNP, clinical variable (size <1TB)


  • Semi-supervised clustering algorithm
    • Matlab, R, native or docker
    • Output: Rules-based classification (size <1GB)
    • Algorithms and benchmarks: Density-based algorithm compared to the use of state of the art “black box” methods
    • Reference time on 120 HPC nodes: less than 1 hour
  • Support vector machine classifier trained and cross-validation
    • Matlab, R, native or docker
    • Output: Image based weight (size <1GB)
    • Algorithms and benchmarks: The results were compared to clinical diagnostic performed by neurologists (expert knowledge)
    • Reference time on 120 HPC nodes: less than 1 hour
  • Deep learning algorithm
    • Matlab, R, Java
    • Output: Neural set for feature learning (size <1GB)
    • Algorithms and benchmarks:
      • Neural net/stacked auto-encoder
      • Compared atlas based features vs. random based features
      • Compared to clinical label
    • Reference time on 120 HPC nodes: less than 12 hours
  • Monte-Carlo bi-clustering
    • Output: Co-modules between gene expression and brain morphologies (size <1GB)
    • Algorithms and benchmarks: Compared to separated clustering approach
    • Reference time on 120 HPC nodes: less than 24 hours


  • Identify multimodal clinical data [done]
  • Data pre-processing [done]
  • Data transfer
  • Implementing a test of different algorithms
  • Model configuration
  • Benchmark algorithms

Problems to be solved

The following problems require special attention and need to be solved in co-design with the users for implementing the use case successfully.

  • Container-based deployment of the software on HPC systems.