research | Lingchao Mao

I develop statistical machine learning and deep learning methodologies for modeling complex datasets with high dimensionality, multi-modality, and limited supervision. Most of these methods are motivated by biomedical applications but are generalizable to other application domains.

My research has three main directions:

Machine Learning with limited supervision and knowledge integration
Disentanglement and fusion of multi-modal/high-dimensional datasets
Data mining and subgroup identification

1. Machine Learning with Limited Supervision and Knowledge Integration

Labeled data is often scarce in biomedical applications, leading to the challenge of how to learn with limited supervision. One common strategy is weakly supervised learning, where models are trained with incomplete or noisy labels. Another approach involves integrating domain knowledge, enhancing model performance by incorporating expert insights or external data sources into the learning process.

Precision Medicine of Brain Cancer

a Weakly-Supervised Transfer Learning model for personalized predictive modeling of tumor cell density

Knowledge-Informed Machine Learning

a live review of KIML for cancer applications

Liver Tumor Segmentation

with clinical knowledge-informed label smoothing

2. Disentanglement and Fusion of Multi-modal/High-dimensional Datasets

With advancements in technology, high-dimensional and multi-source data are increasingly being collected for biomedical applications, including imaging, genomics, clinical questionnaires, and Molecular Dynamics (MD) simulations. Learning from multi-modal datasets can leverage complementary information and lead to improved performance for prediction tasks.

Analyzing these datasets presents interesting challenges, primarily due to the limited availability of precise labels in biomedical contexts. Additionally, some datasets may have missing modalities for a portion of the samples; for instance, not all patients may have all imaging modalities collected due to accessibility or financial constraints. Moreover, complex datasets often contain a mix of signals influenced by environment constraints, obscuring the true patterns of interest to researchers. Models must effectively disentangle various sources of signals and fuse information from these datasets for predictive modeling and knowledge discovery.

Early prediction of Alzheimer’s Disease

Multimodal learning using neuroimaging and genomics data

Discovery of Biomarkers for Persistence of Post-traumetic Headache

Multimodal learning using neuroimaging, questionnaires, and mobile-collected speech data.

Analyzing Steered Molecular Dynamics

AI for understanding molecular dynamics

3. Data Mining and Predictive Modelling

Data mining is a powerful technique for knowledge-discovery and information analysis from large, complex datasets. For example, we analyzed millions of tweets from social media to identify influencers in a social network. In healthcare, we mined large-scale medical claims to discover patients in similar risk groups for hospital readmission.

Multi-cancer Screening with cfDNA

Cancer diagnosis via simple blood draw and machine learning

Identifying Influencers on Social Media

Social network analysis of public health influencers

Prediction of Unplanned Hospitalization of Medicare Patients

CMS AI Challenge