This blog is written based on my invited talk at the Molecular Epidemiology Research Lab, Max-Delbrück-Centrum für Molekulare Medizin (MDC) in Germany. I would like to thank Dr. Sara Moazzen (Postdoctoral Researcher) and Prof. Dr. Tobias Pischon (Molecular Epidemiology Research Group Leader) for the invitation and for arranging my talk.

Disclaimer: Blogs constitute only the opinion of the author.

Introduction

A complex disease, also known as a multifactorial disease, is a disease that results from a combination of genetic, environmental, and lifestyle factors, many of which have yet to be identified. Complex diseases account for 70% of all global deaths1. Common examples of complex diseases include noncommunicable diseases such as cancer, diabetes, cardiovascular diseases, Parkinson’s disease, depressive disorders, and psychotic spectrum disorders1.

Complex diseases exhibit a high degree of heterogeneity in terms of signs, symptoms, underlying causal mechanisms, and the number of genetic and non-genetic risk factors involved1. Heterogeneity in the classical classification of complex diseases is the main challenge in research and clinical practice. The clinical variability between patients with diagnosed these complex diseases suggest the existence of subtypes of the disease. Diseases management is challenging because of the heterogeneity. Diseases classification by clinicians based on clinical manifestations, physical examinations, blood assay results, and histological appearance cannot capture the full picture of the diseases. However, this only partially reflects the true heterogenic character of cancer. Several contemporary studies have attempted to subtype complex disease using different kinds of data. Disease subtyping, which involves grouping patients into distinct subgroups based on phenomic, metabolomic, epigenomic, proteomic, transcriptomic, and genomics data, holds promise is a promising strategy for improving diagnosis, prediction, treatment, prevention, and prognosis2,3.

The availability of advanced technologies for genome-wide profiling, electronic health records (EHR), large-scale cohort and birth registry data, and novel data-driven methods provides potential for successful subtyping of complex diseases and tailoring disease prevention and treatment based on individual differences4. Classical data-driven methods includes group-based trajectory modeling, latent class growth analysis, growth mixture modeling, general growth mixture modeling, latent class analysis, and latent transition analysis3. iCluster and Biclustering methods are also contemporary approaches for big data integration and disease subtyping4,5.

Over the past decade, data-driven approaches have proven effective in exploring heterogeneity and biological drivers of complex diseases6. Additionally, these approaches can aid in identifying high-risk population groups, selecting interventions for specific patient groups with similar phenotypes, and evaluating of patient prognosis7. Thus, data-driven methods have the potential to contribute to precision medicine by addressing challenges related to heterogeneity in diagnostic and treatment selection. Data-driven clustering methods are used to identify subgroups within a sample on the basis of the observed data only, also known as unsupervised classification. These methods differ from supervised classification, in which an algorithm predicting group membership is developed in a sample in which membership is known, and then applied to new samples in which membership is unknown.

In this correspondence, we share our experience and provide a summary of multiomics-based data-driven subtypes of complex diseases: schizophrenia, depressive disorders, Parkinson’s diseases, diabetes, and cancer. These diseases are highly prevalent and brings a substantial burden on patients, families, community, and healthcare systems8-10.  We also discuss the challenges associated with data-driven methods application and highlight solutions to fully harness their capabilities.

Precision medicine and multiomics data

The term precision medicine received wider attention in January 2015 as President Obama unveiled plans for a national “precision medicine initiative” to promote the development and use of genomic tools in health care11. Precision medicine is an emerging approach for deep characterization of diseases12, precisely assessing patients’ risk and prognosis13,14, and precisely prescribing the right drugs to the right patients15 considering individual variability in molecular biomarkers, environment and lifestyle16. Precision medicine offers a remarkable opportunity to help physicians better comprehend and practice medicine, predict the needs of their patients, share health data, improve health and address health disparities 13,14,16,17. The diversity of data for precision medicine can be ensured by including diverse source of population and collecting diverse type of data including exposome or environtome18, metabolomics19, and proteomics18. Precision medicine will continue to transform healthcare in the coming decade as it expands in key areas: huge cohorts, artificial intelligence (AI), routine clinical genomics, phenomics and environment, and returning value across diverse populations20.

Over the past 20 years, advances in genomic technology have enabled unparalleled access to the information contained within the human genome and provided rich evidence for the implementation of precision medicine. However, the multiple genetic variants associated with various diseases typically account for only a small fraction of the disease risk and the expression of our genes fluctuates over time and in response to the environment. This may be due to the multifactorial nature of disease mechanisms, strong impact of the environment, sociocultural factors, gene-environment interactions and epigenomics, and change in metabolomics and proteomics19,21. Thus, the ability to combine and harness the explosion of omics data may offer additional insights to precision medicine and will be critical to improving treatments for patients.

There are contemporary studies that shows precision medicine studies already gone beyond genomics and adapt a panoptic view through deep phenotyping using clinical laboratory tests, metabolomics technologies, and advanced noninvasive imaging data from diverse population22-25. For example, a three years precision medicine study depicted that integrating whole-genome sequencing and deep phenotyping of metabolomics, advanced imaging, and clinical laboratory tests in addition to family/medical history helps to identify a high percentage of genotype and phenotype associations in dyslipidemia, cardiomyopathy, arrhythmia, and other cardiac diseases, and diabetes and endocrine diseases in adults22. Another evidence showed that the accuracy and utility of current diabetes prediction models might improve outcomes and can ensure precision medicine when genetic risks combined with clinical risk factors, age, race or ethnicity, and natural history of disease26.

No field in science and medicine today remains untouched by big data.18 Precision medicine lends itself to big data or “informatics” approaches and is focused on storing, accessing, sharing, and studying these data while taking necessary precautions to protect patients’ privacy27. The promises of precision medicine will be more quickly realized by expanding collaborations to rapidly process and interpret the growing volumes of omics data27. A confluence of biological, physical, engineering, computer, and health sciences is setting the stage for a transformative leap toward data-driven, mechanism-based health and health care for each individual for better control of chronic disease; smaller, faster, and more successful clinical trials; and avoidance of unnecessary tests and ineffective therapies, the slope of the health care–cost curve could decline28.

Subtypes of complex diseases

Schizophrenia: Habtewold and colleagues6 conducted a systematic review of 53 cross-sectional and longitudinal data-driven studies among adult patients diagnosed with schizophrenia spectrum disorders (SSD) (Table 1). Four studies identified two to five subtypes of SSD based on positive symptoms. Despite these inconsistencies, the studies consistently reported four subtypes. Ten studies identified three to five subtypes based on negative symptoms. Even though there was a huge inconsistency among reports, identified subtypes had four main features. By combining positive and negative symptoms, 11 studies identified two to five subtypes of SSD. Despite the inconsistencies among reports, the reported subtypes had four main features. Moreover, 23 studies reported three to five subtypes of SSD based on cognitive performance, with most of them identifying three subtypes. Patients with different subtypes of SSD had different sociodemographic characteristics, clinical outcomes, and daily functioning and quality of life levels.

Major depressive disorder: Beijers and colleagues29 conducted a systematic review of 29 cross-sectional and longitudinal data-driven studies among individuals with depression using psychometric, biochemical, neuroimaging and genetic data (Table 1). Six out of the 29 reviewed studies consistently identified two subtypes (i.e., high and low) of depression based on biochemicals/metabolomics extracted from plasma, cerebrospinal fluid, urine, and basal ganglia.  Nine of the 29 reviewed studies depicted the existence of two to five subtypes of depression using clinician- or subject-scored psychometric symptom-based data assessed using various psychometric instruments. Most studies (i.e., four out of nine) frequently identified two depression subtypes. Furthermore, five studies identified two to four subtypes of depression using neuroimaging data assessed by using resting-state functional magnetic resonance imaging, canonical correlation analysis, and diffusion tensor imaging based on fractional anisotropy (FA) scores of white matter. Almost all (i.e., four out of five) studies identified two subtypes of depression. Finally, one of the reviewed studies identified two subtypes of depression using SNPs associated with major depressive disorder, whereas another study identified five subtypes of depression by combining sociodemographic data, clinical questionnaire scores, resting-state functional connectivity measures, and various biomarkers.

Parkinson’s disease: Recently, Lee and colleagues30 reviewed data-driven subtyping studies among patients with PD according to clinical, neuroimaging, biochemical, genetic and transcriptomic data (Table 1). The reviewed studies identified two to four subtypes based on motor symptoms, two subtypes based on non-motor symptoms, and two to four subtypes by combining motor and non-motor symptoms. Two to three subtypes of PD were identified based on structural and functional neuroimaging data. Moreover, two to four subtypes were identified using genetic data, and three subtypes were identified using biochemical/metabolomic data.

Another contemporary systematic review of data-driven studies by Pourzinal and colleagues31 unraveled severity-based and domain-based subtypes of PD based on cognitive impairment. Severity-based subtyping studies (9 studies) identified three to five subtypes, which majority revealed three subtypes ranging from cognitively intact to severely impaired. Domain-based studies (11 studies) identified two to six subtypes, while most studies reported four subtypes. van Rooden and colleagues32 also previously conducted a systematic review of seven data-driven studies among patients with PD. Six of the seven studies used motor symptoms for subtyping, five studies used motor symptoms and measures of cognition, and five studies used depressive symptoms, age-at-onset, measure of disease progression. Broadly, the reviewed studies reported two to five subtypes of Parkinson’s disease. Six of the seven studies identified “old age-at-onset and rapid disease progression” PD subtype. Five of the seven studies identified the “young age-at-onset and slow disease progression” subtype.

Diabetes mellitus: Sarría-Santamera and colleagues33 conducted a systematic review of 14 cross-sectional and longitudinal data-driven studies among diabetic patients using multiomics data extracted from electronic health records, healthcare databases, and previously conducted longitudinal observational cohort studies and surveys (Table 1). Eight out of 14 reviewed studies used clinical data and biomarkers for subtyping. Four studies used biomarkers for subtyping. One study used sociodemographic, clinical and biomarkers data for subtyping. One study used 73 variables for subtyping. The reviewed studies identified two to five subtypes of DM, of which 6 of them identified five subtypes.

Cancer: Zhao and colleagues34 conducted a review of 20 data-driven studies using microarray and RNA-seq, mutations, microRNAs (miRNAs), copy number variation (CNV) and DNA methylation data in patients mainly diagnosed with breast cancer, colorectal cancer, pancreatic ductal adenocarcinoma (PDAC), leukemia, lymphoma, pancreatic cancer, and glioblastoma (Table 1). Four studies identified four to ten subtypes of breast cancer using molecular data. Despite inconsistent naming and number of subtypes by different studies, Zhao and colleagues conclude that breast tumors fall primarily into three major subtypes. Seven studies identified three to six subtypes of colorectal cancer using mRNA data. Despite these inconsistencies, the Colorectal Cancer Subtyping Consortium (CRCSC) classified into four robust subtypes of colorectal cancer. Moreover, three studies identified two to three subtypes of pancreatic ductal adenocarcinoma (PDAC) using miRNA and mRNA data, whereas two studies identified two to 16 subtypes of leukemia using mRNA and methylation data. Other studies identified two to four subtypes of pancreatic cancer, lymphoma, and lung cancer based on mRNA data. Finally, one study identified two subtypes of glioblastoma using miRNA data and another study identified 11 subtypes of cancer using Microarray, RNA sequencing, qPCR, NanoString, and Tissue microarray data. Patient characteristics in each subtype were not evaluated.

References

1             Johansson, Å. et al. Precision medicine in complex diseases-Molecular subgrouping for improved prediction and treatment stratification. J Intern Med 294, 378-396 (2023). https://doi.org/10.1111/joim.13640

2             Biswas, S. & Hasija, Y. in Big Data Analytics for Healthcare   (ed Pantea Keikhosrokiani)  63-72 (Academic Press, 2022).

3             Muthén, B. & Muthén, L. K. Integrating person-centered and variable-centered analyses: growth mixture modeling with latent trajectory classes. Alcohol Clin Exp Res 24, 882-891 (2000).

4             Nguyen, T., Tagett, R., Diaz, D. & Draghici, S. A novel approach for data integration and disease subtyping. Genome Res 27, 2025-2039 (2017). https://doi.org/10.1101/gr.215129.116

5             Xie, J., Ma, A., Fennell, A., Ma, Q. & Zhao, J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 20, 1449-1464 (2019). https://doi.org/10.1093/bib/bby014

6             Habtewold, T. D. et al. A systematic review and narrative synthesis of data-driven studies in schizophrenia symptoms and cognitive deficits. Transl Psychiatry 10, 244 (2020). https://doi.org/10.1038/s41398-020-00919-x

7             Mori, M., Krumholz, H. M. & Allore, H. G. Using Latent Class Analysis to Identify Hidden Clinical Phenotypes. Jama 324, 700-701 (2020). https://doi.org/10.1001/jama.2020.2278

8             Charlson, F. J. et al. Global Epidemiology and Burden of Schizophrenia: Findings From the Global Burden of Disease Study 2016. Schizophr Bull 44, 1195-1203 (2018). https://doi.org/10.1093/schbul/sby058

9             Ou, Z. et al. Global Trends in the Incidence, Prevalence, and Years Lived With Disability of Parkinson’s Disease in 204 Countries/Territories From 1990 to 2019. Front Public Health 9, 776847 (2021). https://doi.org/10.3389/fpubh.2021.776847

10           Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet 402, 203-234 (2023). https://doi.org/10.1016/s0140-6736(23)01301-6

11           Juengst, E., McGowan, M. L., Fishman, J. R. & Settersten, R. A., Jr. From “Personalized” to “Precision” Medicine: The Ethical and Social Implications of Rhetorical Reform in Genomic Medicine. Hastings Cent Rep 46, 21-33 (2016). https://doi.org/10.1002/hast.614

12           Ashley, E. A. Towards precision medicine. Nat Rev Genet 17, 507-522 (2016). https://doi.org/10.1038/nrg.2016.86

13           Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Affairs 37, 780-785 (2018).

14           Fine, M. J., Ibrahim, S. A. & Thomas, S. B. The role of race and genetics in health disparities research. American journal of public health 95, 2125-2128 (2005). https://doi.org/10.2105/ajph.2005.076588

15           Letai, A. Functional precision cancer medicine-moving beyond pure genomics. Nat Med 23, 1028-1035 (2017). https://doi.org/10.1038/nm.4389

16           Watson, K. S. et al. Adapting a conceptual framework to engage diverse stakeholders in genomic/precision medicine research. Health Expect (2022). https://doi.org/10.1111/hex.13486

17           Geneviève, L. D., Martani, A., Shaw, D., Elger, B. S. & Wangmo, T. Structural racism in precision medicine: leaving no one behind. BMC Med Ethics 21, 17 (2020). https://doi.org/10.1186/s12910-020-0457-8

18           Özdemir, V. et al. Personalized medicine beyond genomics: alternative futures in big data-proteomics, environtome and the social proteome. J Neural Transm (Vienna) 124, 25-32 (2017). https://doi.org/10.1007/s00702-015-1489-y

19           Rattray, N. J. W. et al. Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics 12, 4 (2018). https://doi.org/10.1186/s40246-018-0134-x

20           Denny, J. C. & Collins, F. S. Precision medicine in 2030-seven ways to transform healthcare. Cell 184, 1415-1419 (2021). https://doi.org/10.1016/j.cell.2021.01.015

21           Mostafavi, H. et al. Variable prediction accuracy of polygenic scores within an ancestry group. Elife 9 (2020). https://doi.org/10.7554/eLife.48376

22           Hou, Y. C. et al. Precision medicine integrating whole-genome sequencing, comprehensive metabolomics, and advanced imaging. Proc Natl Acad Sci U S A 117, 3053-3062 (2020). https://doi.org/10.1073/pnas.1909378117

23           Rahman, S. et al. Quo vadis now: Beyond genomics to an era of personalised medicine. J Inherit Metab Dis 45, 129-131 (2022). https://doi.org/10.1002/jimd.12487

24           Snyderman, R. & Spellmeyer, D. Precision medicine: beyond genomics to targeted therapies. Per Med 13, 97-100 (2016). https://doi.org/10.2217/pme.15.48

25           Pfohl, U. et al. Precision Oncology Beyond Genomics: The Future Is Here-It Is Just Not Evenly Distributed. Cells 10 (2021). https://doi.org/10.3390/cells10040928

26           Mercader, J. M., Ng, M. C. Y., Manning, A. K. & Rich, S. S. Predicting diabetes risk in diverse populations: what next? Lancet Diabetes Endocrinol 9, 808-810 (2021). https://doi.org/10.1016/s2213-8587(21)00287-4

27           Madhavan, S., Subramaniam, S., Brown, T. D. & Chen, J. L. Art and Challenges of Precision Medicine: Interpreting and Integrating Genomic Data Into Clinical Practice. Am Soc Clin Oncol Educ Book 38, 546-553 (2018). https://doi.org/10.1200/edbk_200759

28           Hawgood, S., Hook-Barnard, I. G., O’Brien, T. C. & Yamamoto, K. R. Precision medicine: Beyond the inflection point. Sci Transl Med 7, 300ps317 (2015). https://doi.org/10.1126/scitranslmed.aaa9970

29           Beijers, L., Wardenaar, K. J., van Loo, H. M. & Schoevers, R. A. Data-driven biological subtypes of depression: systematic review of biological approaches to depression subtyping. Mol Psychiatry 24, 888-900 (2019). https://doi.org/10.1038/s41380-019-0385-5

30           Lee, S. H. et al. Parkinson’s Disease Subtyping Using Clinical Features and Biomarkers: Literature Review and Preliminary Study of Subtype Clustering. Diagnostics (Basel) 12 (2022). https://doi.org/10.3390/diagnostics12010112

31           Pourzinal, D. et al. Systematic review of data-driven cognitive subtypes in Parkinson disease. Eur J Neurol 29, 3395-3417 (2022). https://doi.org/10.1111/ene.15481

32           van Rooden, S. M. et al. The identification of Parkinson’s disease subtypes using cluster analysis: a systematic review. Mov Disord 25, 969-978 (2010). https://doi.org/10.1002/mds.23116

33           Sarría-Santamera, A., Orazumbekova, B., Maulenkul, T., Gaipov, A. & Atageldiyeva, K. The Identification of Diabetes Mellitus Subtypes Applying Cluster Analysis Techniques: A Systematic Review. Int J Environ Res Public Health 17 (2020). https://doi.org/10.3390/ijerph17249523

34           Zhao, L., Lee, V. H. F., Ng, M. K., Yan, H. & Bijlsma, M. F. Molecular subtyping of cancer: current status and moving toward clinical applications. Brief Bioinform 20, 572-584 (2019). https://doi.org/10.1093/bib/bby026

35           Rouzier, R. et al. Breast cancer molecular subtypes respond differently to preoperative chemotherapy. Clin Cancer Res 11, 5678-5685 (2005). https://doi.org/10.1158/1078-0432.Ccr-04-2421

36           Prasuhn, J. & Brüggemann, N. Genotype-driven therapeutic developments in Parkinson’s disease. Mol Med 27, 42 (2021). https://doi.org/10.1186/s10020-021-00281-8

37           van Smeden, M., Harrell, F. E., Jr. & Dahly, D. L. Novel diabetes subgroups. Lancet Diabetes Endocrinol 6, 439-440 (2018). https://doi.org/10.1016/s2213-8587(18)30124-4

38           Ghosh, J. & Acharya, A. Cluster ensembles. WIREs Data Mining and Knowledge Discovery 1, 305-315 (2011). https://doi.org/https://doi.org/10.1002/widm.32

39           Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 91-118 (2003).