Synthetic Data Increases Equity for Underrepresented Populations in Cancer Clinical Trials

October 25, 2024 by Elisa Becze BA, ELS, Editor

Technologies such as synthetic minority oversampling techniques (SMOTE) allow cancer scientists to use computer-generated data that closely matches understudied members of society, Laritza Rodriguez, MD, PhD, program director for the National Cancer Institute’s Center for Cancer Health Equity, said in an October 2024 blog post (https://datascience.cancer.gov/news-events/blog/synthetic-data-helps-counter-lack-diversity-data). The computational tool can improve representation and equity in clinical studies.

SMOTE, a type of machine learning, allows researchers to create new, synthetic case samples by incorporating features from the existing class samples. Because it does not just replicate the minority class samples that already exist in the data set, “you have less risk of overfitting—that is, creating data that nearly matches the original data set—which is a known drawback of other oversampling techniques that duplicate existing samples,” Rodriguez said (https://datascience.cancer.gov/news-events/blog/synthetic-data-helps-counter-lack-diversity-data).

She said that (https://datascience.cancer.gov/news-events/blog/synthetic-data-helps-counter-lack-diversity-data) SMOTE is particularly useful when creating or testing predictive models, illustrated by a case study: “For example, a study (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679879/) of an artificial intelligence model trained on data from predominantly White populations wasn’t nearly as robust in women with a prior history of breast cancer or Hispanic women. Oversampling techniques such as SMOTE can help avoid a lack of diversity in the data so you can gain better insight into important features in these groups.”

Additionally, Rodriguez said that by increasing the number of minority class samples and balance the class distribution, scientists can better detect invisible signals in extremely sparse data sets where prediction without oversampling is not possible (e.g., gene markers present only in a small number of members of a population).

Some of SMOTE’s limitations include:

Like all new technologies, “keeping the human in the loop is the best way to ensure you’ve created a true gold standard, although this often is the most expensive of the validation options,” Rodriguez said (https://datascience.cancer.gov/news-events/blog/synthetic-data-helps-counter-lack-diversity-data).

SMOTE is available in the Python library and in other data processing libraries, Rodriguez said. Learn more about SMOTE’s application to cancer research in her full blog post, and check out ONS Voice’s additional coverage about artificial intelligence and other emerging technologies (https://voice.ons.org/topic/technologies) in cancer care.


Copyright © 2024 by the Oncology Nursing Society. User has permission to print one copy for personal or unit-based educational use. Contact pubpermissions@ons.org for quantity reprints or permission to adapt, excerpt, post online, or reuse ONS Voice content for any other purpose.