This summer, SPUR allowed me to take a step far outside my comfort zone. As a student studying finance and minoring in CIS, I usually navigate DCF models, coding assignments, and investment reports. Biology was never on my radar—until now. With guidance from Dr. Shuva Gupta (Penn) and Dr. Divyansh Agarwal (MIT), I self-taught my way into the world of single-cell RNA sequencing (scRNA-seq), and along the way, discovered an overlooked problem: its environmental cost.
Single-cell RNA sequencing (scRNA-seq) measures the RNA expression of individual cells, allowing researchers to study cellular differences that bulk RNA-seq averages out. This makes it possible to identify rare cell populations, trace developmental processes, and characterize how cells respond differently within the same tissue. These insights are crucial for advancing cancer research, immunology, and other fields where cell-level heterogeneity is significant.
While doing my research, the news out of China hit close: mosquito-borne diseases like Chikungunya are spreading further due to climate change. Climate change is an enduring crisis that affects nearly every aspect of human and ecological life. Without stronger regulations, global temperatures are projected to rise by 2.6 to 3.1 degrees Celsius this century—a shift that would put coastal cities such as Rio de Janeiro, Shanghai, Miami, and Osaka at risk of flooding, drive increases in heat-related illnesses, worsen maternal and infant health outcomes, and exacerbate preexisting medical conditions. The financial toll is expected to reach trillions of dollars, while hundreds of biodiversity-rich habitats face heightened risk. A culprit of climate change is carbon footprint, defined as the total amount of CO2e-equivalent, the amount of CO2 with the same global warming potential as all the greenhouse gases emitted directly or indirectly.
Biology, the field working to understand and mitigate these public health threats, also has its own environmental impact. In the computation side of biology research, the algorithms we run to process data consume electricity, and when electricity is generated from fossil fuels, every computation carries a carbon footprint. For scRNA-seq, which analyzes thousands of cells individually to reveal gene expressions, the data is massive, the workflows are complex, and the carbon costs can quietly add up.
My project set out to ask: How much carbon do scRNA-seq workflows emit, and how do scientists’ algorithm choices influence this footprint? This became my gap statement: although the scientific community has built hundreds of tools for different tasks in scRNA-seq, rarely do any report environmental metrics. Researchers can compare tools by speed or accuracy, but not by sustainability.
For this summer, I focused on Seurat, one of the most widely used R packages for scRNA-seq. Within Seurat, there are multiple ways to normalize data, reduce dimensionality, and test for differential gene expression. These small user-defined choices can change both the scientific outcome and the environmental impact. So, I implemented 38 distinct Seurat workflows on a publicly available dataset of 2,700 human single cells. For each workflow, I recorded runtime, CPU/GPU usage, and memory, and then ran the numbers through Green Algorithms, a framework that translates computing resources into carbon emissions. I then used the ANOVA analysis model, which I learned in STAT 1020 with Professor Gupta, to analyze the carbon footprint dataset and to find out whether choices in these different steps are statistically significant for affecting carbon emission and if the effect of any one step depends on the choice of another step.
Our results indicate that choices for differential expression (DE) testing and normalization are statistically significant in affecting carbon footprint. Choices of DE tests have the most significant effect on carbon emissions, with negbinom testing associated with the highest average carbon emission. Among normalization methods, LogNormalize proved more carbon-friendly than SCTransform. Dimensionality reduction methods showed no significant impact. Additionally, different steps do not significantly interact with each other’s effects on carbon emissions. That means researchers can optimize each step individually without worrying about hidden interactions.
Biological research exists to protect human and planetary health, but the tools we use to pursue these goals shouldn’t undermine them. By showing that software implementation choices for the same task can significantly vary in emissions, my project adds an environmental dimension to the decision-making process in bioinformatics.
This work is still just one piece of the puzzle. Moving forward, I plan to expand beyond clustering in Seurat to other tasks like trajectory inference, and later to alternative packages such as Scanpy, scVI, and Monocle3. The long-term goal is to build a comprehensive framework so that biologists can weigh not just speed and accuracy, but also carbon footprint when choosing their workflows. For me, this summer was not only about quantifying carbon emissions. It was about learning to navigate a new field from scratch, blending statistics and computing with biology, and tackling a problem that bridges science, sustainability, and society.
