Abstract: A team of researchers from the Dana-Farber Cancer Institute, Broad Institute of Mit and Harvard, Google and Columbia University created a model of artificial intelligence that can predict which genes are expressed in any type of human cells. The model, called Epibert, has been inspired by Bert, a deep learning model designed to understand and generate a language similar to man.
Epibert has been trained in data from hundreds of human cell types in many phases. He was fed with a genomic sequence, which has 3 billion principles, along with chromatin availability maps, which inform which of these sequences is canceled from the chromosome and read by the cell. The model was first trained to know the relationship between the DNA sequence and the availability of chromatin on large pieces of genome in a specific type of cell. Then he uses these learned relationships to predict which genes were active in the appropriate type of cell. He thoroughly identified the regulatory elements – the parts of the genome recognized by transcription factors – and their impact on gene expression in many types of cells, building “grammar”, which is generalizing and predictable. This grammar building process can be compared to the way a large language model, such as chatgpt, learns to build significant sentences and paragraphs with many examples of text. The Epibert model can process availability and predict the functional bases, as well as RNA expression for never see cells.
Meaning: Each cell in the body has the same genome sequence, so the difference between two types of cells is not a gene gene, but which genes are included when and how much. About 20% of genome codes for regulatory elements specify which genes are turned on, but little is known about where these codes are in the genome, what their instructions look like or how the mutations affect the function in the cell. Epibert will shed light on the way of regulating genes in cells and potentially how the regulatory system of this cell can be mutated in a way that leads to diseases such as cancer.
Financing: Broad Institute, Novo Nordisk Foundation, National Genome Research Institute, Sharf Green Cancer Research Fund, Richard and Nancy Lubin Family and American Cancer Society. Access to TENSOR (TPU) processing and support provided by Google.
Source:
Reference to the journal:
Javed, n. ,. (2025). Multimodal transformer for cell regulatory forecasts. . doi.org/10.1016/j.xgen 20125.100762.

