Model Overview

This is a TransformerEngine accelerated version of CodonFM. The code for this model can be found within bionemo-recipes, and this checkpoint is an exact parameter match to the original research work which can be found at the official CodonFM Github repository.

Description:

CodonFM predicts masked codons in mRNA sequences from codon-level context to enable variant effect interpretation and codon optimization as part of NVIDIA’s CodonFM Encodon family. For this family of models we have 4 models. The first set of 3 models are with randomly masked tokens with 80 million, 600 million and 1 Billion parameter. The fourth model is with 1 Billion parameters but is trained with codon frequency aware masking.

An additional set of accelerated checkpoints also available for use.

This model is ready for commercial/non-commercial use.

License/Terms of Use

Governing Terms: Use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography:

Global

Use Case:

Optimized Expression and Stability for mRNA design: To design mRNAs with codon usage patterns that enhance translation efficiency, protein yield, and transcript stability across specific cell types and tissues.
Variant Interpretation for pathogenicity: To identify and prioritize functional synonymous and missense variants in the context of diseases.

Release Date:

Github 10/27/2025 via https://github.com/NVIDIA-Digital-Bio/CodonFM
Hugging Face 10/27/2025 via:

NGC 10/27/2025 via https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/nv_codonfm_encodon

Model Architecture:

The NVIDIA CodonFM Encodon family features Transformer-based architectures tailored for codon-level sequence modeling in mRNA. Each model applies a masked language modeling (MLM) objective to predict masked codons from surrounding context of 2046 codons, enabling genome-scale codon optimization and synonymous variant interpretation.

Model Name	Parameters
Encodon-80M	7.68 × 10⁷
Encodon-600M	6.09 × 10⁸
Encodon-1B	9.11 × 10⁸
Encodon-Cdwt-1B	9.11 × 10⁸

Input:

Input Type(s): Text (mRNA Sequence)
Input Format: fasta files converted to memmaps
Input Parameters: 1D
Other Properties Related to Input: mRNA sequence represented as a string of codons, of maximum length 2046. Longer sequences are automatically truncated to this length

Output:

Output Type(s): mRNA Sequence
Output Format: Text
Output Parameters: 2D
Other Properties Related to Output: Numeric 2D tensor with float-point values representing probabilities of a given codon at a give position within the sequence

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

PyTorch - 2.5.1

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

NV-CodonFM-Encodon-80M-v1
NV-CodonFM-Encodon-600M-v1
NV-CodonFM-Encodon-1B-v1
NV-CodonFM-Encodon-Cdwt-1B-v1
NV-CodonFM-Encodon-TE-80M-v1
NV-CodonFM-Encodon-TE-600M-v1
NV-CodonFM-Encodon-TE-1B-v1
NV-CodonFM-Encodon-TE-Cdwt-1B-v1

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: RefSeq Data from NCBI

Data Modality

Text (mRNA Sequencing data)

Properties: Coding sequences from the NCBI RefSeq database (release 2024-04) were used for training. A total of >130M non-viral protein-coding sequences from >22,000 species were included, comprising >2,000 eukaryotes. Sequences not divisible by three or containing ambiguous bases were removed. Taxonomy-level deduplication using MMSeqs eliminated redundant entries, and coding sequences from bacteria pathogenic to humans were excluded. The resulting dataset was partitioned into nine species groups: primates, archaea, bacteria, fungi, invertebrate, plant, protozoa, non-primate mammals, and non-mammal vertebrates. Sequences were clustered by similarity and then split into training and validation sets with stratification across groups to ensure balanced representation.

Encodon models use codon-level tokenization, processing input sequences of up to 2,046 codons. Each model was trained using a masked language modeling (MLM) objective, where randomly masked codons were predicted from their context. The Encodon pretraining dataset was sorted based on sequence taxonomy to maintain species balance, and sequence subsets could be resampled dynamically.

Non-Audio, Image, Text Training Data Size: NCBI RefSeq genomes FTP directory currently contains over 395,000 genomes totaling approximately 3.3 terabases (Tb)

Data Collection Method for all data Dataset:

Automatic/Sensors

Labeling Method by Dataset:

Not Applicable

Evaluation Datasets:

Link	Properties
ClinVar Variant Interpretation	This task involves classifying genetic variants from ClinVar, a publicly available database that aggregates information about the clinical significance of human genetic variants, into pathogenic or benign categories based on their coding sequence context
Denovo variant classification	This task uses variants from the Deciphering Developmental Disorders (DDD) and autism spectrum disorder (ASD) cohort study, which catalogs genetic mutations linked to rare pediatric and developmental diseases, to evaluate classification of pathogenic versus benign variants based on coding sequence context.
mRNA Translation Efficiency	This task predicts ribosome profiling signal intensity along coding sequences, evaluating how well models capture translation efficiency and codon-level regulation from sequence context.
Protein Abundance	This task predicts fluorescent protein expression levels (mRFP) from coding sequences, testing how accurately models capture codon-dependent effects on translation efficiency and protein abundance.

Data Collection Method for all data Dataset:

Human

Labeling Method by Dataset:

Not Applicable

Inference:

Acceleration Engine: None
Test Hardware: A100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 20

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/NV-CodonFM-Encodon-TE-1B-v1

Clara-Biology

Collection

NVIDIA Clara Models for Biology • 14 items • Updated about 18 hours ago