Analyses

The following analyses provide a concise overview of the pangenome assembly and its derived proteome. We report genome contiguity and composition, gene-space completeness (BUSCO), raw ORF/CDS/protein counts following clustering, and summary counts for functional assignments (Gene Ontology, KEGG, eggNOG, COG). Complementing these are orthogroup and singleton summaries that reveal shared and cultivar-specific gene content. Together, these metrics highlight the assembly’s structural quality, the reliability of predicted proteins after redundancy reduction, and the breadth of functional annotation available for downstream comparative and functional genomics.

Contig Reduction Through Filtering & Clustering

Raw assembly 158,817 contigs from unmapped reads

Quality & length filtering 67,462 contigs retained

MMseqs2 clustering + containment 2,295 final unique contigs

Assembly Evaluation (QUAST)

Contiguity metrics

2,295

Final contigs

17.65 Mb

N50

29.45 Mb

Largest contig

32.71%

GC content

Length & composition

394.43 Mb

Total assembly length

32.71%

GC content

N's per 100 kbp

Interpretation:

Large N50/L50 values and a very large maximum contig (≈29 Mb) indicate good contiguity. Low N-content (21 Ns/100 kbp) and reasonable GC% support structural accuracy of the assembly.

BUSCO Summary

Interpretation:

BUSCO completeness is very high (98.9%), indicating strong gene-space representation.
A large proportion of single-copy BUSCOs (81.3%) suggests low redundancy and effective filtering.
Duplicated BUSCOs (17.6%) likely reflect true paralogs or pangenomic novelty.
Fragmented (0.5%) and missing (0.6%) BUSCOs are exceptionally low, supporting the biological completeness of the assembly.

ORF, CDS, and Protein Prediction Statistics

Counts

97,815

Proteins predicted

97,815

CDS predicted

97,816

ORFs

93,245

Stop codons

Interpretation:

Raw protein and CDS predictions are nearly identical in number, reflecting consistent coding sequence recovery. ORFs show only a marginal increase, while slightly fewer stop codons indicate that most ORFs are well-formed and complete.

Protein Metrics

Non-redundant proteins retained after de-replication vs redundant ORFs removed by clustering.

N50 of proteins

Typical protein length where 50% of total protein length is in sequences of this size or longer

509 aa

Interpretation:

Representative proteome (38,806) after CD-HIT clustering is the conservative set recommended for gene-centric analyses.
Redundancy is extensive: raw ORF predictions (~97.8k) include many duplicates/isoforms; clustering reduces noise and improves downstream analyses.
Transcripts vs ORFs: total transcripts (146,413) exceed ORF counts, reflecting multi-isoform transcripts and non-coding transcripts filtered at the ORF stage.
N50 = 509 aa indicates a substantial portion of long, well-formed proteins, supporting reliable functional annotation.

Functional Annotation Summary

Counts

19,505

Gene Ontology (GOs)

7,859

Enzyme Commission (EC)

18,400

KEGG pathways

34,059

COG categories

36,300

eggNOG orthologous groups

Orthogroups & Singletons

Interpretation

Out of 13,111 orthogroups, 3,375 are core (shared by all four cultivars), while the majority (9,736) are dispensable (shared by two to three cultivars). Notably, no cultivar-specific unique orthogroups were detected, highlighting the strong genetic overlap among the reference cultivars.

When looking at singleton genes (unassigned / unique genes), Alphonso (2,588) and Amrapali (2,239) show the highest counts, whereas Dashehari (725) and Neelam (1,071) contribute fewer. This indicates that while the genetic backbone is largely shared, each cultivar still retains a distinct set of unique genes that may underlie cultivar-specific traits.

Analyses

Contig Reduction Through Filtering & Clustering

Assembly Evaluation (QUAST)

BUSCO Summary

ORF, CDS, and Protein Prediction Statistics

Protein Metrics

Functional Annotation Summary

Orthogroups & Singletons

Orthogroup Sharing