Analyses
The following analyses provide a concise overview of the pangenome assembly and its derived proteome. We report genome contiguity and composition, gene-space completeness (BUSCO), raw ORF/CDS/protein counts following clustering, and summary counts for functional assignments (Gene Ontology, KEGG, eggNOG, COG). Complementing these are orthogroup and singleton summaries that reveal shared and cultivar-specific gene content. Together, these metrics highlight the assembly’s structural quality, the reliability of predicted proteins after redundancy reduction, and the breadth of functional annotation available for downstream comparative and functional genomics.
Contig Reduction Through Filtering & Clustering
Assembly Evaluation (QUAST)
Large N50/L50 values and a very large maximum contig (≈29 Mb) indicate good contiguity. Low N-content (21 Ns/100 kbp) and reasonable GC% support structural accuracy of the assembly.
BUSCO Summary
- BUSCO completeness is very high (98.9%), indicating strong gene-space representation.
- A large proportion of single-copy BUSCOs (81.3%) suggests low redundancy and effective filtering.
- Duplicated BUSCOs (17.6%) likely reflect true paralogs or pangenomic novelty.
- Fragmented (0.5%) and missing (0.6%) BUSCOs are exceptionally low, supporting the biological completeness of the assembly.
ORF, CDS, and Protein Prediction Statistics
Raw protein and CDS predictions are nearly identical in number, reflecting consistent coding sequence recovery. ORFs show only a marginal increase, while slightly fewer stop codons indicate that most ORFs are well-formed and complete.
Protein Metrics
- Representative proteome (38,806) after CD-HIT clustering is the conservative set recommended for gene-centric analyses.
- Redundancy is extensive: raw ORF predictions (~97.8k) include many duplicates/isoforms; clustering reduces noise and improves downstream analyses.
- Transcripts vs ORFs: total transcripts (146,413) exceed ORF counts, reflecting multi-isoform transcripts and non-coding transcripts filtered at the ORF stage.
- N50 = 509 aa indicates a substantial portion of long, well-formed proteins, supporting reliable functional annotation.
Functional Annotation Summary
Orthogroups & Singletons
Out of 13,111 orthogroups, 3,375 are core (shared by all four cultivars), while the majority (9,736) are dispensable (shared by two to three cultivars). Notably, no cultivar-specific unique orthogroups were detected, highlighting the strong genetic overlap among the reference cultivars.
When looking at singleton genes (unassigned / unique genes), Alphonso (2,588) and Amrapali (2,239) show the highest counts, whereas Dashehari (725) and Neelam (1,071) contribute fewer. This indicates that while the genetic backbone is largely shared, each cultivar still retains a distinct set of unique genes that may underlie cultivar-specific traits.