Analyses

The following analyses provide a concise overview of the pangenome assembly and its derived proteome. We report genome contiguity and composition, gene-space completeness (BUSCO), raw ORF/CDS/protein counts following clustering, and summary counts for functional assignments (Gene Ontology, KEGG, eggNOG, COG). Complementing these are orthogroup and singleton summaries that reveal shared and cultivar-specific gene content. Together, these metrics highlight the assembly’s structural quality, the reliability of predicted proteins after redundancy reduction, and the breadth of functional annotation available for downstream comparative and functional genomics.


Contig Reduction Through Filtering & Clustering

1
Raw assembly 158,817 contigs from unmapped reads
2
Quality & length filtering 67,462 contigs retained
3
MMseqs2 clustering + containment 2,295 final unique contigs

Assembly Evaluation (QUAST)

Contiguity metrics
2,295
Final contigs
17.65 Mb
N50
29.45 Mb
Largest contig
32.71%
GC content
Length & composition
394.43 Mb
Total assembly length
32.71%
GC content
21
N's per 100 kbp
Interpretation:

Large N50/L50 values and a very large maximum contig (≈29 Mb) indicate good contiguity. Low N-content (21 Ns/100 kbp) and reasonable GC% support structural accuracy of the assembly.


BUSCO Summary

Interpretation:
  • BUSCO completeness is very high (98.9%), indicating strong gene-space representation.
  • A large proportion of single-copy BUSCOs (81.3%) suggests low redundancy and effective filtering.
  • Duplicated BUSCOs (17.6%) likely reflect true paralogs or pangenomic novelty.
  • Fragmented (0.5%) and missing (0.6%) BUSCOs are exceptionally low, supporting the biological completeness of the assembly.

ORF, CDS, and Protein Prediction Statistics

Counts
97,815
Proteins predicted
97,815
CDS predicted
97,816
ORFs
93,245
Stop codons
Interpretation:

Raw protein and CDS predictions are nearly identical in number, reflecting consistent coding sequence recovery. ORFs show only a marginal increase, while slightly fewer stop codons indicate that most ORFs are well-formed and complete.


Protein Metrics

Non-redundant proteins retained after de-replication vs redundant ORFs removed by clustering.
N50 of proteins
Typical protein length where 50% of total protein length is in sequences of this size or longer
509 aa
Interpretation:
  • Representative proteome (38,806) after CD-HIT clustering is the conservative set recommended for gene-centric analyses.
  • Redundancy is extensive: raw ORF predictions (~97.8k) include many duplicates/isoforms; clustering reduces noise and improves downstream analyses.
  • Transcripts vs ORFs: total transcripts (146,413) exceed ORF counts, reflecting multi-isoform transcripts and non-coding transcripts filtered at the ORF stage.
  • N50 = 509 aa indicates a substantial portion of long, well-formed proteins, supporting reliable functional annotation.

Functional Annotation Summary

Counts
19,505
Gene Ontology (GOs)
7,859
Enzyme Commission (EC)
18,400
KEGG pathways
34,059
COG categories
36,300
eggNOG orthologous groups

Orthogroups & Singletons

Interpretation

Out of 13,111 orthogroups, 3,375 are core (shared by all four cultivars), while the majority (9,736) are dispensable (shared by two to three cultivars). Notably, no cultivar-specific unique orthogroups were detected, highlighting the strong genetic overlap among the reference cultivars.

When looking at singleton genes (unassigned / unique genes), Alphonso (2,588) and Amrapali (2,239) show the highest counts, whereas Dashehari (725) and Neelam (1,071) contribute fewer. This indicates that while the genetic backbone is largely shared, each cultivar still retains a distinct set of unique genes that may underlie cultivar-specific traits.


Orthogroup Sharing