Abstract
Prokaryotic Argonaute proteins represent a mechanistically versatile yet sequence‑divergent family of nucleic‑acid–guided effectors whose functional and evolutionary breadth remains incompletely charted, especially within the vast reservoir of metagenomic “dark” proteins. Here, we present a multitask deep‑learning framework that unifies transformer‑derived residue embeddings, predicted contact topologies, and domain‑aware sequence constraints in graph attention–conditional random‑field (GAT‑CRF) architecture. Trained on a curated cohort of 6,615 residue‑annotated protein graphs, the model simultaneously classifies global Argonaute identity and delineates PAZ, MID, and PIWI boundaries with residue‑level resolution.
In comparative benchmarks, the network outperforms Foldseek and HMMER, recalling up to 35 % more domains at matched precision, and surpasses PSI‑BLAST in family‑level retrieval while operating entirely without pairwise alignment. Deployment across 43.2 million non‑redundant MGnify proteins yields 1,459 high‑confidence Argonaute homologs, expanding the known repertoire by nearly an order of magnitude, and reveals a previously unrecognized monophyletic clade that branches from longA Argonautes but lacks both the catalytic DEDX tetrad and PAZ/APAZ guide‑anchoring modules. Detailed motif analysis confirms 168 catalytically competent longA enzymes, maps MID‑anchor diversification across clades and uncovers rare aromatic amplifications (YY, FWK) suggestive of enhanced guide affinity. Accessory‑domain scans identify Tudor, Topoisomerase‑I, and S13‑like H2TH fusions, highlighting modular accretion as a driver of functional innovation.
This work demonstrates that graph‑based representation learning can transcend the limitations of alignment‑centric pipelines, resolve residue‑level architectures, and illuminate hidden enzymatic diversity at the metagenomic scale. The resulting atlas of canonical and novel pAgos provides a rich source of candidates for mechanistic exploration and biotechnological development, while the methodological blueprint is broadly transferable to other fast‑evolving protein families residing in the microbial dark proteome.
School
School of Sciences and Engineering
Department
Biotechnology Program
Degree Name
MS in Biotechnology
Graduation Date
Summer 6-15-2025
Submission Date
5-26-2025
First Advisor
Ahmed Moustafa
Second Advisor
Magdy M. Mahfouz
Committee Member 1
May Bakr
Committee Member 2
Robert Hoehndorf
Committee Member 3
Anwar Abdelnaser
Extent
73 p.
Document Type
Master's Thesis
Institutional Review Board (IRB) Approval
Not necessary for this item
Recommended Citation
APA Citation
Kazlak, A. M.
(2025).Mining the Microbial Treasure Trove: Multitasking Deep Learning Framework for Functional Discovery of Novel Prokaryotic Argonaute Proteins in Metagenomes [Master's Thesis, the American University in Cairo]. AUC Knowledge Fountain.
https://fount.aucegypt.edu/etds/2526
MLA Citation
Kazlak, Ahmed Mohamed. Mining the Microbial Treasure Trove: Multitasking Deep Learning Framework for Functional Discovery of Novel Prokaryotic Argonaute Proteins in Metagenomes. 2025. American University in Cairo, Master's Thesis. AUC Knowledge Fountain.
https://fount.aucegypt.edu/etds/2526
