Abstract

Prokaryotic Argonaute proteins represent a mechanistically versatile yet sequence‑divergent family of nucleic‑acid–guided effectors whose functional and evolutionary breadth remains incompletely charted, especially within the vast reservoir of metagenomic “dark” proteins. Here, we present a multitask deep‑learning framework that unifies transformer‑derived residue embeddings, predicted contact topologies, and domain‑aware sequence constraints in graph attention–conditional random‑field (GAT‑CRF) architecture. Trained on a curated cohort of 6,615 residue‑annotated protein graphs, the model simultaneously classifies global Argonaute identity and delineates PAZ, MID, and PIWI boundaries with residue‑level resolution.

In comparative benchmarks, the network outperforms Foldseek and HMMER, recalling up to 35 % more domains at matched precision, and surpasses PSI‑BLAST in family‑level retrieval while operating entirely without pairwise alignment. Deployment across 43.2 million non‑redundant MGnify proteins yields 1,459 high‑confidence Argonaute homologs, expanding the known repertoire by nearly an order of magnitude, and reveals a previously unrecognized monophyletic clade that branches from longA Argonautes but lacks both the catalytic DEDX tetrad and PAZ/APAZ guide‑anchoring modules. Detailed motif analysis confirms 168 catalytically competent longA enzymes, maps MID‑anchor diversification across clades and uncovers rare aromatic amplifications (YY, FWK) suggestive of enhanced guide affinity. Accessory‑domain scans identify Tudor, Topoisomerase‑I, and S13‑like H2TH fusions, highlighting modular accretion as a driver of functional innovation.

This work demonstrates that graph‑based representation learning can transcend the limitations of alignment‑centric pipelines, resolve residue‑level architectures, and illuminate hidden enzymatic diversity at the metagenomic scale. The resulting atlas of canonical and novel pAgos provides a rich source of candidates for mechanistic exploration and biotechnological development, while the methodological blueprint is broadly transferable to other fast‑evolving protein families residing in the microbial dark proteome.

School

School of Sciences and Engineering

Department

Biotechnology Program

Degree Name

MS in Biotechnology

Graduation Date

Summer 6-15-2025

Submission Date

5-26-2025

First Advisor

Ahmed Moustafa

Second Advisor

Magdy M. Mahfouz

Committee Member 1

May Bakr

Committee Member 2

Robert Hoehndorf

Committee Member 3

Anwar Abdelnaser

Extent

73 p.

Document Type

Master's Thesis

Institutional Review Board (IRB) Approval

Not necessary for this item

Available for download on Wednesday, May 26, 2027

Share

COinS