User guide
cfoldseeker has several search modes and helper tools, each one requiring different prior work to be done with MMseqs and/or FoldSeek.
Tip
Each cfoldseeker command prints logs at the stdout. You can save them in a log file by tee-ing.
cfoldseeker ... | tee log_file.log
Remote mode
The remote mode requires from you a set of protein structural models, and our copy of the UniProt ID mapping table (only the KEGG and GenPept rows of the official mapping table).
Prior work
Get structural models as CIF files of your query proteins, either experimentally or computationally (AlphaFold, ESMFold, OpenFold…). Collect them all in one folder query_models.
The search
Run cfoldseeker in remote mode using our UniProt mapping table uniprot_kegg_genpept.gz and save results in a new folder results. Search the AFDB50, the AFDB-Proteome, the AFDB-SwissProt databases, using 4 workers for the cross-referencing APIs.
cfoldseeker -m remote -q query_models -o results -rdb afdb50 afdb-proteome afdb-swissprot -w 4 -uma uniprot_kegg_genpept.gz
Local mode
The local mode requires from you a set of query protein structures, a genomic context table made using cfoldseeker-cds, and a FoldSeek target database.
Prior work
1. Get structural models as CIF files of your query proteins, either experimentally or computationally (AlphaFold, ESMFold, OpenFold). Collect them all in one folder query_models.
2. Build a genomic context table using our provided helper tool cfoldseeker-cds, which builds a context table directly from a set of NCBI or Bakta GFF files, or from a folder holding an extracted NCBI package of GFF files.
Tip
cfoldseeker-cds usually produces context DBs populated with accession codes. Although using accession codes standardises analysis outputs greatly, they are not very human-friendly. Enable the -tn flag to make cfoldseeker-cds populate the context DB with readily readable taxon names. These names will recur in the outputs of a cfoldseeker analysis.
In both NCBI modes, cfoldseeker-cds will fetch the taxon names from NCBI via the Entrez API. In Bakta mode, it will generate a generic taxon name. In TSV mode, it will trust the user’s inputs.
To produce a compressed genomic context DB ncbi_package_db.gz from a default NCBI package (a folder ncbi_dataset, which has an identically named subfolder):
cfoldseeker-cds -i ncbi-dataset -o ncbi_package_db.gz -m ncbi-package -gz
To produce a compressed context DB ncbi_files_db.gz populated with taxon names from a folder of NCBI GFF files gffs:
cfoldseeker-cds -i gffs -o ncbi_files_db.gz -m ncbi-gff -gz -tn
Note
Keep in mind that cfoldseeker-cds gets the taxa names from the GFF filenames, or from the subfolders in the NCBI package. Give your GFF files a unique name (e.g. NCBI accession ID)!
Tip
cfoldseeker-cds DBs can be concatenated using cat. No need to rerun the builder!
3. Generate the target DB target_DB using foldseek createdb. This prepares a FoldSeek DB from your folder containing the target set of protein structures (input). You probably don’t have thousands of protein structures lingering around, so you will need FoldSeek’s builtin support of ProstT5, a LLM that directly translates amino acid sequences into FoldSeek’s internal 3Di alphabet, skipping protein model prediction. This is the key step that allows searching sequence databases using structural similarity.
First make sure you have downloaded ProstT5’s weights.
foldseek databases ProstT5 <path-to-prostt5-weights> tmp
Then you can build the FoldSeek DB directly from your protein sequences in input.
foldseek createdb input target_DB --prostt5-model <path-to-prostt5-weights>
Warning
This is a time- and computationally demanding task! Consider using a GPU (by adding the --gpu 1 flag if you have the hardware configured).
Tip
FoldSeek offers a command to merge existing DBs foldseek concatdbs. Use it to concat existing target DBs, and save time and computational work.
The search
Run cfoldseeker in local mode using FoldSeek DB target_DB, and context DB cds_db.gz. Save results in a new folder results.
cfoldseeker -m local -q query_models -o results -ldb target_DB/target_DB -cdb cds_db.gz
Local-clustered mode
The local-clustered mode requires a set of query protein structures, a genomic context table made using cfoldseeker-cds, a MMseqs2 clustering TSV file, and a FoldSeek target database of the representative proteins.
Prior work
1. Get structural models as CIF files of your query proteins, either experimentally or computationally (AlphaFold, ESMFold, OpenFold…). Collect them all in one folder query_models.
2. Build a genomic context table using our provided helper tool cfoldseeker-cds, which builds a context table directly from a set of NCBI or Bakta GFF files, or from a folder holding an extracted NCBI package of GFF files. (See the prior work section of local search above)
3. Cluster your target sequences in the folder input_all using mmseqs. You can do this using easy-cluster, or easy-linclust for huge sequence databases. We recommend to use an identity and a coverage threshold of 90 % to ensure all proteins in a protein cluster have identical functions. Use other thresholds at your own risk!
mmseqs easy-linclust input_all clustered tmp --min-seq-id 0.9 -c 0.9
This will, among others, produce a fasta file clustered_rep_seq.fasta containing the amino acid sequences of the representative protein of each cluster.
4. Generate the target DB from the representative sequences using FoldSeek and ProstT5 (see also the prior work section of local search above). Make sure you have downloaded ProstT5’s weights.
foldseek createdb clustered_rep_seq.fasta target_DB --prostt5-model <path-to-prostt5-weights>
The search
Run cfoldseeker in local mode using FoldSeek DB target_DB, context DB cds_db.gz, and preclustering table clustered_table.tsv. Save results in a new folder results.
cfoldseeker -m local_clustered -q query_models -o results -ldb target_DB/target_DB -cdb cds_db.gz -scl clustered_table.tsv
Specifying search options
cfoldseeker offers several filtering thresholds to refine your hit set and find relevant gene clusters.
General options
The general search options are a mix of what cblaster and foldseek offer. These options are listed below.
filter |
Description |
|
Maximum E-value of a Foldseek hit |
|
Minimum bitscore of a Foldseek hit |
|
Minimum sequence identity between a hit and a query (in %) |
|
Minimum query coverage of the hit (in %) |
|
Minimum target coverage of the hit (in %) |
|
Maximum gap between two hits on the same scaffold (in bp) |
|
Maximum length of a cluster (in bp) |
|
Minimum number of hits in a cluster |
|
Minimum number of queries represented in a cluster |
|
Queries that must have a hit in a cluster |
Getting all cluster layouts
Sometimes, a protein may match with multiple query proteins, e.g. when you have two paralogs among your query proteins. This makes it tricky to determine what the query layout of an identified cluster is. For example, if two proteins in a cluster both match with query proteins 1 and 2, cluster layout 12 is equally correct as layout 21. By default, if cfoldseeker encounters the identical cluster passing the filtering thresholds with different layouts, it will only keep the one with the highest cluster score.
If you are interested in all possible cluster layouts passing your filtering thresholds rather than only the highest-scoring one, you can turn on the all-layouts flag and keep all passing configurations.
Using taxon filters
The Foldseek webserver offers an interface to filter hits by a taxonomic filter. cfoldseeker exposes this interface in its remote mode via the -tf or --taxon-filter flag. Foldseek expects NCBI taxon IDs for its taxonomic filter. If you are not sure what the exact taxon ID of your taxa group is, you can check it at the NCBI Taxonomy website.
Another option is to add as a taxonomic filter under settings on the Foldseek webserver. Click then the API button on the top right, which will give you a pop-up with code to submit your query via the terminal, including the taxonomy ID of your taxa group among the settings.
Specifying outputs
cfoldseeker can produce several outputs. By default, it produces only hit and cluster overview tables in TSV format, but several cblaster-style outputs and the raw Foldseek hit tables can be returned on request.
Cluster table
The clusters.tsv file is a tab-separated file summarising the properties of the identified clusters. It comprises the 10 columns described below.
Column |
Description |
number |
Arbitrary unique number |
hits |
IDs of the hits part of this cluster |
start |
Starting coordinate of the entire cluster |
end |
End coordinate of the entire cluster |
length |
Sum of the lengths of all exons part of this cluster |
score |
Sum of the Foldseek bitscores of all cluster members |
scaff |
ID of the scaffold/contig harbouring this cluster |
strand |
Strand location of the cluster |
taxon_name |
Name of the taxon (e.g. NCBI Assembly ID) |
taxon_id |
Unique taxon ID (e.g. NCBI taxon ID) |
Hit table
The hits.tsv file gathers metadata about all hits part of the identified clusters. It contains the 16 columns described below.
Column |
Description |
db_id |
Unique hit ID |
query |
ID of the query with which the hit matches |
scaff |
ID of the scaffold/contig harbouring this hit |
strand |
Strand location of the hit |
coords |
Comma-separated list of coordinate intervals for all exons of this protein |
db |
DB in which this hit was found (for local DBs: local) |
crossref_id |
ID of the cross-referenced record (same as db_id for local DBs) |
crossref_method |
Cross-referencing method (KEGG, GenPept, WGS-GenPept, or local) |
name |
Protein annotation |
taxon_name |
Name of the taxon (e.g. NCBI Assembly ID) |
taxon_id |
Unique taxon ID (e.g. NCBI taxon ID) |
evalue |
Hit e-value |
score |
Hit bitscore |
seqid |
Sequence identity between hit and query protein |
qcov |
Query coverage of the hit (which fraction of the query fits) |
tcov |
Target coverage of the hit (which fraction of the target fits) |
Foldseek output
cfoldseeker can return the raw Foldseek output from which it started. This can be useful if you want to track down why a certain hit was not found being part of a cluster, or how many hits have been found for a certain query protein. In remote mode, cfoldseeker returns the json files it received from the Foldseek webserver. In the local modes, it will return the tab-separated text file returned by the local foldseek call.
cblaster outputs
cfoldseeker has tightly integrated cblaster. All results are cast into a cblaster session, from which familiar outputs can be obtained, such as the summary table, the binary table, the hit plot, and the clinker alignment. See the cblaster documentation for specifics on these outputs.