cfoldseeker

main

cfoldseeker.main.create_parser() ArgumentParser[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:

None

Returns:

An ArgumentParser object holding the CLI ready to collect the arguments when called

Return type:

parser (argparse.ArgumentParser)

Initialise the correct search class and pass it on the necessary arguments.

Parameters:

parsed_args (dict) – nested dictionary holding the arguments as parsed by parse_and_validate_arguments

Returns:

A Search workflow object ready to run

Return type:

the_run (RemoteSearch | LocalSearch | LocalClusteredSearch)

cfoldseeker.main.main()[source]

Main entry point of cfoldseeker.

Oversees the complete workflow: parses command-line arguments, sets up the logger and the run, and calls the workflow.

cfoldseeker.main.parse_and_validate_arguments(args: Namespace, skip_context_table_check: bool = False) dict[source]

This function validates the parsed arguments given through the command line.

Parameters:
  • parser (argparse.NameSpace) – A NameSpace object with parsed CLI arguments

  • skip_csuite_IO_checks (bool) – Skip argument validation for intermediary inputs and outputs in the csuite workflows. For compatibility with the csuite validation checker.

Returns:

A dictionary holding the parsed and validated argument values.

Return type:

parsed_args (dict)

Raises:

ValueError – if an invalid argument value was given.

cfoldseeker.main.run_workflow(parsed_args: dict) None[source]

Execute the complete cfoldseeker workflow.

Initialises the appropriate Run instance, executes it, generates the output, and cleans up the temporary files.

Returns:

None

cfoldseeker.main.setup_logging(verbosity: int) None[source]

Set up the root logger if it has not been set up yet.

Parameters:

verbosity (int) – Verbosity level (choices: 0,1,2,3,4).

Returns:

None

classes

class cfoldseeker.classes.Cluster(hits, number=0)[source]

Bases: object

Represents a gene cluster containing one or more protein hits.

A cluster groups proximal hits on the same genomic scaffold that meet specified clustering criteria. All hits in a cluster are expected to share the same scaffold and taxon.

hits

List of Hit objects in the cluster.

Type:

list[Hit]

number

Cluster identifier/rank number.

Type:

int

score

Cumulative score of all hits in the cluster.

Type:

int

start

Minimum genomic coordinate across all hits.

Type:

int

end

Maximum genomic coordinate across all hits.

Type:

int

length

Total length in base pairs of all exons across hits.

Type:

int

scaff

Scaffold/contig ID (taken from first hit).

Type:

str

taxon_id

Taxonomic ID (taken from first hit).

Type:

str

taxon_name

Taxonomic name (taken from first hit).

Type:

str

filelabel

Filelabel of local sequence file (taken from first hit).

Type:

str

as_dict() dict[source]

Convert the Cluster object to a dictionary.

Returns:

Dictionary with cluster attributes including comma-separated

hit IDs and all genomic coordinates.

Return type:

dict

class cfoldseeker.classes.Hit(db_id, query, crossref_id=[], crossref_method='', name='', taxon_name='', taxon_id=0, db='', filelabel='', evalue=1, score=0, seqid=0, qcov=0, tcov=0, scaff='', coords=[], strand='')[source]

Bases: object

Represents a single protein structure hit from a FoldSeek search.

This class encapsulates information about a homologous protein structure match, including its database identifiers, sequence similarity metrics, genomic location, and taxonomic data.

query

ID of the homologous query protein.

Type:

str

db_id

ID of the hit in its structure database.

Type:

str

db

Structure database the hit was found in.

Type:

str

crossref_id

ID used for cross-referencing (either KEGG or GenPept ID).

Type:

list

crossref_method

Method used for cross-referencing (either KEGG, GenPept, WGS-GenPept, or local).

Type:

str

name

Annotation or description of the hit.

Type:

str

taxon_name

Name of the taxon in which this hit was found.

Type:

str

taxon_id

Identifier of the taxon in which this hit was found.

Type:

int

evalue

E-value of the FoldSeek hit.

Type:

float

score

FoldSeek alignment score.

Type:

int

seqid

Sequence identity percentage with the query protein.

Type:

float

qcov

Query coverage percentage.

Type:

float

tcov

Target coverage percentage.

Type:

float

scaff

RefSeq or GenBank ID of the scaffold encoding the hit.

Type:

str

coords

Genomic coordinates of the encoding gene’s exons.

Type:

list

strand

DNA strand the encoding gene is located on (‘+’ or ‘-‘).

Type:

str

as_dict() dict[source]

Convert the Hit object to a dictionary.

Returns:

Dictionary with all Hit attributes; coordinates are formatted as

double-dot-separated exon pairs joined by commas.

Return type:

dict

end() int | None[source]

Return the end coordinate of the last exon.

Returns:

Maximum genomic coordinate across all exons, or None if

no coordinates are defined.

Return type:

int | None

intergenic_distance(other_hit: Hit) int[source]

Calculate the intergenic distance between this hit and another hit.

For genes on the same scaffold, computes the distance between the end of the upstream gene and the start of the downstream gene. If genes overlap, returns the negative of the length of the overlapping gene.

Parameters:

other_hit (Hit) – The other Hit object to measure distance to.

Returns:

Intergenic distance in base pairs (positive for gaps, negative

for overlaps). Returns the negative of the length of the smaller gene in case of a full overlap.

Return type:

int

length() int[source]

Return the total length in base pairs of all exons.

Returns:

Sum of lengths across all exons, calculated as

(end - start + 1) for each exon.

Return type:

int

same_location(other_hit: Hit) bool[source]

Check if two hits are at exactly the same genomic coordinates.

Parameters:

other_hit (Hit) – The other Hit object to compare.

Returns:

True if both hits are on the same scaffold and their genomic

coordinates completely overlap, False otherwise.

Return type:

bool

start() int | None[source]

Return the start coordinate of the first exon.

Returns:

Minimum genomic coordinate across all exons, or None if

no coordinates are defined.

Return type:

int | None

class cfoldseeker.classes.Search(query, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: ABC

Abstract base class for protein structure searches with cluster identification.

This class manages a FoldSeek-based search workflow (remote or local), including result parsing, cluster identification using graph-based algorithms, and output generation. Subclasses must implement abstract methods for search execution and result parsing. Methods for the cluster identification and output generation are implemented here and shared over all subclasses.

query

List of query protein structure file paths.

Type:

list

params

Search configuration parameters (e.g., max gap, min hits).

Type:

dict

hits

All identified hits from the search.

Type:

list[Hit]

clusters

Identified gene clusters passing filters.

Type:

list[Cluster]

output_flags

Outputs to be generated.

Type:

dict

OUTPUT_DIR

Directory for output files.

Type:

Path

TEMP_DIR

Directory for temporary files.

Type:

Path

generate_cblaster_session() Session[source]

Generate a cblaster Session object from the identified clusters.

Constructs a cblaster-compatible session containing all search results, organised in the same hierarchy as cblaster (by organism, scaffold, and cluster). This object can be saved and reloaded for interactive visualisation and analysis outside of cfoldseeker.

Returns:

Session holding all information about the identified clusters.

Return type:

Session (cblaster.Session)

generate_output()[source]

Generate the requested output files for this search.

Checks which outputs are requested from the parsed output flags, and generates what is necessary using the appropriate methods.

Parameters:

None

Returns:

None

generate_tables(output_folder: Path) None[source]

Save hit and cluster lists as tab-separated value (TSV) tables.

Generates two output files: - hits.tsv: Table of all hits with their properties. - clusters.tsv: Table of all clusters with their properties.

Parameters:

output_folder (Path) – Directory where output tables will be written.

identify_clusters() None[source]

Identify gene clusters among the hits based on clustering criteria.

This method groups hits by scaffold, calculates intergenic distances, filters based on maximum gap and minimum hit thresholds, and uses a directed graph to identify chains of unique proximal hits. It then applies additional filters for cluster size, query coverage, and length before ranking clusters by score.

The method populates self.clusters with identified Cluster objects and updates self.hits to contain only hits in identified clusters.

Raises:
  • RuntimeError – If no hit groups pass the distance criteria.

  • RuntimeError – If no cluster could be identified among the hit groups.

identify_clusters_from_groups(close_groups: list, max_length: int, min_hits: int, min_covered_queries: int, require: set[str], all_layouts: bool)[source]

Identify clusters from the proximal hit groups.

Constructs Cluster objects from hit groups that pass all cluster identification thresholds (max cluster length, minimum no. hits, minimum no. covered queries, required queries). Can also return all cluster layouts that fit the cluster identification thresholds with a less-than-best score.

Returns:

list of Cluster objects that pass all identification thresholds.

Return type:

clusters (list[Cluster])

abstractmethod identify_hits()[source]

Parse FoldSeek output and populate the hits list.

Note

This method must be implemented by subclasses to convert raw FoldSeek results from the webserver or a local command call into a list of Hit objects.

identify_proximal_groups(max_gap: int) list[list[Hit]][source]

Identify proximal groups among the hits.

Calculates the distance between all genes on the same scaffold, discards self-hits and hit pairs that fail the intergenic distance threshold.

Returns:

Hit pairs of proximal hits that pass the intergenic distance threshold.

Return type:

close_groups (list(list[Hit]))

Mutates:

RuntimeError: If there are not hit groups passing the intergenic distance criteria.

abstractmethod run()[source]

Execute the complete search workflow.

Note

This method must be implemented by subclasses to orchestrate the entire search process including FoldSeek execution and result parsing.

abstractmethod run_foldseek()[source]

Run the FoldSeek search tool.

Note

This method must be implemented by subclasses to execute FoldSeek either remotely or locally with the appropriate parameters, input files and target databases.

remote

class cfoldseeker.remote.RemoteSearch(query, mapping_table_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: Search

Subclass executing the workflow for gene cluster identification from remote protein searches.

Extends the Search base class to perform FoldSeek-based searches against remote databases, parse results, and cross-reference hits to genomic data. Uses a local copy of the UniProt mapping table to retrieve genomic coordinates for each protein with fallback strategies using KEGG and ENA.

New attributes:
mapping_table: Polars LazyFrame containing UniProt ID cross-references to

various databases (KEGG, EMBL-CDS, etc.).

Inherits from:

Search: Base class providing the cluster identification and output generation capabilities.

See also

Local: Sister class providing the search and parsing capabilities for local database searches

crossref_afdb() list[source]

Cross-reference AFDB hits and retrieve genomic neighbourhood information.

Systematically cross-references AlphaFold DB (AFDB) hits to genomic data through three methods: KEGG IDs from KEGG, GenPept IDs from ENA, and WGS-GenPept IDs from UniSave. For each hit, extracts scaffold IDs, CDS coordinates, and strand information. Updates scaffold IDs to their latest versions using NCBI Entrez.

Returns:

A list of Hit objects with populated genomic information (scaffold, coordinates, and strand data).

crossref_afdb_via_genpept(afdb_hits: list[Hit]) list[Hit][source]

Cross-reference AFDB hits to GenPept IDs in ENA using the UniProt mapping table

Retrieves the GenPept IDs in ENA for all hits and updates the the crossref ID and method if any.

Parameters:

afdb_hits – List of Hit objects without genomic information and cross-references

Mutates:

Hit objects in afdb_hits: Fills cross-reference ID and method attributes

Returns:

list of Hits that have no cross-reference to GenPept in the UniProt mapping table. These need to be processed by a different cross-referencing method.

Return type:

hits_failed_legg (list[Hit])

crossref_afdb_via_kegg(afdb_hits: list[Hit]) list[Hit][source]

Cross-reference AFDB hits to KEGG IDs using the UniProt mapping table

Retrieves the KEGG IDs for all hits and updates the the crossref ID and method if any.

Parameters:

afdb_hits – List of Hit objects without genomic information and cross-references

Mutates:

Hit objects in afdb_hits: Fills cross-reference ID and method attributes

Returns:

list of Hits that have no cross-reference to KEGG in the UniProt mapping table. These need to be processed by a different cross-referencing method.

Return type:

hits_failed_legg (list[Hit])

crossref_afdb_via_wgs_genpept(afdb_hits: list[Hit]) list[Hit][source]

Cross-reference AFDB hits to WGS-GenPept IDs in UniSave using the UniSave API.

Retrieves the WGS-GenPept IDs in UniSave for all hits and updates the the crossref ID and method if any.

Parameters:

afdb_hits – List of Hit objects without genomic information and cross-references

Mutates:

Hit objects in afdb_hits: Fills cross-reference ID and method attributes

Returns:

list of Hits that have no cross-reference to WGS-GenPept in UniSave. These need to be processed by a different cross-referencing method.

Return type:

hits_failed_legg (list[Hit])

identify_hits() None[source]

Parse FoldSeek results and create Hit objects for hits passing the predefined criteria.

Extracts hit information from the raw FoldSeek results, and applies filtering thresholds based on e-value, bit score, sequence identity, and coverage. Creates Hit objects for hits meeting all criteria and removes redundant hits found in multiple databases, keeping only the first instance per UniProt ID.

Returns:

None

Raises:

RuntimeError – If the hit list is empty after applying the criteria

Note

Logs a warning when there are no hits for a certain query and DB pair.

passes_criteria(hit: Hit)[source]

Check if a hit passes the criteria set for this search.

Returns:

Dit the hit pass all criteria?

Return type:

(bool)

prepare_mapping_dict(all_uniprot_ids: list, db: str) dict[source]

Extract a mapping dictionary from the local UniProt mapping table.

Filters the full UniProt cross-reference LazyFrame to extract IDs for a specific target database, grouping multiple cross-references per UniProt ID into a dictionary of lists.

Parameters:
  • all_uniprot_ids – List of UniProt accession numbers to extract.

  • db – Target database name (e.g., ‘KEGG’, ‘EMBL-CDS’).

Returns:

A dictionary mapping UniProt IDs (keys) to lists of cross-reference IDs in the target database (values).

pull_and_parse_genpept_records(afdb_hits: list[Hit]) list[Hit][source]

Pull and parse the (WGS-)GenPept records associated with an AFDB hit. Update the hit attributes.

Pulls the (WGS-)GenPept records associated with an AFDB hit. It immediately extracts the scaffold and strand information, and the genomic coordinates from the record and updates the hit’s attributes accordingly.

Parameters:

afdb_hits – List of Hit objects with cross-references

Mutates:

Hit objects in afdb_hits: Fills scaffold, strand and coordinates attributes

Returns:

Hit objects with updated genomic location attributes retrieved from this cross-referencing method.

Return type:

processed_hits (list[Hit])

pull_and_parse_kegg_records(afdb_hits: list[Hit]) list[Hit][source]

Pull and parse the KEGG records associated with an AFDB hit. Update the hit attributes.

Pulls the KEGG records associated with an AFDB hit. It immediately extracts the scaffold and strand information, and the genomic coordinates from the record and updates the hit’s attributes accordingly.

Parameters:

afdb_hits – List of Hit objects with cross-references

Mutates:

Hit objects in afdb_hits: Fills scaffold, strand and coordinates attributes

Returns:

Hit objects with updated genomic location attributes retrieved from this cross-referencing method.

Return type:

processed_hits (list[Hit])

run() None[source]

Execute the complete remote search workflow.

Orchestrates all processing steps in sequence: running FoldSeek remotely via the webserver, parsing the results, cross-referencing using different sources, pulling the cross-referenced records, parsing the genomic neighbourhood coordinates from these records, and identifying the gene clusters.

Returns:

None

run_foldseek() None[source]

Submit protein queries to FoldSeek and retrieve the results.

Sends all query structures to the FoldSeek webserver in parallel, collects submission tickets, monitors job status, and downloads completed results. Stores raw JSON results in the temporary folder.

Returns:

None

update_version_digits(processed_hits: list[Hit], max_attempts: int = 3) list[Hit][source]

Update the version digits of the scaffold IDs of every hit.

Adds or updates the version digit of the scaffold ID of every hit. Retrieves the latest scaffold ID from the NCBI Entrez API with retry logic, and updates the hit’s scaffold IDs accordingly.

Parameters:
  • processed_hits (list[Hit]) – hits with filled cross-reference and genomic location attributes

  • max_attempts (int) – Maximum number of times to try getting the most recent version digits from Entrez.

Mutates:

Hit objects in processed_hits: Updates the scaffold attribute with the most recent version digit.

Returns:

hits with an updated version digit in the scaffold attribute of every hit.

Return type:

processed_hits (list[Hit])

local

class cfoldseeker.local.LocalSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: Search

Subclass executing the workflow for gene cluster identification from local protein searches using FoldSeek.

Extends the Search base class to perform searches against local FoldSeek databases. Handles FoldSeek execution, result parsing, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds.

db_path

Path to the FoldSeek protein structure target database.

Type:

Path

coord_db

DataFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.

Type:

polars.LazyFrame

collect_hits(results: DataFrame) None[source]

Collects hit instances from a filtered hit table.

Collects and instantiates Hit objects for every hit in the filtered table, after fetching genomic context data from the context database. Genomic coordinate strings are parsed on-the-fly.

Parameters:

results (polars.DataFrame) – A filtered FoldSeek hits table

Returns:

None

Mutates:

self.hits: Instantiates the list of identified Hit objects.

Note

Stores generated Hit objects in self.hits as a list. Genomic coordinates are parsed from a comma-separated string of joined range pairs (e.g., “10..50”, “join(150..200,250..300)”) into nested lists of integers.

identify_hits() None[source]

Identify hits passing the hit thresholds from the FoldSeek results.

Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.

Returns:

None

parse_foldseek_results() DataFrame[source]

Parse the FoldSeek result table, expand it to include all members of the sequence clusters, and generate Hit objects with filled genomic coordinates.

Reads the FoldSeek result table, expands it with all members of each sequence cluster of which the original FoldSeek hits are the representatives (by joining with the clustering table), applies filtering thresholds (bit score, query coverage, target coverage), removes duplicate hits, and joins results with the CDS coordinates database. Parses genomic coordinates from the coordinate string and creates Hit objects for each match.

The following filtering thresholds are applied: 1. Sequence identity >= min_seqid 2. E-value <= max_eval 3. Bit score >= min_score 4. Query coverage >= min_qcov (converted to percentage) 5. Target coverage >= min_tcov (converted to percentage)

Returns:

A filtered FoldSeek hits table

Return type:

results (polars.DataFrame)

run() None[source]

Execute the complete local search workflow.

Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.

Returns:

None

run_foldseek() None[source]

Execute a FoldSeek search against the local protein structure database.

Constructs and runs a FoldSeek ‘easy-search’ command with all query structures (in CIF format) against the local database. Applies filters for sequence identity and E-value thresholds. Captures stdout and stderr in real-time via separate threads and logs them appropriately.

Exhaustive search (no database prefiltering) has been enabled to retrieve all hits in the target database.

Returns:

None

Raises:

RuntimeError – If FoldSeek returns a non-zero exit code.

Note

FoldSeek output is written to a temporary TSV file in TEMP_DIR.

local_clustered

class cfoldseeker.local_clustered.LocalClusteredSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'), seq_clust_tsv=PosixPath('.'))[source]

Bases: LocalSearch

Subclass executing the workflow for gene cluster identification from local protein searches in a sequence-preclustered database using FoldSeek.

Extends the LocalSearch base class to expand identified hit sets with cluster members from a premade sequence clustering of the target database. Basically runs a LocalSearch against a FoldSeek structure database of sequence cluster representatives and adds the sequence cluster members of their representative was identified as a valid hit, before continuing with cross-reffing and gene cluster identification.

Handles FoldSeek execution, result parsing, adding sequence cluster members, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds, and a TSV with the sequence cluster members made beforehand with MMseqs2.

db_path

Path to the FoldSeek protein structure target database.

Type:

Path

coord_db

LazyFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.

Type:

polars.LazyFrame

seq_clust

LazyFrame containing the sequence cluster

Type:

polars.LazyFrame

representative of every sequence in the target database. These representatives
then form the FoldSeek target structure database.
expand_sequence_clusters(results: DataFrame) DataFrame[source]

Expand a given FoldSeek result table with all members of the original hits’ sequence clusters.

Includes all members of the sequence clusters of which the original FoldSeek hits are the representatives of by joining the FoldSeek result table with the MMseqs2 clustering table.

Drops duplicate protein/query pairs.

Hit metadata are taken over from the representative protein.

Parameters:

results (polars.DataFrame) – Original FoldSeek result table with only the representative proteins

Returns:

Result table with all added non-representative proteins for each representative

Return type:

expanded_results (polars.DataFrame)

identify_hits() None[source]

Identify hits passing the hit thresholds from the FoldSeek results.

Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.

Returns:

None

Mutates:

self.hits: Instantiates the list of identified Hit objects.

run() None[source]

Execute the complete local search workflow.

Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.

Returns:

None

communication

cfoldseeker.communication.check_query_status(job_id: str) str[source]

Retrieves the current status of a FoldSeek job.

Queries the FoldSeek API to check the processing status of a previously submitted job using its unique job ID.

Parameters:

job_id – The unique identifier for the FoldSeek job.

Returns:

A string indicating the job status (e.g., “COMPLETE”, “RUNNING”, etc.).

cfoldseeker.communication.pull_dict_from_unisave(entries: list, max_workers: int = 1, no_progress: bool = False) dict[source]

Retrieves multiple UniSave records and returns them as a dictionary.

Fetches a list of UniSave entries in parallel and returns them mapped to their original accession numbers. Failed retrievals are filtered out.

Parameters:
  • entries – List of UniProt accession numbers to retrieve.

  • max_workers – Number of worker threads for parallel retrieval. Defaults to 1.

  • no_progress – If True, suppresses the progress bar during retrieval. Defaults to False.

Returns:

A dictionary mapping each successfully retrieved accession number to its corresponding UniSave record as a string. Failed retrievals are excluded from the dictionary.

cfoldseeker.communication.pull_from_ena(entry: str, max_retries: int = 3) None | str[source]

Retrieves a GenPept record from the ENA Browser API.

Attempts to fetch a GenPept sequence record from the European Nucleotide Archive (ENA) with retry logic for rate-limited responses.

Parameters:
  • entry – The accession number or identifier of the GenPept record to retrieve.

  • max_retries – Maximum number of retry attempts for rate-limited requests (error code 429). Defaults to 3. Waiting time between trials is 5 seconds.

Returns:

A string containing the GenPept record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.

cfoldseeker.communication.pull_from_unisave(entry: str, max_retries: int = 3) None | str[source]

Retrieves a UniSave record from the UniProt REST API.

Fetches a protein sequence record from UniSave (UniProt archive) with retry logic for rate-limited responses.

Parameters:
  • entry – The UniProt accession number of the record to retrieve.

  • max_retries – Maximum number of retry attempts for rate-limited requests (429). Defaults to 3. Waiting time between trials is 5 seconds.

Returns:

A string containing the UniSave record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.

cfoldseeker.communication.retrieve_foldseek_results(job_id: str) dict[source]

Waits for a FoldSeek job to complete and retrieves its results.

Polls the job status at regular intervals until completion, then downloads and returns the parsed results from the FoldSeek API.

Parameters:

job_id – The unique identifier for the FoldSeek job.

Returns:

A dictionary containing the parsed results from the completed FoldSeek job.

cfoldseeker.communication.submit_foldseek_query(query_path: Path, dbs: list, taxfilters: list, max_attempts: int = 3) dict[source]

Submits a structure file to the FoldSeek API for processing.

Sends a protein structure query to the FoldSeek webserver with specified databases and taxonomic filters. Returns the submission ticket on success or raises an error on failure with maximum attempt logic.

Parameters:
  • query_path (Path) – Path object pointing to the structure file to submit.

  • dbs (list) – List of database names to search against.

  • taxfilters (list) – List of taxonomic filters to apply to the search.

  • max_attempts (int) – Maximum number of submission attempts

Returns:

A dictionary containing the submission ticket and metadata from the FoldSeek API response.

Raises:

RuntimeError – If submission fails too many times.

remote_parsers

cfoldseeker.remote_parsers.extract_genomic_information_ena(record: str) dict[source]

Extracts the genomic information from a pulled ENA GenPept record.

cfoldseeker.remote_parsers.extract_genomic_information_kegg(gene_entry: str) dict[source]

Extracts the genomic information from a pulled KEGG Gene record.

cfoldseeker.remote_parsers.extract_scaffold_mapping_kegg(genome_entry: str) dict[source]

Maps all KEGG scaffold IDs for a Genome entry to the associated GenBank/RefSeq IDs.

build_cds_db

cfoldseeker.build_cds_db.check_duplicate_contigs(cds_db: LazyFrame, parsing_mode: str) LazyFrame[source]

Check for and attempt to fix duplicate contig labels per taxon.

Detects cases where the same contig label appears in multiple taxa. For Bakta GFF files, attempts to prepend the existing locus tag prefix to make contigs unique. For other formats, exits with an error.

Parameters:
  • cds_db (polars.LazyFrame) – A dataframe containing CDS records with ‘contig’, ‘taxon_id’, and ‘gene_tag’ columns.

  • parsing_mode (str)) – The format mode used for parsing (‘bakta-gff’, ‘ncbi-gff’, ‘ncbi-package’, or ‘tsv’).

Returns:

The input DataFrame with modified contig labels if a fix

was applied (Bakta mode only).

Return type:

cds_db (polars.LazyFrame)

Mutates:
cds_db (polars.LazyFrame): The input DataFrame with modified contig labels if a fix

was applied (Bakta mode only).

Raises:

RuntimeError – If duplicate contigs are detected and cannot be fixed, or if fix attempt fails.

Note

This function contains a potential local partial materialisation of the LazyFrame. This triggers all files to be parsed.

cfoldseeker.build_cds_db.create_parser() ArgumentParser[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:

None

Returns:

An ArgumentParser object holding the CLI ready to collect the arguments when called

Return type:

parser (ArgumentParser)

cfoldseeker.build_cds_db.fetch_taxon_names(taxon_ids: list, max_attempts: int = 5) list[source]

Fetch NCBI taxon names for a given list of NCBI taxon IDs.

Fetches summary objects for a given list of NCBI Taxonomy IDs using BioPython’s Entrez API with retry logic, and extracts the taxon name from each summary.

Parameters:
  • taxon_ids (list) – List of NCBI taxon IDs as strings.

  • max_attempts (int) – Maximum numbers of attempts to retry fetching in case of a failure. Defaults to 5.

Returns:

List of NCBI taxon names

Return type:

taxon_names (list)

cfoldseeker.build_cds_db.main()[source]

Main entry point for the CDS database construction tool.

Oversees the complete workflow: parses command-line arguments, and calls the workflow.

cfoldseeker.build_cds_db.parse_and_validate_arguments(args: Namespace) dict[source]

This function parses and validates the arguments received through the command line.

Parameters:

args (argparse.Namespace) – A Namespace holder object with the parsed argument values

Returns:

A dictionary holding the parsed and validated argument values.

Return type:

parsed_args (dict)

Raises:

ValueError – if an invalid argument value was given.

cfoldseeker.build_cds_db.parse_files(input_path: Path, parsing_mode: str, n_workers: int = 1, no_progress: bool = False) LazyFrame[source]

Parses all input files and constructs a draft CDS coordinates database.

Dispatches file parsing based on the specified parsing mode, using parallel processing to handle multiple files efficiently. Concatenates all parsed results into a single DataFrame.

Parameters:
  • input_path (Path) – Path to the folder containing input files.

  • parsing_mode (str) – File format mode - one of: ‘ncbi-gff’, ‘ncbi-package’, ‘bakta-gff’, or ‘tsv’.

  • n_workers (int) – Number of worker threads for parallel file parsing. Defaults to 1.

  • no_progress (bool) – If True, suppresses the progress bar during parsing. Defaults to False.

Returns:

A Polars LazyFrame containing concatenated CDS records from all input files with columns: gene_tag, name, contig, coords, strand, taxon_id, and filename (or taxon_name for TSV formats).

cfoldseeker.build_cds_db.run_workflow(parsed_args: dict) None[source]

Execute the complete CDS database construction workflow.

Loads and parses input files, validates contig uniqueness, assigns taxon labels, and writes the final CDS coordinates database to disk as a tab-separated file. Supports optional gzip compression.

Parameters:

parsed_args (dict) – A dictionary holding the parsed and validated argument values.

Returns:

None

Note

This workflow uses a lazy parsing method to limit RAM usage. This comes at the cost of the files being parsed multiple times, which is slower.

cfoldseeker.build_cds_db.set_taxon_labels(cds_db: LazyFrame, fetch_taxa_auto: bool, fetch_taxa_file: Path, parsing_mode: str, batch_size: int = 250, max_attempts: int = 5) LazyFrame[source]

Set taxon labels as either scientific names or filenames.

For NCBI files, optionally fetches the scientific names from NCBI Taxonomy via BioPython’s NCBI Entrez API with retry logic. For Bakta GFF files, generates generic labels or uses filenames. For TSV files, preserves user-provided annotations.

Parameters:
  • cds_db (polars LazyFrame) – Dataframe containing CDS records with ‘taxon_id’, ‘filename’, and optionally ‘gene_tag’ columns.

  • fetch_taxa_ncbi (bool) – If True, uses scientific names (NCBI) or generates generic names (Bakta) to use as taxon names instead of the filenames. If false, the default filenames will be kept, unless a rename file was supplied.

  • fetch_taxa_file (Path | None) – Path to the rename file with the taxon names to replace the current ones sourced from the filenames. Defaults to None.

  • parsing_mode (str)) – The format mode used for parsing (‘ncbi-gff’, ‘ncbi-package’, ‘bakta-gff’, or ‘tsv’).

  • batch_size (int) – Number of taxon names to fetch in one batch. Defaults to 250.

  • max_attempts (int) – Maximum numbers of times to attempt fetching the taxon names using Entrez. Defaults to 5.

Returns:

The input DataFrame with a new ‘taxon_name’ column and

the ‘filename’ column removed.

Return type:

cds_db (polars.LazyFrame)

Mutates:
cds_db (polars.LazyFrame): The input DataFrame with a new ‘taxon_name’ column and

the ‘filename’ column removed.

Note

This function removes the temporary column ‘filename’ if present, as it may have been introduced when parsing NCBI GFF files. This function contains a potential local partial materialisation of the LazyFrame. This triggers all files to be parsed.

cfoldseeker.build_cds_db.setup_logging(verbosity: int) None[source]

Set up the root logger if it has not been set up yet.

Parameters:

verbosity (int) – Verbosity level (choices: 0,1,2,3,4).

Returns:

None

export_sequences