cfoldseeker
main
- cfoldseeker.main.create_parser() ArgumentParser[source]
This function creates a parser object that will collect the arguments given through the command line.
- Parameters:
None
- Returns:
An ArgumentParser object holding the CLI ready to collect the arguments when called
- Return type:
parser (argparse.ArgumentParser)
- cfoldseeker.main.init_search(parsed_args)[source]
Initialise the correct search class and pass it on the necessary arguments.
- Parameters:
parsed_args (dict) – nested dictionary holding the arguments as parsed by parse_and_validate_arguments
- Returns:
A Search workflow object ready to run
- Return type:
the_run (RemoteSearch | LocalSearch | LocalClusteredSearch)
- cfoldseeker.main.main()[source]
Main entry point of cfoldseeker.
Oversees the complete workflow: parses command-line arguments, sets up the logger and the run, and calls the workflow.
- cfoldseeker.main.parse_and_validate_arguments(args: Namespace, skip_context_table_check: bool = False) dict[source]
This function validates the parsed arguments given through the command line.
- Parameters:
parser (argparse.NameSpace) – A NameSpace object with parsed CLI arguments
skip_csuite_IO_checks (bool) – Skip argument validation for intermediary inputs and outputs in the csuite workflows. For compatibility with the csuite validation checker.
- Returns:
A dictionary holding the parsed and validated argument values.
- Return type:
parsed_args (dict)
- Raises:
ValueError – if an invalid argument value was given.
classes
- class cfoldseeker.classes.Cluster(hits, number=0)[source]
Bases:
objectRepresents a gene cluster containing one or more protein hits.
A cluster groups proximal hits on the same genomic scaffold that meet specified clustering criteria. All hits in a cluster are expected to share the same scaffold and taxon.
- number
Cluster identifier/rank number.
- Type:
int
- score
Cumulative score of all hits in the cluster.
- Type:
int
- start
Minimum genomic coordinate across all hits.
- Type:
int
- end
Maximum genomic coordinate across all hits.
- Type:
int
- length
Total length in base pairs of all exons across hits.
- Type:
int
- scaff
Scaffold/contig ID (taken from first hit).
- Type:
str
- taxon_id
Taxonomic ID (taken from first hit).
- Type:
str
- taxon_name
Taxonomic name (taken from first hit).
- Type:
str
- filelabel
Filelabel of local sequence file (taken from first hit).
- Type:
str
- class cfoldseeker.classes.Hit(db_id, query, crossref_id=[], crossref_method='', name='', taxon_name='', taxon_id=0, db='', filelabel='', evalue=1, score=0, seqid=0, qcov=0, tcov=0, scaff='', coords=[], strand='')[source]
Bases:
objectRepresents a single protein structure hit from a FoldSeek search.
This class encapsulates information about a homologous protein structure match, including its database identifiers, sequence similarity metrics, genomic location, and taxonomic data.
- query
ID of the homologous query protein.
- Type:
str
- db_id
ID of the hit in its structure database.
- Type:
str
- db
Structure database the hit was found in.
- Type:
str
- crossref_id
ID used for cross-referencing (either KEGG or GenPept ID).
- Type:
list
- crossref_method
Method used for cross-referencing (either KEGG, GenPept, WGS-GenPept, or local).
- Type:
str
- name
Annotation or description of the hit.
- Type:
str
- taxon_name
Name of the taxon in which this hit was found.
- Type:
str
- taxon_id
Identifier of the taxon in which this hit was found.
- Type:
int
- evalue
E-value of the FoldSeek hit.
- Type:
float
- score
FoldSeek alignment score.
- Type:
int
- seqid
Sequence identity percentage with the query protein.
- Type:
float
- qcov
Query coverage percentage.
- Type:
float
- tcov
Target coverage percentage.
- Type:
float
- scaff
RefSeq or GenBank ID of the scaffold encoding the hit.
- Type:
str
- coords
Genomic coordinates of the encoding gene’s exons.
- Type:
list
- strand
DNA strand the encoding gene is located on (‘+’ or ‘-‘).
- Type:
str
- as_dict() dict[source]
Convert the Hit object to a dictionary.
- Returns:
- Dictionary with all Hit attributes; coordinates are formatted as
double-dot-separated exon pairs joined by commas.
- Return type:
dict
- end() int | None[source]
Return the end coordinate of the last exon.
- Returns:
- Maximum genomic coordinate across all exons, or None if
no coordinates are defined.
- Return type:
int | None
- intergenic_distance(other_hit: Hit) int[source]
Calculate the intergenic distance between this hit and another hit.
For genes on the same scaffold, computes the distance between the end of the upstream gene and the start of the downstream gene. If genes overlap, returns the negative of the length of the overlapping gene.
- Parameters:
other_hit (Hit) – The other Hit object to measure distance to.
- Returns:
- Intergenic distance in base pairs (positive for gaps, negative
for overlaps). Returns the negative of the length of the smaller gene in case of a full overlap.
- Return type:
int
- length() int[source]
Return the total length in base pairs of all exons.
- Returns:
- Sum of lengths across all exons, calculated as
(end - start + 1) for each exon.
- Return type:
int
- same_location(other_hit: Hit) bool[source]
Check if two hits are at exactly the same genomic coordinates.
- Parameters:
other_hit (Hit) – The other Hit object to compare.
- Returns:
- True if both hits are on the same scaffold and their genomic
coordinates completely overlap, False otherwise.
- Return type:
bool
- class cfoldseeker.classes.Search(query, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]
Bases:
ABCAbstract base class for protein structure searches with cluster identification.
This class manages a FoldSeek-based search workflow (remote or local), including result parsing, cluster identification using graph-based algorithms, and output generation. Subclasses must implement abstract methods for search execution and result parsing. Methods for the cluster identification and output generation are implemented here and shared over all subclasses.
- query
List of query protein structure file paths.
- Type:
list
- params
Search configuration parameters (e.g., max gap, min hits).
- Type:
dict
- output_flags
Outputs to be generated.
- Type:
dict
- OUTPUT_DIR
Directory for output files.
- Type:
Path
- TEMP_DIR
Directory for temporary files.
- Type:
Path
- generate_cblaster_session() Session[source]
Generate a cblaster Session object from the identified clusters.
Constructs a cblaster-compatible session containing all search results, organised in the same hierarchy as cblaster (by organism, scaffold, and cluster). This object can be saved and reloaded for interactive visualisation and analysis outside of cfoldseeker.
- Returns:
Session holding all information about the identified clusters.
- Return type:
Session (cblaster.Session)
- generate_output()[source]
Generate the requested output files for this search.
Checks which outputs are requested from the parsed output flags, and generates what is necessary using the appropriate methods.
- Parameters:
None
- Returns:
None
- generate_tables(output_folder: Path) None[source]
Save hit and cluster lists as tab-separated value (TSV) tables.
Generates two output files: - hits.tsv: Table of all hits with their properties. - clusters.tsv: Table of all clusters with their properties.
- Parameters:
output_folder (Path) – Directory where output tables will be written.
- identify_clusters() None[source]
Identify gene clusters among the hits based on clustering criteria.
This method groups hits by scaffold, calculates intergenic distances, filters based on maximum gap and minimum hit thresholds, and uses a directed graph to identify chains of unique proximal hits. It then applies additional filters for cluster size, query coverage, and length before ranking clusters by score.
The method populates self.clusters with identified Cluster objects and updates self.hits to contain only hits in identified clusters.
- Raises:
RuntimeError – If no hit groups pass the distance criteria.
RuntimeError – If no cluster could be identified among the hit groups.
- identify_clusters_from_groups(close_groups: list, max_length: int, min_hits: int, min_covered_queries: int, require: set[str], all_layouts: bool)[source]
Identify clusters from the proximal hit groups.
Constructs Cluster objects from hit groups that pass all cluster identification thresholds (max cluster length, minimum no. hits, minimum no. covered queries, required queries). Can also return all cluster layouts that fit the cluster identification thresholds with a less-than-best score.
- abstractmethod identify_hits()[source]
Parse FoldSeek output and populate the hits list.
Note
This method must be implemented by subclasses to convert raw FoldSeek results from the webserver or a local command call into a list of Hit objects.
- identify_proximal_groups(max_gap: int) list[list[Hit]][source]
Identify proximal groups among the hits.
Calculates the distance between all genes on the same scaffold, discards self-hits and hit pairs that fail the intergenic distance threshold.
- Returns:
Hit pairs of proximal hits that pass the intergenic distance threshold.
- Return type:
close_groups (list(list[Hit]))
- Mutates:
RuntimeError: If there are not hit groups passing the intergenic distance criteria.
remote
- class cfoldseeker.remote.RemoteSearch(query, mapping_table_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]
Bases:
SearchSubclass executing the workflow for gene cluster identification from remote protein searches.
Extends the Search base class to perform FoldSeek-based searches against remote databases, parse results, and cross-reference hits to genomic data. Uses a local copy of the UniProt mapping table to retrieve genomic coordinates for each protein with fallback strategies using KEGG and ENA.
- New attributes:
- mapping_table: Polars LazyFrame containing UniProt ID cross-references to
various databases (KEGG, EMBL-CDS, etc.).
- Inherits from:
Search: Base class providing the cluster identification and output generation capabilities.
See also
Local: Sister class providing the search and parsing capabilities for local database searches
- crossref_afdb() list[source]
Cross-reference AFDB hits and retrieve genomic neighbourhood information.
Systematically cross-references AlphaFold DB (AFDB) hits to genomic data through three methods: KEGG IDs from KEGG, GenPept IDs from ENA, and WGS-GenPept IDs from UniSave. For each hit, extracts scaffold IDs, CDS coordinates, and strand information. Updates scaffold IDs to their latest versions using NCBI Entrez.
- Returns:
A list of Hit objects with populated genomic information (scaffold, coordinates, and strand data).
- crossref_afdb_via_genpept(afdb_hits: list[Hit]) list[Hit][source]
Cross-reference AFDB hits to GenPept IDs in ENA using the UniProt mapping table
Retrieves the GenPept IDs in ENA for all hits and updates the the crossref ID and method if any.
- Parameters:
afdb_hits – List of Hit objects without genomic information and cross-references
- Mutates:
Hit objects in afdb_hits: Fills cross-reference ID and method attributes
- Returns:
list of Hits that have no cross-reference to GenPept in the UniProt mapping table. These need to be processed by a different cross-referencing method.
- Return type:
hits_failed_legg (list[Hit])
- crossref_afdb_via_kegg(afdb_hits: list[Hit]) list[Hit][source]
Cross-reference AFDB hits to KEGG IDs using the UniProt mapping table
Retrieves the KEGG IDs for all hits and updates the the crossref ID and method if any.
- Parameters:
afdb_hits – List of Hit objects without genomic information and cross-references
- Mutates:
Hit objects in afdb_hits: Fills cross-reference ID and method attributes
- Returns:
list of Hits that have no cross-reference to KEGG in the UniProt mapping table. These need to be processed by a different cross-referencing method.
- Return type:
hits_failed_legg (list[Hit])
- crossref_afdb_via_wgs_genpept(afdb_hits: list[Hit]) list[Hit][source]
Cross-reference AFDB hits to WGS-GenPept IDs in UniSave using the UniSave API.
Retrieves the WGS-GenPept IDs in UniSave for all hits and updates the the crossref ID and method if any.
- Parameters:
afdb_hits – List of Hit objects without genomic information and cross-references
- Mutates:
Hit objects in afdb_hits: Fills cross-reference ID and method attributes
- Returns:
list of Hits that have no cross-reference to WGS-GenPept in UniSave. These need to be processed by a different cross-referencing method.
- Return type:
hits_failed_legg (list[Hit])
- identify_hits() None[source]
Parse FoldSeek results and create Hit objects for hits passing the predefined criteria.
Extracts hit information from the raw FoldSeek results, and applies filtering thresholds based on e-value, bit score, sequence identity, and coverage. Creates Hit objects for hits meeting all criteria and removes redundant hits found in multiple databases, keeping only the first instance per UniProt ID.
- Returns:
None
- Raises:
RuntimeError – If the hit list is empty after applying the criteria
Note
Logs a warning when there are no hits for a certain query and DB pair.
- passes_criteria(hit: Hit)[source]
Check if a hit passes the criteria set for this search.
- Returns:
Dit the hit pass all criteria?
- Return type:
(bool)
- prepare_mapping_dict(all_uniprot_ids: list, db: str) dict[source]
Extract a mapping dictionary from the local UniProt mapping table.
Filters the full UniProt cross-reference LazyFrame to extract IDs for a specific target database, grouping multiple cross-references per UniProt ID into a dictionary of lists.
- Parameters:
all_uniprot_ids – List of UniProt accession numbers to extract.
db – Target database name (e.g., ‘KEGG’, ‘EMBL-CDS’).
- Returns:
A dictionary mapping UniProt IDs (keys) to lists of cross-reference IDs in the target database (values).
- pull_and_parse_genpept_records(afdb_hits: list[Hit]) list[Hit][source]
Pull and parse the (WGS-)GenPept records associated with an AFDB hit. Update the hit attributes.
Pulls the (WGS-)GenPept records associated with an AFDB hit. It immediately extracts the scaffold and strand information, and the genomic coordinates from the record and updates the hit’s attributes accordingly.
- Parameters:
afdb_hits – List of Hit objects with cross-references
- Mutates:
Hit objects in afdb_hits: Fills scaffold, strand and coordinates attributes
- Returns:
Hit objects with updated genomic location attributes retrieved from this cross-referencing method.
- Return type:
processed_hits (list[Hit])
- pull_and_parse_kegg_records(afdb_hits: list[Hit]) list[Hit][source]
Pull and parse the KEGG records associated with an AFDB hit. Update the hit attributes.
Pulls the KEGG records associated with an AFDB hit. It immediately extracts the scaffold and strand information, and the genomic coordinates from the record and updates the hit’s attributes accordingly.
- Parameters:
afdb_hits – List of Hit objects with cross-references
- Mutates:
Hit objects in afdb_hits: Fills scaffold, strand and coordinates attributes
- Returns:
Hit objects with updated genomic location attributes retrieved from this cross-referencing method.
- Return type:
processed_hits (list[Hit])
- run() None[source]
Execute the complete remote search workflow.
Orchestrates all processing steps in sequence: running FoldSeek remotely via the webserver, parsing the results, cross-referencing using different sources, pulling the cross-referenced records, parsing the genomic neighbourhood coordinates from these records, and identifying the gene clusters.
- Returns:
None
- run_foldseek() None[source]
Submit protein queries to FoldSeek and retrieve the results.
Sends all query structures to the FoldSeek webserver in parallel, collects submission tickets, monitors job status, and downloads completed results. Stores raw JSON results in the temporary folder.
- Returns:
None
- update_version_digits(processed_hits: list[Hit], max_attempts: int = 3) list[Hit][source]
Update the version digits of the scaffold IDs of every hit.
Adds or updates the version digit of the scaffold ID of every hit. Retrieves the latest scaffold ID from the NCBI Entrez API with retry logic, and updates the hit’s scaffold IDs accordingly.
- Parameters:
processed_hits (list[Hit]) – hits with filled cross-reference and genomic location attributes
max_attempts (int) – Maximum number of times to try getting the most recent version digits from Entrez.
- Mutates:
Hit objects in processed_hits: Updates the scaffold attribute with the most recent version digit.
- Returns:
hits with an updated version digit in the scaffold attribute of every hit.
- Return type:
processed_hits (list[Hit])
local
- class cfoldseeker.local.LocalSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]
Bases:
SearchSubclass executing the workflow for gene cluster identification from local protein searches using FoldSeek.
Extends the Search base class to perform searches against local FoldSeek databases. Handles FoldSeek execution, result parsing, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds.
- db_path
Path to the FoldSeek protein structure target database.
- Type:
Path
- coord_db
DataFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.
- Type:
polars.LazyFrame
- collect_hits(results: DataFrame) None[source]
Collects hit instances from a filtered hit table.
Collects and instantiates Hit objects for every hit in the filtered table, after fetching genomic context data from the context database. Genomic coordinate strings are parsed on-the-fly.
- Parameters:
results (polars.DataFrame) – A filtered FoldSeek hits table
- Returns:
None
- Mutates:
self.hits: Instantiates the list of identified Hit objects.
Note
Stores generated Hit objects in self.hits as a list. Genomic coordinates are parsed from a comma-separated string of joined range pairs (e.g., “10..50”, “join(150..200,250..300)”) into nested lists of integers.
- identify_hits() None[source]
Identify hits passing the hit thresholds from the FoldSeek results.
Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.
- Returns:
None
- parse_foldseek_results() DataFrame[source]
Parse the FoldSeek result table, expand it to include all members of the sequence clusters, and generate Hit objects with filled genomic coordinates.
Reads the FoldSeek result table, expands it with all members of each sequence cluster of which the original FoldSeek hits are the representatives (by joining with the clustering table), applies filtering thresholds (bit score, query coverage, target coverage), removes duplicate hits, and joins results with the CDS coordinates database. Parses genomic coordinates from the coordinate string and creates Hit objects for each match.
The following filtering thresholds are applied: 1. Sequence identity >= min_seqid 2. E-value <= max_eval 3. Bit score >= min_score 4. Query coverage >= min_qcov (converted to percentage) 5. Target coverage >= min_tcov (converted to percentage)
- Returns:
A filtered FoldSeek hits table
- Return type:
results (polars.DataFrame)
- run() None[source]
Execute the complete local search workflow.
Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.
- Returns:
None
- run_foldseek() None[source]
Execute a FoldSeek search against the local protein structure database.
Constructs and runs a FoldSeek ‘easy-search’ command with all query structures (in CIF format) against the local database. Applies filters for sequence identity and E-value thresholds. Captures stdout and stderr in real-time via separate threads and logs them appropriately.
Exhaustive search (no database prefiltering) has been enabled to retrieve all hits in the target database.
- Returns:
None
- Raises:
RuntimeError – If FoldSeek returns a non-zero exit code.
Note
FoldSeek output is written to a temporary TSV file in TEMP_DIR.
local_clustered
- class cfoldseeker.local_clustered.LocalClusteredSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'), seq_clust_tsv=PosixPath('.'))[source]
Bases:
LocalSearchSubclass executing the workflow for gene cluster identification from local protein searches in a sequence-preclustered database using FoldSeek.
Extends the LocalSearch base class to expand identified hit sets with cluster members from a premade sequence clustering of the target database. Basically runs a LocalSearch against a FoldSeek structure database of sequence cluster representatives and adds the sequence cluster members of their representative was identified as a valid hit, before continuing with cross-reffing and gene cluster identification.
Handles FoldSeek execution, result parsing, adding sequence cluster members, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds, and a TSV with the sequence cluster members made beforehand with MMseqs2.
- db_path
Path to the FoldSeek protein structure target database.
- Type:
Path
- coord_db
LazyFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.
- Type:
polars.LazyFrame
- seq_clust
LazyFrame containing the sequence cluster
- Type:
polars.LazyFrame
- representative of every sequence in the target database. These representatives
- then form the FoldSeek target structure database.
- expand_sequence_clusters(results: DataFrame) DataFrame[source]
Expand a given FoldSeek result table with all members of the original hits’ sequence clusters.
Includes all members of the sequence clusters of which the original FoldSeek hits are the representatives of by joining the FoldSeek result table with the MMseqs2 clustering table.
Drops duplicate protein/query pairs.
Hit metadata are taken over from the representative protein.
- Parameters:
results (polars.DataFrame) – Original FoldSeek result table with only the representative proteins
- Returns:
Result table with all added non-representative proteins for each representative
- Return type:
expanded_results (polars.DataFrame)
- identify_hits() None[source]
Identify hits passing the hit thresholds from the FoldSeek results.
Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.
- Returns:
None
- Mutates:
self.hits: Instantiates the list of identified Hit objects.
- run() None[source]
Execute the complete local search workflow.
Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.
- Returns:
None
communication
- cfoldseeker.communication.check_query_status(job_id: str) str[source]
Retrieves the current status of a FoldSeek job.
Queries the FoldSeek API to check the processing status of a previously submitted job using its unique job ID.
- Parameters:
job_id – The unique identifier for the FoldSeek job.
- Returns:
A string indicating the job status (e.g., “COMPLETE”, “RUNNING”, etc.).
- cfoldseeker.communication.pull_dict_from_unisave(entries: list, max_workers: int = 1, no_progress: bool = False) dict[source]
Retrieves multiple UniSave records and returns them as a dictionary.
Fetches a list of UniSave entries in parallel and returns them mapped to their original accession numbers. Failed retrievals are filtered out.
- Parameters:
entries – List of UniProt accession numbers to retrieve.
max_workers – Number of worker threads for parallel retrieval. Defaults to 1.
no_progress – If True, suppresses the progress bar during retrieval. Defaults to False.
- Returns:
A dictionary mapping each successfully retrieved accession number to its corresponding UniSave record as a string. Failed retrievals are excluded from the dictionary.
- cfoldseeker.communication.pull_from_ena(entry: str, max_retries: int = 3) None | str[source]
Retrieves a GenPept record from the ENA Browser API.
Attempts to fetch a GenPept sequence record from the European Nucleotide Archive (ENA) with retry logic for rate-limited responses.
- Parameters:
entry – The accession number or identifier of the GenPept record to retrieve.
max_retries – Maximum number of retry attempts for rate-limited requests (error code 429). Defaults to 3. Waiting time between trials is 5 seconds.
- Returns:
A string containing the GenPept record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.
- cfoldseeker.communication.pull_from_unisave(entry: str, max_retries: int = 3) None | str[source]
Retrieves a UniSave record from the UniProt REST API.
Fetches a protein sequence record from UniSave (UniProt archive) with retry logic for rate-limited responses.
- Parameters:
entry – The UniProt accession number of the record to retrieve.
max_retries – Maximum number of retry attempts for rate-limited requests (429). Defaults to 3. Waiting time between trials is 5 seconds.
- Returns:
A string containing the UniSave record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.
- cfoldseeker.communication.retrieve_foldseek_results(job_id: str) dict[source]
Waits for a FoldSeek job to complete and retrieves its results.
Polls the job status at regular intervals until completion, then downloads and returns the parsed results from the FoldSeek API.
- Parameters:
job_id – The unique identifier for the FoldSeek job.
- Returns:
A dictionary containing the parsed results from the completed FoldSeek job.
- cfoldseeker.communication.submit_foldseek_query(query_path: Path, dbs: list, taxfilters: list, max_attempts: int = 3) dict[source]
Submits a structure file to the FoldSeek API for processing.
Sends a protein structure query to the FoldSeek webserver with specified databases and taxonomic filters. Returns the submission ticket on success or raises an error on failure with maximum attempt logic.
- Parameters:
query_path (Path) – Path object pointing to the structure file to submit.
dbs (list) – List of database names to search against.
taxfilters (list) – List of taxonomic filters to apply to the search.
max_attempts (int) – Maximum number of submission attempts
- Returns:
A dictionary containing the submission ticket and metadata from the FoldSeek API response.
- Raises:
RuntimeError – If submission fails too many times.
remote_parsers
- cfoldseeker.remote_parsers.extract_genomic_information_ena(record: str) dict[source]
Extracts the genomic information from a pulled ENA GenPept record.
build_cds_db
- cfoldseeker.build_cds_db.check_duplicate_contigs(cds_db: LazyFrame, parsing_mode: str) LazyFrame[source]
Check for and attempt to fix duplicate contig labels per taxon.
Detects cases where the same contig label appears in multiple taxa. For Bakta GFF files, attempts to prepend the existing locus tag prefix to make contigs unique. For other formats, exits with an error.
- Parameters:
cds_db (polars.LazyFrame) – A dataframe containing CDS records with ‘contig’, ‘taxon_id’, and ‘gene_tag’ columns.
parsing_mode (str)) – The format mode used for parsing (‘bakta-gff’, ‘ncbi-gff’, ‘ncbi-package’, or ‘tsv’).
- Returns:
- The input DataFrame with modified contig labels if a fix
was applied (Bakta mode only).
- Return type:
cds_db (polars.LazyFrame)
- Mutates:
- cds_db (polars.LazyFrame): The input DataFrame with modified contig labels if a fix
was applied (Bakta mode only).
- Raises:
RuntimeError – If duplicate contigs are detected and cannot be fixed, or if fix attempt fails.
Note
This function contains a potential local partial materialisation of the LazyFrame. This triggers all files to be parsed.
- cfoldseeker.build_cds_db.create_parser() ArgumentParser[source]
This function creates a parser object that will collect the arguments given through the command line.
- Parameters:
None
- Returns:
An ArgumentParser object holding the CLI ready to collect the arguments when called
- Return type:
parser (ArgumentParser)
- cfoldseeker.build_cds_db.fetch_taxon_names(taxon_ids: list, max_attempts: int = 5) list[source]
Fetch NCBI taxon names for a given list of NCBI taxon IDs.
Fetches summary objects for a given list of NCBI Taxonomy IDs using BioPython’s Entrez API with retry logic, and extracts the taxon name from each summary.
- Parameters:
taxon_ids (list) – List of NCBI taxon IDs as strings.
max_attempts (int) – Maximum numbers of attempts to retry fetching in case of a failure. Defaults to 5.
- Returns:
List of NCBI taxon names
- Return type:
taxon_names (list)
- cfoldseeker.build_cds_db.main()[source]
Main entry point for the CDS database construction tool.
Oversees the complete workflow: parses command-line arguments, and calls the workflow.
- cfoldseeker.build_cds_db.parse_and_validate_arguments(args: Namespace) dict[source]
This function parses and validates the arguments received through the command line.
- Parameters:
args (argparse.Namespace) – A Namespace holder object with the parsed argument values
- Returns:
A dictionary holding the parsed and validated argument values.
- Return type:
parsed_args (dict)
- Raises:
ValueError – if an invalid argument value was given.
- cfoldseeker.build_cds_db.parse_files(input_path: Path, parsing_mode: str, n_workers: int = 1, no_progress: bool = False) LazyFrame[source]
Parses all input files and constructs a draft CDS coordinates database.
Dispatches file parsing based on the specified parsing mode, using parallel processing to handle multiple files efficiently. Concatenates all parsed results into a single DataFrame.
- Parameters:
input_path (Path) – Path to the folder containing input files.
parsing_mode (str) – File format mode - one of: ‘ncbi-gff’, ‘ncbi-package’, ‘bakta-gff’, or ‘tsv’.
n_workers (int) – Number of worker threads for parallel file parsing. Defaults to 1.
no_progress (bool) – If True, suppresses the progress bar during parsing. Defaults to False.
- Returns:
A Polars LazyFrame containing concatenated CDS records from all input files with columns: gene_tag, name, contig, coords, strand, taxon_id, and filename (or taxon_name for TSV formats).
- cfoldseeker.build_cds_db.run_workflow(parsed_args: dict) None[source]
Execute the complete CDS database construction workflow.
Loads and parses input files, validates contig uniqueness, assigns taxon labels, and writes the final CDS coordinates database to disk as a tab-separated file. Supports optional gzip compression.
- Parameters:
parsed_args (dict) – A dictionary holding the parsed and validated argument values.
- Returns:
None
Note
This workflow uses a lazy parsing method to limit RAM usage. This comes at the cost of the files being parsed multiple times, which is slower.
- cfoldseeker.build_cds_db.set_taxon_labels(cds_db: LazyFrame, fetch_taxa_auto: bool, fetch_taxa_file: Path, parsing_mode: str, batch_size: int = 250, max_attempts: int = 5) LazyFrame[source]
Set taxon labels as either scientific names or filenames.
For NCBI files, optionally fetches the scientific names from NCBI Taxonomy via BioPython’s NCBI Entrez API with retry logic. For Bakta GFF files, generates generic labels or uses filenames. For TSV files, preserves user-provided annotations.
- Parameters:
cds_db (polars LazyFrame) – Dataframe containing CDS records with ‘taxon_id’, ‘filename’, and optionally ‘gene_tag’ columns.
fetch_taxa_ncbi (bool) – If True, uses scientific names (NCBI) or generates generic names (Bakta) to use as taxon names instead of the filenames. If false, the default filenames will be kept, unless a rename file was supplied.
fetch_taxa_file (Path | None) – Path to the rename file with the taxon names to replace the current ones sourced from the filenames. Defaults to None.
parsing_mode (str)) – The format mode used for parsing (‘ncbi-gff’, ‘ncbi-package’, ‘bakta-gff’, or ‘tsv’).
batch_size (int) – Number of taxon names to fetch in one batch. Defaults to 250.
max_attempts (int) – Maximum numbers of times to attempt fetching the taxon names using Entrez. Defaults to 5.
- Returns:
- The input DataFrame with a new ‘taxon_name’ column and
the ‘filename’ column removed.
- Return type:
cds_db (polars.LazyFrame)
- Mutates:
- cds_db (polars.LazyFrame): The input DataFrame with a new ‘taxon_name’ column and
the ‘filename’ column removed.
Note
This function removes the temporary column ‘filename’ if present, as it may have been introduced when parsing NCBI GFF files. This function contains a potential local partial materialisation of the LazyFrame. This triggers all files to be parsed.