cfoldseeker

main

cfoldseeker.main.create_parser() → ArgumentParser[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:: None
Returns:: An ArgumentParser object holding the CLI ready to collect the arguments when called
Return type:: parser (argparse.ArgumentParser)

cfoldseeker.main.init_search(parsed_args)[source]

Initialise the correct search class and pass it on the necessary arguments.

Parameters:: parsed_args (dict) – nested dictionary holding the arguments as parsed by parse_and_validate_arguments
Returns:: A Search workflow object ready to run
Return type:: the_run (RemoteSearch | LocalSearch | LocalClusteredSearch)

cfoldseeker.main.main()[source]

Main entry point of cfoldseeker.

Oversees the complete workflow: parses command-line arguments, sets up the logger and the run, and calls the workflow.

cfoldseeker.main.parse_and_validate_arguments(args: Namespace, skip_context_table_check: bool = False) → dict[source]

This function validates the parsed arguments given through the command line.

Parameters:

parser (argparse.NameSpace) – A NameSpace object with parsed CLI arguments
skip_csuite_IO_checks (bool) – Skip argument validation for intermediary inputs and outputs in the csuite workflows. For compatibility with the csuite validation checker.

Returns:

A dictionary holding the parsed and validated argument values.

Return type:

parsed_args (dict)

Raises:

ValueError – if an invalid argument value was given.

cfoldseeker.main.run_workflow(parsed_args: dict) → None[source]

Execute the complete cfoldseeker workflow.

Initialises the appropriate Run instance, executes it, generates the output, and cleans up the temporary files.

Returns:: None

cfoldseeker.main.setup_logging(verbosity: int) → None[source]

Set up the root logger if it has not been set up yet.

Parameters:: verbosity (int) – Verbosity level (choices: 0,1,2,3,4).
Returns:: None

classes

class cfoldseeker.classes.Cluster(hits, number=0)[source]

Bases: object

Represents a gene cluster containing one or more protein hits.

A cluster groups proximal hits on the same genomic scaffold that meet specified clustering criteria. All hits in a cluster are expected to share the same scaffold and taxon.

hits

List of Hit objects in the cluster.

Type:: list[Hit]

number

Cluster identifier/rank number.

Type:: int

score

Cumulative score of all hits in the cluster.

Type:: int

start

Minimum genomic coordinate across all hits.

Type:: int

end

Maximum genomic coordinate across all hits.

Type:: int

length

Total length in base pairs of all exons across hits.

Type:: int

scaff

Scaffold/contig ID (taken from first hit).

Type:: str

taxon_id

Taxonomic ID (taken from first hit).

Type:: str

taxon_name

Taxonomic name (taken from first hit).

Type:: str

filelabel

Filelabel of local sequence file (taken from first hit).

Type:: str

as_dict() → dict[source]

Convert the Cluster object to a dictionary.

Returns:

Dictionary with cluster attributes including comma-separated: hit IDs and all genomic coordinates.

Return type:

dict

class cfoldseeker.classes.Hit(db_id, query, crossref_id=[], crossref_method='', name='', taxon_name='', taxon_id=0, db='', filelabel='', evalue=1, score=0, seqid=0, qcov=0, tcov=0, scaff='', coords=[], strand='')[source]

Bases: object

Represents a single protein structure hit from a FoldSeek search.

This class encapsulates information about a homologous protein structure match, including its database identifiers, sequence similarity metrics, genomic location, and taxonomic data.

query

ID of the homologous query protein.

Type:: str

db_id

ID of the hit in its structure database.

Type:: str

db

Structure database the hit was found in.

Type:: str

crossref_id

ID used for cross-referencing (either KEGG or GenPept ID).

Type:: list

crossref_method

Method used for cross-referencing (either KEGG, GenPept, WGS-GenPept, or local).

Type:: str

name

Annotation or description of the hit.

Type:: str

taxon_name

Name of the taxon in which this hit was found.

Type:: str

taxon_id

Identifier of the taxon in which this hit was found.

Type:: int

evalue

E-value of the FoldSeek hit.

Type:: float

score

FoldSeek alignment score.

Type:: int

seqid

Sequence identity percentage with the query protein.

Type:: float

qcov

Query coverage percentage.

Type:: float

tcov

Target coverage percentage.

Type:: float

scaff

RefSeq or GenBank ID of the scaffold encoding the hit.

Type:: str

coords

Genomic coordinates of the encoding gene’s exons.

Type:: list

strand

DNA strand the encoding gene is located on (‘+’ or ‘-‘).

Type:: str

as_dict() → dict[source]

Convert the Hit object to a dictionary.

Returns:

Dictionary with all Hit attributes; coordinates are formatted as: double-dot-separated exon pairs joined by commas.

Return type:

dict

end() → int | None[source]

Return the end coordinate of the last exon.

Returns:

Maximum genomic coordinate across all exons, or None if: no coordinates are defined.

Return type:

int | None

intergenic_distance(other_hit: Hit) → int[source]

Calculate the intergenic distance between this hit and another hit.

For genes on the same scaffold, computes the distance between the end of the upstream gene and the start of the downstream gene. If genes overlap, returns the negative of the length of the overlapping gene.

Parameters:

other_hit (Hit) – The other Hit object to measure distance to.

Returns:

Intergenic distance in base pairs (positive for gaps, negative: for overlaps). Returns the negative of the length of the smaller gene in case of a full overlap.

Return type:

int

length() → int[source]

Return the total length in base pairs of all exons.

Returns:

Sum of lengths across all exons, calculated as: (end - start + 1) for each exon.

Return type:

int

same_location(other_hit: Hit) → bool[source]

Check if two hits are at exactly the same genomic coordinates.

Parameters:

other_hit (Hit) – The other Hit object to compare.

Returns:

True if both hits are on the same scaffold and their genomic: coordinates completely overlap, False otherwise.

Return type:

bool

start() → int | None[source]

Return the start coordinate of the first exon.

Returns:

Minimum genomic coordinate across all exons, or None if: no coordinates are defined.

Return type:

int | None

class cfoldseeker.classes.Search(query, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: ABC

Abstract base class for protein structure searches with cluster identification.

This class manages a FoldSeek-based search workflow (remote or local), including result parsing, cluster identification using graph-based algorithms, and output generation. Subclasses must implement abstract methods for search execution and result parsing. Methods for the cluster identification and output generation are implemented here and shared over all subclasses.

query

List of query protein structure file paths.

Type:: list

params

Search configuration parameters (e.g., max gap, min hits).

Type:: dict

hits

All identified hits from the search.

Type:: list[Hit]

clusters

Identified gene clusters passing filters.

Type:: list[Cluster]

output_flags

Outputs to be generated.

Type:: dict

OUTPUT_DIR

Directory for output files.

Type:: Path

TEMP_DIR

Directory for temporary files.

Type:: Path

generate_cblaster_session() → Session[source]

Generate a cblaster Session object from the identified clusters.

Constructs a cblaster-compatible session containing all search results, organised in the same hierarchy as cblaster (by organism, scaffold, and cluster). This object can be saved and reloaded for interactive visualisation and analysis outside of cfoldseeker.

Returns:: Session holding all information about the identified clusters.
Return type:: Session (cblaster.Session)

generate_output()[source]

Generate the requested output files for this search.

Checks which outputs are requested from the parsed output flags, and generates what is necessary using the appropriate methods.

Parameters:: None
Returns:: None

generate_tables(output_folder: Path) → None[source]

Save hit and cluster lists as tab-separated value (TSV) tables.

Generates two output files: - hits.tsv: Table of all hits with their properties. - clusters.tsv: Table of all clusters with their properties.

Parameters:: output_folder (Path) – Directory where output tables will be written.

identify_clusters() → None[source]

Identify gene clusters among the hits based on clustering criteria.

This method groups hits by scaffold, calculates intergenic distances, filters based on maximum gap and minimum hit thresholds, and uses a directed graph to identify chains of unique proximal hits. It then applies additional filters for cluster size, query coverage, and length before ranking clusters by score.

The method populates self.clusters with identified Cluster objects and updates self.hits to contain only hits in identified clusters.

Raises:

RuntimeError – If no hit groups pass the distance criteria.
RuntimeError – If no cluster could be identified among the hit groups.

identify_clusters_from_groups(close_groups: list, max_length: int, min_hits: int, min_covered_queries: int, require: set[str], all_layouts: bool)[source]

Identify clusters from the proximal hit groups.

Constructs Cluster objects from hit groups that pass all cluster identification thresholds (max cluster length, minimum no. hits, minimum no. covered queries, required queries). Can also return all cluster layouts that fit the cluster identification thresholds with a less-than-best score.

Returns:: list of Cluster objects that pass all identification thresholds.
Return type:: clusters (list[Cluster])

abstractmethod identify_hits()[source]: Parse FoldSeek output and populate the hits list.

Note

This method must be implemented by subclasses to convert raw FoldSeek results from the webserver or a local command call into a list of Hit objects.

identify_proximal_groups(max_gap: int) → list[list[Hit]][source]

Identify proximal groups among the hits.

Calculates the distance between all genes on the same scaffold, discards self-hits and hit pairs that fail the intergenic distance threshold.

Returns:: Hit pairs of proximal hits that pass the intergenic distance threshold.
Return type:: close_groups (list(list[Hit]))

Mutates:: RuntimeError: If there are not hit groups passing the intergenic distance criteria.

abstractmethod run()[source]: Execute the complete search workflow.

Note

This method must be implemented by subclasses to orchestrate the entire search process including FoldSeek execution and result parsing.

abstractmethod run_foldseek()[source]: Run the FoldSeek search tool.

Note

This method must be implemented by subclasses to execute FoldSeek either remotely or locally with the appropriate parameters, input files and target databases.

remote

class cfoldseeker.remote.RemoteSearch(query, mapping_table_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: Search

Subclass executing the workflow for gene cluster identification from remote protein searches.

Extends the Search base class to perform FoldSeek-based searches against remote databases, parse results, and cross-reference hits to genomic data. Uses a local copy of the UniProt mapping table to retrieve genomic coordinates for each protein with fallback strategies using KEGG and ENA.

New attributes:

mapping_table: Polars LazyFrame containing UniProt ID cross-references to: various databases (KEGG, EMBL-CDS, etc.).

Inherits from:

Search: Base class providing the cluster identification and output generation capabilities.

local

class cfoldseeker.local.LocalSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'))[source]

Bases: Search

Subclass executing the workflow for gene cluster identification from local protein searches using FoldSeek.

Extends the Search base class to perform searches against local FoldSeek databases. Handles FoldSeek execution, result parsing, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds.

db_path

Path to the FoldSeek protein structure target database.

Type:: Path

coord_db

DataFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.

Type:: polars.LazyFrame

collect_hits(results: DataFrame) → None[source]

Collects hit instances from a filtered hit table.

Collects and instantiates Hit objects for every hit in the filtered table, after fetching genomic context data from the context database. Genomic coordinate strings are parsed on-the-fly.

Parameters:: results (polars.DataFrame) – A filtered FoldSeek hits table
Returns:: None

Mutates:: self.hits: Instantiates the list of identified Hit objects.

Note

Stores generated Hit objects in self.hits as a list. Genomic coordinates are parsed from a comma-separated string of joined range pairs (e.g., “10..50”, “join(150..200,250..300)”) into nested lists of integers.

identify_hits() → None[source]

Identify hits passing the hit thresholds from the FoldSeek results.

Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.

Returns:: None

parse_foldseek_results() → DataFrame[source]

Parse the FoldSeek result table, expand it to include all members of the sequence clusters, and generate Hit objects with filled genomic coordinates.

Reads the FoldSeek result table, expands it with all members of each sequence cluster of which the original FoldSeek hits are the representatives (by joining with the clustering table), applies filtering thresholds (bit score, query coverage, target coverage), removes duplicate hits, and joins results with the CDS coordinates database. Parses genomic coordinates from the coordinate string and creates Hit objects for each match.

The following filtering thresholds are applied: 1. Sequence identity >= min_seqid 2. E-value <= max_eval 3. Bit score >= min_score 4. Query coverage >= min_qcov (converted to percentage) 5. Target coverage >= min_tcov (converted to percentage)

Returns:: A filtered FoldSeek hits table
Return type:: results (polars.DataFrame)

run() → None[source]

Execute the complete local search workflow.

Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.

Returns:: None

run_foldseek() → None[source]

Execute a FoldSeek search against the local protein structure database.

Constructs and runs a FoldSeek ‘easy-search’ command with all query structures (in CIF format) against the local database. Applies filters for sequence identity and E-value thresholds. Captures stdout and stderr in real-time via separate threads and logs them appropriately.

Exhaustive search (no database prefiltering) has been enabled to retrieve all hits in the target database.

Returns:: None
Raises:: RuntimeError – If FoldSeek returns a non-zero exit code.

Note

FoldSeek output is written to a temporary TSV file in TEMP_DIR.

local_clustered

class cfoldseeker.local_clustered.LocalClusteredSearch(query, db_path, coord_db_path, params={}, hits=[], clusters=[], output_flags={}, output_folder=PosixPath('.'), temp_folder=PosixPath('.'), seq_clust_tsv=PosixPath('.'))[source]

Bases: LocalSearch

Subclass executing the workflow for gene cluster identification from local protein searches in a sequence-preclustered database using FoldSeek.

Extends the LocalSearch base class to expand identified hit sets with cluster members from a premade sequence clustering of the target database. Basically runs a LocalSearch against a FoldSeek structure database of sequence cluster representatives and adds the sequence cluster members of their representative was identified as a valid hit, before continuing with cross-reffing and gene cluster identification.

Handles FoldSeek execution, result parsing, adding sequence cluster members, Hit object generation, and gene cluster identification. Uses a TSV of CDS coordinates made beforehand with cfoldseeker-cds, and a TSV with the sequence cluster members made beforehand with MMseqs2.

db_path

Path to the FoldSeek protein structure target database.

Type:: Path

coord_db

LazyFrame containing CDS coordinates with columns: gene_tag, name, contig, strand, coords, taxon_id, taxon_name.

Type:: polars.LazyFrame

seq_clust

LazyFrame containing the sequence cluster

Type:: polars.LazyFrame

representative of every sequence in the target database. These representatives

then form the FoldSeek target structure database.

expand_sequence_clusters(results: DataFrame) → DataFrame[source]

Expand a given FoldSeek result table with all members of the original hits’ sequence clusters.

Includes all members of the sequence clusters of which the original FoldSeek hits are the representatives of by joining the FoldSeek result table with the MMseqs2 clustering table.

Drops duplicate protein/query pairs.

Hit metadata are taken over from the representative protein.

Parameters:: results (polars.DataFrame) – Original FoldSeek result table with only the representative proteins
Returns:: Result table with all added non-representative proteins for each representative
Return type:: expanded_results (polars.DataFrame)

identify_hits() → None[source]

Identify hits passing the hit thresholds from the FoldSeek results.

Parses the FoldSeek results table and applies hit-level filtering, then fetches genomic context information for each hit from the context DB, and collects freshly instantiated Hit objects to host all metadata.

Returns:: None

Mutates:: self.hits: Instantiates the list of identified Hit objects.

run() → None[source]

Execute the complete local search workflow.

Orchestrates all processing steps in sequence: running FoldSeek locally against the local database, parsing the FoldSeek results and creating Hit objects for hits passing the hit criteria, and identifying gene clusters from the hits.

Returns:: None

communication

cfoldseeker.communication.check_query_status(job_id: str) → str[source]

Retrieves the current status of a FoldSeek job.

Queries the FoldSeek API to check the processing status of a previously submitted job using its unique job ID.

Parameters:: job_id – The unique identifier for the FoldSeek job.
Returns:: A string indicating the job status (e.g., “COMPLETE”, “RUNNING”, etc.).

cfoldseeker.communication.pull_dict_from_unisave(entries: list, max_workers: int = 1, no_progress: bool = False) → dict[source]

Retrieves multiple UniSave records and returns them as a dictionary.

Fetches a list of UniSave entries in parallel and returns them mapped to their original accession numbers. Failed retrievals are filtered out.

Parameters:

entries – List of UniProt accession numbers to retrieve.
max_workers – Number of worker threads for parallel retrieval. Defaults to 1.
no_progress – If True, suppresses the progress bar during retrieval. Defaults to False.

Returns:

A dictionary mapping each successfully retrieved accession number to its corresponding UniSave record as a string. Failed retrievals are excluded from the dictionary.

cfoldseeker.communication.pull_from_ena(entry: str, max_retries: int = 3) → None | str[source]

Retrieves a GenPept record from the ENA Browser API.

Attempts to fetch a GenPept sequence record from the European Nucleotide Archive (ENA) with retry logic for rate-limited responses.

Parameters:

entry – The accession number or identifier of the GenPept record to retrieve.
max_retries – Maximum number of retry attempts for rate-limited requests (error code 429). Defaults to 3. Waiting time between trials is 5 seconds.

Returns:

A string containing the GenPept record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.

cfoldseeker.communication.pull_from_unisave(entry: str, max_retries: int = 3) → None | str[source]

Retrieves a UniSave record from the UniProt REST API.

Fetches a protein sequence record from UniSave (UniProt archive) with retry logic for rate-limited responses.

Parameters:

entry – The UniProt accession number of the record to retrieve.
max_retries – Maximum number of retry attempts for rate-limited requests (429). Defaults to 3. Waiting time between trials is 5 seconds.

Returns:

A string containing the UniSave record in text format, or None if the retrieval fails after max retries or an unexpected error occurs.

cfoldseeker.communication.retrieve_foldseek_results(job_id: str) → dict[source]

Waits for a FoldSeek job to complete and retrieves its results.

Polls the job status at regular intervals until completion, then downloads and returns the parsed results from the FoldSeek API.

Parameters:: job_id – The unique identifier for the FoldSeek job.
Returns:: A dictionary containing the parsed results from the completed FoldSeek job.

cfoldseeker.communication.submit_foldseek_query(query_path: Path, dbs: list, taxfilters: list, max_attempts: int = 3) → dict[source]

Submits a structure file to the FoldSeek API for processing.

Sends a protein structure query to the FoldSeek webserver with specified databases and taxonomic filters. Returns the submission ticket on success or raises an error on failure with maximum attempt logic.

Parameters:

query_path (Path) – Path object pointing to the structure file to submit.
dbs (list) – List of database names to search against.
taxfilters (list) – List of taxonomic filters to apply to the search.
max_attempts (int) – Maximum number of submission attempts

Returns:

A dictionary containing the submission ticket and metadata from the FoldSeek API response.

Raises:

RuntimeError – If submission fails too many times.

remote_parsers

cfoldseeker.remote_parsers.extract_genomic_information_ena(record: str) → dict[source]: Extracts the genomic information from a pulled ENA GenPept record.

cfoldseeker.remote_parsers.extract_genomic_information_kegg(gene_entry: str) → dict[source]: Extracts the genomic information from a pulled KEGG Gene record.

cfoldseeker.remote_parsers.extract_scaffold_mapping_kegg(genome_entry: str) → dict[source]: Maps all KEGG scaffold IDs for a Genome entry to the associated GenBank/RefSeq IDs.

build_cds_db

cfoldseeker.build_cds_db.check_duplicate_contigs(cds_db: LazyFrame, parsing_mode: str) → LazyFrame[source]

Check for and attempt to fix duplicate contig labels per taxon.

Detects cases where the same contig label appears in multiple taxa. For Bakta files, attempts to prepend the existing locus tag prefix to make contigs unique. For other formats, exits with an error.

Parameters:

cds_db (polars.LazyFrame) – A dataframe containing CDS records with ‘contig’, ‘taxon_id’, and ‘gene_tag’ columns.
parsing_mode (str)) – The format mode used for parsing.

Returns:

The input DataFrame with modified contig labels if a fix: was applied (Bakta mode only).

Return type:

cds_db (polars.LazyFrame)

Mutates:

cds_db (polars.LazyFrame): The input DataFrame with modified contig labels if a fix: was applied (Bakta mode only).

Raises:: RuntimeError – If duplicate contigs are detected and cannot be fixed, or if fix attempt fails.

Note

This function contains a potential local partial materialisation of the LazyFrame. This triggers all files to be parsed.

cfoldseeker.build_cds_db.create_parser() → ArgumentParser[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:: None
Returns:: An ArgumentParser object holding the CLI ready to collect the arguments when called
Return type:: parser (ArgumentParser)

cfoldseeker.build_cds_db.fetch_taxon_names(taxon_ids: list, max_attempts: int = 5) → list[source]

Fetch NCBI taxon names for a given list of NCBI taxon IDs.

Fetches summary objects for a given list of NCBI Taxonomy IDs using BioPython’s Entrez API with retry logic, and extracts the taxon name from each summary.

Parameters:

taxon_ids (list) – List of NCBI taxon IDs as strings.
max_attempts (int) – Maximum numbers of attempts to retry fetching in case of a failure. Defaults to 5.

Returns:

List of NCBI taxon names

Return type:

taxon_names (list)

cfoldseeker.build_cds_db.main()[source]

Main entry point for the CDS database construction tool.

Oversees the complete workflow: parses command-line arguments, and calls the workflow.

cfoldseeker.build_cds_db.parse_and_validate_arguments(args: Namespace) → dict[source]

This function parses and validates the arguments received through the command line.

Parameters:: args (argparse.Namespace) – A Namespace holder object with the parsed argument values
Returns:: A dictionary holding the parsed and validated argument values.
Return type:: parsed_args (dict)
Raises:: ValueError – if an invalid argument value was given.

cfoldseeker.build_cds_db.parse_files(input_path: Path, parsing_mode: str, temp_cds_db_path: Path, no_progress: bool = False) → LazyFrame[source]

Parses all input files and constructs a temporary CDS coordinates database.

Dispatches file parsing based on the specified parsing mode, and returns a lazy entry point to the temporary TSV file for further processing.

Parameters:

input_path (Path) – Path to the folder containing input files.
parsing_mode (str) – File format mode - one of: ‘ncbi-gbff’, ‘ncbi-package’, ‘bakta-gbff’, or ‘tsv’.
no_progress (bool) – If True, suppresses the progress bar during parsing. Defaults to False.

Returns:

A Polars LazyFrame containing concatenated CDS records from all input files with columns: gene_tag, name, contig, coords, strand, taxon_id, and filename (or taxon_name for TSV formats).

cfoldseeker.build_cds_db.read_genome(file: str | Path)[source]

Open the appropriate file handle for a genome file.

Automatically distinguishes between compressed and uncompressed files based on the file extension.

Parameters:: file (str | Path) – genome file to open
Returns:: A file handle to open the genome file
Return type:: handle

cfoldseeker.build_cds_db.run_workflow(parsed_args: dict) → None[source]

Execute the complete CDS database construction workflow.

Loads and parses input files, validates contig uniqueness, assigns taxon labels, and writes the final CDS coordinates database to disk as a tab-separated file. Supports optional gzip compression.

Parameters:: parsed_args (dict) – A dictionary holding the parsed and validated argument values.
Returns:: None

Note

This workflow uses a lazy parsing method to limit RAM usage. This comes at the cost of the files being parsed multiple times, which is slower.

cfoldseeker.build_cds_db.set_taxon_labels(cds_db: LazyFrame, fetch_taxa_auto: bool, fetch_taxa_file: Path, parsing_mode: str, batch_size: int = 250, max_attempts: int = 5) → LazyFrame[source]

Set taxon labels as either scientific names or filenames.

For NCBI files, optionally fetches the scientific names from NCBI Taxonomy via BioPython’s NCBI Entrez API with retry logic. For Bakta files, generates generic labels or uses filenames. For TSV files, preserves user-provided annotations.

Parameters:

cds_db (polars LazyFrame) – Dataframe containing CDS records with ‘taxon_id’, ‘filename’, and optionally ‘gene_tag’ columns.
fetch_taxa_ncbi (bool) – If True, uses scientific names (NCBI) or generates generic names (Bakta) to use as taxon names instead of the filenames. If false, the default filenames will be kept, unless a rename file was supplied.
fetch_taxa_file (Path | None) – Path to the rename file with the taxon names to replace the current ones sourced from the filenames. Defaults to None.
parsing_mode (str)) – The format mode used for parsing.
batch_size (int) – Number of taxon names to fetch in one batch. Defaults to 250.
max_attempts (int) – Maximum numbers of times to attempt fetching the taxon names using Entrez. Defaults to 5.

Returns:

The input DataFrame with a new ‘taxon_name’ column and: the ‘filename’ column removed.

Return type:

cds_db (polars.LazyFrame)

Mutates:

cds_db (polars.LazyFrame): The input DataFrame with a new ‘taxon_name’ column and: the ‘filename’ column removed.

Note

This function contains a potential local partial materialisation of the LazyFrame when fetching taxon names automatically. This triggers all files to be parsed an additional time.

cfoldseeker.build_cds_db.setup_logging(verbosity: int) → None[source]

Set up the root logger if it has not been set up yet.

Parameters:: verbosity (int) – Verbosity level (choices: 0,1,2,3,4).
Returns:: None

extract_sequences

cfoldseeker.extract_sequences.convert_session(session: Session) → Session[source]

Convert a cblaster session in-memory to a cfoldseeker session.

Adds file labels at the sequence fields of each subject in the session, which is necessary to retrieve the correct local genome and proteome files. Sets the internal ID to None to break the connection with cblaster’s own DB.

Parameters:: session (cblaster.Session) – The cblaster session to convert.
Returns:: The converted session.
Return type:: session (cblaster.Session)

Mutates:: session (cblaster.Session): The converted session.

cfoldseeker.extract_sequences.create_parser() → ArgumentParser[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:: None
Returns:: An ArgumentParser object holding the CLI ready to collect the arguments when called
Return type:: parser (ArgumentParser)

cfoldseeker.extract_sequences.locate_nucleotide_sequences(scaffolds: list[Scaffold]) → DataFrame[source]

Locate all nucleotide sequences to be fetched from session information.

Deduces where to get the nucleotides sequences for every cluster in the session file from location information saved in there.

Parameters:

scaffolds (list) – List of cblaster Scaffold objects sourced from the session file

Returns:

DataFrame holding the file location and genomic range: of every cluster in the session file.

Return type:

df (polars.DataFrame)

cfoldseeker.extract_sequences.locate_protein_sequences(scaffolds: list[Scaffold]) → DataFrame[source]

Locate all protein sequences to be fetched from session information.

Deduces where to get the protein sequences for every cluster in the session file from location information saved in there.

Parameters:

scaffolds (list) – List of cblaster Scaffold objects sourced from the session file

Returns:

DataFrame holding the file locations and protein IDs: of every protein of each cluster in the session file.

Return type:

df (polars.DataFrame)

cfoldseeker.extract_sequences.main()[source]

Main entry point for the sequence export tool.

Oversees the complete workflow: parses command-line arguments, and calls the workflow.

cfoldseeker.extract_sequences.parse_and_validate_arguments(args: Namespace) → dict[source]

This function parses and validates the arguments received through the command line.

Parameters:: args (argparse.Namespace) – A Namespace holder object with the parsed argument values
Returns:: A dictionary holding the parsed and validated argument values.
Return type:: parsed_args (dict)
Raises:: ValueError – if an invalid argument value was given.

cfoldseeker.extract_sequences.read_genome(file: str | Path)[source]

Open the appropriate file handle for a genome file.

Automatically distinguishes between compressed and uncompressed files based on the file extension.

Parameters:: file (str | Path) – genome file to open
Returns:: A file handle to open the genome file
Return type:: handle

cfoldseeker.extract_sequences.run_workflow(parsed_args: dict) → None[source]

Execute the sequence export workflow.

Loads the session, locates the nucleotide and protein sequences in the Genbanks folder, and writes the cluster Genbank files.

Parameters:: parsed_args (dict) – A dictionary holding the parsed and validated argument values.
Returns:: None

cfoldseeker.extract_sequences.select_clusters(session: Session, filters: dict) → tuple[source]

Selects the clusters to process based on a selection of filtering parameters.

Returns the cblaster cluster hierarchy filtered by a maximum number, cluster number, score, organism or scaffold.

Parameters:

session (cblaster.Session) – The cblaster session to get the hierarchy from.
filters (dict) – dictionary of filtering parameter values.

Returns:

Tuple of three checked out groups of objects: clusters, scaffolds and assemblies.

Return type:

cluster_hierarchy (tuple)

Note

This is a wrapper for cblaster’s get_sorted_cluster_hierarchies.

cfoldseeker.extract_sequences.setup_logging(verbosity: int) → None[source]

Set up the root logger if it has not been set up yet.

Parameters:: verbosity (int) – Verbosity level (choices: 0,1,2,3,4).
Returns:: None

cfoldseeker.extract_sequences.write_cluster_genbanks(scaffolds: list[Scaffold], assemblies: list[str], prefix: str, flavour: str, required_genes: list[str], nucl_locations: DataFrame, prot_locations: DataFrame, output_dir: Path, gbffs_path: Path, n_workers: int = 1, no_progress: bool = False) → None[source]

Write Genbank cluster files for each cluster on all scaffolds in the session.

Fetches the nucleotide and protein sequences for each cluster from the earlier: located files, and writes them away in new Genbank files. Supports parallellisation.

Parameters:

scaffolds (list) – List of cblaster Scaffold objects sourced from the session file.
assemblies (list) – List of assembly filenames, sourced from the session.
prefix (str) – String to start the file name of each cluster with.
flavour (str) – Requested flavour of the resulting genbank file. Either regular genbank (‘genbank’), or a genbank with BigScape-specific fields (‘bigscape’).
required_genes (list) – List of strings corresponding with the query genes that were marked required during the search. Sourced from the session.
nucl_locations (polars.DataFrame) – DataFrame with the file location of all nucleotide sequences to be fetched, as determined by locate_nucleotide_sequences.
prot_locations (polars.DataFrame) – DataFrame with the file location of all protein sequences to be fetched, as determined by locate_protein_sequences.
output_dir (pathlib.Path) – Path of the output folder.
gbffs_path (pathlib.Path) – Path of the Genbanks folder.
n_workers (int) – Number of parallel worker threads. Defaults to 1.
no_progress (bool) – Flag to disable showing the progress bar. Defaults to False.

Returns:

None