OpenMS
ACTrie Class Reference

An Aho Corasick trie (a set of nodes with suffix links mainly) More...

#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>

Collaboration diagram for ACTrie:
[legend]

Public Member Functions

 ACTrie (uint32_t max_aaa=0, uint32_t max_mm=0)
 Default C'tor which just creates a root node. More...
 
 ~ACTrie ()
 D'tor. More...
 
void addNeedle (const std::string &needle)
 
void addNeedles (const std::vector< std::string > &needles)
 
void addNeedlesAndCompress (const std::vector< std::string > &needles)
 
void compressTrie ()
 Traverses the trie in BFS order and makes it more compact and efficient to traverse. More...
 
size_t getNeedleCount () const
 How many needles were added to the trie? More...
 
void setMaxAAACount (const uint32_t max_aaa)
 
uint32_t getMaxAAACount () const
 Maximum number of ambiguous amino acids allowed during search. More...
 
void setMaxMMCount (const uint32_t max_mm)
 
uint32_t getMaxMMCount () const
 Maximum number of mismatches allowed during search. More...
 
bool nextHits (ACTrieState &state) const
 
void getAllHits (ACTrieState &state) const
 

Private Member Functions

bool nextHitsNoClear_ (ACTrieState &state) const
 
Index add_ (const Index from, const AA edge)
 
bool addHits_ (Index i, const size_t text_pos, std::vector< Hit > &hits) const
 Add all hits occurring in node i (including all its suffix hits) More...
 
bool addHitsSpawn_ (Index i, const ACSpawn &spawn, const size_t text_pos, std::vector< Hit > &hits, const int current_spawn_depths) const
 same as addHits_, but only follows the suffix chain until the spawn loses its prefix More...
 
Index follow_ (const Index i, const AA edge) const
 
bool followSpawn_ (ACSpawn &spawn, const AA edge, ACTrieState &state) const
 
Index stepMaster_ (const Index i, const AA edge, ACTrieState &state) const
 
bool stepSpawn_ (ACSpawn &spawn, ACTrieState &state) const
 
void createSpawns_ (const Index i, const AA fromAA, const AA toAA, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const
 
void createSubSpawns_ (const ACSpawn &prototype, const AA fromAA, const AA toAA, ACTrieState &state) const
 Create spawns from a spawn with an AAA or MM, using prototype as template, following edges in range fromAA to toAA. More...
 
void createMMSpawns_ (const Index i, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const
 Same as createSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More...
 
void createMMSubSpawns_ (const ACSpawn &prototype, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state) const
 Same as createSubSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More...
 
Index findChildNaive_ (Index parent, AA child_label)
 During needle addition (naive trie), obtain the child with edge child_label from parent; if it does not exist, an invalid Index is returned. More...
 
Index findChildBFS_ (const Index parent, const AA child_label) const
 After compression (BFS trie), obtain the child with edge child_label from parent; if it does not exist, an invalid Index is returned. More...
 

Private Attributes

std::vector< ACNodetrie_
 the trie, in either naive structure or BFS order (after compressTrie) More...
 
uint32_t needle_count_ {0}
 total number of needles in the trie More...
 
uint32_t max_aaa_ {0}
 maximum number of ambAAs allowed More...
 
uint32_t max_mm_ {0}
 maximum number of mismatches allowed More...
 
std::unordered_map< Index, std::vector< uint32_t > > umap_index2needles_
 maps a node to which needles end there (valid for both naive and BFS tree) More...
 
std::unordered_map< Index, std::vector< Index > > umap_index2children_naive_
 maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly More...
 

Detailed Description

An Aho Corasick trie (a set of nodes with suffix links mainly)

Constructor & Destructor Documentation

◆ ACTrie()

ACTrie ( uint32_t  max_aaa = 0,
uint32_t  max_mm = 0 
)

Default C'tor which just creates a root node.

Parameters
max_aaaMaximum number of ambiguous amino acids (B,J,Z,X) allowed in a hit
max_mmMaximum number of mismatched amino acids allowed in a hit

◆ ~ACTrie()

~ACTrie ( )

D'tor.

Member Function Documentation

◆ add_()

Index add_ ( const Index  from,
const AA  edge 
)
private

Insert a new child node into the trie (unless already present) when starting at parent node from and following the edge labeled edge. Return the index of the (new) child node. Note: This operates on the naive trie, not the BFS.

◆ addHits_()

bool addHits_ ( Index  i,
const size_t  text_pos,
std::vector< Hit > &  hits 
) const
private

Add all hits occurring in node i (including all its suffix hits)

Parameters
iThe ACNode where a needle ends (also all its suffices are checked)
text_poscurrent position in query (i.e. end of matched hit)
[out]hitsResult vector which will be expanded with hits (if any)
Returns
true if hits were found

◆ addHitsSpawn_()

bool addHitsSpawn_ ( Index  i,
const ACSpawn spawn,
const size_t  text_pos,
std::vector< Hit > &  hits,
const int  current_spawn_depths 
) const
private

same as addHits_, but only follows the suffix chain until the spawn loses its prefix

◆ addNeedle()

void addNeedle ( const std::string &  needle)

Add a needle to build up the trie. Call compressTrie() after the last needle was added before searching

Exceptions
Exception::InvalidValueif needle contains an invalid amino acid (such as '*')

◆ addNeedles()

void addNeedles ( const std::vector< std::string > &  needles)

Convenience function; adds needles to build up the trie. Call compressTrie() after the last needle was added before searching

Exceptions
Exception::InvalidValueif needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception

◆ addNeedlesAndCompress()

void addNeedlesAndCompress ( const std::vector< std::string > &  needles)

Convenience function which adds needles and immediately compresses the trie (i.e. no more needles can be added)

Exceptions
Exception::InvalidValueif needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception

◆ compressTrie()

void compressTrie ( )

Traverses the trie in BFS order and makes it more compact and efficient to traverse.

Also creates the suffix links.

Call this function after adding all needles, and before searching any queries.

◆ createMMSpawns_()

void createMMSpawns_ ( const Index  i,
const AA  except_fromAA,
const AA  except_toAA,
const AA  except_edge,
ACTrieState state,
const uint32_t  aaa_left,
const uint32_t  mm_left 
) const
private

Same as createSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself.

◆ createMMSubSpawns_()

void createMMSubSpawns_ ( const ACSpawn prototype,
const AA  except_fromAA,
const AA  except_toAA,
const AA  except_edge,
ACTrieState state 
) const
private

Same as createSubSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself.

◆ createSpawns_()

void createSpawns_ ( const Index  i,
const AA  fromAA,
const AA  toAA,
ACTrieState state,
const uint32_t  aaa_left,
const uint32_t  mm_left 
) const
private

Create spawns from an AAA or MM, starting at trie node i, following edges in range fromAA to toAA The number of AAA's/MM's left for the spawn must be passed (these numbers already reflect the original edge label)

◆ createSubSpawns_()

void createSubSpawns_ ( const ACSpawn prototype,
const AA  fromAA,
const AA  toAA,
ACTrieState state 
) const
private

Create spawns from a spawn with an AAA or MM, using prototype as template, following edges in range fromAA to toAA.

◆ findChildBFS_()

Index findChildBFS_ ( const Index  parent,
const AA  child_label 
) const
private

After compression (BFS trie), obtain the child with edge child_label from parent; if it does not exist, an invalid Index is returned.

◆ findChildNaive_()

Index findChildNaive_ ( Index  parent,
AA  child_label 
)
private

During needle addition (naive trie), obtain the child with edge child_label from parent; if it does not exist, an invalid Index is returned.

◆ follow_()

Index follow_ ( const Index  i,
const AA  edge 
) const
private

Starting at node i, find the child with label edge If no child exists, follow the suffix link and try again (until root is reached) Note: This operates on the BFS trie (after compressTrie()), not the naive one.

◆ followSpawn_()

bool followSpawn_ ( ACSpawn spawn,
const AA  edge,
ACTrieState state 
) const
private

Advances spawn by consuming edge; same as follow_(), just for a spawn Returns true if spawn is still alive, false otherwise

◆ getAllHits()

void getAllHits ( ACTrieState state) const

Collects all hits from the current position in the query until the end of the query I.e. similar to while(next(state)) merge(hits_all, state.hits);

◆ getMaxAAACount()

uint32_t getMaxAAACount ( ) const

Maximum number of ambiguous amino acids allowed during search.

◆ getMaxMMCount()

uint32_t getMaxMMCount ( ) const

Maximum number of mismatches allowed during search.

◆ getNeedleCount()

size_t getNeedleCount ( ) const

How many needles were added to the trie?

◆ nextHits()

bool nextHits ( ACTrieState state) const

Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits is cleared & filled and true is returned. If the query ends and there is no hit, false is returned.

◆ nextHitsNoClear_()

bool nextHitsNoClear_ ( ACTrieState state) const
private

Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits is NOT cleared, but filled and true is returned. If the query ends and all spawns are processed, false is returned (but hits might still have changed)

◆ setMaxAAACount()

void setMaxAAACount ( const uint32_t  max_aaa)

Set maximum number of ambiguous amino acids allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.

◆ setMaxMMCount()

void setMaxMMCount ( const uint32_t  max_mm)

Set maximum number of mismatches allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.

◆ stepMaster_()

Index stepMaster_ ( const Index  i,
const AA  edge,
ACTrieState state 
) const
private

Same as follow_, but considers trying mismatches and AAA's if possible (by adding spawns to state)

Returns
The new tree node, where Master is at after consuming edge

◆ stepSpawn_()

bool stepSpawn_ ( ACSpawn spawn,
ACTrieState state 
) const
private

Same as follow_, but considers trying mismatches and AAA's if possible (by adding spawns to state)

Returns
true if spawn is still alive, false otherwise (i.e. did loose its prefix)

Member Data Documentation

◆ max_aaa_

uint32_t max_aaa_ {0}
private

maximum number of ambAAs allowed

◆ max_mm_

uint32_t max_mm_ {0}
private

maximum number of mismatches allowed

◆ needle_count_

uint32_t needle_count_ {0}
private

total number of needles in the trie

◆ trie_

std::vector<ACNode> trie_
private

the trie, in either naive structure or BFS order (after compressTrie)

◆ umap_index2children_naive_

std::unordered_map<Index, std::vector<Index> > umap_index2children_naive_
private

maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly

◆ umap_index2needles_

std::unordered_map<Index, std::vector<uint32_t> > umap_index2needles_
private

maps a node to which needles end there (valid for both naive and BFS tree)