OpenMS
FASTAContainer< TFI_File > Class Reference

FASTAContainer<TFI_File> will make FASTA entries available chunk-wise from start to end by loading it from a FASTA file. This avoids having to load the full file into memory. While loading, the container will memorize the file offsets of each entry, allowing to read an arbitrary i'th entry again from disk. If possible, only entries from the currently cached chunk should be queried, otherwise access will be slow. More...

#include <OpenMS/DATASTRUCTURES/FASTAContainer.h>

Collaboration diagram for FASTAContainer< TFI_File >:
[legend]

Public Member Functions

 FASTAContainer ()=delete
 
 FASTAContainer (const String &FASTA_file)
 C'tor with FASTA filename. More...
 
size_t getChunkOffset () const
 how many entries were read and got swapped out already More...
 
bool activateCache ()
 Swaps in the background cache of entries, read previously via cacheChunk() More...
 
bool cacheChunk (int suggested_size)
 Prefetch a new cache in the background, with up to suggested_size entries (or fewer upon reaching end-of-file) More...
 
size_t chunkSize () const
 number of entries in active cache More...
 
const FASTAFile::FASTAEntrychunkAt (size_t pos) const
 Retrieve a FASTA entry at cache position pos (fast) More...
 
bool readAt (FASTAFile::FASTAEntry &protein, size_t pos)
 Retrieve a FASTA entry at global position pos (must not be behind the currently active chunk, but can be smaller) More...
 
bool empty ()
 is the FASTA file empty? More...
 
void reset ()
 resets reading of the FASTA file, enables fresh reading of the FASTA from the beginning More...
 
size_t size () const
 NOT the number of entries in the FASTA file, but merely the number of already read entries (since we do not know how many are still to come) More...
 

Private Attributes

FASTAFile f_
 FASTA file connection. More...
 
std::vector< std::streampos > offsets_
 internal byte offsets into FASTA file for random access reading of previous entries. More...
 
std::vector< FASTAFile::FASTAEntrydata_fg_
 active (foreground) data More...
 
std::vector< FASTAFile::FASTAEntrydata_bg_
 prefetched (background) data; will become the next active data More...
 
size_t chunk_offset_
 number of entries before the current chunk More...
 
std::string filename_
 FASTA file name. More...
 

Detailed Description

FASTAContainer<TFI_File> will make FASTA entries available chunk-wise from start to end by loading it from a FASTA file. This avoids having to load the full file into memory. While loading, the container will memorize the file offsets of each entry, allowing to read an arbitrary i'th entry again from disk. If possible, only entries from the currently cached chunk should be queried, otherwise access will be slow.

Internally uses FASTAFile class to read single sequences.

Constructor & Destructor Documentation

◆ FASTAContainer() [1/2]

FASTAContainer ( )
delete

◆ FASTAContainer() [2/2]

FASTAContainer ( const String FASTA_file)
inline

C'tor with FASTA filename.

Member Function Documentation

◆ activateCache()

bool activateCache ( )
inline

Swaps in the background cache of entries, read previously via cacheChunk()

If you call this function without a prior call to cacheChunk(), the cache will be empty.

Returns
true if cache contains data; false if empty
Note
Should be invoked by a single thread, followed by a barrier to sync access of subsequent calls to chunkAt()

◆ cacheChunk()

bool cacheChunk ( int  suggested_size)
inline

Prefetch a new cache in the background, with up to suggested_size entries (or fewer upon reaching end-of-file)

Call activateCache() afterwards to make the data available via chunkAt() or readAt().

Parameters
suggested_sizeNumber of FASTA entries to read from disk
Returns
true if new data is available; false if background data is empty

◆ chunkAt()

const FASTAFile::FASTAEntry& chunkAt ( size_t  pos) const
inline

Retrieve a FASTA entry at cache position pos (fast)

Requires prior call to activateCache(). Index pos must be smaller than chunkSize().

Note
: can be used by multiple threads at a time (until activateCache() is called)

◆ chunkSize()

size_t chunkSize ( ) const
inline

number of entries in active cache

◆ empty()

bool empty ( )
inline

is the FASTA file empty?

◆ getChunkOffset()

size_t getChunkOffset ( ) const
inline

how many entries were read and got swapped out already

◆ readAt()

bool readAt ( FASTAFile::FASTAEntry protein,
size_t  pos 
)
inline

Retrieve a FASTA entry at global position pos (must not be behind the currently active chunk, but can be smaller)

This query is fast, if pos contains the currently active chunk, and slow (read from disk) for earlier entries. Can be used before reaching the end of the file, since it will reset the file position after its done reading (if reading from disk is required), but must not be used for entries beyond the active chunk (unseen data).

Parameters
proteinReturn value
posAbsolute entry number in FASTA file
Returns
true if reading was successful; false otherwise (e.g. EOF)
Exceptions
Exception::IndexOverflowif pos is beyond active chunk
Note
: not multi-threading safe (use chunkAt())!

◆ reset()

void reset ( )
inline

resets reading of the FASTA file, enables fresh reading of the FASTA from the beginning

◆ size()

size_t size ( ) const
inline

NOT the number of entries in the FASTA file, but merely the number of already read entries (since we do not know how many are still to come)

Note
Data in the background cache is included here, i.e. access to size()-1 using readAt() might be slow if activateCache() was not called yet.

Member Data Documentation

◆ chunk_offset_

size_t chunk_offset_
private

number of entries before the current chunk

◆ data_bg_

std::vector<FASTAFile::FASTAEntry> data_bg_
private

prefetched (background) data; will become the next active data

◆ data_fg_

std::vector<FASTAFile::FASTAEntry> data_fg_
private

active (foreground) data

◆ f_

FASTAFile f_
private

FASTA file connection.

◆ filename_

std::string filename_
private

FASTA file name.

◆ offsets_

std::vector<std::streampos> offsets_
private

internal byte offsets into FASTA file for random access reading of previous entries.