OpenMS
Loading...
Searching...
No Matches
ProFormaTokenizer Class Reference

Tokenizer for ProForma v2 peptidoform notation. More...

#include <OpenMS/CHEMISTRY/ProFormaTokenizer.h>

Collaboration diagram for ProFormaTokenizer:
[legend]

Classes

struct  Token
 A single token from the input stream. More...
 

Public Types

enum class  TokenType {
  LBRACKET , RBRACKET , LPAREN , RPAREN ,
  LBRACE , RBRACE , LANGLE , RANGLE ,
  PLUS , MINUS , SLASH , PIPE ,
  HASH , COLON , COMMA , CARET ,
  QUESTION , AT , NUMBER , IDENTIFIER ,
  END
}
 Token types produced by the tokenizer. More...
 

Public Member Functions

 ProFormaTokenizer (std::string_view input, size_t start_pos=0)
 Construct a tokenizer for the given input string.
 
 ~ProFormaTokenizer ()=default
 Default destructor.
 
 ProFormaTokenizer (const ProFormaTokenizer &)=default
 Copy constructor.
 
 ProFormaTokenizer (ProFormaTokenizer &&)=default
 Move constructor.
 
ProFormaTokenizeroperator= (const ProFormaTokenizer &)=default
 Copy assignment operator.
 
ProFormaTokenizeroperator= (ProFormaTokenizer &&)=default
 Move assignment operator.
 
Token next ()
 Consume and return the next token.
 
Token peek ()
 Look at the next token without consuming it.
 
bool hasMore () const
 Check if more tokens are available.
 
size_t position () const
 Get the current position in the input.
 
std::string_view getContext (size_t pos, size_t before=20, size_t after=20) const
 Get a context string around a position for error messages.
 

Static Public Member Functions

static const char * tokenTypeName (TokenType type)
 Get a human-readable name for a token type.
 

Private Member Functions

Token scanToken_ ()
 Scan and return the next token from the current position.
 
Token scanNumber_ ()
 Scan a number token (integer, decimal, optionally signed)
 
Token scanIdentifier_ ()
 Scan an identifier token (letter sequence)
 
bool isAtEnd_ () const
 Check if we have reached the end of input.
 
char current_ () const
 Get the current character (or '\0' if at end)
 
char peek_ (size_t offset) const
 Get the character at offset from current position (or '\0' if out of bounds)
 
char advance_ ()
 Advance to the next character and return the previous one.
 

Static Private Member Functions

static bool isLetter_ (char c)
 Check if a character is a letter (A-Za-z)
 
static bool isDigit_ (char c)
 Check if a character is a digit (0-9)
 

Private Attributes

std::string_view input_
 The input string (must remain valid for tokenizer lifetime)
 
size_t pos_ = 0
 Current position in the input.
 
std::optional< Tokenpeeked_
 Cached peeked token (if any)
 

Detailed Description

Tokenizer for ProForma v2 peptidoform notation.

This class provides lexical analysis (tokenization) for ProForma strings. It produces tokens suitable for parsing the ProForma grammar, supporting zero-copy operation via std::string_view for performance.

The tokenizer handles:

  • Single-character tokens: [ ] ( ) { } < > + - / | # : , ^ ? @
  • Numbers: integers and decimals (e.g., "123", "15.9949", "+2", "-1")
  • Identifiers: letter sequences (e.g., "UNIMOD", "Oxidation", "PEPTIDE")
  • End of input marker
Note
The tokenizer does not skip whitespace - ProForma strings should not contain whitespace according to the specification.
The input string must remain valid for the lifetime of the tokenizer since tokens reference slices of the input.

Usage example:

std::string input = "EM[UNIMOD:35]K";
ProFormaTokenizer tokenizer(input);
while (tokenizer.hasMore())
{
ProFormaTokenizer::Token token = tokenizer.next();
std::cout << "Token: " << token.text << " at position " << token.position << std::endl;
}
Tokenizer for ProForma v2 peptidoform notation.
Definition ProFormaTokenizer.h:54
A single token from the input stream.
Definition ProFormaTokenizer.h:89
size_t position
Byte offset in the original input (0-indexed)
Definition ProFormaTokenizer.h:92
std::string_view text
View into the original input (zero-copy)
Definition ProFormaTokenizer.h:91

Member Enumeration Documentation

◆ TokenType

enum class TokenType
strong

Token types produced by the tokenizer.

Enumerator
LBRACKET 

Left square bracket: [.

RBRACKET 

Right square bracket: ].

LPAREN 

Left parenthesis: (.

RPAREN 

Right parenthesis: )

LBRACE 

Left curly brace: {.

RBRACE 

Right curly brace: }.

LANGLE 

Left angle bracket: <.

RANGLE 

Right angle bracket: >

PLUS 

Plus sign: +.

MINUS 

Minus sign: -.

SLASH 

Forward slash: /.

PIPE 

Vertical bar (pipe): |.

HASH 

Hash/pound sign: #.

COLON 

Colon: :

COMMA 

Comma: ,.

CARET 

Caret: ^.

QUESTION 

Question mark: ?

AT 

At sign: .

NUMBER 

Numeric literal (integer or decimal, possibly with leading +/-)

IDENTIFIER 

Letter sequence (A-Za-z)

END 

End of input.

Constructor & Destructor Documentation

◆ ProFormaTokenizer() [1/3]

ProFormaTokenizer ( std::string_view  input,
size_t  start_pos = 0 
)
explicit

Construct a tokenizer for the given input string.

Parameters
inputThe ProForma string to tokenize. Must remain valid for the lifetime of this tokenizer.
start_posOptional starting position (default 0). Used for efficient lookahead without re-scanning from the beginning.

◆ ~ProFormaTokenizer()

~ProFormaTokenizer ( )
default

Default destructor.

◆ ProFormaTokenizer() [2/3]

ProFormaTokenizer ( const ProFormaTokenizer )
default

Copy constructor.

◆ ProFormaTokenizer() [3/3]

Move constructor.

Member Function Documentation

◆ advance_()

char advance_ ( )
private

Advance to the next character and return the previous one.

◆ current_()

char current_ ( ) const
private

Get the current character (or '\0' if at end)

◆ getContext()

std::string_view getContext ( size_t  pos,
size_t  before = 20,
size_t  after = 20 
) const

Get a context string around a position for error messages.

Returns a substring of the input centered around the given position, useful for providing context in error messages.

Parameters
posThe position to center the context around
beforeMaximum number of characters to include before pos
afterMaximum number of characters to include after pos
Returns
A view of the context substring

◆ hasMore()

bool hasMore ( ) const

Check if more tokens are available.

Returns
true if there are more tokens (i.e., next() would not return END)

◆ isAtEnd_()

bool isAtEnd_ ( ) const
private

Check if we have reached the end of input.

◆ isDigit_()

static bool isDigit_ ( char  c)
staticprivate

Check if a character is a digit (0-9)

◆ isLetter_()

static bool isLetter_ ( char  c)
staticprivate

Check if a character is a letter (A-Za-z)

◆ next()

Token next ( )

Consume and return the next token.

Advances the tokenizer position past the returned token.

Returns
The next token from the input stream

◆ operator=() [1/2]

ProFormaTokenizer & operator= ( const ProFormaTokenizer )
default

Copy assignment operator.

◆ operator=() [2/2]

ProFormaTokenizer & operator= ( ProFormaTokenizer &&  )
default

Move assignment operator.

◆ peek()

Token peek ( )

Look at the next token without consuming it.

Multiple calls to peek() without intervening next() calls return the same token.

Returns
The next token that would be returned by next()

◆ peek_()

char peek_ ( size_t  offset) const
private

Get the character at offset from current position (or '\0' if out of bounds)

◆ position()

size_t position ( ) const

Get the current position in the input.

Returns
The byte offset of the next character to be scanned

◆ scanIdentifier_()

Token scanIdentifier_ ( )
private

Scan an identifier token (letter sequence)

◆ scanNumber_()

Token scanNumber_ ( )
private

Scan a number token (integer, decimal, optionally signed)

◆ scanToken_()

Token scanToken_ ( )
private

Scan and return the next token from the current position.

◆ tokenTypeName()

static const char * tokenTypeName ( TokenType  type)
static

Get a human-readable name for a token type.

Parameters
typeThe token type
Returns
A string describing the token type

Member Data Documentation

◆ input_

std::string_view input_
private

The input string (must remain valid for tokenizer lifetime)

◆ peeked_

std::optional<Token> peeked_
private

Cached peeked token (if any)

◆ pos_

size_t pos_ = 0
private

Current position in the input.