Tokenizer for ProForma v2 peptidoform notation. More...

#include <OpenMS/CHEMISTRY/ProFormaTokenizer.h>

Collaboration diagram for ProFormaTokenizer:

Classes
struct	Token
	A single token from the input stream. More...

Public Types
enum class	TokenType { LBRACKET , RBRACKET , LPAREN , RPAREN , LBRACE , RBRACE , LANGLE , RANGLE , PLUS , MINUS , SLASH , PIPE , HASH , COLON , COMMA , CARET , QUESTION , AT , NUMBER , IDENTIFIER , END }
	Token types produced by the tokenizer. More...

Public Member Functions
	ProFormaTokenizer (std::string_view input, size_t start_pos=0)
	Construct a tokenizer for the given input string.

	~ProFormaTokenizer ()=default
	Default destructor.

	ProFormaTokenizer (const ProFormaTokenizer &)=default
	Copy constructor.

	ProFormaTokenizer (ProFormaTokenizer &&)=default
	Move constructor.

ProFormaTokenizer &	operator= (const ProFormaTokenizer &)=default
	Copy assignment operator.

ProFormaTokenizer &	operator= (ProFormaTokenizer &&)=default
	Move assignment operator.

Token	next ()
	Consume and return the next token.

Token	peek ()
	Look at the next token without consuming it.

bool	hasMore () const
	Check if more tokens are available.

size_t	position () const
	Get the current position in the input.

std::string_view	getContext (size_t pos, size_t before=20, size_t after=20) const
	Get a context string around a position for error messages.

Static Public Member Functions
static const char *	tokenTypeName (TokenType type)
	Get a human-readable name for a token type.

Private Member Functions
Token	scanToken_ ()
	Scan and return the next token from the current position.

Token	scanNumber_ ()
	Scan a number token (integer, decimal, optionally signed)

Token	scanIdentifier_ ()
	Scan an identifier token (letter sequence)

bool	isAtEnd_ () const
	Check if we have reached the end of input.

char	current_ () const
	Get the current character (or '\0' if at end)

char	peek_ (size_t offset) const
	Get the character at offset from current position (or '\0' if out of bounds)

char	advance_ ()
	Advance to the next character and return the previous one.

Static Private Member Functions
static bool	isLetter_ (char c)
	Check if a character is a letter (A-Za-z)

static bool	isDigit_ (char c)
	Check if a character is a digit (0-9)

Private Attributes
std::string_view	input_
	The input string (must remain valid for tokenizer lifetime)

size_t	pos_ = 0
	Current position in the input.

std::optional< Token >	peeked_
	Cached peeked token (if any)

Detailed Description

Tokenizer for ProForma v2 peptidoform notation.

This class provides lexical analysis (tokenization) for ProForma strings. It produces tokens suitable for parsing the ProForma grammar, supporting zero-copy operation via std::string_view for performance.

The tokenizer handles:

Single-character tokens: [ ] ( ) { } < > + - / | # : , ^ ? @
Numbers: integers and decimals (e.g., "123", "15.9949", "+2", "-1")
Identifiers: letter sequences (e.g., "UNIMOD", "Oxidation", "PEPTIDE")
End of input marker

Note: The tokenizer does not skip whitespace - ProForma strings should not contain whitespace according to the specification.; The input string must remain valid for the lifetime of the tokenizer since tokens reference slices of the input.

Usage example:

std::string input = "EM[UNIMOD:35]K";
ProFormaTokenizer tokenizer(input);
 
while (tokenizer.hasMore())
{
    ProFormaTokenizer::Token token = tokenizer.next();
    std::cout << "Token: " << token.text << " at position " << token.position << std::endl;
}

Member Enumeration Documentation

◆ TokenType

enum class TokenType

strong

Token types produced by the tokenizer.

Enumerator
LBRACKET	Left square bracket: [.
RBRACKET	Right square bracket: ].
LPAREN	Left parenthesis: (.
RPAREN	Right parenthesis: )
LBRACE	Left curly brace: {.
RBRACE	Right curly brace: }.
LANGLE	Left angle bracket: <.
RANGLE	Right angle bracket: >
PLUS	Plus sign: +.
MINUS	Minus sign: -.
SLASH	Forward slash: /.
PIPE	Vertical bar (pipe): \|.
HASH	Hash/pound sign: #.
COLON	Colon: :
COMMA	Comma: ,.
CARET	Caret: ^.
QUESTION	Question mark: ?
AT	At sign: .
NUMBER	Numeric literal (integer or decimal, possibly with leading +/-)
IDENTIFIER	Letter sequence (A-Za-z)
END	End of input.

Constructor & Destructor Documentation

◆ ProFormaTokenizer() [1/3]

ProFormaTokenizer	(	std::string_view	input,
		size_t	start_pos = `0`
	)

explicit

Construct a tokenizer for the given input string.

Parameters

input	The ProForma string to tokenize. Must remain valid for the lifetime of this tokenizer.
start_pos	Optional starting position (default 0). Used for efficient lookahead without re-scanning from the beginning.

◆ ~ProFormaTokenizer()

~ProFormaTokenizer ( )

default

Default destructor.

◆ ProFormaTokenizer() [2/3]

ProFormaTokenizer ( const ProFormaTokenizer & )

default

Copy constructor.

◆ ProFormaTokenizer() [3/3]

ProFormaTokenizer ( ProFormaTokenizer && )

default

Move constructor.

Member Function Documentation

◆ advance_()

char advance_ ( )

private

Advance to the next character and return the previous one.

◆ current_()

char current_ ( ) const

private

Get the current character (or '\0' if at end)

◆ getContext()

std::string_view getContext	(	size_t	pos,
		size_t	before = `20`,
		size_t	after = `20`
	)		const

Get a context string around a position for error messages.

Returns a substring of the input centered around the given position, useful for providing context in error messages.

Parameters

pos	The position to center the context around
before	Maximum number of characters to include before pos
after	Maximum number of characters to include after pos

Returns: A view of the context substring

◆ hasMore()

bool hasMore ( ) const

Check if more tokens are available.

Returns: true if there are more tokens (i.e., next() would not return END)

◆ isAtEnd_()

bool isAtEnd_ ( ) const

private

Check if we have reached the end of input.

◆ isDigit_()

static bool isDigit_ ( char c )

staticprivate

Check if a character is a digit (0-9)

◆ isLetter_()

static bool isLetter_ ( char c )

staticprivate

Check if a character is a letter (A-Za-z)

◆ next()

Token next ( )

Consume and return the next token.

Advances the tokenizer position past the returned token.

Returns: The next token from the input stream

◆ operator=() [1/2]

ProFormaTokenizer & operator= ( const ProFormaTokenizer & )

default

Copy assignment operator.

◆ operator=() [2/2]

ProFormaTokenizer & operator= ( ProFormaTokenizer && )

default

Move assignment operator.

◆ peek()

Token peek ( )

Look at the next token without consuming it.

Multiple calls to peek() without intervening next() calls return the same token.

Returns: The next token that would be returned by next()

◆ peek_()

char peek_ ( size_t offset ) const

private

Get the character at offset from current position (or '\0' if out of bounds)

◆ position()

size_t position ( ) const

Get the current position in the input.

Returns: The byte offset of the next character to be scanned

◆ scanIdentifier_()

Token scanIdentifier_ ( )

private

Scan an identifier token (letter sequence)

◆ scanNumber_()

Token scanNumber_ ( )

private

Scan a number token (integer, decimal, optionally signed)

◆ scanToken_()

Token scanToken_ ( )

private

Scan and return the next token from the current position.

◆ tokenTypeName()

static const char * tokenTypeName ( TokenType type )

static

Get a human-readable name for a token type.

Parameters

type	The token type

Returns: A string describing the token type

Member Data Documentation

◆ input_

std::string_view input_

private

The input string (must remain valid for tokenizer lifetime)

◆ peeked_

std::optional<Token> peeked_

private

Cached peeked token (if any)

◆ pos_

size_t pos_ = 0

private

Current position in the input.

Classes

Public Types

Public Member Functions

Static Public Member Functions

Private Member Functions

Static Private Member Functions

Private Attributes

Detailed Description

Member Enumeration Documentation

◆ TokenType

Constructor & Destructor Documentation

◆ ProFormaTokenizer() [1/3]

◆ ~ProFormaTokenizer()

◆ ProFormaTokenizer() [2/3]

◆ ProFormaTokenizer() [3/3]

Member Function Documentation

◆ advance_()

◆ current_()

◆ getContext()

◆ hasMore()

◆ isAtEnd_()

◆ isDigit_()

◆ isLetter_()

◆ next()

◆ operator=() [1/2]

◆ operator=() [2/2]

◆ peek()

◆ peek_()

◆ position()

◆ scanIdentifier_()

◆ scanNumber_()

◆ scanToken_()

◆ tokenTypeName()

Member Data Documentation

◆ input_

◆ peeked_

◆ pos_