@yozora/core-tokenizer
Defines the shape of Yozora Tokenizer and life cycle methods, as well as some utility functions to assist in resolving tokens.
Install
- npm
- Yarn
- pnpm
npm install --save @yozora/core-tokenizer
yarn add @yozora/core-tokenizer
pnpm add @yozora/core-tokenizer
Usage
According to the Parse Strategy, there are two types of tokenizers: Block Tokenizer and Inline tokenizer.
Block Tokenizer
The parsing steps of the block tokenizer are divided into three life cycles:
-
match-block
: match a block node and get aBlockToken
-
post-match-block
: filter or merge block-level nodes at the same level (currently only used in @yozora/tokenizer-list) -
parse-block
: Parse aBlockToken
into a YAST node
match-block phase
In the process of parsing block nodes, the content is read line by line. The block-level node has a nested structure:
> This is a blockquote
> - This is a list item in blockquote
> - # This is a setext heading in the list item of the blockquote
> - > ...
As shown in the second line of the above code, when parsing
ListItem, it cannot get the first character in
the original document line, but wait for its ancestor elements along the
existing nesting structure (such as the above Blockquote)
to complete the matching, and then gets a matching opportunity. In order to make
the tokenizers work with each other transparently, when designing the life cycle
methods of the block-level tokenizer in the match-block
stage, the parsing
logic of the nested structure lifted into @yozora/core-parser , and use a
special data structure called PhrasingContentLine
as the actual parsing unit
of a line:
export interface PhrasingContentLine {
/**
* Start index of interval in nodePoints.
*/
startIndex: number
/**
* End index of interval in nodePoints.
*/
endIndex: number
/**
* Array of NodePoint which contains all the contents of this line.
*/
nodePoints: ReadonlyArray<NodePoint>
/**
* The index of first non-blank character in the rest of the current line
*/
firstNonWhitespaceIndex: number
/**
* The precede space count, one tab equals four space.
* @see https://github.github.com/gfm/#tabs
*/
countOfPrecedeSpaces: number
}
The life cycle methods at this stage is subdivided into the following methods (see match-block for the type definition details):
-
eatOpener
: (Required) Try to match a new block node. -
eatAndInterruptPreviousSibling
: (optional) try to interrupt the previous sibling node and match a new block node. -
eatContinuationText
: (Optional) Try to match the continuation text of current block node, that is, consume the currentPhrasingContentLine
with the current block node. There may be many kinds of results at this stage, which are distinguished according to the value ofstatus
in the returned result:-
notMatched
: Not matched. -
closing
: Matched and this is the last line of the current block node. That is, the current block node is in a saturated state and is closing. -
opening
: Matched, and not closing yet. -
failedAndRollback
: The match fails, and the content of the previous lines are to be rolled back. For convenience, it is assumed that the rollback operation does not affect the previously satisfied nested structure. -
closingAndRollback
: Matching failed, but only the last line needs to be rollback, the current node is still a valid one and will be closed soon.
-
-
eatLazyContinuationText
: (Optional) Try to match Laziness Continuation Text. Actually only the @yozora/tokenizer-paragraph needs to implement this method, see https://github.github.com/gfm/#phase-1-block-structure step3 for details. -
onClose
: (Optional) Called when the current node is closed, used to perform -
some cleanup operations.
-
extractPhrasingContentLines
: (Optional) Convert a Block Token generated by the current tokenizer toPhrasingContentLines[]
. This method is only needed when the matching node of this type may be rolled back. -
buildBlockToken
: (Optional) ConvertPhrasingContentLines[]
into a Block Token. This method is only needed when the matching node of this type may be rolled back
post-match-block phase
The lifecycle methods at this stage are subdivided into the following methods (for complete type definitions, see post-match-block):
transformMatch
: (Required) Convert the sibling nodes of a certain level in the tree obtained in the match-block stage into a new block node list. In fact, this life cycle method is only implemented in @yozora/tokenizer-list
parse-block phase
The life cycle methods at this stage is subdivided into the following methods (see parse-block for the complete type definition):
parseBlock
: Convert a Block Token into Yast Node
Inline Tokenizer
The parsing step of the inline parser is divided into two life cycles
match-inline
: Match the inline contents and get anInlineToken
parse-inline
: Parse anInlineToken
into a YAST node
match-inline phase
After a block node is closed, we can start matching inline nodes, so when we
match inline nodes, we get a continuous text without the concept of "line".
But inline nodes have priority. For example, link has a higher priority than
emphasis (see https://github.github.com/gfm/#example-529). In order to enable
unperceptual coordination between tokenizers, when designing the life cycle
function of the inline tokenizer in the match-inline
phase, put priority-related
logic in @yozora/core-parser In processing, each tokenizer only provides
four types of separators: opener
, both
, closer
, full
. Then the
processor in @yozora/core-parser completes the coordination work.
The lifecycle methods at this stage is subdivided into the following methods (see match-inline for the complete type definition):
findDelimiter
: (Required) Find a delimiterisDelimiterPair
: (Optional) Check whether the given two delimiters can matchprocessDelimiterPair
: (Optional) Process the matched two delimiters. Such as @yozora/tokenizer-emphasisprocessSingleDelimiter
: (Optional) Process a single delimiter. Such as @yozora/tokenizer-text
parser-inline phase
The lifecycle methods at this stage is subdivided into the following methods (see [parse-inline][lifecycle-pase-inline] for the complete type definition):
processToken
: (Required) Convert an Inline Token to a YAST node.
Related
-
@yozora/template-tokenizer For creating a Yozora Tokenizer.
-
Block Tokenizer Lifecycle
-
Inline Tokenizer Lifecycle