@yozora/core-tokenizer
Defines the shape of Yozora Tokenizer and life cycle methods, as well as some utility functions to assist in resolving tokens.
Install
- npm
- Yarn
- pnpm
npm install --save @yozora/core-tokenizer
yarn add @yozora/core-tokenizer
pnpm add @yozora/core-tokenizer
Usage
According to the Parse Strategy, there are two types of tokenizers: Block Tokenizer and Inline tokenizer.
Block Tokenizer
The parsing steps of the block tokenizer are divided into three life cycles:
-
match-block
: processing literal text and produce BlockTokens. -
parse-block
: processing BlockTokens into AST Nodes
match-block phase
In the process of parsing block nodes, the content is read line by line. The block-level node has a nested structure:
> This is a blockquote
> - This is a list item in blockquote
> - # This is a setext heading in the list item of the blockquote
> - > ...
As shown in the second line of the above code, when parsing
ListItem, it cannot get the first character in
the original document line, but wait for its ancestor elements along the
existing nesting structure (such as the above Blockquote)
to complete the matching, and then gets a matching opportunity. In order to make
the tokenizers work with each other transparently, when designing the life cycle
methods of the block-level tokenizer in the match-block
stage, the parsing
logic of the nested structure lifted into @yozora/core-parser , and use a
special data structure called PhrasingContentLine
as the actual parsing unit
of a line:
export interface PhrasingContentLine {
/**
* Start index of interval in nodePoints.
*/
startIndex: number
/**
* End index of interval in nodePoints.
*/
endIndex: number
/**
* Array of NodePoint which contains all the contents of this line.
*/
nodePoints: ReadonlyArray<INodePoint>
/**
* The index of first non-blank character in the rest of the current line
*/
firstNonWhitespaceIndex: number
/**
* The precede space count, one tab equals four space.
* @see https://github.github.com/gfm/#tabs
*/
countOfPrecedeSpaces: number
}
The life cycle methods at this stage is subdivided into the following methods (see [match-block][lifecycle-match-block] for the type definition details):
-
isContainingBlock
: (Required) Indicate that whether if it is a container block. -
eatOpener
: (Required) Try to match a new block node. -
eatAndInterruptPreviousSibling
: (optional) try to interrupt the previous sibling node and match a new block node. -
eatContinuationText
: (Optional) Try to match the continuation text of current block node, that is, consume the currentPhrasingContentLine
with the current block node. There may be many kinds of results at this stage, which are distinguished according to the value ofstatus
in the returned result:-
notMatched
: Not matched. -
closing
: Matched and this is the last line of the current block node. That is, the current block node is in a saturated state and is closing. -
opening
: Matched, and not closing yet. -
failedAndRollback
: The match fails, and the content of the previous lines are to be rolled back. For convenience, it is assumed that the rollback operation does not affect the previously satisfied nested structure. -
closingAndRollback
: Matching failed, but only the last line needs to be rollback, the current node is still a valid one and will be closed soon.
-
-
eatLazyContinuationText
: (Optional) Try to match Laziness Continuation Text. Actually only the @yozora/tokenizer-paragraph needs to implement this method, see https://github.github.com/gfm/#phase-1-block-structure step3 for details. -
onClose
: (Optional) Called when the current node is closed, used to perform some cleanup operations.
parse-block phase
The life cycle methods at this stage is subdivided into the following methods (see [parse-block][lifecycle-parse-block] for the complete type definition):
parse
: Processing a specified type token list to AST Node list.
Additional in BlockTokenizer
-
extractPhrasingContentLines
: (Optional) Convert a Block Token generated by the current tokenizer toPhrasingContentLines[]
. This method is only needed when the matching node of this type may be rolled back. -
buildBlockToken
: (Optional) ConvertPhrasingContentLines[]
into a Block Token. This method is only needed when the matching node of this type may be rolled back
Inline Tokenizer
The parsing step of the inline parser is divided into two life cycles
match-inline
: processing literal text and produce InlineTokens.parse-inline
: processing InlineTokens into AST Nodes.
match-inline phase
After a block node is closed, we can start matching inline nodes, so when we
match inline nodes, we get a continuous text without the concept of "line".
But inline nodes have priority. For example, link has a higher priority than
emphasis (see https://github.github.com/gfm/#example-529). In order to enable
unperceptual coordination between tokenizers, when designing the life cycle
function of the inline tokenizer in the match-inline
phase, put priority-related
logic in @yozora/core-parser In processing, each tokenizer only provides
four types of separators: opener
, both
, closer
, full
. Then the
processor in @yozora/core-parser completes the coordination work.
The lifecycle methods at this stage is subdivided into the following methods (see [match-inline][lifecycle-match-inline] for the complete type definition):
findDelimiter
: (Required) Find a delimiterisDelimiterPair
: (Optional) Check whether the given two delimiters can matchprocessDelimiterPair
: (Optional) Process the matched two delimiters. Such as @yozora/tokenizer-emphasisprocessSingleDelimiter
: (Optional) Process a single delimiter. Such as @yozora/tokenizer-text
parser-inline phase
The lifecycle methods at this stage is subdivided into the following methods (see [parse-inline][lifecycle-pase-inline] for the complete type definition):
parse
: Processing a specified type token list to AST Node list.
Related
-
@yozora/template-tokenizer For creating a Yozora Tokenizer.
-
Block Tokenizer Lifecycle
- match-block: IMatchBlockPhaseApi, IMatchBlockHook
- parse-block: IParseBlockPhaseApi, IParseBlockHook
-
Inline Tokenizer Lifecycle
- match-inline: IMatchInlinePhaseApi, IMatchInlineHook
- parse-inline: IParseInlinePhaseApi, IParseInlineHook