cmark

My personal build of CMark ✏️

Commit
6dcd2beafdfbc9f694916bcdfa822b896aa44177
Parent
d272645dc32f01e73c1ac0f7f1dd6f34e834e9e0
Author
John MacFarlane <jgm@berkeley.edu>
Date

Updated spec.

Diffstat

1 file changed, 157 insertions, 8 deletions

Status File Name N° Changes Insertions Deletions
Modified test/spec.txt 165 157 8
diff --git a/test/spec.txt b/test/spec.txt
@@ -2715,7 +2715,7 @@ So, we explain what counts as a block quote or list item by explaining
 how these can be *generated* from their contents. This should suffice
 to define the syntax, although it does not give a recipe for *parsing*
 these constructions.  (A recipe is provided below in the section entitled
-[A parsing strategy](#appendix-a-a-parsing-strategy).)
+[A parsing strategy](#appendix-a-parsing-strategy).)
 
 ## Block quotes
 
@@ -7940,7 +7940,10 @@ Multiple     spaces
 
 <!-- END TESTS -->
 
-# Appendix A: A parsing strategy {-}
+# Appendix: A parsing strategy {-}
+
+In this appendix we describe some features of the parsing strategy
+used in the CommonMark reference implementations.
 
 ## Overview {-}
 
@@ -7957,8 +7960,6 @@ are parsed into sequences of Markdown inline elements (strings,
 code spans, links, emphasis, and so on), using the map of link
 references constructed in phase 1.
 
-## The document tree {-}
-
 At each point in processing, the document is represented as a tree of
 **blocks**.  The root of the tree is a `document` block.  The `document`
 may have any number of other blocks as **children**.  These children
@@ -7982,7 +7983,7 @@ marked by arrows:
              "aliquando id"
 ```
 
-## How source lines alter the document tree {-}
+## Phase 1: block structure {-}
 
 Each line that is processed has an effect on this tree.  The line is
 analyzed and, depending on its contents, the document may be altered
@@ -7997,6 +7998,36 @@ in one or more of the following ways:
 Once a line has been incorporated into the tree in this way,
 it can be discarded, so input can be read in a stream.
 
+For each line, we follow this procedure:
+
+1. First we iterate through the open blocks, starting with the
+root document, and descending through last children down to the last
+open block.  Each block imposes a condition that the line must satisfy
+if the block is to remain open.  For example, a block quote requires a
+`>` character.  A paragraph requires a non-blank line.
+In this phase we may match all or just some of the open
+blocks.  But we cannot close unmatched blocks yet, because we may have a
+[lazy continuation line].
+
+2.  Next, after consuming the continuation markers for existing
+blocks, we look for new block starts (e.g. `>` for a block quote.
+If we encounter a new block start, we close any blocks unmatched
+in step 1 before creating the new block as a child of the last
+matched block.
+
+3.  Finally, we look at the remainder of the line (after block
+markers like `>`, list markers, and indentation have been consumed).
+This is text that can be incorporated into the last open
+block (a paragraph, code block, header, or raw HTML).
+
+Setext headers are formed when we detect that the second line of
+a paragraph is a setext header line.
+
+Reference link definitions are detected when a paragraph is closed;
+the accumulated text lines are parsed to see if they begin with
+one or more reference link definitions.  Any remainder becomes a
+normal paragraph.
+
 We can see how this works by considering how the tree above is
 generated by four lines of Markdown:
 
@@ -8094,7 +8125,7 @@ We thus obtain the final tree:
              "aliquando id"
 ```
 
-## From block structure to the final document {-}
+## Phase 2: inline structure {-}
 
 Once all of the input has been parsed, all open blocks are closed.
 
@@ -8125,5 +8156,123 @@ Notice how the [line ending] in the first paragraph has
 been parsed as a `softbreak`, and the asterisks in the first list item
 have become an `emph`.
 
-The document can be rendered as HTML, or in any other format, given
-an appropriate renderer.
+### An algorithm for parsing nested emphasis and links {-}
+
+By far the trickiest part of inline parsing is handling emphasis,
+strong emphasis, links, and images.  This is done using the following
+algorithm.
+
+When we're parsing inlines and we hit either
+
+- a run of `*` or `_` characters, or
+- a `[` or `![`
+
+we insert a text node with these symbols as its literal content, and we
+add a pointer to this text node to the [delimiter stack](@delimiter-stack).
+
+The [delimiter stack] is a doubly linked list.  Each
+element contains a pointer to a text node, plus information about
+
+- the type of delimiter (`[`, `![`, `*`, `_`)
+- the number of delimiters,
+- whether the delimiter is "active" (all are active to start), and
+- whether the delimiter is a potential opener, a potential closer,
+  or both (which depends on what sort of characters precede
+  and follow the delimiters).
+
+When we hit a `]` character, we call the *look for link or image*
+procedure (see below).
+
+When we hit the end of the input, we call the *process emphasis*
+procedure (see below), with `stack_bottom` = NULL.
+
+#### *look for link or image* {-}
+
+Starting at the top of the delimiter stack, we look backwards
+through the stack for an opening `[` or `![` delimiter.
+
+- If we don't find one, we return a literal text node `]`.
+
+- If we do find one, but it's not *active*, we remove the inactive
+  delimiter from the stack, and return a literal text node `]`.
+
+- If we find one and it's active, then we parse ahead to see if
+  we have an inline link/image, reference link/image, compact reference
+  link/image, or shortcut reference link/image.
+
+  + If we don't, then we remove the opening delimiter from the
+    delimiter stack and return a literal text node `]`.
+
+  + If we do, then
+
+    * We return a link or image node whose children are the inlines
+      after the text node pointed to by the opening delimiter.
+
+    * We run *process emphasis* on these inlines, with the `[` opener
+      as `stack_bottom`.
+
+    * We remove the opening delimiter.
+
+    * If we have a link (and not an image), we also set all
+      `[` delimiters before the opening delimiter to *inactive*.  (This
+      will prevent us from getting links within links.)
+
+#### *process emphasis* {-}
+
+Parameter `stack_bottom` sets a lower bound to how far we
+descend in the [delimiter stack].  If it is NULL, we can
+go all the way to the bottom.  Otherwise, we stop before
+visiting `stack_bottom`.
+
+Let `current_position` point to the element on the [delimiter stack]
+just above `stack_bottom` (or the first element if `stack_bottom`
+is NULL).
+
+We keep track of the `openers_bottom` for each delimiter
+type (`*`, `_`).  Initialize this to `stack_bottom`.
+
+Then we repeat the following until we run out of potential
+closers:
+
+- Move `current_position` forward in the delimiter stack (if needed)
+  until we find the first potential closer with delimiter `*` or `_`.
+  (This will be the potential closer closest
+  to the beginning of the input -- the first one in parse order.)
+
+- Now, look back in the stack (staying above `stack_bottom` and
+  the `openers_bottom` for this delimiter type) for the
+  first matching potential opener ("matching" means same delimiter).
+
+- If one is found:
+
+  + Figure out whether we have emphasis or strong emphasis:
+    if both closer and opener spans have length >= 2, we have
+    strong, otherwise regular.
+
+  + Insert an emph or strong emph node accordingly, after
+    the text node corresponding to the opener.
+
+  + Remove any delimiters between the opener and closer from
+    the delimiter stack.
+
+  + Remove 1 (for regular emph) or 2 (for strong emph) delimiters
+    from the opening and closing text nodes.  If they become empty
+    as a result, remove them and remove the corresponding element
+    of the delimiter stack.  If the closing node is removed, reset
+    `current_position` to the next element in the stack.
+
+- If none in found:
+
+  + Set `openers_bottom` to the element before `current_position`.
+    (We know that there are no openers for this kind of closer up to and
+    including this point, so this puts a lower bound on future searches.)
+
+  + If the closer at `current_position` is not a potential opener,
+    remove it from the delimiter stack (since we know it can't
+    be a closer either).
+
+  + Advance `current_position` to the next element in the stack.
+
+After we're done, we remove all delimiters above `stack_bottom` from the
+delimiter stack.
+