One of the more powerful concepts I’ve stumbled across recently is the idea of abstract syntax trees, or ASTs. If you’ve ever studied alchemy, you may recall that the whole motivation for alchemists was to discover some way to transform not-gold into gold through scientific or arcane methods.

ASTs are kind of like that. Using ASTs, we can transform Markdown into HTML, JSX into JavaScript, and so much more.

Why are ASTs useful?

Early in my career, I tried to change files using a find-and-replace method. This ended up being fairly complicated, so I tried using regular expressions. I ended up abandoning the idea because it was so brittle; the app broke all the time because someone would enter text in a way I hadn’t anticipated and it would break my regular expressions causing the whole app to fall down.

The reason this was so hard is that HTML is flexible. That makes it extremely hard to parse using regular expressions. String-based replacement like this is prone to breaking because it might miss a match, match too much, or do something weird that results in invalid markup that leaves the page looking janky.

ASTs, on the other hand, turn HTML into something far more structured, which makes it much simpler to dive into a text node and do replacements on only that text, or to mess with elements without needing to deal with the text at all.

This makes AST transformation safer and less error-prone than a purely string-based solution.

What are ASTs used for?

To start, let’s take a look at a minimal document using a couple lines of Markdown. This will be saved as a file called home.md, which we’ll save in the content folder of our website.

# Hello World!

![cardigan corgi](<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>) An adorable corgi!

Some more text goes here.

Assuming we know Markdown, we can infer that when this Markdown is parsed, it’ll end up being an <h1> that says, “Hello World!” and a <p> that says, “This is some Markdown.”

But how does it get transformed from Markdown to HTML?

That’s where ASTs come in!

Because it supports multiple languages, we’re going to use the unist syntax tree specification and, more specifically, the project unified.

Install the dependencies

First, we need to install the dependencies required to parse the Markdown into an AST and convert it to HTML. To do that, we need to make sure we’ve initialized the folder as a package. Run the following command in your terminal:

# make sure you’re in your root folder (where `content` is)
# initialize this folder as an npm package
npm init

# install the dependencies
npm install unified remark-parse remark-html

If we assume our Markdown is stored in home.md, we can get the AST with the following code:

const fs = require('fs');
const unified = require('unified');
const markdown = require('remark-parse');
const html = require('remark-html');

const contents = unified()
  .use(markdown)
  .use(html)
  .processSync(fs.readFileSync(`${process.cwd()}/content/home.md`))
  .toString();

console.log(contents);

This code takes advantage of Node’s built-in fs module, which allows us to access and manipulate the filesystem. For more information on how this works, check out the official docs.

If we save this as src/index.js and use Node to execute this script from the command line, we’ll see the following in our terminal:

$ node src/index.js 
<h1>Hello World!</h1>
<p><img src="<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>" alt="cardigan corgi"> An adorable corgi!</p>
<p>Some more text goes here.</p>

We tell unified to use remark-parse to turn the Markdown file into an AST, then to use remark-html to turn the Markdown AST into a HTML — or, more specifically, it turns it into something called a VFile. Using the toString() method turns that AST into an actual string of HTML we can display in the browser!

Thanks to the hard work of the open-source community, remark does all the hard work of turning Markdown into HTML for us. (See the diff)

Next, let’s look at how this actually works.

What does an AST look like?

To see the actual AST, let’s write a tiny plugin to log it:

const fs = require('fs');
const unified = require('unified');
const markdown = require('remark-parse');
const html = require('remark-html');

const contents = unified()
	.use(markdown)
  .use(() => tree => console.log(JSON.stringify(tree, null, 2)))
	.use(html)
	.processSync(fs.readFileSync(`${process.cwd()}/content/home.md`))
	.toString();

The output of running the script will now be:

{
  "type": "root",
  "children": [
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Hello World!",
          "position": {}
        }
      ],
      "position": {}
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "image",
          "title": null,
          "url": "<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>",
          "alt": "cardigan corgi",
          "position": {}
        },
        {
          "type": "text",
          "value": " An adorable corgi!",
          "position": {}
        }
      ],
      "position": {}
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Some more text goes here.",
          "position": {}
        }
      ],
      "position": {}
    }
  ],
  "position": {}
}

Note that the position values have been truncated to save space. They contain information about where the node is in the document. For the purposes of this tutorial, we won’t be using this information. (See the diff)

This is a little overwhelming to look at, but if we zoom in we can see that each part of the Markdown becomes a type of node with a text node inside it.

For example, the heading becomes:

{
  "type": "heading",
  "depth": 1,
  "children": [
    {
      "type": "text",
      "value": "Hello World!",
      "position": {}
    }
  ],
  "position": {}
}

Here’s what this means:

  • The type tells us what kind of node we’re dealing with.
  • Each node type has additional properties that describe the node. The depth property on the heading tells us what level heading it is — a depth of 1 means it’s an <h1> tag, 2 means <h2>, and so on.
  • The children array tells us what’s inside this node. In both the heading and the paragraph, there’s only text, but we could also see inline elements here, like <strong>.

This is the power of ASTs: We’ve now described the Markdown document as an object that a computer can understand. If we want to print this back to Markdown, a Markdown compiler would know that a “heading” node with a depth of 1 starts with #, and a child text node with the value “Hello” means the final line should be # Hello.

How AST transformations work

Transforming an AST is usually done using the visitor pattern. It‘s not important to know the ins and outs of how this works to be productive, but if you’re curious, JavaScript Design Patterns for Humans by Soham Kamani has a great example to help explain how it works. The important thing to know is that the majority of resources on AST work will talk about “visiting nodes,” which roughly translates to “find part of the AST so we can do stuff with it.” The way this works practice is that we write a function that will be applied to AST nodes matching our criteria.

A few important notes about how it works:

  • ASTs can be huge, so for performance reasons we will mutate nodes directly. This runs counter to how I would usually approach things — as a general rule I don’t like to mutate global state — but it makes sense in this context.
  • Visitors work recursively. That means that if we process a node and create a new node of the same type, the visitor will run on the newly created node as well unless we explicitly tell the visitor not to.
  • We’re not going to go too deep in this tutorial, but these two ideas will help us understand what’s going on as we start to mess with the code.

How do I modify the HTML output of the AST?

What if we want to change the output of our Markdown, though? Let’s say our goal is to wrap image tags with a figure element and supply a caption, like this:

<figure>
  <img
    src="<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>"
    alt="cardigan corgi"
  />
  <figcaption>An adorable corgi!</figcaption>
</figure>

To accomplish this, we’ll need transform the HTML AST — not the Markdown AST — because Markdown doesn’t have a way of creating figure or figcaption elements. Fortunately, because unified is interoperable with multiple parsers, we can do that without writing a bunch of custom code.

Convert a Markdown AST to an HTML AST

To convert the Markdown AST to an HTML AST, add remark-rehype and switch to rehype-stringify for turning the AST back to HTML.

npm install remark-rehype rehype-stringify

Make the following changes in src/index.js to switch over to rehype:

const fs = require('fs');
const unified = require('unified');
const markdown = require('remark-parse');
const remark2rehype = require('remark-rehype');
const html = require('rehype-stringify');

const contents = unified()
	.use(markdown)
  .use(remark2rehype)
	.use(() => tree => console.log(JSON.stringify(tree, null, 2)))
	.use(html)
	.processSync(fs.readFileSync('corgi.md'))
	.toString();

console.log(contents);

Note that the HTML variable changed from remark-html to rehype-stringify — both turn the AST into a format that can be stringified to HTML

If we run the script, we can see the image element now looks like this in the AST:

{
  "type": "element",
  "tagName": "img",
  "properties": {
    "src": "https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg",
    "alt": "cardigan corgi"
  },
  "children": [],
  "position": {}
}

This is the AST for the HTML representation of the image, so we can start changing it over to use the figure element. (See the diff)

Write a plugin for unified

To wrap our img element with a figure element, we need to write a plugin. In unified, plugins are added with the use() method, which accepts the plugin as a first argument and any options as a second argument:

.use(plugin, options)

The plugin code is a function (called an “attacher” in unified jargon) that receives option. These options are used to create a new function (called a “transformer”) that receives the AST and does work to, er, transform it. For more details on plugins, check out the plugin overview in the unified docs.

The function it returns will receive the entire AST as its argument, and it doesn’t return anything. (Remember, ASTs are mutated globally.) Create a new file called img-to-figure.js in the same folder as index.js, then put the following inside:

module.exports = options => tree => {
  console.log(tree);
};

To use this, we need to add it to src/index.js:

const fs = require('fs');
const unified = require('unified');
const markdown = require('remark-parse');
const remark2rehype = require('remark-rehype');
const html = require('rehype-stringify');
const imgToFigure = require('./img-to-figure');

const contents = unified()
  .use(markdown)
  .use(remark2rehype)
  .use(imgToFigure)
  .processSync(fs.readFileSync('corgi.md'))
  .toString();

console.log(contents);

If we run the script, we’ll see the whole tree logged out in the console:

{
  type: 'root',
  children: [
    {
      type: 'element',
      tagName: 'p',
      properties: {},
      children: [Array],
      position: [Object]
    },
    { type: 'text', value: '\n' },
    {
      type: 'element',
      tagName: 'p',
      properties: {},
      children: [Array],
      position: [Object]
    }
  ],
  position: {
    start: { line: 1, column: 1, offset: 0 },
    end: { line: 4, column: 1, offset: 129 }
  }
}

(See the diff)

Add a visitor to the plugin

Next, we need to add a visitor. This will let us actually get at the code. Unified takes advantage of a number of utility packages, all prefixed with unist-util-*, that allow us to do common things with our AST without writing custom code.

We can use unist-util-visit to modify nodes. This gives us a visit helper that takes three arguments:

  • The entire AST we’re working with
  • A predicate function to identify which nodes we want to visit
  • A function to make any changes to the AST we want to make

To install, run the following in your command line:

npm install unist-util-visit

Let’s implement a visitor in our plugin by adding the following code:

const visit = require('unist-util-visit');

  module.exports = options => tree => {
    visit(
      tree,
      // only visit p tags that contain an img element
      node =>
        node.tagName === 'p' && node.children.some(n => n.tagName === 'img'),
      node => {
        console.log(node);
      }
    );
};

When we run this, we can see there’s only one paragraph node logged:

{
  type: 'element',
  tagName: 'p',
  properties: {},
  children: [
    {
      type: 'element',
      tagName: 'img',
      properties: [Object],
      children: [],
      position: [Object]
    },
    { type: 'text', value: ' An adorable corgi!', position: [Object] }
  ],
  position: {
    start: { line: 3, column: 1, offset: 16 },
    end: { line: 3, column: 102, offset: 117 }
  }
}

Perfect! We’re getting only the paragraph node that has the image we want to modify. Now we can start to transform the AST!

(See the diff)

Wrap the image in a figure element

Now that we have the image attributes, we can start to change the AST. Remember, because ASTs can be really large, we mutate them in place to avoid creating lots of copies and potentially slowing our script down.

We start by changing the node’s tagName to be a figure instead of a paragraph. The rest of the details can stay the same for now.

Make the following changes in src/img-to-figure.js:

const visit = require('unist-util-visit');

module.exports = options => tree => {
  visit(
    tree,
    // only visit p tags that contain an img element
    node =>
    node.tagName === 'p' && node.children.some(n => n.tagName === 'img'),
    node => {
      node.tagName = 'figure';
    }
  );
};

If we run our script again and look at the output, we can see that we’re getting closer!

<h1>Hello World!</h1>
<figure><img src="<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>" alt="cardigan corgi">An adorable corgi!</figure>
<p>Some more text goes here.</p>

(See the diff)

Use the text next to the image as a caption

To avoid needing to write custom syntax, we’re going to use any text passed inline with an image as the image caption.

We can make an assumption that usually images don’t have inline text in Markdown, but it’s worth noting that this could 100% cause unintended captions to appear for people writing Markdown. We’re going to take that risk in this tutorial. If you’re planning to put this into production, make sure to weigh the trade-offs and choose what’s best for your situation.

To use the text, we’re going to look for a text node inside our parent node. If we find one, we want to grab its value as our caption. If no caption is found, we don’t want to transform this node at all, so we can return early.

Make the following changes to src/img-to-figure.js to grab the caption:

const visit = require('unist-util-visit');

module.exports = options => tree => {
  visit(
    tree,
    // only visit p tags that contain an img element
    node =>
    node.tagName === 'p' && node.children.some(n => n.tagName === 'img'),
    node => {
      // find the text node
      const textNode = node.children.find(n => n.type === 'text');
 
      // if there’s no caption, we don’t need to transform the node
      if (!textNode) return;
 
      const caption = textNode.value.trim();
 
      console.log({ caption });
      node.tagName = 'figure';
    }
  );
};

Run the script and we can see the caption logged:

{ caption: 'An adorable corgi!' }

(See the diff)

Add a figcaption element to the figure

Now that we have our caption text, we can add a figcaption to display it. We could do this by creating a new node and deleting the old text node, but since we’re mutating in place it’s a little less complicated to just change the text node into an element.

Elements don’t have text, though, so we need to add a new text node as a child of the figcaption element to display the caption text.

Make the following changes to src/img-to-figure.js to add the caption to the markup:

const visit = require('unist-util-visit');

module.exports = options => tree => {
  visit(
    tree,
    // only visit p tags that contain an img element
    node =>
      node.tagName === 'p' && node.children.some(n => n.tagName === 'img'),
    node => {
      // find the text node
      const textNode = node.children.find(n => n.type === 'text');

      // if there’s no caption, we don’t need to transform the node
      if (!textNode) return;

      const caption = textNode.value.trim();
      // change the text node to a figcaption element containing a text node
      textNode.type = 'element';
      textNode.tagName = 'figcaption';
      textNode.children = [
        {
          type: 'text',
          value: caption
        }
      ];

      node.tagName = 'figure';
    }
  );
};

If we run the script again with node src/index.js, we see the transformed image wrapped in a figure element and described with a figcaption!

<h1>Hello World!</h1>
<figure><img src="<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>" alt="cardigan corgi"><figcaption>An adorable corgi!</figcaption></figure>

<p>Some more text goes here.</p>

(See the diff)

Save the transformed content to a new file

Now that we’ve made a bunch of transformations, we want to save those adjustments to an actual file so we can share them.

Since the Markdown doesn’t include a full HTML document, we’re going to add one more rehype plugin called rehype-document to add the full document structure and a title tag.

Install by running:

npm install rehype-document

Next, make the following changes to src/index.js:

const fs = require('fs');
const unified = require('unified');
const markdown = require('remark-parse');
const remark2rehype = require('remark-rehype');
const doc = require('rehype-document');
const html = require('rehype-stringify');

const imgToFigure = require('./img-to-figure');

const contents = unified()
	.use(markdown)
	.use(remark2rehype)
	.use(imgToFigure)
    .use(doc, { title: 'A Transformed Document!' })
	.use(html)
	.processSync(fs.readFileSync(`${process.cwd()}/content/home.md`))
	.toString();

 const outputDir = `${process.cwd()}/public`;

  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir);
  }
 
  fs.writeFileSync(`${outputDir}/home.html`, contents);

Run the script again and we’ll be able to see a new folder in root called public, and inside that we’ll see home.html. Inside, our transformed document is saved!

<!doctype html><html lang="en">
<head>
<meta charset="utf-8">
<title>A Transformed Document!</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
	<h1>Hello World!</h1>
	<figure><img src="<https://images.dog.ceo/breeds/corgi-cardigan/n02113186_1030.jpg>" alt="cardigan corgi"><figcaption>An adorable corgi!</figcaption></figure>
	<p>Some more text goes here.</p>
</body>
</html>

(See the diff)

If we open public/home.html in a browser, we can see our transformed Markdown rendered as a figure with a caption.

Holy buckets! Look at that adorable corgi! And we know it’s adorable because the caption tells us so.

What to do next

Transforming files using ASTs is extremely powerful — with it, we’re able to create pretty much anything we can imagine in a safe way. No regexes or string parsing required!

From here, you can dig deeper into the ecosystem of plugins for remark and rehype to see more of what’s possible and get more ideas for what you can do with AST transformation, from building your own Markdown-powered static site generator; to automating performance improvements by modifying code in-place; to whatever you can imagine!

AST transformation is a coding superpower. Get started by checking out this demo’s source code — I can’t wait to see what you build with it! Share your projects with me on Twitter.





Source link

Write A Comment