Making Your Own JavaScript Linter (part 1)

A comprehensive tutorial

Joana Borges Late
CodeX

--

A linter running

This is the first part of a comprehensive tutorial on constructing a JavaScript linter. A linter is a tool for checking source code. We will see the fundamental concepts and the source code of a real linter written in JavaScript, including algorithms for the tricky parts.

Note: dirtyrat is the linter used to produce this tutorial.

How a linter works internally

How can we make a computer so clever to understand a thousand line file written with the complex set of rules of JavaScript? How to deal with such complexity? The answer is always the same.

We break the big complex system in small, simple components, that have simple relations with other components.

In the case of a linter, these small, simple components are called tokens. So the first step of linting is to produce a collection of tokens from a text. And it is exactly how we, human beings, read a text. The difference is that, instead of tokens, we say words, numbers and punctuation symbols.

Usually the module that produces tokens from a text is called tokenizer. It may produce raw tokens or filter them and serve only meaningful tokens. It may tokenize a whole file at once or produce token by token on demand.

Let’s see an example. In the code below, how many raw tokens have the first line?

Is five your guess? Well, there’s more than meets the eye.

Raw tokens

Considering that some kinds are named as “=” and “;” instead of assignment and semicolon, you may realize that the name dirtyrat is somewhat appropriate.

Beyond produce all raw tokens, the tokenizer (or another module) eliminates the meaningless ones — blank lines, remarks and whitespaces.

The point of early eliminating useless tokens is that it greatly simplifies the checks that are to be done. Look at the partial examples below. They all have valid syntax for the start of the if statement. They all come down to the same two meaningful tokens (the keyword if and the opening parenthesis).

When the linter is analyzing filtered (meaningful only) tokens and it sees a token with the keyword if, it is just the case to say “Now give me a token which content is ‘(‘ or else I will shout ‘ERROR’!”.

When the tokenizer does not filter the tokens… well, I wouldn’t like to imagine how bloated, ugly, and confusing would be this linter internally!

What kinds of a token are useless depends on the rules of the programming language of the target text. C does not care about EOLs (ends of the line), it relies on the semicolon. Python counts whitespaces at the start of each line to close code blocks.

If the linter is checking style then whitespace tokens must be preserved. If the tokenizer is part of a documentation tool, remarks must be preserved. A tokenizer is not just for linters and other tools. Compilers and interpreters also have one.

So, basically, linting is all about creating a list of meaningful tokens from the source code and then analyzing the relation of each token with its next neighbor.

The token object

For a linter, it is not enough to find errors perfectly. It should point out exactly where each error occurred and what the error was. The “where” part is easy as long as we register the position of each token (name of file, row and col). The “what” part is hard because there are too many possibilities to cover; that’s why the simple and general error message unexpected token is widely used.

So the job of the tokenizer must be more than turn a big string (source code) into a list of small strings (tokens). The tokenizer must create a list of token objects.

The field kind is especially useful. Very often, the linter does not check the field value. If it is waiting for a number, does not matter if the number is 0, 33, or 44. It is enough for the token to be of the kind “number”.

In this tutorial, from now on, token (alone) means a token object.

How the tokenizer works

The tokenizer is basically a big loop.

Notice that the function eatCharacter is the heart of the tokenizer as the tokenizer is the heart of the linter. Beyond its basic duty of serving the next character, this small and simple function controls the end of the line, end of the file, valid characters, and position of the token.

Repair how it is impossible to have two consecutive whitespace tokens. Once createWhiteSpaceToken starts, it only stops when the next character in the source code is not whitespace. This is a very simple case. Other functions like createStringToken, createNumberToken, createNameToken, etc. are more complex but the basic functionality is the same.

There is no function createKeywordToken. It is the function createNameToken that checks if token.value belongs to the keywords list.

The partial code above does not handle the case when a symbol has more than one character, like the equality symbol (“==”). The source code of dirtyrat is available on GitHub if you want to know more details.

About string interpolation, there are three ways to deal with it.

  • The good way: tokenizer ignores interpolation and delivers a standard string token. The interpolation will be processed later in a special module.
  • The bad way: the complexity and size of the function createStringToken is greatly increased in order to produce all tokens of the interpolation.
  • The terrible way: tokenizer ignores interpolation and delivers a standard string token. But instead of creating a special module to process the interpolation, we consider the interpolation as a source code file itself and (trying to obey the DRY principle) we refactor the tokenizer to be able to handle this “source code file”. What a mess!

You may think that the program would be more performant if, instead of eating (shrinking) hundreds of times a big string (source code), there was a variable to memorize the position of the current character in the source code, like a cursor. This works but complicates the code.

In fact, dirtyrat uses a third way that joins simplicity and performance: first it splits the code into lines, then it tokenizes line by line.

The scanner

The tokenizer creates a list of meaningful tokens. Now we have to analyze the relation among the tokens. This is done in the module parser (and its helper modules). As parsing expressions and function bodies is a bit complex, we create the module scanner, that contains a few simple but very helpful functions with the purpose to server tokens to other modules. Let’s see some of these functions.

All the function names in the scanner start with “see” or “eat”. The “see” functions fit when we need to know about the next token but are not going to consume it now.

The function seeEndOfBlockOrTrueEndOfLine doesn’t even return a token! It just checks if the next token conforms to some conditions.

So, what kind of end of the line is not a true end of the line? The semicolon (“;”).

The scanner is the only module that accesses the tokens list after it is ready.

Dirtyrat always exits execution when finds an error or else it could point to tens of errors that really don’t exist, just because the first error breaks the structure of the following lines of code. For example, eatString returns a string token or the linter exits after to show the proper error message.

The module Error

Since we been talking about errors, let’s look at module error.

As you can see the idea is that each module is simple and make things very easy for the other modules. For example, when the module expression finds a bad token, all it has to do is say unexpected(token). And a precise error message will be displayed.

Note: in dirtyrat the module error is part of module helper.

To be continued

You can read the second part of the tutorial here.

--

--