Writing a Compiler - Part 1 - Defining The Language

Programming languages are tools that we, developers, use on a daily basis. Moreover, we fully depend on them: features, performance, learning curve etc. However, most of us might not have even thought of what’s hidden behind programming languages we use or how much effort has been put into their development.

Furthermore, compilers are the essential components in most development toolchains. Although they are so common and most of us use them on a daily basis, we still lack knowledge on how they work or how a development of one simple compiler could look like.

In the Writing a Compiler series of blog posts, I will try to demystify the process of defining our own, rather simple, programming language as well as creating a compiler for it. We will cover following topics:

Defining the programming language
Writing a lexer
Building up an Abstract Syntax Tree (AST)
Introduction to LLVM and it’s Intermediate Representation (IR) code
Generating IR code for the given AST
LLVM optimizations on a generated IR code
Creating a compiler by using LLVM C++ API
Creating a custom LLVM backend
…

Above list is a subject to change as I will be adding and / or removing stuff from the series as the time passes by. Therefore, the series at the end might be a bit longer / shorter than expected.

DISCLAIMER: As this is my first compiler, I might not be doing things the right way but I am open to suggestions, modifications or even guest blog posts from expert compiler developers out there.

Our Language Goals

Before we jump into things, let’s first define a wish list for our language:

It should be beginner-friendly.
- When I say beginners, I mean people who don’t have a lot of technical knowledge or at least software development knowledge, but might want to understand what a programming language is and how to use it.
Its syntax should be as light and easy to use as possible.
- Considering Python’s reputation as the best programming language for beginners, let’s take some of its syntax and modify it to our needs a bit.
It should support very basic types - a number and a string.
- Whether a number is u64 / i64, u32 / i32 etc. is not that important to beginners. So, let’s hide implementation details about the size of the number from our users.

It should to be “partially” statically typed.

When declaring variables, the user has to either specify the type or initialize the variable or both:

   let var_1 = 5                  # a number
   let var_2 = "Hello, world!"    # a string

   let var_1: num    # a default number: 0
   let var_2: str    # a default string: ‘’

or both

   let var_1: num = 10                 # a number
   let var_2: str = "Hello, world!"    # a string

On the other hand, when declaring a function signature, specifying parameter types and return type is mandatory, as this improves code readability:
1 2 def func(a: num, b: num) -> num: return a + b

“main” function that developers usually define themselves in other programming languages should not exist. Instead, root indentation level is considered to be the starting point for program execution.
1 2 3 4 5 6 def add(a: num, b: num) -> num: return a + b # Following instructions are executed on program start let var = add(4, 5) print(var)

Please note that I won’t be sharing any language grammar specification here as it’s mostly the same as Python’s, besides a few differences that you may have already spotted (like let for variable declaration).

Now that we have specified some basic language goals, let’s get to some code.

Getting Ready For The Lexer

Writing the lexer won’t be part of this blog post, but still we need to prepare the ground for what will come in the next blog post.

So, what is a lexer anyway? It’s a component that separates a stream of characters into tokens, i.e. categories of words like numbers, strings, keywords (specific to each programming language), operators, etc.

For example, following line of code:

let var: num = 0

contains several tokens. Let’s inspect what these tokens are:

let, :, num and = tokens belong to group of so called reserved tokens / keywords, i.e. tokens that are specific to our language and are forbidden to be used by the user
var belongs to group of identifiers (for variables, functions, classes, etc.)
0 is a plain number value

Now that we have covered general concept of lexer and tokens, let’s define our reserved tokens in the actual code.

Defining Reserved Tokens

As we already said, most of reserved tokens of our programming language are taken from Python so we will mostly focus on how to represent those reserved tokens in our compiler’s code. Obviously, the best way to define them is an enum like:

enum class ReservedToken : std::uint8_t
{
    Unknown,

    OpParenthesisOpen,
    OpParenthesisClose,
    OpSubscriptOpen,
    OpSubscriptClose,

    OpAdd,
    OpDiv,
    OpExp,
    // ... the rest of the operators

    // Keywords
    KwAnd,
    KwAs,
    KwAssert,
    KwBreak,
    KwClass,
    KwContinue,
    KwDef,
    KwDel,
    KwElif,
    KwElse,
    // ... the rest of the keywords

    // Number of possible reserved tokens
    Count
};

Besides just defining reserved tokens, we should also map each reserved token to its corresponding string representation.

Since all the reserved tokens are known to us at compile time, let’s make use of constexpr. First, we’ll split our tokens into two groups: keywords and operators. Both these groups should be sorted as this will help us match stream of characters from the source file to particular reserved token.

struct StringifiedToken
{
    std::string_view tokenStr;
    ReservedToken    token{ ReservedToken::Unknown };

    [[ nodiscard ]]
    constexpr bool operator<( StringifiedToken const & rhs ) const
    {
        return tokenStr < rhs.tokenStr;
    }
};

template< typename T, std::size_t N >
[[ nodiscard ]]
constexpr std::array< T, N > sort( std::array< T, N > arr ) noexcept
{
    std::sort( std::begin( arr ), std::end( arr ) );
    return arr;
}

constexpr std::array keywords
{
    sort
    (
        std::array
        {
            StringifiedToken{ "False" , ReservedToken::KwFalse },
            StringifiedToken{ "None"  , ReservedToken::KwNone  },
            StringifiedToken{ "True"  , ReservedToken::KwTrue  },

            // ... the rest of the mappings

            StringifiedToken{ "with"  , ReservedToken::KwWith  },
            StringifiedToken{ "yield" , ReservedToken::KwYield }
        }
    )
};

constexpr auto operators
{
    sort
    (
        std::array
        {
            StringifiedToken{ "!"  , ReservedToken::OpLogicalNot },
            StringifiedToken{ "!=" , ReservedToken::OpNotEqual   },
            StringifiedToken{ "%"  , ReservedToken::OpMod        },
            StringifiedToken{ "%=" , ReservedToken::OpAssignMod  },

            // ... the rest of the mappings

            StringifiedToken{ "&"  , ReservedToken::OpBitwiseAnd      },
            StringifiedToken{ "|"  , ReservedToken::OpBitwiseOr       },
            StringifiedToken{ "|=" , ReservedToken::OpAssignBitwiseOr },
            StringifiedToken{ "||" , ReservedToken::OpLogicalOr       },
            StringifiedToken{ "~"  , ReservedToken::OpBitwiseNot      }
        }
    )
};

If you wish to experiment a bit, play with the code on godbolt. There you can also check all the keywords and operators that we will be using.

Besides reserved tokens, source code consists other tokens as well. Let’s see how to handle all these other types of tokens in our code.

General Token Definition

After reserved tokens definition, let’s list all the other token categories that we will support:

Identifier - as already mentioned, identifiers are used for identifying variables, functions, classes, etc.
Indentation - since our language is derived from Python and, thus, indentation is used to indicate a block of code, indentations take special place among our tokens
Newline, Eof, Number, String - represent exactly what their names say

Next step is to decide how to represent all the tokens in a single, generic way. Probably the best way to do this is simply using the std::variant, like:

struct Identifier
{
    std::string name;
};

struct Indentation {};
struct Newline {};
struct Eof {};

using Number = std::uint32_t;
using String = std::string;

class Token
{
public:
    using ValueType = std::variant< ReservedToken, Identifier, Number, String, Indentation, Newline, Eof >;

    Token() noexcept = default;

    Token( ValueType value ) : value_{ std::move( value ) } {}

    template< typename T >
    [[ nodiscard ]] bool is() const noexcept
    {
        return std::holds_alternative< T >( value_ );
    }

    template< typename T >
    [[ nodiscard ]] T const & get() const noexcept
    {
        assert( is< T >() ); // use only in tests
        return std::get< T >( value_ );
    }

    [[ nodiscard ]] ValueType const & value() const noexcept
    {
        return value_;
    }

private:
    ValueType value_;
};

Here is the full code so you can take a look at it.

Finishing Up

Now you should be able to understand what our language goals are and how to properly define both reserved tokens and all the other token categories.

In the next part we’ll learn how to write a lexer for our programming language, what Maximal Munch is and much more. If you come across any issues, please let me know in the comments!

See you till the next blog post!

Writing a Compiler - Part 1 - Defining The Language

Our Language Goals

Getting Ready For The Lexer

Defining Reserved Tokens

General Token Definition

Finishing Up

Trending Tags

Contents

Trending Tags

Writing a Compiler - Part 1 - Defining The Language

Our Language Goals

Getting Ready For The Lexer

Defining Reserved Tokens

General Token Definition

Finishing Up

Trending Tags

Contents

Further Reading

Different Ways To Define Binary Flags

Obscure Fact About References

Polymorphic Usage Of Non - Polymorphic Class

Trending Tags