Building a Simple HTML Lexer and Parser in Java

The first step in building the recursive decent parser is to build a class that enables tokens to be stored. In order to do this, a Token Class was built, with enum types of DIGIT, LETTER, STRING, KEYWORD, EOI, and INVALID. These can be extracted from the following definitions of tokens.

Simplified HTML

Token Class

Token Class

Select Methods from the Token Class

Once the token class is created, a lexicographic analyzer is needed to take the input string and turn it into a sequence of tokens. The heart of the lexer, is the next token method, which returns the token which corresponds to the next sequence of characters. In order to accomplish this, other helper methods are used, to consume input. These methods are greedy and attempt to consume as much input as possible and still produce a valid token. If they fail to produce a valid token, an INVALID token is returned.

nextToken Method

Snippet of Code from Lexer Class

When designing the parser, the grammar must be considered, which in this case is represented in Extended Backus – Naur Form (E-BNF), which is a form of expressing a Context-Free Grammar.

The general syntax of E-BNF is:

BNF Syntax

E-BNF Syntax for Simplified HTML

E-BNF Grammar of Simplified HTML

E-BNF allows for zero or more symbols through the use of ‘{‘ and ‘}’. These are known as meta-symbols for the language. For example, the { TEXT } translates into zero or more TEXT symbols.

The technique I used when designing the parser, was to implement each block as a method. For example, Following lexical analysis, the webpage method is called which consumes the body tags, along with greedily consuming TEXT symbols between the start and end body tags. When consuming the TEXT symbols, the text method is called, which can consume either a STRING, bold tags, italicize tags, or unordered list tags. Each method handles the construction of a particular block, through indirect recursion between other methods. If the block can be constructed, then the input is part of the grammar; otherwise, the program exits on the invalid sequence of tokens. The output of the parser is a tabulated version of the input if it is a valid statement.

Webpage method from Parser Class

Webpage method from Parser Class