ADAPTABLE PARSING EXPRESSION
GRAMMARS

LEONARDO VIEIRA DOS SANTOS REIS
ADAPTABLE PARSING EXPRESSION
GRAMMARS
Tese apresentada ao Programa de Pós-
-Graduação em Ciência da Computação do
Instituto de Ciências Exatas da Universi-
dade Federal de Minas Gerais como requi-
sito parcial para a obtenção do grau de
Doutor em Ciência da Computação.
Orientador: Roberto da Silva Bigonha
Coorientador: Vladimir Oliveira Di Iorio
Belo Horizonte
Novembro de 2014

LEONARDO VIEIRA DOS SANTOS REIS
ADAPTABLE PARSING EXPRESSION
GRAMMARS
Thesis presented to the Graduate Program
in Computer Science of the Universidade
Federal de Minas Gerais in partial fulfill-
ment of the requirements for the degree of
Doctor in Computer Science.
Advisor: Roberto da Silva Bigonha
Co-Advisor: Vladimir Oliveira Di Iorio
Belo Horizonte
November 2014
© 2014, Leonardo Vieira dos Santos Reis.
Todos os direitos reservados.
Reis, Leonardo Vieira dos Santos
R375a Adaptable Parsing Expression Grammars / Leonardo
Vieira dos Santos Reis. — Belo Horizonte, 2014
xxvi, 90 f. : il. ; 29cm
Tese (doutorado) — Universidade Federal de Minas
Gerais — Departamento de Ciência da Computação
Orientador: Roberto da Silva Bigonha
Coorientador: Vladimir Oliveira Di Iorio
1. Computação – Teses. 2. Linguagens de
Programação (Computadores) - Sintaxe – Teses.
3. Compiladores (Computadores) – Teses.
I. Orientador. II. Coorientador. III. Título.
CDU 519.6*33(043)
UNIVERSIDADE FEDERAL DE MINAS GERAIS
INSTITUTO DE CIENCIAS EXA TAS
PROGRAMA DE P6S-GRADUA<;AO EM CIENCIA DA COMPUT A<;AO
FOLHA DE APROVA<;Ao
Adaptableparsingexpressiongrammars
LEONARDO VIEIRA DOSSANTOSREIS
TesedefendidaeaprovadapelabancaexaminadoraconstituidapelosSenhores:
BeloHorizonte,17denovembrode2014.

to my wife Helen and my son Emmanuel
ix

Acknowledgments
I dedicate a special thanks to my parents, Maria Cristina and Ednaldo, who have given
me love and did everything possible to give me the best education. My thanks to my
grandmother Anizia, my uncles Nilton, Móises, Pedro, Jóse, Rosival, my aunts Néia,
Idma, Marlene, Argentina, Vera, my brothers Diego e Milton and my sister Maiza.
Thank you all my friends. My roommates and friends, Henrique (Ratinho),
Hussin, Fillipe and Diorgenes, for the good relationship. All my work friends, Thi-
ago Augusto, Wagner, Mônica, Elton, Rodrigo, Alex, Filipe, Welbert and Guilherme
Knop, for they make the work more enjoyable. Thank you my doctorate friends, Sérgio,
Eliseu, Isabel and Glauber, that allow me a better environment to study.
Thank you to all people who made part of my life in Diamantina city. I specially
thank Mariana and my Taekwondo friends: Noel, Gui, Dayvson, Nagila, Camila, Vito
and Alexandre. You are a second family for me.
My sincere thanks to my advisors: Prof. Roberto Bigonha and Prof. Vladimir
Di Iorio. Thanks for the patience and the learning, I really learned a lot with you two.
I am very grateful to my wife Helen Cristina and my son Emmanuel. Without
your love, patience and support I never would go so far. I love you!
xi

“Why program by hand in five days what you can spend five years of your life
automating?”
(Terrence Parr)
xiii

Resumo
Geradores automáticos de analisadores sintáticos têm sido usados por mais de 50 anos.
Ferramentas tais como o YACC automaticamente geram um analisador sintático a par-
tir de uma definição formal da sintaxe da linguagem, que usualmente é baseada em
uma gramática livre do contexto. A principal motivação para geradores automáticos de
analisadores sintáticos é garantir que o compilador está correto e que reconhece todas
as sentenças da linguagem que se pretende especificar, visto que com uma implemen-
tação manual é muito difícil de garantir que todos os programas de uma linguagem
serão corretamente analisados. Apesar das vantagens mencionadas acima, geradores
automáticos de analisadores sintáticos ainda não dão suporte a linguagens que per-
mitem modificar o seu próprio conjunto de regras dinamicamente. Faltam modelos
apropriados para descrever tais linguagens, assim como geradores automáticos de anal-
isadores sintáticos eficientes. Portanto, os analisadores sintáticos dessas linguagens são
manualmente implementados. Nesta tese, é apresentado o projeto e modelo formal
de Adaptable Parsing Expression Grammars (APEG), uma extensão de Parsing Ex-
pression Grammar (PEG) que permite a manipulação do conjunto de regras sintáticas
durante a análise do programa de entrada. Mostramos, também, que APEG é poderoso
o suficiente para definir linguagens que exigem a modificação de seu conjunto de regras
dinamicamente e analisadores sintáticos gerados a partir do modelo são eficientes para
serem usados na prática.
Palavras-chave: APEG, PEG, gramáticas adaptáveis, análise sintática, sintaxe
xv

Abstract
Parser generators have been used for more than 50 years. Tools like YACC can auto-
matically build a parser from a formal definition of the syntax of a language, usually
based on context-free grammars (CFG). The main motivation for automatic parser
generation is compiler correctness and recognition completeness, since with manual
implementation it is very difficult to guarantee that all programs in a given language
will be correctly analysed. Despite the advantagens mentioned above, the technology
of automatic parser generation is still not available for languages that allow on-the-fly
modifications on their own set of grammar rules. There is a lack of appropriate formal
models for describing the syntax of these languages, therefore efficient parsers may not
be automatically generated, requiring handwritten code. In this thesis, we present the
design and formal definition of Adaptable Parsing Expression Grammars (APEG), an
extension to the Parsing Expression Grammar (PEG) model that allows the modifi-
cation of production rules during the analysis of an input string. We also show that
APEG is capable to define languages that require on-the-fly modifications and allows
automatic generation of parsers that are reasonably efficient to be used in practice.
Keywords: APEG, PEG, adaptable grammars, parsing, syntax
xvii

List of Figures
1.1 An example of a program in the language Foo. . . . . . . . . . . . . . . . . 3
1.2 Grammar describing the Foo language. . . . . . . . . . . . . . . . . . . . . 3
1.3 Examples of programs in extensible Foo . . . . . . . . . . . . . . . . . . . 5
2.1 Context Free Grammar (CFG) for the toy language . . . . . . . . . . . . . 9
2.2 Illegal program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 A W-grammar for generating the language {anbncn} . . . . . . . . . . . . . 10
2.4 Affix grammar for the language {anbncn | n > 0}. . . . . . . . . . . . . . . 12
2.5 CFG for generating binary numbers . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Derivation tree of the string 1001 . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Decorated derivation tree of the string 1001 . . . . . . . . . . . . . . . . . 14
2.8 Attribute Grammar (AG) for binary numbers . . . . . . . . . . . . . . . . 14
2.9 AG for the language {anbncn | n ≥ 0}. . . . . . . . . . . . . . . . . . . . . 15
2.10 Extended Attribute Grammar (EAG) for binary numbers . . . . . . . . . . 16
2.11 EAG for the language {anbncn | n ≥ 0}. . . . . . . . . . . . . . . . . . . . 16
2.12 Parsing Expression Grammar (PEG) for the language {anbncn | n ≥ 0} . . 22
3.1 Attribute PEG for binary numbers . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Semantics of Adaptable PEG - Part I. . . . . . . . . . . . . . . . . . . . . 35
3.3 Semantics of Adaptable PEG - Part II. . . . . . . . . . . . . . . . . . . . . 36
3.4 An Attribute PEG for a data dependent language. . . . . . . . . . . . . . 38
3.5 Using adaptability in the PEG of Figure 3.4. . . . . . . . . . . . . . . . . . 38
3.6 Syntax of block with declaration and use of variables (simplified). . . . . . 39
3.7 Adaptable PEG for declaration and use of variables. . . . . . . . . . . . . . 39
4.1 Concrete syntax for the example of Figure 3.4. . . . . . . . . . . . . . . . . 47
4.2 APEG with user-defined functions and character classes. . . . . . . . . . . 48
4.3 Example with the set of production rules changed at parse time. . . . . . . 48
xix
4.4 Data structures representing a copy of an APEG with a modified production
rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Example with the set of production rules changed at parse time. . . . . . . 56
4.6 Example of code generated by a PEG. . . . . . . . . . . . . . . . . . . . . 57
4.7 Generated code for the block language. . . . . . . . . . . . . . . . . . . . . 59
5.1 Processing of a SugarJ top-level declaration (borrowed from [Erdweg et al.,
2011]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 A definition of sugar library for Pairs in SugarJ. . . . . . . . . . . . . . . . 65
5.3 Use of the pair syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Syntax definition of sugar libraries. . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Syntax definition of compilation units. . . . . . . . . . . . . . . . . . . . . 68
5.6 Composition of more than one sugar library. . . . . . . . . . . . . . . . . . 69
5.7 Definition of the closure syntax. . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Definition of a for loop in Fortress. . . . . . . . . . . . . . . . . . . . . . . 71
5.9 APEG formalization of Fortress language. . . . . . . . . . . . . . . . . . . 72
5.10 Combining grammars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1 Simplified version of the APEG grammar . . . . . . . . . . . . . . . . . . . 83
xx
List of Tables
1.1 Parse table for language Foo. . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Rules generated by metarules and hyperrules . . . . . . . . . . . . . . . . . 11
2.2 Rules generated by affixes and rules schema for generating the string aaabbbccc 12
5.1 Time in milliseconds for parsing programs written in the SugarJ language.
The performance of the original SugarJ compiler and the APEG version are
compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xxi

List of Acronyms
AMG Adaptive Multi-Pass Grammar
APEG Adaptable Parsing Expression Grammar
AG Attribute Grammar
AST Abstract Syntax Tree
CFG Context Free Grammar
DFA Deterministic Finite Automaton
DSL Domain-Specific Language
EAG Extended Attribute Grammar
LGI Language Generator by Instil
PEG Parsing Expression Grammar
RAG Recursive Adaptable Grammar
TLG Two-Level Grammar
xxiii

Contents
Acknowledgments xi
Resumo xv
Abstract xvii
List of Figures xix
List of Tables xxi
List of Acronyms xxiii
1 On-the-Fly Grammar Modification 1
1.1 Parsing Adaptability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Toward new Models to Describe Syntax 7
2.1 From Context-Free to Extended Attribute Grammars . . . . . . . . . . 8
2.1.1 W-Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Affix Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Attribute Grammar and Extended Attribute Grammar . . . . . 13
2.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Adaptable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Imperative Adaptable Models . . . . . . . . . . . . . . . . . . . 17
2.2.2 Declarative Adaptable Models . . . . . . . . . . . . . . . . . . . 19
2.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Parsing Expression Grammar . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Data-Dependent Grammar . . . . . . . . . . . . . . . . . . . . . . . . . 25
xxv
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Adaptable Parsing Expression Grammar 29
3.1 Attribute Parsing Expression Grammar . . . . . . . . . . . . . . . . . . 30
3.2 Adaptable Parsing Expression Grammar . . . . . . . . . . . . . . . . . 34
3.3 APEG in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Data Dependent Languages . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Static Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Implementation of Adaptable Parsing Expression Grammars 43
4.1 PEG-related implementations . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Implementing an Interpreter for APEG . . . . . . . . . . . . . . . . . . 46
4.2.1 Examples Showing the Concrete Syntax . . . . . . . . . . . . . 46
4.2.2 Implementing PEG with Attributes . . . . . . . . . . . . . . . . 49
4.2.3 Implementing Adaptability . . . . . . . . . . . . . . . . . . . . . 51
4.3 APEG Properties Associated with Memoization . . . . . . . . . . . . . 52
4.4 Mixing Code Generation and Interpretation – An Initial Approach . . . 56
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Evaluation and Validation 63
5.1 An APEG Implementation of an Extensible Language . . . . . . . . . . 64
5.2 The Syntax of SugarJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 The Syntax of Fortress . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Conclusion and Future Work 79
6.1 Adaptability at a Low Complexity Cost . . . . . . . . . . . . . . . . . . 79
6.2 A Reasonably Efficient Implementation . . . . . . . . . . . . . . . . . . 81
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Adaptable Parsing Expression Grammar 83
Bibliography 85
xxvi
Chapter 1
On-the-Fly Grammar Modification
Program source code readability is far more important than writability: every code
is written once, and read many times, mainly during the debugging phase of software
development. Clarity and legibility are more important that brevity [Rahien, 2010].
Readability can be achieved by many ways, e.g., better software organization, conscious
use of modularity, careful coding. The use of Domain-Specific Languages (DSLs) has
been considered a good way to improve readability [Kosar et al., 2012; Barišić et al.,
2011; Kieburtz et al., 1996].
Fowler [2010] describes DSLs as computer languages of limited expressiveness
focused on a particular domain. In this definition, it is clear that a DSL is a language
to be used by humans to instruct a computer to do something. However, this language,
differently from a general-purpose language as Java, only has a minimum set of features
needed to support its domain. Therefore, a DSL is not a language to be used for
building an entire system; rather, a DSL may be useful for only a single aspect of
a system. There is an increasing use of DSLs in software development, because they
bring about a lot of benefits, such as improving development productivity [Kosar et al.,
2012, 2010] and maintainability [Kosar et al., 2010; van Deursen and Klint, 1998].
Among the various methods for implementing DSLs, extensible languages, which
are languages that have features that allow adding new constructions to them-
selves [Wilson, 2004], seem to have several advantages over other approaches [Erdweg
et al., 2011; Tobin-Hochstadt et al., 2011; Kosar et al., 2008]. One of the advantages is
the possibility of implementing DSLs in a modular way. In this respect, Erdweg et al.
[2011] show how DSLs can be implemented using the extensible language SugarJ, by
means of syntax units designated as sugar libraries, which specify a new construction
for a domain concept. Tobin-Hochstadt et al. [2011] also discuss the advantages of
implementing DSLs by means of libraries.
1
2 Chapter 1. On-the-Fly Grammar Modification
The implementation of extensible languages requires adapting the parser every
time the language is extended with a new construction. Unfortunately, most popular
tools designed for automatic generation of syntactic analyzers do not address resources
for the specification and implementation of extensible languages. Moreover, impor-
tant advances in parsing theory, such as LL(*) [Parr and Fisher, 2011], Adaptative
LL(*) [Parr et al., 2014], Indentation-Sensitive Grammars [Adams, 2013] and Pars-
ing Expression Grammars (PEGs) [Ford, 2004] do not include features for allowing
on-the-fly modification on the rules of the grammar language, changing the parser dy-
namically. Therefore, extensible languages have been implemented in an ad-hoc way
by regenerating a static grammar which accomplishes the new changes, compiling this
grammar and using it for parsing the program [Erdweg et al., 2011; Ryu, 2009; Reis
et al., 2009]. The development of techniques to enable parser generation to cope with
on-the-fly modifications of the parsing grammar seems very important to make the use
of DSLs more popular, since it may allow an easier definition of DSLs using extensible
languages.
In this thesis, we focus on models and parsing algorithms which allow modifica-
tions on the grammar rules on the fly. Our goal is to develop a model that is powerful
enough to specify the syntax of extensible languages, as SugarJ [Erdweg et al., 2011],
Fortress [Allen et al., 2008, 2009; Ryu, 2009] and XAJ [Reis et al., 2009]. The model
must also allow automatic generation of parsers for the defined languages. As a result
of the work, we developed Adaptable Parsing Expression Grammar (APEG), which is
an adaptable model based on PEG.
The organization of the remaining of this chapter is as follows: Section 1.1 dis-
cusses the problem we are addressing in more details. Section 1.2 presents the thesis
objective and sections 1.3 and 1.4 present the contributions and the organization of
this thesis, respectively.
1.1 Parsing Adaptability
One of the most important tasks in a compiler is checking if and how the text repre-
senting a program (source code) is a valid sentence of the language, i.e. parsing the
program. Generally, a language designer implements this task based on a Context Free
Grammar (CFG) for defining the syntax of the language, and generates a parse table
from this grammar to drive the parsing process. For example, suppose a language
designer is creating a language Foo, which is a simple language with expressions on
prefixed notation and statements. Figure 1.1 shows an example of a program in this
1.1. Parsing Adaptability 3
Figure 1.1 An example of a program in the language Foo.
program:
variables x, y;
x := 1;
y := + x 2;
end
Figure 1.2 Grammar describing the Foo language.
start→ “program” “:” var stm “end” r1
var → “variables” id vars r2
vars→ “,” id vars r3
vars→ “;” r4
stm→ id “:=” expr “;” stm r5
stm→ λ r6
expr → num r7
expr → “+” expr expr r8
expr → “*” expr expr r9
id→ [“a”−“z”] id r10
id→ λ r11
num→ [“0”−“9”] num r12
num→ λ r13
language.
To implement a parser for the Foo language, first the language designer may
specify the formal syntax of the language using a CFG and, next, uses this grammar for
generating a parser. Ignoring details, a parser algorithm generated is, basically, a table
that guides the derivation relation of CFGs. This table tells what rule must be used
for the derivation of the leftmost nonterminal considering a lookahed of k ≥ 1 symbols
of the current input. Figure 1.2 shows a grammar which describes the syntax of the
Foo language, and Table 1.1 presents a parse table generated from this grammar, using
a top-down approach with k = 1. Every rule of the grammar presented in Figure 1.2
has a label, ri, which is used for referencing it on Table 1.1.
The parse table presented in Table 1.1 is used for verifying the program in Fig-
ure 1.1 as follows: the recognition process begins with the nonterminal start and if
the first symbol on input is the token program, the table defines that the rule to be
used is start → “program” “:” var stm “end”. As the two first terminals of this rule,
“program” and “:”, match the two input tokens, they are consumed. Now, the next
token on input is the token variables and the leftmost nonterminal is var. Again,
looking at line var and column variables on the table, the rule of var that is used is
4 Chapter 1. On-the-Fly Grammar Modification
[a− z] [0− 9] “+” “*” “;” “,” “:=” “program” “variables” “end”
start r1
var r2
vars r4 r3
stm r5 r6
expr r7 r8 r9
id r10 r11 r11 r11
num r12 r13 r13 r13
Table 1.1: Parse table for language Foo.
var → “variables” id vars. This process progresses until either an error is found (blank
cell in table) or the input ends.
CFGs and their parsing algorithms can handle a large class of languages which
do not perform changes in their set of rules during the parsing. In fact, CFGs are
not suitable for languages which allows on-the-fly modifications in their rules. For
illustrating this situation, suppose an extensible version of the Foo language, which
has a statement to add new rules to its own grammar. Such a statement can be
represented by the following rule, added to the CFG of Figure 1.2:
stm→ “#” id “ => ” rhs “; ”
A statement that extends the language begins with the symbol # followed by the name
of a nonterminal, the symbol =>, the right side of a rule (defined by nonterminal rhs
whose rules are not presented) and ends with a semicolon. When this rule is applied
in a derivation, its semantics is to extend the grammar, adding a rule id → rhs, such
as the program of Figure 1.3a, which adds a rule for division expressions. However, if
the parsing table is not updated immediately, the valid program in Figure 1.3b is not
correctly parsed, because this program uses the extension just added.
This problem happens because the Foo language defines how to parse itself, but
traditional parsing algorithms use a static parser table and do not allow to modify it
during the input analyses. The same problem arises when a bottom-up approach is
taken. The example above is a toy language, but it illustrates a feature required in
extensible languages such as SugarJ, Fortress, Racket [Tobin-Hochstadt et al., 2011]
and Lisp.
The challenge is to design a parsing algorithm that performs on-the-fly changes
on the parse table (or an equivalent structure used to drive the parser) at parsing time
and add an information to use new rules, such as expr → “/” expr expr.
1.2. Thesis Objective 5
Figure 1.3 Examples of programs in extensible Foo
program:
variables x, y;
x := 4;
# expr => “/” expr expr;
y := + x 2;
end
(a) A program which does not
use new syntax
program:
variables x, y;
x := 4;
# expr => “/” expr expr;
y := / x 2;
end
(b) A program which uses a new
syntax.
1.2 Thesis Objective
Our goal is the development of an adaptable model which must satisfy the following
requirements:
• to offer facilities for adapting the grammar during the parsing process, without
adding too much complexity to the base model;
• to assure that grammar extensions take effect immediately;
• to allow an implementation of an automatic parser generator with reasonable
efficiency.
The model developed in this thesis is based on PEG and is called Adaptable
Parsing Expression Grammar (APEG). The proposed model has focus on the syntactic
issues only, so we only discuss in details how to define the syntax of extensible languages
using APEG. However, the model does not impose any restriction on how the semantics
issues may be dealt with. Similarly to other models, for example, an Abstract Syntax
Tree (AST) can be built after parsing the program, and the semantics can be defined
by operations over this AST.
1.3 Contributions
The contributions of this work are in implementations of programming languages. The
specific contributions are:
• the design of an adaptable model based on PEG;
• a careful formalization of the model;
6 Chapter 1. On-the-Fly Grammar Modification
• on-the-fly grammar and parser adaptability;
• an automatic parsing generation for APEG;
• evaluation of the using of APEG in the implementation of real extensible lan-
guages.
1.4 Thesis Organization
This thesis is organized as follows. Chapter 2 presents a revision of formalisms for the
specification of the syntax of languages. Our definition model and formalization are
presented in Chapter 3, together with examples of use and a comparative discussion
with other models. Chapter 4 explains the implementation of APEG, discussing how we
combine PEG and attributes and how adaptability is achieved in the model. Chapter 5
shows how the model can be used for describing the syntax of real extensible languages
and an experimental evaluation of our implementation with respect to execution time.
Finally, Chapter 6 concludes this work and discusses the future work.
Chapter 2
Toward new Models to Describe
Syntax
A language is a set of sentences over an alphabet. To define the syntax of languages,
Chomsky [1956, 1959] proposed grammars, which is a system that generates every
sentence of the language that is intended to describe.
Formally, a grammar is a 4-tuple (VN , VT , R, S), where V = VN ∪ VT is the
vocabulary, R is a finite set of rewrite rules and S ∈ VN is the initial sentential form.
VN and VT are disjoint set of nonterminals and terminals, respectively. A rule has the
form ϕ → ψ, where ϕ and ψ are strings of symbols of the vocabulary. A sentence of
the language is generated from the initial sentential form S and applying the rewrite
rules until reaching a string with only terminal symbols.
For example, suppose a grammar containing the rules: S → AB, A→ C, CB →
Cb and C → a. A derivation of the sentence ab is showed below
S
S→AB⇒ AB A→C⇒ CB CB→Cb⇒ Cb C→a⇒ ab
where the derivation begins with nonterminal S, then the rule S → AB is applied to
get the sentence AB. Next, the rule A → C is applied to the sentence, resulting CB.
The symbol ⇒ is used to represent the derivation relation. In order to show all details
of the process, the rule used in every derivation step is showed on top of the derivation
symbol ⇒.
After his seminal paper in 1956 [Chomsky, 1956], Chomsky made restrictions on
the form of the rules of a grammar to obtain better devices for reasoning about the
structure of natural languages [Chomsky, 1959]. The most important restriction is the
one that allows only a nonterminal symbol on the left hand side of a rule, which creates
7
8 Chapter 2. Toward new Models to Describe Syntax
the Context Free Grammars (CFGs) that had demonstrated to be of great importance
for describing the syntax of programming languages since its first use in the definition
of the syntax of ALGOL. The restrictions made by Chomsky is what we now know
as Hierarchy of Chomsky with four types of grammars: regular grammars (type 3),
context-free grammars (type 2), context-sensitive grammars (type 1) and unrestricted
grammars (type 0).
Although the main purpose of grammars was to study the structure of the En-
glish language, CFGs (type 2 grammars) have been used for describing the syntax of
programming languages, because of the simplicity of the generative mechanism and
the existence of efficient programs to check if a string is generated by a particular
CFG. However, there are some programming language features, such as the relation
between declaration and use of variables or on-the-fly grammar modification, that can-
not be formalized by a CFG. Therefore, various augmentations of CFG or new models
have been proposed for describing the full syntax of programming languages, such as
Extended Attribute Grammar (EAG) [Watt and Madsen, 1983], Christiansen’ Adapt-
able Grammar [Christiansen, 1990] and Recursive Adaptable Grammar (RAG) [Shutt,
1998].
After this brief historical introduction on grammars, this chapter presents a review
of the various models that have been proposed for describing the syntax of programming
languages. First, Section 2.1 discusses several augmentations to CFGs, which are not
adaptable, i.e., it is impossible to explicitly manipulate the set of the grammar rules.
Next, Section 2.2 discusses adaptable models. In Sections 2.3 and 2.4, we discuss two
other models: Parsing Expression Grammar (PEG) and Data-Dependent Grammars .
We have decided to discuss them in separate sections because they are important
advances in theory of programming languages and our model presented in Chapter 3
is strongly inspired by PEG and has features of data-dependent grammars. Finally,
Section 2.5 concludes this chapter.
2.1 From Context-Free to Extended Attribute
Grammars
Consider a toy language in which a program begins with a list of declarations of vari-
ables followed by a list of update commands. The type of the variables can be integer
or boolean, and an update command is composed by a variable on the left hand side
and an expression on the right hand side. Moreover, this language has the following
constraints: 1) a variable cannot be declared more than once; 2) a variable cannot be
2.1. From Context-Free to Extended Attribute Grammars 9
Figure 2.1 CFG for the toy language
Prog → program Decl Stmt end
Decl → Type Vars ; | Type Vars ; Decl
Type → integer | boolean
Vars → Identifier | Vars , Identifier
Stmt → Identifier = Expr ; | Identifier = Expr ; Stmt
Expr → Identifier | Number | Boolean
Identifier → Char | Char Identifier
Char → a | b | . . . | z
Number → Digit | Digit Number
Digit → 0 | 1 | . . . | 9
Boolean → true | false
Figure 2.2 Illegal program.
program
boolean c;
c = 5;
end
used if it was not declared; and 3) the expression on the right hand side of an update
command must have the same type of the variable on the left hand side.
Figure 2.1 describes the syntax of the language using a CFG, except for the
constraints, because they cannot be specified with a CFG, since these constraints are
not context-free. Therefore, illegal programs can be derived from this grammar, such
as the program in Figure 2.2. According to Slonneger and Kurtz [1995], language
implementers say that infractions, like that in Figure 2.2, which involves the context,
belongs to the static semantics of a language. This name, static semantics, is because
detecting an error involves the meaning of symbols and it can be statically determined
from the text of the program. However, Slonneger and Kurtz [1995] argue that static
errors belong to the syntax, not the semantics of a language.
Although Slonneger and Kurtz [1995] claim that “static semantics” are related to
syntax, CFG cannot describe them, therefore, “static semantics” must be specified by
other way, possibly, using other formalisms more powerful than CFG. Context-sensitive
grammars are powerful enough to describe all the aspects of programming languages,
but these grammars are unsuitable for several reasons: 1) the expansion of a node in
the derivation tree may depend on sibling nodes; 2) the direct hierarchical relationships
between nonterminals that furnish a basis for semantics descriptions is lost; 3) formal
context-sensitive grammars are difficult to construct and understand; and 4) parsing
algorithms for context-sensitive grammars are, in general, not efficient, thus, are not
10 Chapter 2. Toward new Models to Describe Syntax
Figure 2.3 A W-grammar for generating the language {anbncn}
N ` u | uN
L ` a | b | c
(a) metarules.
N 7→ Na, Nb, Nc
NuL 7→ NL, L
uL 7→ L
(b) hyperrules.
suitable for compiler implementation [Slonneger and Kurtz, 1995]. The remaining
of this section discusses augmentations of CFGs which are capable to describe these
aspects of the syntax of programming languages and, at the end of this section, we
discuss the strength and weakness of them focusing in our main problem: adapting the
grammar on the fly.
2.1.1 W-Grammars
The ALGOL 68 report [Wijngaarden, 1969] presents a formalism called W-Grammars ,
formerly known as Two-Level Grammar (TLG), for defining the syntax and semantics
of this language [de Chastellier and Colmerauer, 1969]. TLGs are a formalism which
uses two levels of rules in a way that the first rules generate a scheme of nonterminals
which are used in the second group of rules to create a possible infinite number of rules.
For example, to generate the language {anbncn | n ≥ 0} the metarules (as called the
first level of rules) in Figure 2.3a act generating a set of strings from every nonterminal
which is used to rewrite this nonterminal in the hyperrules (the second level of rules)
in Figure 2.3b. The metarules are similar to CFG, but the hyperrules use a comma as
a separate symbol.
In this example, for deriving the string a,a,b,b,c,c, first we derive the string uu
from the start symbol N as follows (we use the symbol |= for the derivation relation
on the metarules to distingue of the symbol ⇒ used for deriving strings)
N |= uN |= uu
which combined with the hyperrule N 7→ Na,Nb,Nc generates the rule uu →
uua, uub, uuc. So, this rule is used in a derivation process from the axiom uu (an axiom
is every string that can be derived from the start symbol using only the metarules) as
uu⇒ uua, uub, uuc
In a similar way, the rules used in the derivation are built by the combination
of the metarules and hyperrules. Following, we show all steps in the derivation of the
2.1. From Context-Free to Extended Attribute Grammars 11
number rule metarules hyperrule rule
1 N
∗
|= uu N 7→ Na,Nb,Nc uu→ uua, uub, uuc
2 N ` u and L ` a NuL 7→ NL,L uua→ ua, a
3 L ` a uL 7→ L ua→ a
4 N ` u and L ` b NuL 7→ NL,L uub→ ub, b
5 L ` b uL 7→ L ub→ b
6 N ` u and L ` c NuL 7→ NL,L uuc→ uc, c
7 L ` c uL 7→ L uc→ c
Table 2.1: Rules generated by metarules and hyperrules
string a,a,b,b,c,c. The number of the rules used in every step of the derivation is on
top of the derivation symbol⇒ and Table 2.1 shows how the metarules and hyperrules
are used to generate them with their respective number rules.
uu
1⇒ uua, uub, uuc 2⇒ ua, a, uub, uuc 3⇒ a, a, uub, uuc 4⇒ a, a, ub, b, uuc
5⇒ a, a, b, b, uuc 6⇒ a, a, b, b, uc, c 7⇒ a, a, b, b, c, c
As discussed, the idea of TLG is to combine two grammars to generate an infinite
set of rules, in which strings generated by a nonterminal (represented by capital letter)
of metarules can be replaced in hyperrules to form rules. TLGs can generate any
recursively enumerable language [Sintzoff, 1967], but there are not efficient automatic
parser generators based on TLGs. Also, although the model allows generating an
infinite set of rules, it does not have features to easily adapt the set of rules during the
parsing of a program. However, the idea introduced by TLGs of generating rules by
replacing some symbols is used in other models.
2.1.2 Affix Grammars
Affix Grammars is an augmentation of CFG proposed by Koster [1970]. In this model,
nonterminals have affixes which can assume values over a certain domain and act like
parameters of the nonterminals. Affix Grammars are very similar to W-Grammars and
have a two-level grammar; the first level generates the domain of affixes and the second
is the rule scheme with nonterminals and its affixes. As an example, the language,
{anbncn | n ≥ 1}, generated in Section 2.1.1 using a W-Grammar, can be specified
using Affix Grammar as in Figure 2.4a and 2.4b.
The affix rules presented in Figure 2.4a generate the domain of affix and, in this
example, define every natural number in a notation of successor. A string over the
domain of an affix (generated by a CFG) is used for forming a rule by substituting it
12 Chapter 2. Toward new Models to Describe Syntax
Figure 2.4 Affix grammar for the language {anbncn | n > 0}.
N ` 1 | s(N)
(a) Rules for generating the domain
of affixes.
S 7→ A(N) B(N) C(N)
A(1) 7→ a
A(s(N)) 7→ a A(N)
B(1) 7→ b
B(s(N)) 7→ b B(N)
C(1) 7→ c
C(s(N)) 7→ c C(N)
(b) Rule schema.
in the rule schema. For example, substituting the string s(s(s(1))), which is generated
from N , in the rule schema S 7→ A(N) B(N) C(N), we obtain the rule S→ A(s(s(s(1))))
B(s(s(s(1)))) C(s(s(s(1)))). Using the affix and rule schema for generating new rules, it
is possible to generate every string of the language {anbncn | n > 0}. Table 2.2 shows
the rules used for generating the string aaaabbbbcccc.
The affixes can be of two types: inherited or derived . The inherited affixes are the
input parameters of the nonterminals and the derived ones are the output parameters.
The values of the derived affixes are calculated by functions called in productions rules.
In the example above, the affixes of the nonterminals are inherited.
Affix Grammars are very similar to W-Grammars, but they are more suitable
for parsing than W-Grammars and have been used for describing natural languages
[Koster, 1991a,b]. As W-Grammars, Affix Grammars do not have features to modify
the set of rules during parsing.
Affix rule Rule schema rule
N
∗
|= s(s(s(1))) S 7→ A(N) B(N) C(N) S → A(s(s(s(1)))) B(s(s(s(1)))) C(s(s(s(1))))
N
∗
|= s(s(s(1)))
A(s(N)) 7→ a A(N)
B(s(N)) 7→ b B(N)
C(s(N)) 7→ c C(N)
A(s(s(s(1)))) → a A(s(s(1)))
B(s(s(s(1)))) → b B(s(s(1)))
C(s(s(s(1)))) → c C(s(s(1)))
N
∗
|= s(s(1))
A(s(N)) 7→ a A(N)
B(s(N)) 7→ b B(N)
C(s(N)) 7→ c C(N)
A(s(s(1))) → a A(s(1))
B(s(s(1))) → b B(s(1))
C(s(s(1))) → c C(s(1))
N
∗
|= s(1)
A(s(N)) 7→ a A(N)
B(s(N)) 7→ b B(N)
C(s(N)) 7→ c C(N)
A(s(1)) → a A(1)
B(s(1)) → b B(1)
C(s(1)) → c C(1)
Table 2.2: Rules generated by affixes and rules schema for generating the string
aaabbbccc
2.1. From Context-Free to Extended Attribute Grammars 13
Figure 2.5 CFG for generating binary numbers
S → T
T → B T
T → B
B → 0
B → 1
2.1.3 Attribute Grammar and Extended Attribute Grammar
Differently from the other formalisms presented here, which were designed for de-
scribing the full syntax of a programming language, Attribute Grammars (AGs) was
designed by Knuth [1968] to specify the meaning (semantics) of strings generated by
CFGs. An AG is a CFG where every symbol of the vocabulary has a set of attributes
and there are a set of functions for every rule in the grammar that compute the semantic
values of the attributes.
In an AG, the semantics of a string is given from its derivation tree by computing
the value of the attributes of every node in the derivation tree using the corresponding
semantic equations . There are two types of attributes: inherited and synthesized .
Inherited attributes are those whose values are based on the attributes of the ancestor
or sibling symbols and synthesized attributes are the ones based on attributes of the
descendants.
To illustrate how AG works, let’s consider the grammar of Figure 2.5, which
generates binary numbers and the derivation tree for the string 1001, in Figure 2.6. In
order to give the semantic of a binary number generated by this grammar, we add the
following attributes to the nonterminals:
• S has a synthesized attribute named v(S). This attribute, over the domain of
integer numbers, represents the semantic value of the binary number;
• T has two attributes, a synthesized attribute v(T ) and an inherited attribute
p(T ), both over the domain of integer numbers. The attribute v(T ) represents
the semantic value of the binary number and p(T ) will represent the value of the
string generated before T in the derivation tree;
• B has only one synthesized attribute, v(B). This attribute, over the domain of
integer numbers, represents the semantic value of the string generated from B.
The attribute v(T ) is used to flow the semantic value of the expression to the
root, and the inherited attribute p(T ) is used to pass the intermediate value of the
expression. Figure 2.8 shows the AG for binary numbers with the semantic equations
14 Chapter 2. Toward new Models to Describe Syntax
Figure 2.6 Derivation tree of the string 1001
S
T
B T
1 B T
0 B T
0 B
1
Figure 2.7 Decorated derivation tree of the string 1001
S (v = 9)
T (v = 9; p = 0)
B (v = 1) T (v = 9; p = 1)
1 B (v = 0) T (v = 9; p = 2)
0 B (v = 0) T (v = 9; p = 4)
0 B (v = 1)
1
Figure 2.8 AG for binary numbers
S → T v(S)← v(T ) p(T )← 0
T → B T1 v(T )← v(T1) p(T1)← 2 ∗ p(T ) + v(B)
T → B v(T )← 2 ∗ p(T ) + v(B)
B → 0 v(B)← 0
B → 1 v(B)← 1
for each rule of the grammar. In this grammar, we use a subscript in a nonterminal
to avoid ambiguities when an attribute is referenced. Figure 2.7 shows the derivation
tree for the string 1001 decorated with the values of the attributes.
An AG can also be used to specify the context-sensitive aspects of the syntax
of a programming language. The idea is to generate the language without checking
the context-sensitive aspects with the CFG facilities and use the attributes, semantic
2.1. From Context-Free to Extended Attribute Grammars 15
Figure 2.9 AG for the language {anbncn | n ≥ 0}.
S → A B condition: n(A) = n(B)
A → a A1 b n(A)← n(A1) + 1
A → λ n(A)← 0
B → c B1 n(B)← n(B1) + 1
B → λ n(B)← 0
equations and conditions to check the context-sensitive aspects on the derivation tree.
For example, the language {anbncn | n ≥ 0} can be recognized by an AG by generating
the context-free language {anbncm | n,m ≥ 0} and using attributes for checking whether
n is equal to m. Figure 2.9 shows the AG that generates this language. The strings
aabbcc and abcc are generated by the CFG presented in Figure 2.9, but only the former
satisfies the attribute conditions. So, the second string is not in the language defined
by the AG presented in Figure 2.9.
An Extended Attribute Grammar (EAG) is a model for formalizing context-
sensitive features of programming languages based on AG and Affix Grammar and
was designed to be more elegant, readable and generative in nature [Watt and Madsen,
1983]. Essentially, an EAG is a AG where the conditions and semantic equations are
embedded in the rules. EAG, for improving the readability, also uses the up arrow and
down arrow symbols for synthesized and inherited attributes, respectively.
Inherited attributes on the left hand side and synthesized attributes on the right
hand side of a rule are said to be in defining positions . Synthesized attributes on the
left hand side and inherited attributes on the right hand side of a rule are said to be
in applied positions . An improvement of EAG over AG is the use of an expression to
calculate the values of the attributes in the applied positions, allowing a more concise
specification of the semantic equations. For example, Figure 2.10 shows the EAG
deriving from the AG presented in Figure 2.8 for binary numbers. In the first rule, there
is an implicit semantic equation which copy the value of the first synthesized attribute
of the nonterminal T , on the right hand side of the rule, to the synthesized attribute of
S. This is done by using the same variable name for both attributes. There is another
semantic equation in the first rule that defines the value of the inherited attribute of
the nonterminal T as 0. In the second rule, the value of the inherited attribute of the
nonterminal T , on the right hand side, is calculated using the equation 2∗p+v1, which
uses the values of the inherited attribute p of the nonterminal T of the left hand side
of the rule and the synthesized attribute v1 of the nonterminal B. The other semantic
equations are defined in a similar way.
In general, defining positions have a variable name. Constraints may be specified
16 Chapter 2. Toward new Models to Describe Syntax
Figure 2.10 EAG for binary numbers
〈S ↑ v〉 → 〈T ↑ v ↓ 0〉
〈T ↑ v ↓ p〉 → 〈B ↑ v1〉 〈T ↑ v ↓ 2 ∗ p+ v1〉
〈T ↑ 2 ∗ p+ v ↓ p〉 → 〈B ↑ v〉
〈B ↑ 0〉 → 0
〈B ↑ 1〉 → 1
Figure 2.11 EAG for the language {anbncn | n ≥ 0}.
〈S〉 → 〈A ↑ n〉 〈B ↑ n〉
〈A ↑ n+ 1〉 → a 〈A ↑ n〉 b
〈A ↑ 0〉 → λ
〈B ↑ n+ 1〉 → c 〈B ↑ n〉
〈B ↑ 0〉 → λ
using the same name of variable or expressions in defining positions of a rule. For
example, Figure 2.11 shows an EAG for the language {anbncn | n ≥ 0} where there is
a constraint defining that the value of the synthesized attribute of the nonterminals A
and B must be the same. This constraint is specified using the same name, n, in the
defining position.
None of EAGs or AGs allow changing the set of rules during the parsing, but
there are many parser generators based on AGs and efficient parsing algorithms. Our
model developed in this work, Adaptable Parsing Expression Grammar (APEG), uses
the idea of attributes and also has a notation similar to EAGs.
2.1.4 Discussion
So far, we presented some formalisms to specify context-sensitive aspects of program-
ming languages. As CFG has demonstrated to be elegant, readable, intuitive and
simple, all these formalisms augment the CFG kernel with distinct facilities. Besides
the formalisms presented here, there are dozen of others which augment the CFG
kernel [Aho, 1968; Salomaa, 1972; Mayer, 1972; Rozenberg and Wood, 1980]. AG
and EAG have proven to be the most successful formalisms to describe the context-
sensitive aspects of programming languages, because of their simplicity and suitability
for automatic compiler construction.
However, AG and EAG, and also the other formalisms presented in this section,
are not suitable for describing languages whose syntax can be changed during parsing,
because their model is based on an immutable grammar during the derivation process.
As these languages may change their own set of rules during the derivation process,
other formalisms which allow modify their grammar may be more appropriate. The
2.2. Adaptable Models 17
next section presents this type of formalism.
2.2 Adaptable Models
Some authors [Burshteyn, 1990b; Christiansen, 1990; Shutt, 1998] argue that a way
of interpreting context-sensitive features of programming languages, such as variable
declarations, is to see them as adding new grammatical constructs to the language.
The developed models based on this idea, called adaptable grammars , explicitly provide
mechanisms to manipulate the set of rules.
In an adaptable grammar, the task of checking whether a variable used in an
expression has been previously defined may be performed as follows. Instead of having
a general rule like variable -> identifier, each variable declaration may add a new
rule to the grammar. For example, the declaration of a variable with name x adds the
following production rule:
variable -> "x".
The nonterminal variable will then generate only the declared variables, and not a
general identifier.
The main purpose of adaptable grammars was initially to give a syntactic treat-
ment for the definition of context-sensitive features of programming languages, but
they are powerful enough even for the definition of advanced extensibility mechanisms
of programming languages, like the one presented in Figure 1.3b in Section 1.1. The
purpose of this section is to review adaptable models. For this, we use the classifica-
tion of adaptable models adopted by Shutt [1998]. Based on the way the set of rules is
manipulated, Shutt classifies adaptable models as imperative or declarative, which we
discuss in the sequel.
2.2.1 Imperative Adaptable Models
Imperative models are inherently dependent on the parsing algorithm. The set of rules
is treated as a global entity that is modified while derivations are processed. Thus, the
grammar designer must know exactly the order of decisions made by the parser.
To the best of our knowledge, it seems that Wegbreit [1970] was the first to
formalize the idea of grammars that allow the manipulation of their own set of rules, so
the idea has been around for at least 40 years. Wegbreit proposed Extensible Context
Free Grammars , consisting of a CFG together with a finite state transducer. The
finite state transducer is used to modify the set of rules as follows: in every step of
18 Chapter 2. Toward new Models to Describe Syntax
a derivation, xAα, the transducer gets the terminal string x and outputs the rules to
change the grammar; the new set of productions is used in the remaining derivation
process to replace the nonterminal A. The derivation must be leftmost.
In [Burshteyn, 1990b], the author presents the idea of modifiable grammar , which
is an Attribute Grammar where the semantic equations add new rules to the grammar.
The inherited and synthesized attributes are used to carry values that are used for
compounding the new rules. The rules are inserted in the grammar when a reduction
is applied on parsing. In a following work, Burshteyn [1990a] formalized this idea
and presented two versions of modifiable grammar: top-down modifiable grammar and
bottom-up modifiable grammar . In both formalizations, a modifiable grammar consists
of a CFG and a Turing transducer, with instructions that may define a list of rules to
be added, and another to be deleted. Burshteyn presents these two versions because
of the dependency on the parser algorithm. The former is for a top-down and the later
for a bottom-up parsing.
Cabasino et al. [1992] propose evolving grammars , which are grammars that
evolve the set of rules and nonterminals over time. In their definition, the gram-
mar evolution is specified by a sequence of grammars, Gk, Gk+1, . . . , Gn, where every
Gi+1 grammar differs from previous one, Gi, by an addition of new productions and
nonterminals. Every production of the CFG may have an associated rule which defines
the set of new nonterminals and production that are added to the current grammar to
create the next grammar in the evolving sequence. The derivation must be rightmost
and the associated parser bottom-up. Although evolving grammars are capable to
define context-sensitive conditions like declare-use of variables, Cabasino et al. [1992]
reported problems to deal with syntactic scope.
Boullier [1994] defines Dynamic Grammars , a formalism with a recursively enu-
merated set of grammars which is very similar to Evolving Grammar of Cabasino et al.
[1992]. A Dynamic Grammar is a tuple (M, G0), where M is a Turing machine and
G0 is the initial grammar. The Turing machine generates the next grammar of the re-
cursive set from the current configuration on the derivation and the current grammar.
Differently from Evolving Grammar, the Turing machine can also generate grammars
by deleting rules. The derivation must be rightmost and the associated parser must be
bottom-up, such as Evolving Grammar.
Carmi [2010] argues that existing adaptable formalisms do not handle forward
references well, such as goto statements that precede label declarations, and extensible
languages with features like macro syntax and its expansion. Thus, he proposes a new
model, called Adaptive Multi-Pass Grammar (AMG). As in Dynamic Grammars and
Evolving Grammars, the formalism proposed by Carmi [2010] is driven by the parsing
2.2. Adaptable Models 19
algorithm and the derivation must be rightmost. Differently from these models, the
nonterminals symbols of AMGs may have annotation strings and a special type of rule,
a multi-pass rule. A multi-pass rule is similar to a simple rule, however, when the
parser reduces using this type of rule, the annotated string of the rule is put as a prefix
of the input to be parsed. The multi-pass rules together with nonterminal’s anotations
allow parsing a prefix of the input string and putting it back in the input to be parsed
again, possible with other set of rules, allowing handling forward references and macros
definition and expansion.
2.2.2 Declarative Adaptable Models
Imperative models have the disadvantage of the inherent dependency of the parsing
algorithm, thus the grammar designer must know exactly the order of decisions made
by the parser. Declarative models, on the other hand, are relatively independent of
the parsing algorithm and do not impose restrictions on the derivation order. Below
we discuss the declarative models.
Christiansen [1987] proposes Generative Grammars , that is essentially an EAG
in which the first attribute of every nonterminal symbol is inherited and represents the
language attribute. The language attribute contains the set of rules allowed in each
derivation. The initial grammar works as the language attribute for the root node of the
parse tree, and new language attributes may be built and used in different nodes. Each
grammar adaptation is restricted to a specific branch of the parse tree. One advantage
of this approach is that it is easy to define statically scope dependent relations, such
as the block structure declarations of several programming languages. In his survey of
approaches for adaptable grammar formalisms, Christiansen [1990] introduces a new
syntax for Generative Grammar , called Generative Clause Grammars , whose syntax
is more related with the implementation of the model in Prolog. Later, he proposed
an equivalent approach, using definite clause grammars [Christiansen, 2009]. Here we
refer to the first formalization presented by Christiansen.
Shutt [1998] argues that, in AG and other extensions for CFGs, the clarity of
the original base CFG model is undermined by the power of the extending facilities.
As an example, Shutt cites that Christiansen gives an example of an AG for ADA, in
which a single rule representing function calls needs two and a half pages to describe
the context conditions associated to it. Shutt claims that adaptable grammars should
be a natural way of describe context-sensitive elements of a programming language.
However, he observes that even the generative grammars of Christiansen inherit the
non orthogonality of AGs, with two different models competing. The CFG kernel is
20 Chapter 2. Toward new Models to Describe Syntax
simple, generative, but computationally weak. The augmenting facility is obscure and
computationally strong. Therefore, he proposes a declarative model called RAG.
Shutt defines RAG as a one-sorted algebra, where a single domain combines
the syntactic elements (terminals), meta-syntactic (nonterminals and the language at-
tribute) and semantic values (all other attributes). A RAG has a rule function which
maps nonterminals to a set of unbound rules , where an unbound rule is a rule of the
form
〈v0, e0〉 → t0〈e1, v1〉t1 . . . 〈en−1, vn−1〉tn−1〈en, tn〉tn
and every ti are terminals, vi are distinct variables and ei are elements of a term algebra
extended with the variables set {v0, v1, . . . , vn}. Bound rules are formed by substituting
the variables by elements of the algebra. The bound rules are used in the derivation
process. For clarity, consider the following example taken from [Shutt, 1998]. The
RAG with the rule function present below generates the language {anbn | n ≥ 0}
ρ(S) =
{
〈v0, λ〉 → λ
〈v0, λ〉 → a〈v0, v1〉b
}
where S is the start symbol.
RAG imposes the restriction that the leftmost variable v0 of a rule r must be
bound only to some element, a, such that r ∈ ρ(a). Therefore, the only possible
assignment to v0 is S. For deriving any string, we must begin using a bound rule of the
start symbol S. For example, bounding v0 to S and v1 to λ we can start the derivation
with the rule 〈S, λ〉 → a〈S, λ〉b , thus
〈S, λ〉 ⇒ a〈S, λ〉b
Following this idea, the derivation of the string aaabbb is
〈S, λ〉 ⇒ a〈S, λ〉b⇒ aa〈S, λ〉bb⇒ aaa〈S, λ〉bbb⇒ aaabbb
Note that the only bindings to the variables which yields into a useful rule are v0 to S
and v1 to λ, in this example. The only possible value to v0 is S because the restriction
imposed by RAG, as discussed above, and any value bounded to v1 different from λ
results in a nonterminal, in the rule 〈S, λ〉 ← a〈S, v1〉b, that is impossible to rewrite in
the derivation process.
For defining more complex languages, a RAG designer must understand well the
algebraic machinery defined for RAG and uses nonterminals as n-arity operators and
several pre-defined operators. Although the main idea of RAG is the simple substitu-
2.2. Adaptable Models 21
tion of variables to make rules, the operators that must be defined to create complex
languages can become a hard task to create RAGs.
Stansifer and Wand [2011] define Parsing Reflective Grammars , whose purpose is
to give a formal definition of the syntax of extensible languages. New rules are added
to a parsing reflective grammar through a special meta-symbol, R, which indicates the
point that begins an extension of the grammar. The designer of the language specifies
the points of extension using this special symbol on the right hand side of any rule.
The set of the rules is manipulated in a declarative way, since there is no order imposed
on derivation. However, the model is very limited and imposes a special symbol before
the insertion of new rules. Moreover, a power model capable of defining the syntax of
extensible languages is also capable to define static semantics by changing the grammar,
and Parsing Reflective Grammars are not powerful enough for this task.
2.2.3 Discussion
Christiansen [1990] enumerates several difficulties of adaptable models for describing
static semantics of programming languages: removing rules at block exit; delayed or
indirect declarations; preventing multiple declarations. These concerns are also relevant
for extensible languages, because they, in general, deal with scope of the new rules and
it is important to describe languages as libraries, such as in SugarJ [Erdweg et al.,
2011], Fortress [Allen et al., 2008] and Racket [Tobin-Hochstadt et al., 2011].
Imperative adaptable models have problems to deal with scope, because the gram-
mar is global. Some imperative models, such as Wegbreit’s model or Evolving Grammar
of Cabasino, do not allow rule removals, which makes it impossible to work with scope.
The strategy of the declarative model of Christiansen, where each grammar adaptation
is restricted to a specific branch of the parse tree, makes easy to define scope depen-
dent relations. RAG of Shutt also has these same properties of Christiansen Adaptable
Model and can check for multiple declarations, which is very difficult in Christiansen
Adaptable Model.
As Christiansen and Shutt declarative models deal well with scope, they seem
better than the others presented here for describing extensible languages. The Shutt
model, however, is difficult to learn and to use it well, it is necessary a high level of
mathematics background. When working with languages that allow the user to add
new rules on the language grammar, it also important to deal with lexical extensions.
All adaptable models studied do not treat lexical extensions. Also, we do not have any
evidence indicating that parsers based on the adaptable models studied are reasonable
efficient to define the syntax of real languages. Indeed, Erdweg et al. [2011] suggest
22 Chapter 2. Toward new Models to Describe Syntax
Figure 2.12 PEG for the language {anbncn | n ≥ 0}
S ← !X a∗ B !.
A ← a A b / λ
B ← b B c / λ
X ← !(A !b)
adaptable models as an alternative to implement the SugarJ language, but questioned
their efficiency. The model developed in this work incorporates several ideas of Chris-
tiansen’s Adaptable Grammars.
2.3 Parsing Expression Grammar
In a generative system, the syntax of languages is defined by a set of rules that are
applied recursively to generate strings. In contrast, a recognition system has rules and
predicates that decide if the string belongs to the language. To illustrate it, Ford [2004]
gives an example of a language of a’s whose length is even. In a generative system
style, this language is defined as {(aa)n | n ≥ 0} and in a recognition system style as
{s ∈ a∗ | (|s| mod 2 = 0)}. The former definition generates strings by concatenation of
two a’s , while the latter tests if a string of a’s is in the language by checking whether
the rest of the division of its lenght by two is zero.
Ford [2004] argues that the most practical language applications in computer
science involve recognition and structural decomposition of strings, however, the formal
description of languages uses generative models instead of recognition. Ford developed
Parsing Expression Grammar (PEG) to fill this gap.
PEG is a new model for describing the syntax of programming languages, which
uses a recognition-based approach instead of a generative system. Similarly to CFG,
formally a PEG is a 4-tuple (VN , VT , R, S) where VN , VT and S are as in a CFG and
R is a rule function which maps nonterminals to parsing expressions .
A parsing expression is very similar to the right hand side of a CFG in the
extended Backus-Naur form with the addition of new operators, the not-predicate (!)
and the prioritized choice (/). The not-predicate operator checks if the string matches
some syntax without consuming the input and the prioritized choice lists alternative
patterns to be tested in order. As an example, Figure 2.12 shows a PEG that recognizes
the language {anbncn | n ≥ 0}.
In order to understand how PEG works, let’s see the recognition of the string
aabbcc. The recognition process starts with the initial parsing expression S and the
2.3. Parsing Expression Grammar 23
entire input aabbcc. As S is a nonterminal, it is replaced by the corresponding parsing
expression using the function R. So we have the pair (!X a∗ B !., aabbcc), representing
the current parsing expression and the current input to be recognized. We use the
notation B
A
from proof trees to express that, to satisfy A, one must first satisfy B.
Then, our first goal is
(!X a∗ B !., aabbcc)
(S, aabbcc)
The parsing expression !X a∗ B !. is a sequence of simpler parsing expressions,
therefore we try to recognize the input following each smaller parsing expression, in
order. So, first, we try to parse the input aabbcc with the parsing expression !X, and,
next, the remaining input with the other parsing expression, a∗ B !.. We increment our
notation to (e, i) ⇒ o which says that the parsing expression e matches or consumes
the prefix o of the input string i in the recognition process, then we can represent our
goal as
(!X, aabbcc)⇒ o1 (a∗ B !., x)⇒ o
(!X a∗ B !., aabbcc)⇒ o1 o
Note that the input aabbcc must be the concatenation of the prefix consumed by the
parsing expression !X, o1, with it remaining, x.
The symbol ! is the not-predicate operator and the parsing expression !X will suc-
ceed in this input if the parsing expression X fails in the same input. The not-predicate
only tests if the input matches the syntax and does not consume input symbols, so the
output of the not-predicate is the empty string λ. Our proof tree representing the
recognition process improves to
(X, aabbcc)⇒ f
(!X, aabbcc)⇒ λ (a∗ B !., aabbcc)⇒ o
(!X a∗ B !., aabbcc)⇒ o
where the symbol f represents a failure.
Next, to reach the goal (X, aabbcc) ⇒ f , we replace X by its corresponding
parsing expression !(A !b). Again, we have the not-predicate operator, then the parsing
expression A !b must succeed and consume some prefix of the input, then the overall
parsing expression with the not-predicator operator will fail. The following proof tree
shows this process
(A !b, aabbcc)⇒ o2
(!(A !b), aabbcc)⇒ f
(X, aabbcc)⇒ f
24 Chapter 2. Toward new Models to Describe Syntax
The goal (A !b, aabbcc)⇒ o2 has a sequence of simpler parsing expressions, then
it is divided into A and !b. Next, we try to match A and replace it by its parsing
expression a A b / λ, resulting in the following proof tree
(a A b / λ, aabbcc)⇒ o1
(A, aabbcc)⇒ o1 (!b, i1)⇒ o
(A !b, aabbcc)⇒ o2
The parsing expression a A b / λ uses the prioritized choice operator, /. Then,
we try to match the input with the first choice a A b and the second is tried only if the
first reaches a failure. As the first parsing expression is a sequence, the proof tree for
this goal is
(a, aabbcc)⇒ a (A b, abbcc)⇒ o1
(a A b, aabbcc)⇒ ao1
(a A b / λ, aabbcc)⇒ ao1
Note that when the parsing expression is a terminal, it succeeds only if the first symbol
of the input is the same terminal and consumes it, otherwise fails. The interpretation
(a, aabbcc)⇒ a shows this. The same idea explained above is used to parse the branch
(A b, abbcc)⇒ o1, resulting in the following proof tree
(a, abbcc)⇒ a
(a, bbcc)⇒ f
(a A b, bbcc)⇒ f (λ, bbcc)⇒ λ
(a A b /λ, bbcc)⇒ λ
(A, bbcc)⇒ λ (b, bbcc)⇒ b
(A b,bbcc)⇒ b
(a A b, abbcc)⇒ ab
(a A b /λ, abbcc)⇒ ab
(A, abbcc)⇒ ab (b, bcc)⇒ b
(A b, abbcc)⇒ abb
Observe that, as the first symbol of the input is b in (a, bbcc)⇒ f , the parsing expression
fails and then the parsing expression a A b, which is a sequence, also fails. Therefore,
the second alternative of a A b /λ is tested and it succeeds, because the parsing
expression λ always succeeds and does not consume any symbol of the input. As a
result, the parsing expression A b does not fail and consumes the prefix abb of the
input.
Now, we can finish the proof tree of the parsing expression a A b /λ on input
aabbcc, which succeeds and consumes the prefix aabb of the input. Below, we show the
proof tree.
2.4. Data-Dependent Grammar 25
(a, aabbcc)⇒ a (A b, abbcc)⇒ abb
(a A b, aabbcc)⇒ aabb
(a A b / λ, aabbcc)⇒ aabb
So, the proof tree of (A !b, aabbcc)⇒ o2 is
(a A b / λ, aabbcc)⇒ aabb
(A, aabbcc)⇒ aabb (!b, cc)⇒ λ
(A !b, aabbcc)⇒ aabb
The (!b, cc)⇒ λ succeeds because the first symbol of the input is c and the expression
b fails, so the not-predicate succeeds and does not consume any symbol of the input.
Therefore, the proof tree of the start expression is showed below
(A !b, aabbcc)⇒aabb
(!(A !b), aabbcc)⇒f
(X, aabbcc)⇒f
(!X, aabbcc)⇒λ
(a∗, aabbcc)⇒ aa
...
(B, bbcc)⇒ bbcc (!., λ)⇒ λ
(B !., bbcc)⇒ bbcc
(a∗ B !., aabbcc)⇒ aabbcc
(!X a∗ B !., aabbcc)⇒ aabbcc
(S, aabbcc)⇒ aabbcc
The branch (a∗, aabbcc) ⇒ aa matches aa because the operator star is greedy and
matches all a’s until reaching the first b. Similar to the nonterminal A, B recognizes
a sequence of b’s followed by c’s of the same length, so (B !., bbcc) ⇒ bbcc. PEG does
not require that the entire input is consumed in order to recognize it. It is possible
to succeed matching only a prefix of the input. So the parsing expression !. explicitly
tests if there are additional symbols or if it is the end of the input (the dot symbol
matches any input symbol).
In the example above, the language recognized, {anbncn | n ≥ 0}, is not context-
free and, therefore, there is not a CFG that generates this language. Although PEG
can recognize languages that are not context-free, left recursion is a problem in PEG,
producing expressions that recognize no string at all. For example, the parsing expres-
sion A a / λ does not recognize any string. This happens because the alternative is
prioritized.
2.4 Data-Dependent Grammar
Different from almost all formalisms presented here, which have been designed to for-
mally describe the syntax of languages, YAKKER was developed to be a better tool
for parser generation [Jim et al., 2010]. Jim et al. [2010] argue that there are several
applications that require parser generated by hand, because parser generators are not
26 Chapter 2. Toward new Models to Describe Syntax
suitable for them. Jim et al. [2010] give as example the SSL message data format and
HTML 5.0 where the designers tried to use a parser generator, but they were forced to
abandon this idea and built a parser by hand.
In their paper, Jim et al. [2010] present some languages that require special
features usually not implemented by parser generators. These examples involve data
formats, networking and web systems. The authors describe what are the required
features and present the formalization of YAKKER. YAKKER is a extension of CFG
with the addition of the following features:
• Regular right hand side. The right hand side of a rule is a regular expression.
So, the rules of the YAKKER are a function that maps nonterminals to regular
expressions;
• binding . YAKKER allows the storage of portions of the generated string in
variables and uses this information in the derivation process;
• semantic predicates and constraints . The grammar also allows the definition of
constraints which it is possible to test properties on the variables;
• parameterized nonterminals . The nonterminals may have one parameter, which
acts like an inherited attribute. This parameter can be combined with predicates
and variables to implement context-sensitive aspects.
We show these features by an example. Jim et al. [2010] give an example of a data
language where the length of the data comes before the data. Below, we show a
simplified version
literal8→ { x := number (+ | λ) } {n := string2int(x); }
([n > 0] OCTET {n := n− 1})∗ [n = 0]
The terminal symbols are the braces in bold and the plus symbol. The names number
and OCTET are nonterminals that generate a decimal number and a number in octal,
respectively. This example illustrates several features of YAKKER. First, the variable
x binds to the string derived by the nonterminal number, then the semantic predicate
uses the string value of x to convert it into a number (function string2int) and binds
the result to the variable n. The semantic predicates are in braces. The value of
2.5. Conclusion 27
the variable n is used in the constraint (within square brackets) to generate an octal
number with n digits.
Another feature of YAKKER is parameterized nonterminals, allowing nontermi-
nals to have one parameter. This parameter is similar to an inherited attribute and
Jim et al. [2010] claim that it can be used to support modular reuse. For example, a
fixed-width string may be defined as
stringFW (n)→ ([n > 0] CHAR {n := n− 1})∗ [n = 0]
2.5 Conclusion
In this chapter, we have presented a survey on formalisms for describing the syntax of
programming languages. The main motivation of these formalisms are the description
of the full syntax of programming languages, including context-sensitive features. The
models presented in Section 2.1 extend CFGs with facilities for expressing context-
sensitive aspects of programming languages. The most successful model is AG and
its extension EAG. They have been used for specifying the syntax of programming
languages and in automatic compiler generation.
However, AG, EAG and the other models presented in Section 2.1 require a global
and immutable grammar. Therefore, they are not suitable for describing the syntax
of languages which require changes on their own set of rules during parsing, let alone
for automatic parser generation for these languages. Section 2.2 surveys adaptable
grammars, which are models with explicit manipulation of the grammar set of rules
and that can handle these kinds of languages.
We have seen no evidence that the adaptable grammars studied are good for effi-
cient parser generator. As an example, Erdweg et al. [2011] cite adaptable grammars as
an alternative to implement the parser of the language SugarJ, but they question their
efficiency. Adaptable grammars were initially proposed to give only a syntax treatment
for context-sensitive dependencies of programming languages. The idea came from the
observation that context-sensitive features of programming languages add new syntac-
tic constructions to the language. However, adaptable grammars have been around
at least 30 years and they have not been used in the definition of real programming
languages. There are two reasons for that. First, there are some static semantic which
is difficult to define with adaptable grammars, as checking for multiple declaration of
variables, although this task can be easily accomplished using AGs and a symbol table.
A second reason regards their unfitness for automatic parser generation and the low
efficiency of parsers based on adaptable grammars.
28 Chapter 2. Toward new Models to Describe Syntax
Sections 2.3 and 2.4 presented PEG and YAKKER, which are new models for
describing programming languages. Although PEG and YAKKER are not suitable
for languages which modify themelves during parsing, they have been designed for
parsing programming languages and have added features to deal with data, web and
networking languages. The parsing algorithm for PEG and YAKKER are scannerless
and have other features related to new requirements for parsing. Thus, we based our
model, APEG, on PEG and Christiansen’s Adaptable Grammar. We also borrow some
features of YAKKER. Our decision to choose PEG as the base model is based on some
standard features of PEG that are helpful for the description of context dependency,
and for the efficiency of the implementation:
• PEG defines operators for checking an arbitrarily long prefix of the input, with-
out consuming it. We show in Chapter 5 that this feature may allow a simple
solution for specifying forward reference, which is important to define the Fortress
language;
• The choice operator of PEG is ordered, giving more control of which alternative
will be used. Our implementation relies on this feature to provide an implemen-
tation that does not need to build complex parsing tables whenever a rule is
updated, favouring efficiency.
Chapter 3
Adaptable Parsing Expression
Grammar
In this chapter, we present the formalization of our proposed model, Adaptable Parsing
Expression Grammar (APEG).The design goals of the model are:
• it must be expressive enough for describing the syntax of any extensible language;
• it must favour legibility and simplicity; and
• it must be suitable for automatic generation of efficient syntactic analyzers.
APEG is an adaptable model based on PEG. It also uses the idea of attributes
of AGs [Watt and Madsen, 1983] to guide parsing. APEG are L-attributed and their
purposes are syntactic and not semantic as in AGs. The adaptability of APEG is pro-
vided in a way similar to Christiansen Adaptable Grammar. So, APEG also possesses
a special attribute called language attribute, which represents the set of rules that are
currently used. Language attribute values can be defined by means of embedded se-
mantic actions and can be passed to different branches of the parse tree. This allows
a formal representation of a set of syntactic rules that can be modified on-the-fly, i.e.,
during the parsing of an input string.
We have decided to define APEG according to these formalisms based on some of
the reasons explained below. PEG has several features that facilitate parsing program-
ming languages, such as unlimited lookahead, it is a recognition-based approach and it
is scanerless. PEG also has a prioritized operator which allows free manipulation of the
production rules without the insertion of undesirable ambiguities. Attribute systems
have been used in compiler construction and language design, and they have shown to
be a simple and elegant system with well known solutions for several context-sensitive
29
30 Chapter 3. Adaptable Parsing Expression Grammar
Figure 3.1 Attribute PEG for binary numbers
〈S ↑ x0〉 ← 〈T ↑ x0〉
〈T ↑ x0〉 ← 〈B ↑ x0〉 (〈B ↑ x1〉 [x0 = 2 ∗ x0 + x1])∗
〈B ↑ x1〉 ← (0 [x1 = 0]) / (1 [x1 = 1])
problems of programming languages. So, the adaptablility of the model can be com-
bined with the attribute system to define real extensible programming languages.
The remaining of this chapter explains APEG. In order to understand the for-
mal definition of the model, it is necessary to know how attributes are evaluated and
how constraints over them can be defined. First, we discuss our design decisions on
how to combine PEG and attributes, called Attribute Parsing Expression Grammar
or Attribute PEG (Section 3.1) and, next, we present how adaptability is achieved in
the model (Section 3.2). In Section 3.3, we illustrate how the model works defining
interesting features of programming languages using APEG. We end the chapter with
the conclusions, in Section 3.4.
3.1 Attribute Parsing Expression Grammar
Attribute Parsing Expression Grammar (Attribute PEG) is an extension of PEG with
attributes. The Attribute PEG uses the same notation of EAG and also includes
explicit evaluation rules. Attribute PEG includes explicit rules for manipulating the
attributes because the attribute expressions embedded in the rules are not powerful
enough to replace all uses of explicit evaluation rules in PEGs. In PEGs, the use
of recursion is frequently replaced by the use of the repetition operator “*”, giving
definitions more closer to an imperative model. Therefore, we allow that evaluation
rules in Attribute PEGs to update the values of the attribute variables, treating them
as variables of an imperative language.
As an example for an informal introduction, Figure 3.1 shows an Atribute PEG
which recognizes binary numbers and calculates their value. Expressions in brackets
are explicit evaluation rules. In the third line, each of the options of the ordered choice
has its own evaluation rule, defining that the value of the variable x1 is either 0 (if
the input is “0”) or 1 (if the input is “1”). It is not possible to replace these evaluation
rules with attribute expressions, in the same way of EAGs, because the options are
defined in a single parsing expression and it uses ordered choices. In the second line,
the value of variable x0 is initially defined on the first use of the nonterminal B. Then
it is cumulatively updated by the evaluation rule [x0 = 2 ∗ x0 + x1].
In Attribute PEGs, we also allow the use of constraints, as predicates defined in
3.1. Attribute Parsing Expression Grammar 31
any position on the right hand side of a rule. If a predicate fails, the evaluation of
the parsing expression also fails. The use of attributes as variables of an imperative
language and predicate evaluation are similar to the approach adopted for the formal
definition of YAKKER. We also include the “bind expression”, featured defined by
YAKKER, in Atribute PEG, that allows binding a prefix of the input string to a
variable. This allows concise specification of languages.
Another improvement provided by EAG is the possibility of using the same at-
tribute variable in more than one defining position in a rule. Repeated occurence of
the variable name of an attribute defines an implicit constraint, requiring the variable
to have the same value in all instances. In our proposition for Attribute PEG, we
do not adopt this last improvement of EAG, because it would not be consistent with
our design decision of allowing attributes to be updated as variables of an imperative
language. In the following, we formally define the syntax of Attribute PEG.
Definition 1 (Attribute System). An attribute system is a 4-tuple (VN , A, VA, F ),
where VN is a set of nonterminal names, A is a function that maps a symbol of VN to
N× N, VI is a set of variable names disjoint from VN , and F is a set of functions.
An attribute system defines a specification of the attributes of every nonterminal,
defining the numbers of inherited and synthesized attributes each nonterminal has, in
a definition of an Attribute PEG. For example, the definition of the function A for the
Attribute PEG in the example of binary numbers presented in Figure 3.1 is
A(S) = (0, 1)
A(T ) = (0, 1)
A(B) = (0, 1)
which says that every nonterminal has only one synthesized attribute. If the nonter-
minal S had two attributes where one is inherited and another synthesized, then the
definition of the function A for S would be A(S) = (1, 1). We also define arity(N) as
the number of attributes of a nonterminal N , i.e., the sum of the two numbers of the
tuple A(N).
From an attribute system, we also define a simple, untyped language of attribute
expressions , which are expressions over the attribute domains, using functions from
the set F and variables. The attribute expression language includes variables from VA,
boolean, integer and string literals. If f ∈ F is a function of arity n and ε1, . . . , εn are
attribute expressions, then f(ε1, . . . , εn) is also an attribute expression.
32 Chapter 3. Adaptable Parsing Expression Grammar
The representation of a nonterminal in Attribute PEG is done in angular brackets
using the name of the nonterminal followed by a list of attribute expressions with a
symbol to identify if the attribute is inherited (↓) or synthesized (↑). As an example,
the nonterminal T , which has only one synthesized attribute, is used at the rule of the
nonterminal S in Figure 3.1, represented by 〈T ↑ x0〉 (the nonterminal name followed
by the up arrow and one attribute expression, x0).
Definition 2 (Parsing Expression). Suppose an attribute system (VN , A, VA, F ), and
a set of terminals, VT , disjoint from VN ∪ VA. Let e, e1 and e2 be parsing expressions,
εi attribute expressions, † an element of the set {↓, ↑} and ϑ ∈ VA a variable name.
Then, the set of valid parsing expressions (Pe) is recursively defined as:
λ ∈ Pe (empty expression)
a ∈ Pe, for every a ∈ VT (terminal expression)
〈A † ε1 . . . † εp〉 ∈ Pe, for every A ∈ VN
and p = arity(A)
(nonterminal expression)
e1e2 ∈ Pe (sequence expression)
e1/e2 ∈ Pe (ordered choice expression)
e∗ ∈ Pe (zero-or-more repetition expression)
!e ∈ Pe (not-predicate expression)
[ϑ = ε] ∈ Pe (update expression)
[ε] ∈ Pe (constraint expression)
ϑ = e ∈ Pe (bind expression)
The first seven types of parsing expressions are standard to PEG. Only nonter-
minal expressions are different because of the use of attributes. To this set of parsing
expressions , we add three new types of expressions. Update expressions have the for-
mat [ϑ = ε], where ϑ is a variable name and ε is an attribute expression. They are
used to update the value of variables in an environment. Constraint expressions with
the format [ε], where ε is an attribute expression that evaluates to a boolean value,
are used to test for predicates over the attributes. Bind expressions have the format
ϑ = e, where ϑ is a variable and e is a parsing expression. It assigns to ϑ the part of
the input that the parsing expression e matches. If there is no match, the variable ϑ
is unbounded.
Definition 3 (Attribute PEG). Let H be the set of all n-tuples of attribute expressions.
An Attribute PEG is a 7-tuple (VN , VT , A, R, S, VA, F ), where VN and VT are finite sets
of nonterminals and terminals, respectively. (VN , A, VA, F ) forms an attribute system
3.1. Attribute Parsing Expression Grammar 33
and R : VN → Pe ×H is a total rule function which maps every nonterminal name to
a pair of a parsing expression and the attribute expressions of the nonterminal. S is
the initial parsing expression.
Figure 3.1 shows the rule function in a sugared syntactic notation. In this nota-
tion, we put together the nonterminal and its attributes on the left side and the parsing
expression on the right side. We will use this notation throughout the text. The ex-
ample of Figure 3.1 can be expressed formally as G = ({S, T,B}, {0,1}, {(S, (0, 1)),
(T, (0, 1)), (B, (0, 1))}, R, S, {x0, x1}, {+, ∗}), where R is formally defined as
R(S) = (〈T ↑ x0〉, x0)
R(T ) = (〈B ↑ x0〉(〈B ↑ x1〉 [x0 = 2 ∗ x0 + x1])∗, x0)
R(B) = ((0 [x1 = 0]) / (1 [x1 = 1]), x1)
We make the restriction that the attribute expressions in defining positions (in-
herited attributes on the left side of a rule and synthesized attributes on the right side
of a rule) are always represented by a single variable. Without loss of generality, we
will also assume that all inherited attributes are represented in a nonterminal before
its synthesized attributes. So, suppose that e is the parsing expression associated with
nonterminal A, p is its number of inherited attributes and q the number of synthesized
attributes. Then 〈A ↓ ϑ1 ↓ ϑ2 . . . ↓ ϑp ↑ ε1 ↑ . . . ↑ εq〉 ← e represents the rule for A
and its attributes.
In this thesis, we use the following convention for attribute expressions, parsing
expressions, strings and values:
• the Greek letters ε, ϑ and κ represent attribute expressions. The letters ϑ and κ
are used only in defining positions, which must be a variable name;
• the letter e represents parsing expressions;
• the letters x, y and w represent strings; and
• letters v and u are used for the semantic domain of attribute expressions, i.e.,
representing the value of the evaluation of an attribute expression.
All symbols above may be decorated with prime and/or an integer index. We also use
the simplified notation ~vn to denote a sequence v of n expressions. As defining positions
of a rule are represented by the letters ϑ and κ and applying positions by the letters ε,
the 〈A ↓ ϑ1 ↓ . . . ↓ ϑp ↑ ε1 ↑ . . . ↑ εq〉 ← e may be replaced by 〈A ↓ ~ϑp ↑ ~εq〉 ← e. On
the other hand, if the same symbol is used in the right side of a rule, then the notation
34 Chapter 3. Adaptable Parsing Expression Grammar
for a nonterminal expression 〈A ↓ ε′1 ↓ . . . ↓ ε′p ↑ κ1 ↑ . . . ↑ κq〉 may be replaced by
〈A ↓ ~ε′p ↑ ~κq〉.
3.2 Adaptable Parsing Expression Grammar
The adaptability of APEGs is achieved by means of an attribute associated with every
nonterminal, representing the current grammar. Therefore, syntactically, APEG is
an Attribute PEG in which the first attribute of all nonterminals is inherited and
represents the language attribute (the set of parsing expression rules that may be used).
In this section, we define the semantics of APEG. For this purpose, first we define the
notion of an environment.
Definition 4 (Environment). An environment is a function that maps variables to
values. The following notation is used for constructing environments:
• . (a dot) represents an empty environment, i.e., all variables map to the unbound
value;
• [ϑ1/v1, . . . , ϑn/vn] maps ϑi to vi, 1 ≤ i ≤ n, and other variables to unbound;
• E[ϑ1/v1, . . . , ϑn/vn] is an environment which is equal to E, except for the values
of ϑi that map to vi, 1 ≤ i ≤ n.
We write E[[ε]] to indicate the value of the attribute expression ε evaluated in
the environment E. For example, suppose an environment E = [x/5; y/7; z/2],
then E[[x]] is equal to 5 and E[[x ∗ z + y]] is equal to 17. If we define the environment
Y = E[x/3; w/6], then Y [[x]] is equal to 3 and Y [[w ∗ x+ z]] is 20. Note that the
evaluation of a variable that is not defined results in the unbound value, for example
E[[w]] = unbound.
Figures 3.2 and 3.3 present the semantics of an APEG. Almost all the for-
malization is related to Attribute PEG. Only the last equation in Figure 3.3 defines
adaptability. Figures 3.2 and 3.3 define the judgement E ` (e, x) ⇒ o ` E ′, which
says that the interpretation of the parsing expression e, for the input string x, in an
environment E, results in o, and produces a new environment E ′. The result o ∈ V ∗T
indicates the prefix of x that is consumed, if the expression succeeds, or f 6∈ V ∗T , if it
fails.
The equations in Figure 3.2 describe the semantics of regular PEG operations,
and their relation to environments. Note that changes in an environment are discarded
when an expression fails. For example, in a sequence expression, a new environment
3.2. Adaptable Parsing Expression Grammar 35
Figure 3.2 Semantics of Adaptable PEG - Part I.
E ` (e, x)⇒ o ` E′
Empty
E ` (λ, x)⇒ λ ` E a ∈ VTTerm
E ` (a, ax)⇒ a ` E
a, b ∈ VT a 6= b¬Term1
E ` (a, bx)⇒ f ` E
a ∈ VT¬Term2
E ` (a, λ)⇒ f ` E
E1 ` (e1, x1x2y)⇒ x1 ` E2 E2 ` (e2, x2y)⇒ x2 ` E3Seq
E1 ` (e1e2, x1x2y)⇒ x1x2 ` E3
E1 ` (e1, xy)⇒ x ` E2 E2 ` (e2, y)⇒ f ` E3¬Seq1
E1 ` (e1e2, xy)⇒ f ` E1
E ` (e1, x)⇒ f ` E′¬Seq2
E ` (e1e2, x)⇒ f ` E
E ` (e1, xy)⇒ x ` E′Choice1
E ` (e1/e2, xy)⇒ x ` E′
E ` (e1, xy)⇒ f ` E1 E ` (e2, xy)⇒ x ` E2Choice2
E ` (e1/e2, xy)⇒ x ` E2
E ` (e1, x)⇒ f ` E1 E ` (e2, x)⇒ f ` E2¬Choice
E ` (e1/e2, x)⇒ f ` E
E1 ` (e, x1x2y)⇒ x1 ` E2 E2 ` (e∗, x2y)⇒ x2 ` E3Rep
E1 ` (e∗, x1x2y)⇒ x1x2 ` E3
E1 ` (e, x)⇒ f ` E2¬Rep
E1 ` (e∗, x)⇒ λ ` E1
E ` (e, xy)⇒ x ` E′
Neg
E ` (!e, xy)⇒ f ` E′
E ` (e, x)⇒ f ` E′¬Neg
E ` (!e, x)⇒ λ ` E′
36 Chapter 3. Adaptable Parsing Expression Grammar
Figure 3.3 Semantics of Adaptable PEG - Part II.
E ` (e, x)⇒ o ` E′
v = E[[ε]]
Atrib
E ` ([κ = ε], x)⇒ λ ` E[κ/v]
unbound = E[[ε]]¬Atrib
E ` ([κ = ε], x)⇒ f ` E
E ` (e, xy)⇒ x ` E′
Bind
E ` (κ = e, xy)⇒ x ` E′[κ/x]
E ` (e, x)⇒ f ` E′¬Bind
E ` (κ = e, x)⇒ f ` E
true = E[[ε]]
True
E ` ([ε], x)⇒ λ ` E
false = E[[ε]]
False
E ` ([ε], x)⇒ f ` E
〈A ↓ ϑ1 ↓ . . . ↓ ϑp ↑ ε′1 ↑ . . . ↑ ε′q〉 ← e ∈ E[[ε1]], where E[[ε1]] ≡ language attribute
vi = E[[εi]], 1 ≤ i ≤ p uj = E′[[ε′j ]], 1 ≤ j ≤ q
[ϑ1/v1, . . . , ϑp/vp] ` (e, x)⇒ o ` E′Adapt
E ` (〈A ↓ ε1 ↓ . . . ↓ εp ↑ κ1 ↑ . . . ↑ κq〉, x)⇒ o ` E[κ1/u1, . . . , κq/uq]
is computed when it succeeds, a situation represented by rule Seq. If the first or
the second subexpression of a sequence expression fails, the changes are discarded and
the environment used is the one before the sequence expression. These situations are
represented by rules ¬Seq1 and ¬Seq2. A similar behaviour is defined for ¬Term1
and ¬Term2, when a terminal expression fails, for ¬Rep, when a repetition fails, and
for Choice2, when the first alternative of a choice fails. Rules Neg and ¬Neg are the
only one that change the enviroment in the outer level, even in a fail situation. The
change in the Neg rule allows arbitrary lookahead, calculating some information about
the prefix and using it later. To allow simulating positive predicate which changes in
the enviroment (similar to the semantics of the not-predicate), we allow the ¬Neg
rule to change the enviroment. Although this decision enables a collateral effect in the
outer level of a not-predicate, this effect is not propagated outside a sequence parsing
expression because the other rules do not allow changing in the environment in a fail
situation.
The equations in Figure 3.3 describe the semantics of the operations that our
model added to the PEG model, namely update, constraint and bind expressions,
and also the semantics of the evaluation of nonterminal expressions . Rules Atrib and
¬Atrib define the behaviour for update expression, and rulesTrue and False represent
predicate evaluation in constraint expressions. Rules Bind and ¬Bind match the
prefix of the input with the given parsing expression and store this prefix in a variable,
if the match succeeds. The most interesting rule is Adapt. It defines how nonterminal
expressions are evaluated, considering attributes and also the current set of production
3.3. APEG in Action 37
rules, represented by the language attribute. Attribute values are associated with
variables using an approach similar to EAG, but in a way more operational; it is also
similar to parametrized nonterminals of YAKKER described in [Jim et al., 2010], but
allowing several return values instead of just one. When a nonterminal is processed, the
value of its inherited attributes are calculated considering the current environment. The
corresponding parsing expression is fetched from the current set of production rules,
defined by the language attribute, which is always the first attribute of the symbol. In
rule Adapt, the language attribute is represented by expression e1, and Re1 represents
the map from nonterminals to parsing expressions from the adaptable PEG defined
by e1. Rule Adapt is the only point in all the rules of Figures 3.2 and 3.3 that is
associated with the property of adaptability.
We define the language accepted by an APEG as follows. Let G = (VN , VT , A,
R, S, VA, F ) be an APEG. Then
L(G) = {w ∈ V ∗T | . ` (〈S ↓ G〉, w)⇒ w′ ` E ′ ∧ w′ 6= f}
The derivation process begins using an empty environment, with the starting parsing
expression S matching the input string w. The original grammar G is used as the value
for the inherited language attribute of S. If the process succeeds, w′ is the prefix of w
matched and E ′ is the resulting environment. The language L(G) is the set of words
w that do not produce f (failure).
3.3 APEG in Action
In this section, we present two examples of use of APEG. The first example is a defini-
tion of context-dependent constraints commonly required in binary data specification.
The second illustrates a specification of static semantics of programming languages.
3.3.1 Data Dependent Languages
As a motivating example of a context-sensitive language specification, Jim et al. [2010]
present a data format language in which an integer number is used to define the length
of the expression that follows it, between brackets. For instance, a valid sentential form
is “6[abcdef]”.
Figure 3.4 shows how a similar language may be defined in an Attribute (non
adaptable) PEG. The nonterminal number has a synthesized attribute, whose value is
used in the constraint expression that controls the length of text to be parsed in the
38 Chapter 3. Adaptable Parsing Expression Grammar
Figure 3.4 An Attribute PEG for a data dependent language.
〈literal〉 ← 〈number ↑ n〉 "[" 〈strN ↓ n〉 "]"
〈strN ↓ n〉 ← ([n > 0] CHAR [n = n− 1])∗ [n == 0]
〈number ↑ x2〉 ← 〈digit ↑ x2〉 (〈digit ↑ x1〉[x2 = x2 ∗ 10 + x1])∗
〈digit ↑ x1〉 ← 0 [x1 = 0] / 1 [x1 = 1] / . . . / 9 [x1 = 9]
Figure 3.5 Using adaptability in the PEG of Figure 3.4.
〈literal ↓ g〉 ← 〈number ↓ g ↑ n〉 "["
[g1 = g ⊕ rule(“〈strN ↓ g〉 ←” + rep(“CHAR ”, n))]
〈strN ↓ g1〉 "]"
sequel. In line 2, the terminal CHAR represents any single character. The value of the
inherited attribute n is used in the update expression [n = n − 1], inside a repetition
expression. Each time a character is read, the value of n is decreased. There are
also two constraint expressions. The first one checks if the value of n is non negative,
inside the repetition expression, ensuring that the loop will end when n is zero. The
second constraint expression, after the repetition expression, ensures that the number
of characters read is exactly the value initially set to n, when 〈strN ↓ n〉 was evaluated.
Using features from APEG in the same language, we could replace the first two
rules of Figure 3.4 by the rule in Figure 3.5. In an APEG, every nonterminal has the
language attribute as its first inherited attribute. The attribute g of the start symbol is
initialized with the original APEG, but when nonterminal strN is used, a new grammar
g1 is considered. The symbol “⊕” represents an operator for adding rules to a grammar
and function rep produces a string repeatedly concatenated. The formalization of
these functions, specialy how to change an existing grammar, is discussed in Chapter 4
which treats about the implementation of the APEG model. Using these functions,
g1 will be equal to g together with a new rule that indicates that strN can generate
a string with length n. For example, if n is 3, then the following rule will be added:
〈strN ↓ g〉 ← CHAR CHAR CHAR.
3.3.2 Static Semantics
Figure 3.6 presents a PEG definition of a toy block structured language in which a block
consists of a list of declarations of integer variables, followed by a list of assignment
statements. An assignment statement consists of a variable on the left side and a
variable on the right side. For simplicity, the whitespaces are not considered.
Suppose that the context dependent constraints are: a variable cannot be used
3.3. APEG in Action 39
Figure 3.6 Syntax of block with declaration and use of variables (simplified).
block ← { dlist slist } decl ← int id ;
dlist ← decl decl∗ stmt ← id = id ;
slist ← stmt stmt∗ id ← alpha alpha∗
Figure 3.7 Adaptable PEG for declaration and use of variables.
〈block ↓ g〉 ← { 〈dlist ↓ g ↑ g1〉 〈slist ↓ g1〉 }
〈dlist ↓ g ↑ g1〉 ← 〈decl ↓ g ↑ g1〉 [g = g1] (〈decl ↓ g ↑ g1〉 [g = g1])∗
〈decl ↓ g ↑ g1〉 ← !( int 〈var ↓ g〉 ) int 〈id ↓ g ↑ n〉 ;
[g1 = g ⊕ rule(“〈var ↓ g〉 ← #n !〈alpha ↓ g〉”)]
〈slist ↓ g〉 ← 〈stmt ↓ g〉 〈stmt ↓ g〉∗
〈stmt ↓ g〉 ← 〈var ↓ g〉 = 〈var ↓ g〉 ;
〈id ↓ g ↑ s〉 ← s = (〈alpha ↓ g〉 〈alpha ↓ g〉∗)
〈alpha ↓ g〉 ← a / . . . / Z / 0 / . . . / 9 / _
if it has not been declared before, and a variable cannot be declared more than once.
The APEG in Figure 3.7 implements these context-dependent constraints.
In this example, the idea of implementing the context-dependent constraints is
to adapt the nonterminal var on the fly in order to allow only declared variables to
be recognized. Note that, in the beginning, the nonterminal var does not recognize
any symbols (there is no definition to it at the beginning). However, when a variable
is declared (nonterminal decl), a new grammar rule is produced by the addition of a
new choice in the definition of nonterminal var, which allows the recognition of the
new variable name. The resulting new grammar is passed as the language attribute,
in the definition of the nonterminal block, to the nonterminal slist, and, in the sequel,
to stmt. As a result, the nonterminal stmt now can recognize the declared variable.
For example, suppose the input string {int a;int b;a=b;b=a;}. The recognition of
this string starts with the nonterminal block and its language attribute is the grammar
in Figure 3.7. After recognizing the first symbol, {, the parser proceeds to recognize a
list of declarations (nonterminal dlist), passing down the same grammar as the language
attribute to the nonterminal dlist. During the recognition of the nonterminal dlist, it
first tries to match a variable declaration through the nonterminal decl, passing to it
the same language attribute. The nonterminal decl first checks if the variable is already
declared using the parsing expression “ !( int 〈var ↓ g〉 )”. The not-lookahead operator,
!, succeeds if the expression enclosed in parentheses fails, and it does not consume any
symbol from the input. In order to check whether the variable a has already been
declared, the parsing expression enclosed in parentheses matches the int string, but
40 Chapter 3. Adaptable Parsing Expression Grammar
the nonterminal var does not recognize the variable a, because it does not have any
rule for it yet. In the sequel, the parsing expression “int 〈id ↓ g ↑ n〉 ;” recognizes the
declaration of variable a. Note the use of the nonterminal id instead of the nonterminal
var. The nonterminal id is used here to recognize any valid variable name, since it is
a new one. The symbol “#” indicates that the identifier following it (n, in this case)
must be treated as a variable inside a string, then, a new grammar is built from the
current grammar by the addition of a new choice, 〈var ↓ g〉 ← a !〈alpha ↓ g〉, on the
definition of nonterminal var. This new grammar becomes the value of the synthesized
attribute g1.
The grammar synthesized by the nonterminal decl is used in the nonterminal
dlist as language attribute of other calls of the nonterminal decl. Proceeding, the next
variable declaration will be recognized, and the nonterminal dlist synthesizes a new
grammar with these two choices, in this order, “a !〈alpha ↓ g〉” and “b !〈alpha ↓ g〉”, for
the nonterminal var. This new grammar is used by the nonterminal block to pass it as
the language attribute of the nonterminal slist. As a result, the two statements, a = b
and b = a, can be recognized, because the nonterminal var in the language attribute
passed to the nonterminal stmt has rules to recognize the variables a and b.
The use of the operator “!” on the rule defining decl prevents multiple declara-
tions, a problem reported as very difficult to solve when using adaptable models based
on CFG [Christiansen, 1990; Carmi, 2010]. The new rule added to the current APEG
ensures that a variable may be used only if it was previously declared. The symbol
block may be part of a larger APEG, with the declarations restricted to the static scope
defined by the block.
3.4 Conclusion
In order to define APEG, an adaptable model based on PEG, we specified how to
combine PEG and attributes, which we called Attribute PEG. When we have developed
our formalization for Attribute PEG, we were unaware of the Mercer [2008] work which
also gave a formalization to combine attributes and PEG. The semantics of both
formalization are equivalent, although when we published our first formalization of
APEG [Reis et al., 2012], the semantics of the not-predicate was different and did not
allow to propagate changes to the outer level environment. Defining the syntax of the
extensible language Fortress, we realize it is important to allow the not-predicate to
propagate changes. So, we have modified the semantics of the not-predicate, resulting
in a semantics equivalent to the one on Mercer [2008]’work.
3.4. Conclusion 41
We defined APEG on top of Attribute PEG and achieved adaptability using a
special inherited attribute present at every nonterminal that represents the current set
of rules which can be used in every step of the recognition of the input.
APEG keeps some of the most important advantages of declarative adaptable
models, such as an easy definition of context-dependent aspects associated to static
scope and nested blocks. We specified two examples using APEG in order to illustrate
how it works. We showed that the use of PEG as the basis for the model allowed a very
simple solution for the problem of checking for multiple declarations of an identifier.
This problem is reported as very difficult to solve with adaptable models based on
CFG [Christiansen, 1990; Carmi, 2010]. Although we do not intend to define static
semantics of programming languages using APEG, the example in Section 3.3.2 shows
that the lookahead operator could be useful when forward analysis of the input is
required. Indeed, we take advantages of this feature to simulate a multi-pass approach
to parse the Fortress language.
In order to know exactly the adaptations performed by an APEG, a developer
must be aware that it works as a top down parser. It could be considered as a dis-
advantage when compared to declarative models, but any PEG developer is already
prepared to deal with this feature, since PEG is, by definition, a description of a top
down parser.

Chapter 4
Implementation of Adaptable
Parsing Expression Grammars
In this chapter, we discuss issues related to the difficulties of implementing the APEG
model. Compared to the standard PEG model, the main new features of APEG are:
• introduction of inherited and synthesized attributes, creating “PEG with at-
tributes”;
• introduction of update and constraint parsing expressions to compute and test
attribute values;
• adaptability of the model, allowing changes in the grammar while the input is
processed.
Ford developed packrat parsers [Ford, 2002b], which can be used for efficient im-
plementation of PEGs. Packrat parsers allow unlimited lookahead with linear time
processing, at the cost of additional storage for memoization. We have developed an
implementation for the APEG model that is also based on the features of packrat
parsers, but we had to deal with challenges not faced by a standard PEG. For in-
stance, the use of attributes requires a more sophisticated memoization mechanism,
since a nonterminal symbol may be called with different values for inherited attributes,
resulting in different behaviours. Adaptability is another challenge, because changes
in the set of rules may invalidate some memoization results.
In the following, we present a discussion on tools implementing ideas related
to the PEG model (Section 4.1). We describe details of our own implementation of
an interpreter for APEG model, analysing the impacts of the new proposed features
(Section 4.2). We prove that the mechanism of memoization can be adapted to the
43
44
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
APEG model (Section 4.3). We discuss a mixed approach of code generation and
interpretation for APEG grammars (Section 4.4). Then, we conclude this chapter
(Section 4.5).
4.1 PEG-related implementations
As stated before, Ford developed packrat parsing , a method for implementing linear-
time top-down parsers [Ford, 2002b]. Packrat parsing provides better composition
properties than LL/LR parsing, making it more suitable for dynamic or extensible
languages. The main disadvantage of the method is the storage cost, which may be
a product of the number of nonterminal symbols by the total input size, rather than
being proportional to the height of a syntax tree built during the parsing process.
Ford implemented a packrat parser generator named Pappy [Ford, 2002a], which takes
declarative parser specifications and generates packrat parsers in Haskell. The specifi-
cation language accepted by Pappy allows computing semantic values with the help of
embedded code fragments written in Haskell. It also allows the definition of semantic
predicates , with parsing decisions depending on semantic values and not just on syn-
tactic rules. An important disadvantage of this implementation is that the generated
parsers need to have the entire input available up-front, thus making them unusable
for interactive applications. No features for conveniently extending the set of rules are
provided.
Redziejowski [2009] developed Mouse, a parser generator based on PEG. Using
a PEG, Mouse generates Java code for a recursive descent parser. It offers limited
resources for memoization, hence the name Mouse instead of Packrat. Mouse does
not allow the explicit use of attributes. Semantic actions are represented by tags
that can only be inserted at the end of each alternative of a rule. The code for the
semantic actions is written as a method in a separate Java class. Methods are bound
to semantic actions using the same name as the tags. Inside these semantic actions, it
is possible to associate values with the symbols of a rule. This scheme allows a peculiar
implementation of synthesized attributes, but not inherited attributes.
In Language Generator by Instil (LGI) [nez Guzmán, 2009], instead of generating
a descent recursive parser, a PEG is represented as an Abstract Syntax Tree (AST)
that is interpreted. Interpretation allows more flexibility, but it is less efficient than
generated code. LGI implements full memoization, but again the memory management
is very inefficient. The input must be completely read before processing and stored in
memory. Memoization is implemented using a two-dimensional array, with a line for
4.1. PEG-related implementations 45
each nonterminal, and a column for each input symbol. There is no way to define values
for attributes, nor semantic actions. The result of processing an input is a syntax tree
created to represent this input.
Instead of generating a descent recursive parser, LPEG [Ierusalimschy, 2009]
translates parsing expressions to a virtual parsing machine which interprets them.
LPEG can capture text and execute semantic actions, but does not allow explicit use
of attributes. LPEG also does not use memoization of intermediate results, so it can
have an exponential behavior. However, as LPEG is intended to be used in pattern
matching applications, the exponential behavior is not expected to be common in
practice. Moreover, the simplicity of the parsing machine allows the implemenation of
several optimizations.
ANTLR [Parr and Quong, 1994] is a parser generator that introduced a novel
parsing approach, called LL(*) [Parr and Fisher, 2011]. It is not exactly PEG-based,
but it implements some PEG and packrat parser features. For instance, it offers a
lookahead operator and backtracking with memoization, which can be controlled by
the programmer. ANTLR allows the use of inherited and synthesized attributes and
semantic actions as embedded code. It generates efficient code, representing each
nonterminal symbol as a function, in a recursive descent style. Inherited attributes are
implemented as function parameters, and synthesized attributes are implemented as
function return values. When combining backtracking and memoization with arbitrary
semantic actions, an important issue arises: a memoized result is no longer valid if the
semantic actions have collateral effects. ANTLR deals with this problem separating
the process of defining which rule will be used in two steps. First, semantic actions
and memoization are turned off and it uses lookahead to define the correct rule to be
used. Then semantic actions are turned on and the parsing rules are processed again.
Rats! [Grimm, 2006] is another PEG based tool for parser generation, building
packrat parsers with full memoization. The input is treated as a stream of characters,
so it does not suffer from the important limitations imposed by Pappy. All optimiza-
tions introduced by Pappy and several others are implemented in Rats!, producing
efficient parsers. A global state object can be managed by semantic actions without
violating memoization, thus preserving linear time performance. In order to accom-
plish this, Rats! requires that all state modifications be performed within possibly
nested, lightweight transactions, with some additional restrictions. Rats! allows the
construction of extensible and modular grammars that may offer a great level of reuse.
To provide reuse of grammars, it is possible to import modules and extend their pro-
duction rules. Extensibility is carried out only statically. As all other tools analysed
in this section, it is not possible to modify the set of production rules at parse time.
46
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
Katahdin [Seaton, 2007] is a language that allows its own syntax and semantics to
be modified at runtime. It is not exactly a parser generator, but it is mentioned here be-
cause it has two features closely related to our work: the specification of new constructs
is based on PEGs, and the syntax of these constructs may be dynamically defined and
immediately used, at runtime, applying techniques of just-in-time compilation in the
implementation. Some changes on the PEG model make the Katahdin approach differ
from the other tools described here. For example, the PEG ordered choice operator
“/” is replaced by an operator “|”, which gives priority to longest match when choosing
among alternatives, instead of applying the order of alternatives given. According to
the authors, this decision may allow better composition of grammars.
The first version of our implementation, described in the next sections, was devel-
oped as an interpreter for the APEG model. An abstract representation of an APEG
is interpreted, similarly to the LGI tool. The interpreter allows the use of any number
of synthesized and inherited attributes, with a concrete syntax inspired by ANTLR.
The input is treated as a stream of characters, following the approach of Rats! and
ANTLR, and avoiding the restrictions imposed by Pappy and LGI. Similarly to Pappy
and Rats!, full memoization is provided. The use of embedded code, formally defined
by the model as conditional and assignment expressions, is completely implemented in
the interpreter. The most important novelty, when compared with the tools described
in this section, is obviously the possibility of manipulating the set of production rules
at parse time.
4.2 Implementing an Interpreter for APEG
The implementation of an interpreter for APEG was developed using the ANTLR
parser generator [Parr and Quong, 1994], with embedded code written in Java. The
syntax analysis is combined with the static semantics analysis in one single pass. It
is possible to use a nonterminal before declaring its production rule, so all uses of
nonterminals are collected during the analysis single pass and verified later using Java
code and data structures. An AST tree is generated using the operators provided by
ANTLR for AST construction. The generated AST is then interpreted, implementing
the semantics of the model.
4.2.1 Examples Showing the Concrete Syntax
A grammar for the concrete syntax adopted is presented in Appendix A, inspired on
the syntax of ANTLR. Inherited attributes are defined as parameters for nonterminal
4.2. Implementing an Interpreter for APEG 47
Figure 4.1 Concrete syntax for the example of Figure 3.4.
1 apeg datadependent;
2
3 literal locals[int n]: number <n> ’[’ strN <n> ’]’ !. ;
4
5 strN[int n]: ( {? n > 0 } CHAR { n = n - 1; } )* {? n == 0 } ;
6
7 number returns[int x2] locals[int x1]:
8 digit <x2> ( digit <x1> { x2 = x2 * 10 + x1; } )* ;
9
10 digit returns [int x1]:
11 ’0’ { x1 = 0; }
12 / ’1’ { x1 = 1; }
13 / ’2’ { x1 = 2; }
14 / ’3’ { x1 = 3; }
15 / ’4’ { x1 = 4; }
16 / ’5’ { x1 = 5; }
17 / ’6’ { x1 = 6; }
18 / ’7’ { x1 = 7; }
19 / ’8’ { x1 = 8; }
20 / ’9’ { x1 = 9; } ;
21
22 CHAR : . ;
symbols. Synthesized attributes are defined as a list of return values. We also defined
local attributes, which are synthesized attributes whose values are manipulated only
locally. All attributes are typed, anticipating our desire for an efficient code generation.
Update expressions are defined between {...} and constraint expressions are defined
between {?...}. Desugaring syntax for optional, and-predicate, positive Kleene closure
and character classes are also available.
In Figure 4.1, the APEG of Figure 3.4 is written using the concrete syntax pro-
posed. The specification starts with the name of the grammar followed by a list of
production rules. Rule strN shows an example of use of an inherited attribute n, in
line 5. Rule digit defines a synthesized attribute x1 after the keyword “returns”, in line
9. Rule literal defines a local variable n after the keyword “locals”, in line 3. When
a nonterminal is referenced inside a PEG expression, the values of the inherited and
synthesized attributes are listed between <...>, with all inherited attributes coming
first. Character sequences are defined inside single quotes (e.g. ’[’ and ’]’, in line 3).
The dot operator (“.”) matches any single character. All other elements in Figure 4.1
follow the definitions of the APEG model and may be easily understood.
There is no fixed start symbol for the APEG. When a user applies an APEG
to an input file, the name of the chosen start symbol must be provided, together with
values for its inherited attributes. So any of the nonterminal symbols of Figure 4.1
48
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
Figure 4.2 APEG with user-defined functions and character classes.
1 apeg datadependent2;
2 functions StringFunctions;
3
4 literal locals[int n]: number <n> ’[’ strN <n> ’]’ !. ;
5
6 strN[int n]: ( {? n > 0 } CHAR { n = n - 1; } )* {? n == 0 } ;
7
8 number returns[int x2] locals[String t]: t=[0 -9]+ {x2 = strToInt(t);};
9
10 CHAR: . ;
Figure 4.3 Example with the set of production rules changed at parse time.
1 apeg adapdatadependent3;
2 options { isAdaptable = true; }
3 functions AdaptableFunctions , StringFunctions;
4
5 literal[Grammar g] locals[Grammar g1]:
6 number <n>
7 {g1 = addRule(copyGrammar(g), concat(
8 concat(’strN␣:␣’, concatN(’CHAR␣’, n)), ’;’));} ’[’ strN <g1> ’]’
9 ;
10
11 strN[Grammar g]: {? false } ;
12
13 number returns[int r] locals[String t] : t=[0 -9]+ r = strToInt(t);};
14
15 CHAR : . ;
could, in principle, be used as a start symbol, each one describing a different language.
Figure 4.2 presents a simplified version of the same APEG discussed above, show-
ing other features of the implementation. In line 2, the “functions” directive is used,
allowing the definition of one or more files with the code for user-defined functions. As
we implemented the interpreter in Java, the files are associated to Java classes, and the
functions available for the APEG are all static functions defined inside these classes.
One example is the function “strToInt”, used in line 8, which converts a string into
its integer representation. Also in line 8, there is a use of a character class , defining
the symbols ’0’ ... ’9’, and a use of the feature for binding a variable to the input
matched by an APEG expression (“t=...”).
Figure 4.3 shows an example in which the set of production rules is changed at
parse time. It is the concrete syntax equivalent to the APEG described in Figure 3.5.
The option isAdaptable is set to true, indicating for the interpreter that the APEG
itself must be provided as a value for the first inherited attribute of the start symbol.
4.2. Implementing an Interpreter for APEG 49
The model defined in Chapter 3 requires that every nonterminal must have an APEG
(type Grammar) as its first inherited attribute. Some nonterminal symbols do not
need to mention the APEG attribute, like number and CHAR in Figure 4.3, so the user
does not need to declare it
Consider the APEG expression in lines 6 to 8, in Figure 4.3, and suppose that the
input file contains “3[abc]”. The attribute n will be assigned the integer value 3. Then
a new APEG g1 will be constructed, adding to the current APEG g a rule defined by
the string “strN : CHAR CHAR CHAR;”. This string is built using the functions concat
and concatN, defined in file StringFunctions. Then the nonterminal symbol strN will
be processed having g1 as its set of valid production rules. Note that this nonterminal
is referenced in line 8, even before the actual rule is created. The interpreter requires
that all nonterminal symbols used must be statically declared, otherwise an error is
detected at analysis time, so a stub rule was added for strN in line 11. The function
addRule, used in line 7, adds a new set of rules to the given APEG, following the steps
below:
• The textual representation of the new rules is analysed. If a syntax or static
semantics error is detected, an exception is raised.
• If the symbol on the left side of a rule to be added is a new nonterminal, then a
new rule is added to the APEG.
• If the symbol on the left side of a rule to be added is already defined, the
APEG expressions are combined using the ordered choice operator. For example,
the rule for the nonterminal symbol strN in g1 will have the following format:
“strN[Grammar g] : {? false} / CHAR CHAR CHAR;”. It is not possible to
change the list of declared attributes of an existing rule.
4.2.2 Implementing PEG with Attributes
In this section, we discuss how attributes are implemented by the interpreter, and how
they affect the memoization mechanism of the parser.
In a standard packrat parser, the application of nonterminals is memoized as a
mapping from pairs 〈nonterminal, position〉 to results, where a result is either fail
or an integer newPos. Because of backtracking, a nonterminal may be applied to
the same position of the input file more than once. On the second application and
on the following ones, the memoized result is used, allowing linear time processing.
The memoized result indicates that the application will either fail again (fail), or may
50
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
succeed and so it provides the new position (newPos) of the input file that must be
considered.
As discussed in Section 4.1, tools like LGI implement memoization in a naive and
expensive way, in a form of a two-dimensional table with one line for each nonterminal
symbol, and one column for each position of the input file. This representation is
expensive because the table is frequently very sparse. On the tool Rats!, on the other
hand, the columns of this table are stored as several chunks, representing only the cases
that really needed memoization.
When inherited attributes are taken into account, the naive approach is even in-
feasible. A nonterminal symbol may be used with different values for the inherited at-
tributes. A resulting memoization associated with a pair 〈nonterminal, position〉 is not
enough, because the behaviour may be different for different inherited attributes, even
for a same position of the input file. We have implemented memoization as a mapping
from a 3-tuple 〈nonterminal, inhAttr, position〉 to either fail or 〈synAttr, newPos〉.
In this case, inhAttr is a list of values for the inherited attributes and synAttr is a list
of synthesized attributes calculated on the first application of the given 3-tuple. If the
application succeeds, the list of synthesized attributes is reused, and the input file is
repositioned on newPos. This memoization mapping is currently implemented using
hash tables, distributed on the nonterminal symbols defined by the APEG. Each non-
terminal symbol stores a hash table mapping elements 〈inhAttr, position〉 to results of
the form fail or 〈synAttr, newPos〉.
It is important to note that we consider that a set of values for a 3-tuple
〈nonterminal, inhAttr, position〉 will always produce the same result during parsing,
even though the model offers assignment expressions that may change the environment.
This is true because we consider that expressions use only values defined locally, no
global state is available. When user-defined functions are used, we assume that they
also produce the same results when given the same list of arguments. If this property is
not valid, our mechanism for memoization would not work. In Section 4.3, we formally
define and prove this property.
The stack of environments may be used also during the application of several
standard PEG operators. As discussed in Section 3.2, some equations of Figure 3.2
define that changes in an environment are discarded when an expression fails. So our
interpreter is supposed to store a copy of the environment in several cases, not only
when a new instance of a nonterminal symbol is used. We believe that this semantics
is easy to understand, but it may be also very expensive and we are not sure if the
costs of these benefits are worthwhile. Currently, the interpreter offers a directive that
allows the user to control this semantics. The default behaviour is the one defined
4.2. Implementing an Interpreter for APEG 51
in Figure 3.2, but it is also possible to adopt a semantics in which the changes on
the environment produced during the evaluation of the expressions are not discarded,
resulting in a more efficient processing.
4.2.3 Implementing Adaptability
In the APEG model, the set of productions may change at parsing time. This is
the feature that presents the most important challenges for an implementation. The
decision of building an interpreter instead of generating code to simulate the production
rules was directly affected by this feature. Code generation may produce a more efficient
parser, but if production rules are changed at parse time, parts of the generated code
may be invalidated. So we decided that the first implementation for the model would
be an interpreter.
Two other important implementation issues are directly associated with the
adaptability of the model. The first one is the cost of storing the set of rules as
an inherited attribute at every nonterminal symbol. The second one is the impact on
the memoization caused by changes on the production rules. We discuss these two
issues in the following.
In Section 4.2.1, the APEG of Figure 4.3 shows an example in which a copy of the
current APEG was produced, and a production rule of this copy was modified. We have
designed data structures that implement these potentially very expensive operations
in an efficient way. An APEG is represented internally by the interpreter as a class
named Grammar , which contains a set of NonTerminal objects. Each nonterminal
contains information about its attributes and stores an AST representing the APEG
expression associated to it, built during the analysis phase.
When a copy g1 of an APEG g is generated, g1 shares with g the set of nonterminal
symbols, so the copy operation is very efficient. When a production rule is added to g1
and the nonterminal symbol is new, only the new NonTerminal object is added to g1. A
more interesting situation happens when a production rule of g1 is modified, involving
a nonterminal symbol that is already defined, and its definition may be shared with g.
An example is the addition of a rule for the nonterminal symbol strN in the APEG of
Figure 4.3. A copy of the NonTerminal object associated with strN is generated for g1
and a new AST is created for it. Our approach for modifying existing rules, combining
expressions with the ordered choice operator, allows a very efficient implementation.
The objects shared after the creation of the rule for strN are represented in Figure 4.4.
Note that it was necessary to create only a copy of the modified NonTerminal object
and a node to represent the ordered choice operator. All other structures are shared
52
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
Figure 4.4 Data structures representing a copy of an APEG with a modified produc-
tion rule.
by the two Grammar objects.
In Section 4.2.2, we described our approach for memoization, but we did not
consider situations with changes on the set of production rules. If a production rule
changes at parse time, some of the memoization results stored may be invalidated. Cur-
rently, our implementation creates copies of the memoization table, discarding memo-
ized values associated to symbols whose rules have changed, and also to other symbols
depending on the symbols whose rules have changed. We are studying mechanisms for
optimizing this process.
4.3 APEG Properties Associated with Memoization
The following property is a requirement for the correct operation of the memoization
scheme proposed in Section 4.2.2: the parser must have always the same behaviour
when evaluating a nonterminal symbol for a given position of the input file, if it is
given the same values for the set of inherited attributes. When backtrack occurs and a
nonterminal symbol must be evaluated a second time using these same parameters, the
memoized value can be used instead. In the following, we formally define and prove
the desired property. First, an auxiliary lemma is presented.
4.3. APEG Properties Associated with Memoization 53
Lemma 1. In any APEG, if there is an interpretation of a parsing expression e, for
an input string x, in an environment E, it is unique.
Proof. Suppose there is an interpretation E ` (e, x)⇒ o ` E ′. We prove by induction
on the heights of the proof trees (see Figures 3.2 and 3.3) for the antecedents that this
interpretation is unique, i.e., there is only one result for o and E ′. It is easy to see
that when the height of the proof tree is 1, the interpretation is unique. We divide the
proof of the indutive step in the following cases:
• Case e is a sequence e1 e2: By hypothesis, the interpretation E ` (e, x)⇒ o ` E ′
exists, so it must have at least a proof tree, which has one of the forms Seq, ¬Seq1
or ¬Seq2. Using the induction hypothesis, the interpretation E ` (e1, x)⇒ o1 `
E1 is unique. Therefore, if o1 = f , then the interpretation E ` (e, x) ⇒ o ` E ′
must have the form of ¬Seq2 and o is f and E1 is equal to E. Otherwise, the
parsing expression e1 suceeds for input x in the environment E and the proof
tree is Seq or ¬Seq1. Using the induction hypothesis on the interpretation of
the parsing expression e2 for the remaining input in the resulting environment,
we have that it is unique, E1 ` (e2, y) ⇒ o2 ` E2 with x = o1y. So, if o2 is a
failure, f , then the proof tree must be ¬Seq1 and o is f and E ′ is equal to E.
Otherwise, the interpretation suceeds, having the form of Seq, and o is o1o2 and
E ′ is equal to E2.
• Case e is a choice e1/e2: Using the induction hypothesis, the interpretation of
the parsing expression e1 for the input x in the environment E is unique. If
it is not a failure, then the proof tree of E ` (e, x) ⇒ o ` E ′ has the form
Choice1 and the prefix consumed and the resulted enviroment are the same as
the interpretation of e1 for the input x in the environment E. Otherwise, using
the induction hypothesis we have that the interpretation E ` (e2, x) ⇒ o1 ` E1
is unique. If it is a failure, then the proof tree must have the form ¬Choice and
the values of o and E ′ are f and E, respectively. Otherwise, the form of the proof
tree is Choice2 and o is equal to o1 and E ′ is E1.
• Case e is a repetition e∗1: Using the induction hypothesis, the interpretation
E ` (e1, x) ⇒ o1 ` E1 is unique. If o1 is a failure, then the form of the proof
tree is ¬Rep and the results o is λ and E ′ is E. Otherwise, the form of the
proof tree must be Rep. Using the induction hypothesis, we guarantee that the
interpretation E1 ` (e∗1, y) ⇒ o2 ` E2, with x = o1y and y 6= λ, is unique.
Therefore the values of o and E ′ is o1o2 and E2, respectively.
54
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
• Case e is a not-predicate !e1: Using the induction hypothesis, the interpretation
E ` (e1, x)⇒ o1 ` E1 is unique. If the o1 is f , then the form of the proof tree is
¬Neg and o is λ and E ′ is E1. If o1 is not a failure, then the proof tree has the
form of Neg and the values of o and E ′ are f and E1, respectively.
• Case e is a bind expression κ = e1: Using the induction hypothesis, the inter-
pretation E ` (κ = e1, x) ⇒ o1 ` E1 is unique. If the o1 is f , then the form of
the proof tree is ¬Bind and o is f and E ′ is E. If o1 is not a failure, then the
proof tree has the form of Bind and the values of o and E ′ are o1 and E1[κ/o1],
respectively.
• Case e is a nonterminal expression 〈A ↓ ε1 ↓ . . . ↓ εp ↑ κ1 ↑ . . . ↑ κq〉: As
the only possibility for the proof tree has the form of Adapt and, Using the
induction hypothesis, the interpretation [ϑ1/v1, . . . , ϑp/vp] ` (e1, x) ⇒ o1 ` E1
with vi = E[[εi]], 1 ≤ i ≤ p and 〈A ↓ ϑ1 ↓ . . . ↓ ϑp ↑ ε′1 ↑ . . . ↑ ε′q〉 ← e1 ∈ E[[ε1]] is
unique, then the only possible values to o is o1 and to E ′ is E[κ1/u1, . . . , κq/uq],
in which uj = E1[[ε′j]], 1 ≤ j ≤ q
This lemma states that the model is deterministic. Suppose an APEG (VN , VT ,
A, R, S, F ) and the judgement E ` (e, x)⇒ o ` E ′. Given values for e, x and E, if it
is possible to calculate the values for o and E ′, they are uniquely defined. Note that
the evaluation of an expression can fall into infinite loop and have no interpretation.
For example, if the evaluation of the expression e fails for input x in some environment,
the evaluation of the expression (!e)∗ will fall into infinite loop for the same input.
In order to guarantee that Lemma 1 is valid, the implementation of the interpreter
developed assumes that the user-defined functions, representing the set F of an APEG
(VN , VT , A, R, S, F ), are referentially transparent. This lemma will be used to help
proving the desired property, discussed in the beginning of this section.
In the proof of the next theorem, we use the simplified notation ~vn to denote a
sequence v of n expressions. We adopt the following convention: letters ϑ, κ and κ′
are used for defining positions of a rule and ε, ε′ and ε′′ are used for applying positions
of a rule. For example, if a nonterminal symbol A with p inherited attributes and q
synthesized attributes is used in the left side of a rule, then 〈A ↓ ϑ1 ↓ . . . ↓ ϑp ↑ ε1 ↑
. . . ↑ εq〉 may be replaced by 〈A ↓ ~ϑp ↑ ~εq〉. On the other hand, if the same symbol
is used in the right side of a rule, then the notation for the nonterminal expression
〈A ↓ ε′1 ↓ . . . ↓ ε′p ↑ κ1 ↑ . . . ↑ κq〉 may be replaced by 〈A ↓ ~ε′p ↑ ~κq〉. Defining positions
are always represented by a single variable.
4.3. APEG Properties Associated with Memoization 55
Theorem 1. In any APEG, if there is an interpretation of a nonterminal expression
for an input string in a given environment, there is also another interpretation of
the same nonterminal expression in any other environment, provided that the same
values are used for the set of inherited attributes of the nonterminal symbol. The input
consumed by both interpretations will be the same, and also the values calculated for
the synthesized attributes in both interpretations will be the same.
Proof. Consider two environments E and E ′, an integer number p ≥ 1 and expressions
ε1, . . . , εp, ε′1, . . . ε′p. Suppose that the value of the expression εi evaluated in the
environment E is the same of the expression ε′i evaluated in the environment E ′ for
every 1 ≤ i ≤ p, i.e., E[[εi]] = E ′[[ε′i]].
Let A be a nonterminal symbol of the APEG and consider the evaluation of a
nonterminal expression 〈A ↓ ~εp ↑ ~ϑq〉 in the environment E or 〈A ↓ ~ε′p ↑ ~ϑ′q〉 in E ′. The
language attribute (attribute that represents the current set of rules) is the same in
environments E and E ′, because E[[ε1]] = E ′[[ε′1]]. So the rule for nonterminal A is the
same for both environments: 〈A ↓ ~κp ↑ ~ε′′q〉 ← e. Let vi be the value of the attribute
expression εi evaluated in environment E or the attribute expression ε′i evaluated in
environment E ′, i.e., vi = E[[εi]] = E ′[[ε′i]] for every 1 ≤ i ≤ p. Consider that the in-
terpretation, if exists, of the parsing expression e in the environment [κ1/v1, . . . , κp/vp]
for a input string x is given by [κ1/v1, . . . , κp/vp] ` (e, x) ⇒ o ` E ′′. Using Lemma 1,
we may guarantee that this interpretation is unique.
Let the uj be the value of the expression ε′′j , for every 1 ≤ j ≤ q, evaluated
in environment E ′′. We can build the following proof trees to the interpretation of
〈A ↓ ~εp ↑ ~ϑq〉 in the environment E and 〈A ↓ ~ε′p ↑ ~ϑ′q〉 in the environment E ′, for the
same input x:
〈A ↓ ~κp ↑ ~ε′′q 〉 ← e ∈ E[[ε1]], where E[[ε1]] ≡ language attribute
vi = E[[εi]], 1 ≤ i ≤ p uj = E′′[[ε′′j ]], 1 ≤ j ≤ q
[κ1/v1, . . . , κp/vp] ` (e, x)⇒ o ` E′′
Proof Tree 1
E ` (〈A ↓ ~εp ↑ ~ϑq〉, x)⇒ o ` E[ϑ1/u1, . . . , ϑq/uq]
〈A ↓ ~κp ↑ ~ε′′q 〉 ← e ∈ E′[[ε′1]], where E′[[ε′1]] ≡ language attribute
vi = E
′[[ε′i]], 1 ≤ i ≤ p uj = E′′[[ε′′j ]], 1 ≤ j ≤ q
[κ1/v1, . . . , κp/vp] ` (e, x)⇒ o ` E′′
Proof Tree 2
E′ ` (〈A ↓ ~ε′p ↑ ~ϑ′q〉, x)⇒ o ` E′[ϑ′1/u1, . . . , ϑ′q/uq]
The interpretations above consume the same prefix o of the input x. The resulting
environments may be different, but the same values uj are produced for the synthesized
56
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
attributes. If there is no interpretation of the expression e for input string x in the
environment [κ1/v1, . . . , κp/vp], there will be no interpretation for (〈A ↓ ~εp ↑ ~ϑq〉, x) in
the environment E, or for (〈A ↓ ~ε′p ↑ ~ϑ′q〉, x) in E ′.
In order to understand the results presented above, it is important to note the
subtle differences between Lemma 1 and Theorem 1. Lemma 1 states the uniqueness
of the interpretation of a parsing expression, when an environment and an input string
are defined. This result is used to prove Theorem 1, which says that, for a given input
string, it is only necessary to have the same set of values for the inherited attributes of
a nonterminal symbol, in order to guarantee that the interpretation of this nonterminal
will consume the same input and produce the same values for the synthesized attributes.
The consequence of Theorem 1 is that the memoization algorithm used in pack-
rat parsers can be adapted to the APEG model, using the approach described in
Section 4.2.2.
4.4 Mixing Code Generation and Interpretation –
An Initial Approach
Figure 4.5 Example with the set of production rules changed at parse time.
1 block[Grammar g]:
2 ’{’ dlist <g, g1 > slist <g1> ’}’ !.;
3
4 dlist[Grammar g] returns[Grammar g1]:
5 decl <g, g1 > {g = g1;} (decl <g, g1> {g = g1 ;})*;
6
7 decl[Grammar g] returns[Grammar g1]:
8 !(’int␣’ var) ’int␣’ id<s> ’;’
9 {g1 = g + ’var:␣\'’ + s + ’\' !alpha <g, ch >;’;};
10
11 var[Grammar g]:
12 {? false };
13
14 slist[Grammar g]:
15 stmt <g> stmt <g>*;
16
17 stmt[Grammar g]:
18 var <g> ’=’ var <g> ’;’;
19
20 id[Grammar g] returns[String s]:
21 alpha <g, ch1 > {s = ch1;} (alpha <g, ch2 > {s = s + ch2 ;})*;
22
23 alpha[Grammar g] returns[String ch]:
24 ch=[a-zA-Z0 -9_];
4.4. Mixing Code Generation and Interpretation – An Initial
Approach 57
Usually, parser generators for PEG produce a top-down recursive descent parser.
Every nonterminal is implemented by a function whose body is a code mapped from its
parsing expression. The return value of each function is an integer value representing
the position on the input that it has got moved on or the value −1 if it fails. It is
straightforward to extend this idea to include attributes: the inherited attributes are
implemented as parameters of functions and synthesized attributes as return values.
For example, Figure 4.5 shows a concrete syntax of the example of the block languages
presented in Section 3.3 and Figure 4.61 shows the code generated for the nonterminal
var of Figure 4.5. The function var has one parameter, the language attribute, and
returns an object of type Result , which must contain fields representing the portion of
the input consumed and the values of the synthesized attributes, when specified.
Complications with this scheme arise when the base grammar is dynamically
extended during parsing. When new choices are added to the var nonterminal, the
function of Figure 4.6 does not represent anymore the correct code for this nonterminal,
then this function must be updated. However, it is cumbersome regenerating all the
parser code on the fly to reflect these small changes. In cases where the grammar
changes several times, as in extensible languages, the on-the-fly regeneration of all the
parser is very expensive (we discuss this in Chapter 5). An alternative solution is to
interpret the whole grammar directly, as we discussed in Section 4.2. However, this
may decrease parsing efficiency. Following, we discuss an approach to efficiently adapt
the grammar, generating the code from the base grammar and include hooks to jump
from the generated code to interpret the parts that have been added dynamically. Our
strategy is based on the assumption that the code for the base grammar is expected to
be large and used many times.
Since APEG only allows changes in the definition of nonterminal symbols by
insertion of new choices at the end of the rules, we generate a recursive descent parser
from an APEG grammar, so there is a function for each nonterminal and, whenever
necessary, we place at the end of the body of these functions a call to the interpreter.
1For simplification, we ommit the local attribute declarations
Figure 4.6 Example of code generated by a PEG.
1 Result var(Grammar g) {
2 if(false) {
3 // do nothing
4 } else
5 return new Result (-1); // a fail result
6 }
58
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
Figure 4.7 shows a scratch of the code generated for the grammar of Figure 4.5.
As shown, we generate a Java class which has a function to each nonterminal definition
on the grammar. The generated class extends the predefined class Grammar that
has the implementation of standard functions, such as the function interpretChoice to
interpret an AST and functions to add rules to the grammar or to clone the grammar
itself.
The vector adapt (line 3 in Figure 4.7) stores a possible new choice for each
nonterminal. Notice the hook at the end of the function body of each nonterminal
(lines 29 and 40) to call the interpreter with its possible choice. This hook will be
reached only if its preceding code fails, indicating that we must interpret the new
choice. For example, if the code representing the parsing expression '{'dlist<g, g1>
slist<g1> '}' on lines 6 to 24 fails, then we call the interpreter passing its new choice.
So, the action of adapting a grammar is just an action of including a new choice rule
on the vector adapt .
Using this strategy, all the base code for the grammar is generated and compiled,
and only the choices that are added dynamically must be interpreted. Our strategy is
based on the assumption that the code for the base grammar is expected to be large
and used many times. Therefore, the expected result shall be a faster parser than the
interpreter we have implemented and discussed in Section 4.2, and will still allow an
efficient method for managing syntactic extensions.
In the APEG model, a parsing expression of a nonterminal is fetched from its
language attribute. Using different language attributes, it is possible to get different
parsing expressions for the same nonterminal, thus effectively adapting the grammar.
To have this behaviour, each function generated from a nonterminal has the language
attribute as a parameter. The type of this parameter is the type of the grammar
generated. In our example, Figure 4.7 shows the language attribute, whose type is
BlockLanguage, of the functions block (line 5) and var (line 33).
We use the dot notation to call a nonterminal function associated with its correct
language attribute. For example, the nonterminals dlist and slist on the definition
of the function block are called as g.dlist(g) (line 11) and g1.slist(g1) (line 15). Note
that, as the language attribute passed to each nonterminal is different, we call each
nonterminal function from a different language attribute. We must call slist from the
object g1 instead of g because the vector adapt of g1 has a different value of choices
for the function var . So, the interpreter is called, the new choice is passed, allowing
the use of the variables that have been declared.
A restriction to this approach is that as we use the generated class as the lan-
guage attribute type, e.g. the BlockLanguage type in Figure 4.7, it is not possible to
4.4. Mixing Code Generation and Interpretation – An Initial
Approach 59
Figure 4.7 Generated code for the block language.
1 public class BlockLanguage extends Grammar {
2 // vector of new choices
3 private CommonTree [] adapt = new CommonTree [8];
4 ...
5 public Result block(BlockLanguage g) {
6 BlockLanguage g1; // local attribute
7 Result result;
8 int position = g.match("{", currentPos );
9 if(position > 0) {
10 g.currentPos = position;
11 result = g.dlist(g);
12 if(! result.isFail ()) {
13 g1 = (BlockLanguage) result.getAttribute (0);
14 g1.currentPos = result.getNext_pos ();
15 result = g1.slist(g1);
16 if(! result.isFail ()) {
17 position = g.match("}", result.getNext_pos ());
18 if(position > 0) {
19 char ch = g.read(position );
20 if(APEGInputStream.isEOF(ch))
21 return new Result(position );
22 }
23 }
24 }
25 }
26 Environment env;
27 ... // set the environment to start the interpreter
28
29 // interpreter the choice of block (index 0)
30 return g.interpretChoice(adapt [0], env);
31 }
32
33 public Result var(BlockLanguage g) {
34 if(false) {
35 // do nothing
36 }
37 Environment env;
38 // set the environment to start the interpreter
39
40 // interpreter the choice of var (index 3)
41 return g.interpretChoice(adapt [3], env);
42 }
43 ... // other functions
44 }
60
Chapter 4. Implementation of Adaptable Parsing Expression
Grammars
pass a different grammar which is not subtype of the generated class, as the language
attribute. For example, suppose a grammar with other definitions for the same nonter-
minals presented in Figure 4.7. One may want to pass as the language attribute this
grammar in a specific context on the definition of the block language of Figure 4.7.
However, as this grammar is not a subclass of BlockLanguage, there will be a type
error. Instead of using the generated class as the language attribute, we could use the
base type, Grammar , as the language attribute and use reflection on runtime to invoke
the nonterminal functions. However, as the use of reflection may result in a slower
program than the use of the dot notation to call functions, we avoid this solution.
During the interpretation process of a parsing expression, it is possible to en-
counter a reference to a predefined nonterminal. In this case, the interpreter must
execute the function code of this nonterminal. For example, suppose an input to the
block language example of Figure 4.5 which adds the choice var : 'a' !alpha〈ch〉; to the
definition of the nonterminal var . The nonterminal referenced in this choice, alpha,
is the one defined in Figure 4.5 and has a code generated for it. So, when the inter-
preter reaches this nonterminal, it must stop interpreting and invoke the function of
this nonterminal. We implemented this feature using the reflection mechanism of the
Java language. Whenever interpreting a nonterminal, the interpreter checks whether
the nonterminal is a method of the language attribute object, and if so, the interpreter
invokes the method code by reflection. Otherwise, it continues the interpretation.
The code presented in Figure 4.7 was not automatically generated. In order to test
our approach, we first produced handwritten code for some APEG specifications, such
as the one presented in Figure 4.5 and another for the example of the data-dependent
presented in Section 3.3. For the interpretation, we modified the prototype interpreter
we had developed and discussed in Section 4.2. One of the main modifications was the
code for calling, from the interpreter, functions on the generated code. This feature
was implemented using reflection in the Java language, as previously discussed. After
our mixed approach proposal be proved useful, we will write the code generator to
automatically produce a recursive descent parser from an APEG specification.
4.5 Conclusion
In this chapter, we discussed our design decisions to implement the APEG model. We
showed an implementation of PEG with attributes, as an LL-attributed approach, and
how the adaptability can be efficiently managed. We also proved that the memoization
mechanism can be adapted to work with APEG. As a proof of concept, we implemented
4.5. Conclusion 61
an interpreter with all the features discussed here.
Automatic generation code from an APEG grammar is difficult because changes
on the nonterminal rules may invalidate the generated parser code. However, code
generation is known to be faster than interpretation. In order to ameliorate this prob-
lem, we also propose a novel mixed approach to generate an extensible parser from an
APEG grammar, which combines compilation with interpretation. The greatest virtue
of this proposal is its simplicity, which comes from the APEG model.

Chapter 5
Evaluation and Validation
Extensible languages are languages which have features that allow extending its own
set of constructs [Wilson, 2004; Tobin-Hochstadt et al., 2011]. Extensible languages
seem to have several advantages to implement Domain-Specific Languages (DSLs) over
other approaches. One of the advantages is the possibility of implementing DSLs in a
modular way. For example, Erdweg et al. [2011] show how DSLs can be implemented
using the extensible language SugarJ, by means of syntax units designated as sugar
libraries, which specify a new construct for a domain concept. Tobin-Hochstadt et al.
[2011] also discuss the advantages of implementing DSLs by means of libraries.
An interesting application of adaptable models, such as APEG, is to provide a
formal mechanism to specify the syntax of extensible languages, including their exten-
sibility mechanism. Using formal specifications, it may be possible to automatically
parse programs from these languages. In this chapter, we use extensible languages to
evaluate the APEG model because they require dynamic modifications on their own
set of rules. In our evaluation, we are trying to answer:
• is APEG able to define the syntax of extensible languages and automatically
parse them?
• is an APEG parser efficient enough to be used in practice for defining extensible
languages?
To answer the first question, we used APEG to specify the syntax of the languages
SugarJ and Fortress and we compared the results with the original implementations.
We have chosen these languages because they do not impose restrictions on the syntax
of new constructions and they also require the parser to be dynamically modified during
parsing every time a new construction is used. Before we discuss the implementation
63
64 Chapter 5. Evaluation and Validation
Figure 5.1 Processing of a SugarJ top-level declaration (borrowed from [Erdweg et al.,
2011]).
of these language in Sections 5.2 and 5.3, we discuss, in Section 5.1, the complete cycle
of the implementation of an extensible language and how APEG can be used to make
the implementation easier.
To find out if APEG is feasible to be used in practice, we compared the execution
time to parse programs in the SugarJ language using the original parser implementation
and the one using APEG. Section 5.4 shows this evaluation. Finally, Section 5.5
sumarizes this chapter conclusions.
5.1 An APEG Implementation of an Extensible
Language
It is common to divide the implementation of a programming language in phases:
lexical analysis; parsing; semantic analysis; code generation and optimization. An
implementation of an extensible language uses the same division of phases, but it has
to deal with extensions to the language. An extension (or DSL) modifies all phases
providing information to them, informing how to parse and give the semantics for the
extension.
As the most important current tools for automatic generation of lexers and parsers
do not have resources to implement this behavior of extensible languages, language
designers usually handwrite code to emulate it. As an example, Figure 5.1 shows the
SugarJ compiler strategy to modify the parser when using an extension and to give
its semantics. Observe that when parsing a program which uses an extension (box
SugarJ + extensions), the parser builds a mixed AST containing nodes of the original
SugarJ (blue circles) and extension nodes (red squares). After that, the desugar module
5.2. The Syntax of SugarJ 65
Figure 5.2 A definition of sugar library for Pairs in SugarJ.
1 package syntactic;
2
3 public sugar Pair {
4 context -free syntax
5 '(' type ',' type ')' -> type;
6 '(' expr ',' expr ')' -> expr;
7 }
translates extension nodes into nodes of the base language (SugarJ in this case). The
SugarJ compiler takes an incremental approach and the task presented in this scheme is
done for every top-level entry (the SugarJ compiler splits the program file in a sequence
of logical parts called top-level entries, otherwise its approach would not work). Note
that after analysing a top-level entry, the SugarJ compiler may modify the code of the
parser and the desugar modules, allowing parsing the next top-level entries.
APEG was planned to describe the syntax of extensible languages and auto-
matically parse them, including the mechanism of on-the-fly grammar modification
presented in these languages. Therefore, instead of handwriting an incremental mech-
anism for parsing parts of the input, adapting the grammar and resuming the parser
for the remaining input, the designer only specify the language syntax in APEG and
automatically generates an extensible parser which is capable to parsing a program of
this language. The generated parser also may produce a mixed AST, containing base
language and extension nodes. The semantics of the extension nodes can be given using
the same strategy applied by the SugarJ compiler, desugaring the extension nodes into
base language nodes, or using any other desired technique.
In the following, we discuss only the syntactic aspects of the definition of extensi-
ble languages. As APEG does not provide any special support for giving semantics, the
reader may suppose that the semantics may be given by translating the AST extension
nodes into base language nodes after parsing the program using APEG.
5.2 The Syntax of SugarJ
SugarJ [Erdweg et al., 2011] is a language developed by Erdweg et alii to experiment
and validate their idea of sugar libraries. The main aim of sugar libraries is to encap-
sulate the definition of extensions for the Java language in units that may be imported
or composed for creating other extensions, in a modular way. Figure 5.21 shows an
1All examples of programs in SugarJ languages presented here are borrowed from [Erdweg et al.,
2011]
66 Chapter 5. Evaluation and Validation
Figure 5.3 Use of the pair syntax.
1 import syntactic.Pair;
2
3 public class Test {
4
5 private (String , Integer) p = (''12'', 34);
6
7 }
example of a definition of a sugar library for a new syntax for pairs, creating two new
rules: a rule for the definition of pair types in line 5, type → ‘(’ type ‘, ’ type; ‘)’, and
a rule for using pair expressions in line 6, expr → ‘(’ expr ‘, ’ expr; ‘)’. Note that the
definition of a rule in SugarJ is in an order that is reverse to the one commonly used
in context-free grammars.
A definition of a sugar library does not immediately extend the language, an
extension is only created when a module or file imports a sugar library. As an example,
Figure 5.3 shows a program that imports the sugar library Pair in line 1. After
this import statement, the parser effectively extends the language, adding the two
rules defined by the sugar library. The rules added are used for correctly parsing the
attribute p of the class Test in line 5.
We have defined the syntax of SugarJ in APEG and used an experimental ver-
sion of an interpreter of the model to automatically perform parsing. As APEG is
based on PEG, we adapted an implementation of the Java grammar for the Mouse
project [Redziejowski, 2009], which is also based on PEG, and extended it to allow the
definition of sugar libraries. Figure 5.4 shows the syntax definition of sugar libraries.
As a definition of a sugar library does not extend immediately the grammar, the non-
terminal sugar_decl only collects the name of the sugar library and the rules in a
single string. This information is passed through the rules of Figure 5.4 as synthesized
attributes and is used later in an import statement to extend the grammar. Differently
from the implementation of SugarJ, which defines the rules in SDF [Heering et al.,
1989] syntax, we have decided to use the PEG style for defining the rules of SugarJ,
because of the base model. Otherwise, we would have to translate the context-free
rules to PEG and this would add a complexity that is out of the scope of this thesis.
We have also modified the nonterminal that represents type declarations to allow
declarations of sugar libraries. Therefore, the definition rule for this nonterminal has
5.2. The Syntax of SugarJ 67
a new choice:
type_declaration[String pack, Map m] returns[Map m1]:
. . . / sugar_decl<s,r> {m1 = add(m,pack,s,r);}
The nonterminal type_declaration has two inherited attributes, the package name
and a map from names to rules, and one synthesized attribute, a map from sugar
names to their corresponding definitions. So, when a sugar library is defined by the
user, a type_declaration returns a new map associating the sugar library to its rules.
Figure 5.5 shows a new syntax definition for a compilation unit, highlighting the pos-
sible changes on the grammar rules. The nonterminal compilation_unit receives a
map of sugar libraries and passes it to the nonterminal import_decl. The nonterminal
import_decl checks if the file is importing a sugar library and adapts the grammar, if
necessary, using the function adapt . The adaptable grammar is returned as a synthe-
sized attribute and passed to the nonterminal type_declaration, which may use the
new syntax.
Every file is parsed by the nonterminal compilation_unit. So, for parsing our
examples of Figures 5.2 and 5.3, the compiler parses the definition in Figure 5.2 with
the nonterminal compilation_unit, which receives the initial grammar of the SugarJ
language and an empty map without any definition of sugar libraries. As a result, the
nonterminal compilation_unit returns a new map that has an entry for the new sugar
library Pair. This new map is used in the import declaration for parsing the program
text in Figure 5.3, so that the grammar is modified with the new rules defining Pair
syntax.
Composing sugar libraries
Sugar libraries are composed by importing more than one sugar library into the same
file. As an example, Figure 5.6 shows a program that uses the syntax of pairs and
closures. The compiler extends the grammar with the rules of the syntax of closures
defined in Figure 5.7 when parsing the first import statement, in line 1. Next, the
grammar is also changed with the syntax of pairs when parsing the import declaration
in line 2. The modified grammar, which has the syntactic rules of pairs and closures,
is used for parsing the class Partial.
The implementation of SugarJ uses SDF [Heering et al., 1989] and it may be
necessary to write disambiguation rules when composing various grammars. However,
it is impossible to prevent all the possibilities of ambiguities and conflicts, consequently
composing two or more sugar libraries is not always possible. APEG avoids ambiguities
68 Chapter 5. Evaluation and Validation
Figure 5.4 Syntax definition of sugar libraries.
1 sugar_decl r e tu rn s [ S t r ing name , S t r ing r u l e s ] :
2 'sugar ' name=Id '{' def ining_syntax<ru l e s > '}'
3 ;
4 def in ing_syntax r e tu rn s [ S t r ing r u l e s ] :
5 'context−f r e e syntax ' peg_rule<ru l e s >
6 ;
7 peg_rule r e tu rn s [ S t r ing ru l e ] :
8 { ru l e = '' ; } ( peg_expr<s> '−>' id=Id ' ; '
9 { ru l e += id + ' : ' + s + ' ; ' ; } )∗
10 ;
11 peg_expr r e tu rn s [ S t r ing ru l e ] :
12 peg_seq<ru le> ( '/' peg_seq<r> { ru l e += ' / ' + r ; } )∗
13 ;
14 peg_seq r e tu rn s [ S t r ing s ] :
15 peg_predicate<s> ( peg_predicate<s1> { s += ' ' + s1 ; } )∗
16 / { s = '' ; }
17 ;
18 peg_predicate r e tu rn s [ S t r ing r ] :
19 ' ! ' peg_unary_op<s> { r = ' ! ' + s ; }
20 / '&' peg_unary_op<s> { r = '&' + s ; } / peg_unary_op<r>
21 ;
22 peg_unary_op re tu rn s [ S t r ing r ] :
23 peg_factor<s> '∗' { r = s + '∗' ; }
24 / peg_factor<s> '+' { r = s + '+' ; }
25 / peg_factor<s> '?' { r = s + '?' ; }
26 / peg_factor<r>
27 ;
28 peg_factor r e tu rn s [ S t r ing r ] :
29 r=( p eg_ l i t e r a l / Id / ' . ')
30 / '( ' peg_expr<s> ') ' { r = '( ' + s + ') '}
31 ;
Figure 5.5 Syntax definition of compilation units.
1 compi lat ion_unit [Grammar g , Map m] r e tu rn s [Map m1 ] :
2 package_decl<p>? ( import_decl<g , m, g1> {g=g1 ; } )∗
3 ( type_dec larat ion<g , p ,m,m1> {m=m1; } )∗
4 ;
5 import_decl [Grammar g , Map m] r e tu rn s [Grammar g1 ] :
6 'import' n=qua l i f i e d_ id ' ; ' {g1=adapt (g ,m. get (n ) ) ; }
7 ;
5.2. The Syntax of SugarJ 69
Figure 5.6 Composition of more than one sugar library.
1 import javaclosure.Closure;
2 import syntactic.Pair;
3
4 public class Partial {
5 public static <R,X,Y> #R(Y)
6 invoke(final #R((X,Y)) f, final X x) {
7 return #R(Y y) {
8 return f.invoke ((x,y));
9 };
10 }
11 }
Figure 5.7 Definition of the closure syntax.
1 package javaclosure;
2
3 public sugar Closure {
4 context -free syntax
5 '#' type '(' type ')' -> type;
6 '#' type formal_param block -> expr;
7 }
using ordered choice, so composition is, in principle, always possible using APEG. In
fact, if there is some overlapping between the rules of two or more extensions, the first
option on the ordered choice clause will prevail. As new choices are always inserted at
the end of a rule definition, a user may change the priority altering the order of the
import declarations. It seems a simple task, but it is not always easy to understand
the interactions between overlapping rules.
Syntax × Semantics Paradox
Erdweg et al. [2011] claim that it is not clear how to support “local” imports, which
may extend the language. They give an example of extending the language with the
statement s1 after s2 whose semantics is to swap the execution order of the statements
s1 and s2. They argue that the code
(“12”, 34) after import syntactic.Pair
is a paradox, because only after swapping the two statements, the import statement
comes before the expression (“12”, 34), so it becomes a valid expression. However,
they claim that the parser should already know how to parse the pair expression (“12”,
34), before it can even consider parsing the import.
70 Chapter 5. Evaluation and Validation
We claim that this is not a paradox. In fact, it is an error situation and the
doubts arise only because of the lack of formalization of the language and a confusion
between syntax and semantics. Given the definition of the syntax in APEG, which
parses the program from left to right, it is possible to answer this question. Initially,
the grammar has the rule statement → expr ‘after’ expr, then the parser tries to
use this rule to parse the statement. Next, the parser tries to parse the first expression
with the current grammar and fails, because the current grammar was not extended yet
and there is not a rule for correctly parsing the pair expression (“12”, 34). Note that
the meaning of the statement (“12”, 34) after import syntactic.Pair was not
considered because the objective of the parser is only to check if the program conforms
with the grammar rules available at the moment and the semantics of any expression
is considered afterwards only if the program is valid.
5.3 The Syntax of Fortress
The main goals of the design of the Fortress language were to emulate mathematical
syntax and to be extensible [Allen et al., 2009]. These two goals impose additional
difficulties to build a parser for the language. However, defining the extensibility sys-
tem in a formalism like PEG, which supports unlimited lookahead would bring some
advantages [Allen et al., 2009; Ryu, 2009].
Figure 5.8 shows an example of the definition of an extension in Fortress. Line 1
defines a new grammar, called ForLoop, which may use symbols of two other grammars,
Expression and Identifier. The Fortress language has two types of nonterminal
specifications: the extension of an existing nonterminal, using the symbol |:= (line 2)
or the definition of a new one (line 4). The right hand side of a rule has two parts, a
parsing expression and an action. The parsing expression defines the syntax of the new
construct in a PEG style and the action part specifies how to translate the syntax into
the core language. The action part is everything after the symbol ⇒. It is possible
to use aliases associated with terminal or nonterminal symbols, creating references for
them, which can be used in the action part. Figure 5.8 shows an example in which the
nonterminal forStart is referenced by b in line 2.
Figure 5.9 shows part of an APEG syntax definition of the Fortress language.
Similarly to the SugarJ definition, the nonterminal gram_def defines the syntax of an
extension in Fortress and returns a map with the new entry for it. However, differently
from the SugarJ definition, a grammar definition in Fortress allows recursion and may
use the new syntax in the action part. Therefore, it is necessary to collect the grammar
5.3. The Syntax of Fortress 71
Figure 5.8 Definition of a for loop in Fortress.
1 grammar ForLoop extends{Express ion , I d e n t i f i e r }
2 Expr | := for b : f o r S t a r t => <[ b ]>
3
4 f o r S t a r t : :=
5 i : Id <− e : Expr d : doFront => <[ . . . ]>
6 | e : Expr d : doFront => <[ . . . ]>
7 . . .
8 end
rules before parsing the code. We use the and-predicate operator “&” to specify this,
collecting the grammar rules while ignoring the action part. Next, we reparse the
program with the modified grammar. Note that, when collecting the grammar rules
using the and-predicate operator, the action part is parsed as a string, ignoring every
symbol between ‘<[’ and ‘]>’ (nonterminal syn). After collecting the rule definitions,
we adapt the grammar and generate a new grammar g1. This new grammar is passed
to the nonterminal nonterm_def, which passes it to its children, allowing parsing the
action part (nonterminal syntax). Therefore, the action part may use the new syntax
being defined. The use of the and-operator, which allows an infinite lookahead, was
very important to handle recursion, a kind of forward reference.
Combining Grammars
Figure 5.10 shows an example of composition of grammars in Fortress. Grammar A
defines a new nonterminal Nt, and grammar B extends grammar A. Fortress allows the
use of the syntax of A in the action part of B, as in line 6. Grammar C extends B and
can use its syntax, however, C cannot use the syntax of A because it does not explicitly
extend grammar A. In [Allen et al., 2009], the authors report that they need to resolve
the set of extensions (for example, in grammar C it may use syntax defined in C or B,
but not in A) to generate the table for parsing the action part and this is not an easy
task.
Using the APEG model, defining the task described above is simple and clear. We
adapt the grammar, adding the rules of the grammars specified in the extends part.
For example, parsing the grammar B, we add only the rules of A and when parsing the
grammar C, we add only the rules of B.
Another difficulty reported in [Allen et al., 2009] is how to compose the rules with
multiple extensions, as defined in grammar D. In APEG, to have the same behaviour
of the original Fortress implementation, we must adapt the grammar in the following
72 Chapter 5. Evaluation and Validation
Figure 5.9 APEG formalization of Fortress language.
1 gram_def [Grammar g , Map m] r e tu rn s [Map m1 ] :
2 'grammar' n=id gram_ext<m, l >? &collect_gram<r>
3 {g1 = adapt (g , r + a l lRu l e s ( l ) ) ; } nonterm_def<g1>∗ 'end'
4 {m1 = put (m, n , r ) ; }
5 ;
6 gram_ext [Map m] r e tu rn s [ L i s t l ] :
7 'extends' qual i f ied_names<m, l>
8 ;
9 col lect_gram re tu rn s [ S t r ing r ] :
10 { r = '' ; } ( non_def<n , r> { r += 'n : r ; ' ; } )∗
11 ;
12 non_def return [ S t r ing n , S t r ing r ] :
13 n=id ' | := ' syn<r> ( '/' syn<r1> { r += '/' + r1 ; } )∗
14 / n=id ' : := ' syn<r> ( '/' syn<r1> { r += '/' + r1 ; } )∗
15 ;
16 syn r e tu rn s [ S t r ing r ] :
17 peg_seq<r> '=>' '<[' ! ']>' . ']>'
18 ;
19 nonterm_def [Grammar g ] :
20 id ' | := ' syntax ( '/' syntax )∗ / id ' : := ' syntax ( '/' syntax )∗
21 ;
22 syntax :
23 peg_seq<r> '=>' '<[' expr ']>'
24 ;
order: first, we add the rules of the grammar which is currently being defined (rules
of D in the example), next the grammars in the extends part in the same order that
is specified (first, it adds rules of B and in the sequel, rules of C, for the example of
Figure 5.10).
The combination of extensions is difficult in the Fortress implementation because
it must generate an entire grammar which must contain the definitions of all grammars
used. As in the APEG model the grammar is changed locally and only as needed,
combining grammars is easy and clear.
Other Features
When we were defining the Fortress language in APEG, we have noted that the lan-
guage is powerful enough for defining other aspects of itself. For example, an operator
name in Fortress must be defined by a sequence containing only uppercase letters and
underscore. This sequence must not begin or end with underscore, and must have
at least two different symbols. It is not simple to define this in a CFG formalism,
5.3. The Syntax of Fortress 73
Figure 5.10 Combining grammars.
1 grammar A
2 Nt : := macroA => . . .
3 end
4
5 grammar B extends A
6 Nt | := macroB => < [ . . . macroA . . . ] >
7 end
8
9 grammar C extends B
10 Nt | := macroC => < [ . . . macroB . . . ] >
11 end
12
13 grammar D extends {B,C}
14 Nt | := macroD => < [ . . . macroB macroC . . . ] >
15 end
leading to a very large grammar. Using the APEG model, it is possible to define this
syntactically, as follows.
op_name:
ch1=[A-Z] ('_'* ch2=[A-Z] {? ch1 != ch2}
tail_op_name / '_'* &[A-Z] op_name);
tail_op_name:
![A-Za-z0-9_]
/ ('_'* [A-Z])+ ![A-Za-z0-9_];
The nonterminal op_name tests if the first two uppercase letters are different
and if so, the nonterminal tail_op_name is called. Otherwise, it calls itself ignoring all
symbols until the second uppercase letter. The nonterminal tail_op_name recognizes
a sequence of uppercase letters and underscore, in which the last symbol is an uppercase
letter.
The parser available for Fortress treats an operator name as an identifier and
calls an external function to check if it is a valid operator name. This procedure could
also be implemented in Rats! using a definition similar to the one presented above, but
the developers apparently favoured efficiency at the expense of clarity. This example is
presented only to show the power of APEG, using it for problems other than adapting
grammars.
74 Chapter 5. Evaluation and Validation
DSL Implementation With SDF Implementation With APEG
xml
closure
pair
n-xml
n-closure
n-pair
closure-pair-xml
Adapt Parse
17948 280
17858 84
22920 41
- -
22975 378
29680 111
39537 401
Adapt Parse
16 1048
3 582
3 368
70 19029
33 1261
37 1130
52 1604
Table 5.1: Time in milliseconds for parsing programs written in the SugarJ language.
The performance of the original SugarJ compiler and the APEG version are compared.
5.4 Performance Evaluation
In order to verify if our approach can be efficient enough to be used in a real implemen-
tation of extensible languages, we have implemented a parser for the SugarJ language
[Erdweg et al., 2011] in our model. The SugarJ language has the adequate features for
testing our approach, because the language must be dynamically extended whenever
a sugar library is imported by a module. The SugarJ compiler implemented by the
developers of the language serves as basis for a performance comparison. We have run
performance tests comparing the parser generated by our implementation, based on
the APEG SugarJ description, and the parser provided by the SugarJ developers. The
results are summarized in Table 5.1.
The most interesting features to be evaluated are the ones related to the exten-
sibility mechanisms, so we have tested the two parsers with three DSLs (specified as
sugar libraries): xml , closure and pair . These DSLs are presented in [Erdweg et al.,
2011] and available in the SugarJ website. They define, respectively, a XML language
embedded in the Java language, Java extended with closures, and Java extended with
pair type. In the original SugarJ implementation, the syntax of each DSL is defined
using the SDF style and in the APEG version the syntax is defined in the PEG style.
The programs tested as input for both SugarJ parsers are identical, except for the codes
related to the definition of the syntax of DSLs, in Sugar libraries, which are defined in
the PEG style in the APEG version and in the SDF style in the original version.
We have measured the time for syntax analysis of programs written in the se-
lected DSLs. The parsers first process the rules that specify the DSL, and then parse
the code that may include the new constructs. The results in Table 5.1 detail the time
spent into adapting the grammar (adapt) and actually parsing the program, including
new constructs (parse). In the original SugarJ compiler, the adapt time is the time
5.4. Performance Evaluation 75
for compiling the generated SDF file which has the modified grammar, generating a
new parse table, and for loading this new parse table. In our APEG implementation,
the adapt time is the time for interpreting the rules and changing the grammar. Ev-
erything else is considered as parse time. As our main focus is adaptability, we have
not computed the time for the desugaring phase, after parsing the programs. We have
performed the experiments in a 64-bit, 2.4GHz Intel Core i5 running Ubuntu 12.04
with 6GB of RAM. We have repeated the execution 30 times in a row and measured
the average for adapt and parse time.
We may analyse the result considering three types of users: DSL library designers;
users of the DSL libray; and system end-users. For a system end-user, it does not matter
how the system was implemented, using APEG or SDF, because the executable code is,
pontentially, the same. For the other two types of users, however, how the compiler is
implemented does matter, because each one will use the compiler several times during
the development cycle (write-compile-execute).
We expect a DSL library designer to edit several times the code of the DSL
being developed and use this DSL in small programs to test it, so the adapt task may
be performed several times in a row. The first three lines, xml, closure and pair, of
Table 5.1 give an idea of this common work of a DSL designer. These lines present
the parsing of a program that imports and uses only one DSL, namely the one listed
in the first column, giving an idea of the overhead to generate a new parser table
which includes the rules of the DSL. The original SugarJ compiler takes a long time
to perform adaptation because the compiler generates an entire parse table and loads
it every time a program uses a sugar library (DSL). On the other hand, the time for
adapting the grammar in our APEG implementation is the time taken for parsing the
string representing the sugar library rules, which is proportional to the length of this
string, and the time for setting a few pointers that will change the language grammar.
Therefore, for a DSL library designer the APEG implementation presents a better
performance.
The original SugarJ compiler uses a caching system for parse tables, avoiding
the generation of a same parse table several times. A DSL user would take advantage
of this feature, because it only edits programs that use the DSL library and it pays
the adapt price only once. In order to evaluate the impact of this feature, we have
performed tests with several modules using the same DSL. In Table 5.1, the lines
labeled n-closure and n-pair present results of testing 20-module programs using only
the DSL closure and DSL pair , respectively. The overhead for adapting the grammar
occurs only when parsing the first module of a program, but this overhead is so large
that the results with the APEG parser are still better. However, if, in a day of work,
76 Chapter 5. Evaluation and Validation
the programs which use the DSL were recompiled several times (perhaps, more than
200 times would be enough), the overall performance using SDF would be better.
We also collected large XML examples from a XML repository2 and built large
programs for testing the performance of the compilers in this situation. We built a
test, namely n-xml , with data we have collected containing 10 modules using embedded
XML. The total size of the programs is 5.4 MB of code. Our APEG implementation
parses all the code for the n-xml example in a reasonable time, however we could not
compare with the original SugarJ compiler because it was not able to process the large
files even after several minutes of execution. The examples with results on lines n-pair ,
n-closure and n-xml were designed with the goal of creating situations which could
minimize the overhead for adapting the grammar on the original SugarJ compiler,
simulating a situation more often faced by a DSL library user. These tests show that
although the parse task in the APEG implementation is slower than the SDF one, the
time taken by our approach is also reasonable to be used in practice.
Finally, we tested the performance of the implementations when the input pro-
grams use different DSLs. In a real system, it would use several DSLs and combinations
of them in different parts of the system, requiring to adapt the grammar many times
with different rules. The last line of Table 5.1 presents the parse time of a program
which is a collection of modules that use different combinations of the three DSLs,
testing this situation. The first six tests impose few modifications in the grammar and
this last one requires several modifications, but the results are similar.
The tests show that the adaptation time in the original SugarJ compiler is re-
sponsible for more than 93% of the execution time and the adaptation time in our
APEG implementation is responsible for less than 2% of the execution time, in av-
erage. As we expected, the original SugarJ compiler executed faster than our APEG
prototype interpreter, when parsing the programs after the grammar was changed. It
shows that the time for adapting the grammar in our approach is not significant and
the major time is the parse time. Parsing the program after adapting the grammar
takes a time similar to the execution of the regular PEG algorithm3 which has a linear
complexity. Moreover, our interpreter does not implement any optimization, so we
could apply several techniques to optimize the parse task, getting results near to the
SDF performance. Therefore, it is clear that our approach is feasible to use in real
implementations of extensible languages.
Besides the performance of the parsers, there are other points to highlight when
2http://www.cs.washington.edu/research/xmldatasets/www/repository.html
3We assume that the values of the nonterminal inherited attributes when backtracking will often
repeat, allowing memoization
5.5. Conclusion 77
comparing the two SugarJ implementations. Using APEG, the formal definition of the
language’s syntax is totally specified, whereas it is only partially defined in the original
implementation that uses SDF. In the latter, the extensibility mechanism is provided
by additional programming to direct the parser to use the current parse table for
processing the input until a sugar library import is found. At this point, the library is
to be parsed to generate the SDF file grammar, which will be used to produce the parse
table from this new grammar. Next, the parser resumes the analysis of the remaining
input string. In APEG, a language designer does not need to be concerned with all
these procedures, because the adaptation mechanism is automatically performed.
5.5 Conclusion
We showed the definition of the extensible languages SugarJ and Fortress using APEG.
The specification of these languages shows that APEG is a powerful formalism which
permits a clear definition of what rules are available at a given moment during parsing.
The implementation of these languages uses a mixed approach of handwriting code
and automatic code generation. Compared with APEG, the syntax of these languages
is fully defined and programs are automaticaly parsed. Using an automatic approach,
we may assure correctness and recognition completeness, since with the manual imple-
mentation it is very difficult to guarantee that all programs will be correctly analysed.
The formal specification of the SugarJ language allows us to resolve the false paradox
about the local imports raised by Erdweg et al. [2011]. Also, the semantics of the com-
bination of Fortress grammars is clear in the APEG specification, explicitly showing
what set of rules is to be used when a grammar extends another. Although we have
discussed only how to define the syntax of these languages using APEG, the semantics
of the constructs can be done as usual in other models, such as based on an AST built
after parsing the program.
Forward reference is reported as difficult to be handled with adaptable mod-
els [Carmi, 2010]. The definition of grammars in Fortress has a kind of forward ref-
erence, in which the action part may use syntax that is defined later. Therefore, it is
necessary to use a multi-pass approach. We showed that the predicate operator & of
APEG allows simulating a multi-pass parser, handling forward reference properly.
In order to evaluate the feasibility of APEG to be used in pratice, we performed
an experimental evaluation on parsing programs of the SugarJ languages. The re-
sults indicate that APEG may significantly improve the performance of parsing such
programs, when compared to the original implementation built with SDF.

Chapter 6
Conclusion and Future Work
The main motivation for this work was the current lack of appropriate models for the
definition and implementation of features for allowing on-the-fly modification of the
grammar rules. In order to solve this problem, we have proposed a new adaptable
model based on PEG. The main goals of the proposed model are:
1. to offer facilities for adapting the grammar during the parsing process, without
adding too much complexity to the PEG model;
2. to assure that grammar extensions take effect immediately;
3. to allow an implementation with reasonable efficiency.
Below we present arguments intended to convince the reader that we have suc-
ceeded.
6.1 Adaptability at a Low Complexity Cost
Our model allows changing the grammar at parse time and it has a syntax as clear as
Christiansen’s Adaptable Grammars, because the same principles are used. In order to
explore the full power of the model, it is enough for a developer to be familiar with AGs
and PEGs. So it is reasonable to say that not much complexity was added to the PEG
model, or at least, we used principles that are well known by most language designers
(AG). Additionally, we keep some of the most important advantages of declarative
adaptable models, such as an easy definition of context dependent aspects associated
with static scope and nested blocks. We showed that the use of PEG as the basis
for the model allowed a very simple solution for handling forward reference, which is
79
80 Chapter 6. Conclusion and Future Work
important to define the syntax of languages like Fortress. Forward reference is reported
as very difficult issue to be solved with adaptable models based on CFG.
Our model also has some similarities with the imperative approaches of adaptable
grammars. PEG may be viewed as a formal description of a top-down parser, so the
order in which the productions are used is important to determine the adaptations our
model performs. However, this is not a disadvantage as it is for imperative adaptable
models based on CFG. Even for standard PEG (non adaptable), designers must be
aware of the top-down nature of the model, so adaptability is not a significant increase
on the complexity of the model.
When defining the syntax of extensible languages, the use of APEG has some
advantages. Extending a language specification may require the extension of the set of
its lexemes. APEG is scannerless so the extension of the set of lexemes in a language is
performed with the same features used for the extension of the syntax of the language.
APEG enables the use of automatic parser generation, reducing the complexity of
building parsers for such language. It makes easier to correct bugs and to change the
concrete syntax of the language during the development phase. As the implementation
conforms with the specification, if the specification is correct, then also the generated
parser will be. Moreover, our specifications clearly define what rules are available at
a given moment during the parsing. It allows avoiding confusions about the language
syntax, as the false paradox regarding local import raised by Erdweg et al. [2011].
In [Kats et al., 2010], the authors indicate that pure and declarative syntax defi-
nitions have significant advantages over parser definitions. They claim that, using the
PEG model, it is not possible to be completely oblivious about the parser implementa-
tion. A language developer must be aware of the effects of the ordering of alternatives
in production rules, which are especially complex for larger, modular grammars with
injection productions. In our work, changes to a language definition are restricted to
the insertion of new rules, or the extension of a rule by adding an alternative at the end
of this rule. These restrictions prevent most problems associated with subtle changes
on a language definition, caused by the PEG semantics for ordering of alternatives.
The examples in sections 5.2, 5.3, and 5.4 present evidences that the restrictions are
not so severe to the point of causing difficulties for the definition of extensions. In fact,
this semantics provides a simple mechanisnm to resolve ambiguities and conflits when
importing several sugar libraries in programs written in SugarJ, by just swapping the
order of the imports statements. Perhaps, it would be less annoying to the user than
subtle parser table conflits errors that may arise when using several libraries.
6.2. A Reasonably Efficient Implementation 81
6.2 A Reasonably Efficient Implementation
Another goal was to show that the model allows building an implementation efficient
enough to be used in practice. This was another reason for choosing PEG as the base
model.
An adaptable parser must deal with changes of the production rules at parse time.
If a model such as LR, LL or SGLR is used, it may be necessary to carry on a large
amount of computations when a rule is changed, recalculating lookahead functions and
updating parse tables. Considering the restrictions we defined for the extension of
production rules, it is possible to efficiently produce new PEG rules at parse time. In
this case, the PEG semantics for ordering of alternatives is an advantage.
In Chapter 4, we have presented an interpreter for our model. The tests de-
scribed in Section 5.4 indicate that this interpreter may have a better performance
than the available compiler for the extensible language SugarJ. These results show
initial evidences that our approach may be used in practice.
Automatic generation of an adaptable model is difficult because extensions on
the syntax may invalidate the code of the generated parser. In Section 4.4, we also
propose a mixed approach to generate an extensible parser from an APEG grammar,
combining compilation with interpretation. This approach has the virtue of being very
simple, as opposed to other models based on CFG, since LL, LR and SGLR would
require to recalculate parser table decisions.
6.3 Future Work
As explained in Section 4.4, we have not yet developed a code generator that may
automatically produce a recursive descent parser for the static part of the language
specification written in APEG. The examples used here were handwritten and served
only as a proof of concept of the proposed approach. Our next steps include the
implementation of this code generator, which will make possible to test for the entire
syntax of real extensible languages and evaluate the performance of the generated code
when parsing real programs. Our strategy is based on the assumption that the code
for the base grammar is expected to be large and used many times than the extensions.
We still have to prove this assumption. Several optimizations on the generated parsers
may also be introduced. For example, we can apply techniques to generate code for
rules that are used more frequently during interpretation.
The semantics of APEG, described in Figures 3.2 and 3.3, defines that, for some
operations, changes on the environment must be discarded when a failure occurs. We
82 Chapter 6. Conclusion and Future Work
are not sure that this semantics is really useful, but it is clear that it may be expen-
sive. Some investigation is still necessary in order to reach a better conclusion. The
interpreter developed allows turning on or switching off this semantics, so it may be
an appropriate tool for testing the utility of such semantics.
In order to implement memoization for the APEG model, for each nonterminal
symbol, it is necessary to store the values of the inherited attributes used for each
position of the input string. As the range of the inherited attributes can be unbound,
it is not practical to create a simple two dimensional table, such as in packrats parsers.
Because of this, we cannot ensure that the algorithm is linear-time. The approach taken
by Becket and Somogyi [2008] for memoization on Definite Clause Grammars is similar
to ours and they suggest that a super-linear behaviour does not happen in practice.
They also show that, in general, it is better memoing none or few nonterminals than
all nonterminals. This subject requires more investigation.
Currently, rules inserted in a nonterminal at parse time are represented as plain
strings. In the future, we will evaluate the use of metaprogramming techniques for
the definition of the set of new rules. A promising approach may be based on the
techniques used in tools like MetaAspectJ [Zook et al., 2004].
As APEG has a flexible mechanism for changing the grammar during the parsing,
it opens several possibilities to compose grammars and define grammars in a modular
way. It may enable the generation of parsers from APEG grammars and to distribute
them as libraries. We have plans to investigate this.
There are some problems related to the PEG approach which are also inherited
by our approach, such as left-recursion grammars [Medeiros et al., 2012] and error
reporting [Maidl et al., 2013]. Therefore, we have to study how to deal with these
problem in APEG, specially when adapting the grammar.
We also plan to investigate if it would be beneficial to use a parsing machine, such
as LPeg [Ierusalimschy, 2009; Medeiros and Ierusalimschy, 2008], instead of generating
a recursive descent parser from an APEG grammar.
6.4 Publications
The development of this thesis has produced publications at the 16th Brazilian Sym-
posium on Programming Languages [Reis et al., 2012], the 18th Brazilian Symposium
on Programming Languages [Reis et al., 2014c], the 29th Annual ACM Symposium on
Applied Computing [Reis et al., 2014b] and at the Science of Computer Programming
journal [Reis et al., 2014a].
Appendix A
Adaptable Parsing Expression
Grammar
We present below, in Figure A.1, a simplified version of our grammar for APEG, written
in ANTLR. The terminal symbol ID represents an identifier and ALPHA is any character.
The definitions of the nonterminal symbols cond and exp are omitted. They represent
a condition and an expression, respectively, written in the embedded language.
Figure A.1 Simplified version of the APEG grammar
1 grammarDef: ’apeg’ ID ’;’ functions? rule+ ;
2 functions : ’functions ’ (ID)+ ’;’ ;
3
4 rule : ID decls? (’returns ’ decls)? (’locals ’ declas )? ’:’
5 peg_expr ’;’ ;
6 decls : ’[’ varDecl (’,’ varDecl )* ’]’ ;
7
8 varDecl : type ID ;
9 type : ID ;
10
11 peg_expr : peg_seq (’/’ peg_expr | ) ;
12 peg_seq : ( (ID ’=’)? peg_unary_op )+ ;
13
14 peg_unary_op :
15 peg_factor (’?’ | ’*’ | ’+’)?
16 | ’&’ peg_factor | ’!’ peg_factor
17 | ’{?’ cond ’}’ | ’{’ (ID ’=’ expr ’;’)+ ’}’
18 ;
19 peg_factor :
20 ’\’’ ALPHA* ’\’’ | ID (’<’ actPars ’>’)?
21 | ’[’ (ALPHA ’-’ ALPHA)+ ’]’ | ’.’
22 | ’(’ peg_expr ’)’
23 ;
24 actPars: (expr (’,’ expr )* )? ;
83

Bibliography
Adams, M. D. (2013). Principled parsing for indentation-sensitive languages: revisiting
landin’s offside rule. In Proceedings of the 40th annual ACM SIGPLAN-SIGACT
symposium on Principles of programming languages, POPL ’13, pages 511--522, New
York, NY, USA. ACM.
Aho, A. V. (1968). Indexed grammars – an extension of context-free grammars. J.
ACM, 15(4):647--671. ISSN 0004-5411.
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr.,
G. L., and Tobin-Hochstadt, S. (2008). The Fortress Language Specification Version
1.0.
Allen, E., Culpepper, R., Nielsen, J. D., Rafkind, J., and Ryu, S. (2009). Growing a
syntax. In International Workshop on Foundations of Object-Oriented Languages.
Barišić, A., Amaral, V., Goulão, M., and Barroca, B. (2011). Quality in use of domain-
specific languages: A case study. In Proceedings of the 3rd ACM SIGPLAN Workshop
on Evaluation and Usability of Programming Languages and Tools, PLATEAU’11,
pages 65--72, New York, NY, USA. ACM.
Becket, R. and Somogyi, Z. (2008). DCGs + Memoing = Packrat Parsing but Is It
Worth It? In Hudak, P. and Warren, D., editors, Practical Aspects of Declara-
tive Languages, volume 4902 of Lecture Notes in Computer Science, pages 182–196.
Springer Berlin Heidelberg.
Boullier, P. (1994). Dynamic grammars and semantic analysis. Rapport de recherche
RR-2322, INRIA. Projet CHLOE.
Burshteyn, B. (1990a). Generation and recognition of formal languages by modifiable
grammars. SIGPLAN Not., 25:45--53. ISSN 0362-1340.
85
86 Bibliography
Burshteyn, B. (1990b). On the modification of the formal grammar at parse time.
SIGPLAN Not., 25:117--123. ISSN 0362-1340.
Cabasino, S., Paolucci, P. S., and Todesco, G. M. (1992). Dynamic parsers and evolving
grammars. SIGPLAN Not., 27:39--48. ISSN 0362-1340.
Carmi, A. (2010). Adaptive multi-pass parsing. Master’s thesis, Israel Institute of
Technology.
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on
Information Theory, 2:113--124. http://www.chomsky.info/articles/195609--.
pdf – last visited 14th January 2009.
Chomsky, N. (1959). On certain formal properties of grammars. Information and
Control, 2(2):137 – 167. ISSN 0019-9958.
Christiansen, H. (1987). The Syntax and Semantics of Extensible Languages. Roskilde
datalogiske skrifter. Computer Science, Roskilde University Centre.
Christiansen, H. (1990). A survey of adaptable grammars. SIGPLAN Not., 25:35--44.
ISSN 0362-1340.
Christiansen, H. (2009). Adaptable grammars for non-context-free languages. In Pro-
ceedings of IWANN’09, pages 488--495. Springer-Verlag.
de Chastellier, G. and Colmerauer, A. (1969). W-grammar. In Proceedings of the 1969
24th national conference, ACM ’69, pages 511--518, New York, NY, USA. ACM.
Erdweg, S., Rendel, T., Kästner, C., and Ostermann, K. (2011). SugarJ: library-based
syntactic language extensibility. In Proceedings of the 2011 ACM international con-
ference on Object oriented programming systems languages and applications, OOP-
SLA ’11, pages 391--406, New York, NY, USA. ACM.
Ford, B. (2002a). Packrat Parsing: a Practical Linear-Time Algorithm with Backtrack-
ing. PhD thesis, Massachusetts Institute of Technology.
Ford, B. (2002b). Packrat parsing:: Simple, powerful, lazy, linear time, functional
pearl. In Proceedings of the Seventh ACM SIGPLAN International Conference on
Functional Programming, ICFP ’02, pages 36--47, New York, NY, USA. ACM.
Ford, B. (2004). Parsing expression grammars: a recognition-based syntactic founda-
tion. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages, POPL ’04, pages 111--122, New York, NY, USA. ACM.
Bibliography 87
Fowler, M. (2010). Domain-Specific Languages. Addilson-Wesley, Boston, USA. ISBN
0321712943.
Grimm, R. (2006). Better extensibility through modular syntax. In Proceedings of the
2006 ACM SIGPLAN conference on Programming language design and implemen-
tation, PLDI’06, pages 38--51, New York, NY, USA. ACM.
Heering, J., Hendriks, P. R. H., Klint, P., and Rekers, J. (1989). The syntax definition
formalism sdf–reference manual–. SIGPLAN Not., 24(11):43--75. ISSN 0362-1340.
Ierusalimschy, R. (2009). A text pattern-matching tool based on parsing expression
grammars. Software – Practice and Experience, 39(3):221--258. ISSN 0038-0644.
Jim, T., Mandelbaum, Y., and Walker, D. (2010). Semantics and algorithms for data-
dependent grammars. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, POPL’10, pages 417--430, New
York, NY, USA. ACM.
Kats, L. C. L., Visser, E., and Wachsmuth, G. (2010). Pure and declarative syntax
definition: paradise lost and regained. In Cook, W. R., Clarke, S., and Rinard,
M. C., editors, Proceedings of OOPSLA 2010, pages 918–932. ACM.
Kieburtz, R. B., McKinney, L., Bell, J. M., Hook, J., Kotov, A., Lewis, J., Oliva, D. P.,
Sheard, T., Smith, I., and Walton, L. (1996). A software engineering experiment in
software component generation. In ICSE ’96: Proceedings of the 18th International
Conference on Software Engineering, pages 542--552, Washington, DC, USA. IEEE
Computer Society.
Knuth, D. (1968). Semantics of context-free languages. Mathematical Systems Theory,
2(2):127–145. ISSN 0025-5661.
Kosar, T., López, P. E. M., Barrientos, P. A., and Mernik, M. (2008). A preliminary
study on various implementation approaches of domain-specific language. Informa-
tion and Software Technology, 50(5):390 – 405. ISSN 0950-5849.
Kosar, T., Mernik, M., and Carver, J. C. (2012). Program comprehension of domain-
specific and general-purpose languages: comparison using a family of experiments.
Empirical Software Engineering, 17(3):276–304. ISSN 1382-3256.
Kosar, T., Oliveira, N., Marjan, M., Pereira, M. J. V., Črepinšek, M., da Cruz, D., and
Henriques, P. R. (2010). Comparing general-purpose and domain-specific languages:
An empirical study. Computer Science and Information Systems, 7:247–264.
88 Bibliography
Koster, C. (1991a). Affix grammars for natural languages. In Alblas, H. and Melichar,
B., editors, Attribute Grammars, Applications and Systems, volume 545 of Lecture
Notes in Computer Science, pages 469–484. Springer Berlin Heidelberg.
Koster, C. (1991b). Affix grammars for programming languages. In Alblas, H. and
Melichar, B., editors, Attribute Grammars, Applications and Systems, volume 545 of
Lecture Notes in Computer Science, pages 358–373. Springer Berlin Heidelberg.
Koster, C. H. A. (1970). Affix-grammars. In Peck, J. E. L., editor, ALGOL 68 Imple-
mentation: Proceedings of the IFIP Working Conference on ALGOL 68 Implemen-
tation, Munich, Germany, July 20-24, 1970, pages 95–109. North-Holland.
Maidl, A., Mascarenhas, F., and Ierusalimschy, R. (2013). Exception handling for error
reporting in parsing expression grammars. In Du Bois, A. and Trinder, P., editors,
Programming Languages, volume 8129 of Lecture Notes in Computer Science, pages
1–15. Springer Berlin Heidelberg.
Mayer, O. (1972). Some restrictive devices for context-free grammars. Information and
Control, 20(1):69 – 92. ISSN 0019-9958.
Medeiros, S. and Ierusalimschy, R. (2008). A parsing machine for pegs. In Proceedings
of the 2008 Symposium on Dynamic Languages, DLS ’08, pages 2:1--2:12, New York,
NY, USA. ACM.
Medeiros, S., Mascarenhas, F., and Ierusalimschy, R. (2012). Left recursion in parsing
expression grammars. In Proceedings of the 16th Brazilian Conference on Program-
ming Languages, SBLP’12, pages 27--41, Berlin, Heidelberg. Springer-Verlag.
Mercer, D. B. (2008). Attributed parsing expression grammars. Master’s thesis, Uni-
versity of South Alabama.
nez Guzmán, E. A. D. (2009). LGI (Language Generator by Instil).
http://sourceforge.net/projects/instil-lang/.
Parr, T. and Fisher, K. (2011). LL(*): the foundation of the antlr parser generator.
In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI ’11, pages 425--436, New York, NY, USA. ACM.
Parr, T., Harwell, S., and Fisher, K. (2014). Adaptive LL(*) parsing: The power
of dynamic analysis. In Proceedings of the 2014 ACM International Conference on
Object Oriented Programming Systems Languages &#38; Applications, OOPSLA’14,
pages 579--598, New York, NY, USA. ACM.
Bibliography 89
Parr, T. J. and Quong, R. W. (1994). ANTLR: A predicated-LL(k) parser generator.
Software Practice and Experience, 25:789--810.
Rahien, A. (2010). DSLs in BOO: Domain-Specific Languages in .NET. Manning,
Stanford, USA. ISBN 978-1-933988-60-3.
Redziejowski, R. R. (2009). Mouse: from parsing expressions to a practical parser. In
Proceedings of the CS&P 2009 Workshop, pages 514--525. Warsaw University.
Reis, L. V. S., Bigonha, R. S., Di Iorio, V. O., and Amorim, L. E. S. (2012). Adaptable
parsing expression grammars. In Carvalho Junior, F. and Barbosa, L. S., editors,
Programming Languages, volume 7554 of Lecture Notes in Computer Science, pages
72–86. Springer Berlin Heidelberg.
Reis, L. V. S., Bigonha, R. S., Di Iorio, V. O., and Amorim, L. E. S. (2014a). The for-
malization and implementation of adaptable parsing expression grammars. Science
of Computer Programming, 96, Part 2:191–210. ISSN 0167-6423.
Reis, L. V. S., Di Iorio, V. O., and Bigonha, R. S. (2014b). Defining the syntax of
extensible languages. In Proceedings of the 29th Annual ACM Symposium on Applied
Computing, SAC ’14, pages 1570--1576, New York, NY, USA. ACM.
Reis, L. V. S., Di Iorio, V. O., and Bigonha, R. S. (2014c). A mixed approach for
building extensible parsers. In Quintão Pereira, F. M., editor, Programming Lan-
guages, volume 8771 of Lecture Notes in Computer Science, pages 1–15. Springer
International Publishing.
Reis, L. V. S., Di Iorio, V. O., Bigonha, R. S., Bigonha, M. A. S., and Ladeira, R. C.
(2009). XAJ: An extensible aspect-oriented language. In Proceedings of the III
Latin American Workshop on Aspect-Oriented Software Development, pages 57--62.
Universidade Federal do Ceará.
Rozenberg, G. and Wood, D. (1980). Context-free grammars with selective rewriting.
Acta Informatica, 13(3):257–268. ISSN 0001-5903.
Ryu, S. (2009). Parsing fortress syntax. In Proceedings of the 7th International Con-
ference on Principles and Practice of Programming in Java, PPPJ ’09, pages 76--84,
New York, NY, USA. ACM.
Salomaa, A. (1972). Matrix grammars with a leftmost restriction. Information and
Control, 20(2):143 – 149. ISSN 0019-9958.
90 Bibliography
Seaton, C. (2007). A programming language where the syntax and semantics are
mutable at runtime. Technical report CSTR-07-005, University of Bristol.
Shutt, J. N. (1998). Recursive adaptable grammars. Master’s thesis, Worchester Poly-
technic Institute.
Sintzoff, M. (1967). Existence of a van wijngaarden syntax for every recursively enu-
merable set. Annales de la Société Scientifique de Bruxelles, 81(2):115–118.
Slonneger, K. and Kurtz, B. L. (1995). Formal Syntax and Semantics of Programming
Languages: a Laboratory Based Approach. Addison-Wesley. ISBN 0201656973.
Stansifer, P. and Wand, M. (2011). Parsing reflective grammars. In Proceedings of the
Eleventh Workshop on Language Descriptions, Tools and Applications, LDTA ’11,
pages 10:1--10:7, New York, NY, USA. ACM.
Tobin-Hochstadt, S., St-Amour, V., Culpepper, R., Flatt, M., and Felleisen, M. (2011).
Languages as libraries. In Proceedings of the 32nd ACM SIGPLAN conference on
Programming language design and implementation, PLDI’11, pages 132--141, New
York, NY, USA. ACM.
van Deursen, A. and Klint, P. (1998). Little languages: Little maintenance. Journal
of Software Maintenance, 10(2):75--92. ISSN 1040-550X.
Watt, D. A. and Madsen, O. L. (1983). Extended attribute grammars. Comput. J.,
26(2):142–153.
Wegbreit, B. (1970). Studies in Extensible Programming Languages. Outstanding
Dissertations in the Computer Sciences. Garland Publishing, New York. ISBN 0-
8240-4423-1.
Wijngaarden, A. v. (1969). Report on the algorithmic language ALGOL 68. Printing
by the Mathematisch Centrum. ISBN B0007IUUXM.
Wilson, G. V. (2004). Extensible programming for the 21st century. Queue, 2(9):48--57.
ISSN 1542-7730.
Zook, D., Huang, S. S., and Smaragdakis, Y. (2004). Generating AspectJ programs
with Meta-AspectJ. In Generative Programming and Component Engineering: Third
International Conference, GPCE 2004, volume 3286 of LNCS, pages 1--19. Springer.