Introduction

Lex是一个lexical analyzer或者scanner。它把输入转换成tokens。

Yacc是一个syntax analyzer或者parser。它分析tokens，生成syntax tree。

Lex

Lex使用正则表达式匹配字符串。lex文件的后缀通常为.l，lex输入.l文件，输出.c文件。

它的格式如下：

definitions

%%

rules

%%

subroutines

3个部分用%%分开。如果subroutines为空，则第二个%%可省略。definitions和rules均可以为空。但第一个%%不能省略。

rules中，每一条rule单独占用一行。rule行必须从第一个字符开始。如果第一个字符是空白字符，此行不是rule，它会被原本的拷贝到输出.c文件中。

rule由pattern和action组成，用空白字符隔开。pattern是正则表达式。action是可选的。action可以是single C statement，也可以是multiple C statement，后者使用大括号括起来，C语句中可以使用yytext, yyleng, yyout等变量。yytext是匹配的字符串，yyleng字符串的长度，yyout是输出文件FILE *。

Lex Predefined Variables

Name	Function
int yylex(void)	call to invoke lexer, return token
char *yytext	pointer to matched string
yyleng	length of matched string
yylval	value assoicated with token
int yywrap(void)	wrapup, return 1 if done, 0 if not done
FILE *yyout	outputfile
FILE *yyin	input file
INITIAL	initial start condition
BEGIN	condition switch start condition
ECHO	write matched string

definitions section由substitutions, code和start states组成。

substituions类似一个宏，定义一个pattern，然后在rule中使用大括号来使用，如下：

digit [0-0]

letter [A-Za-z]

%%

{letter}({letter}|{digit})*

code是c语言，会原本的拷贝到输出.c文件中，code使用%{ 和 %}包起来，如下：

%{

int yylineno++;

%}

%%

subroutines是纯C代码。将被拷贝到输出.c文件中。

一个简单的例子：

$ ls

a.l

$ cat a.l

%%

. ECHO;

\n ECHO;

%%

int yywrap(void) {

return 1;

}

int main(void) {

yylex();

return 0;

}

$ flex a.l

$ ls

a.l lex.yy.c

$ cc lex.yy.c

$ ./a.out

1

如上的lex，.匹配任何非换行符，\n匹配换行符。ECHO;是一个预定义宏，输出匹配的行。使用flex编译，生成一个lex.yy.c文件，编译后运行。输入任何行，都原本输出。

Yacc

yacc的语法是Backus Naur Form (BNF) 的变种。BNF语法可以用表达context-free language。现代语言的大多数结构可以使用BNF表达。

例如，乘法和加法的表达式语法：

E -> E + E (r1)

E -> E * E (r2)

E -> id (r3)

出现在左边的(left-hand side)的是nonterminals，例如上述id的term是terminals，由lex产生的token。只能出现right-hand side (rhs)。

根据上面3个rule，我们可以产生表达式

E -> E + E (r2)

-> E * z (r3)

-> E + E * z (r1)

-> E + y * z (r3)

-> x + y * z (r4)

每一步扩展一个term，将lhs替换为rhs。

为了parse一个表达式，我们需要do the reverse operation. 方法是bottom-up或者shift-reduce，使用栈来存储terms。

1 . x + y * z shift

2 x . + y * z reduce(r3)

3 E . + y * z shift

4 E + . y * z shift

5 E + y . * z reduce(r3)

6 E + E . * z shift

7 E + E * . z shift

8 E + E * z . reduce(r3)

9 E + E * E . reduce(r2) emit multiply

10 E + E . reduce(r1) emit add

11 E . accept

每次左移1个，然后分解。第6步，分解并使用r1，导致加法的优先级高于乘法，这叫做shift-reduce conflict。

下面是reduce-reduce conflict，因为id可以分解为两种T或E

E -> T

E -> id

T -> id

Pratice Part 1

yacc的输入也分成3个部分

definitions

%%

rules

%%

subroutines

definitions section包含token declarations和C code，C code使用 %{ 和 %} 括起来。

BNF语法放在rules section。

user subroutines 放到subroutines section。

definitions中的token声明如下：

1	`%token INTEGER`

yacc将产生一个.tab.h的头文件，包含token类型等的声明

#ifndef YYSTYPE

#define YYSTYPE int

#endif

#define INTEGER 258

extern YYSTYPE yylval;

lex文件可以包含该头文件，来使用INTEGER定义，来返回一个该类型的token给yacc，yacc调用yylex函数来获取一个token，其返回值是token的类型。

[0-9]+ {

yylval = atoi(yytext);

return INTEGER;

}

直接先展示calc1.l和calc1.y

calc1.l

/* calculator #1 */

%{

#include "y.tab.h"

#include <stdlib.h>

void yyerror(char *);

%}

%%

[0-9]+ {

yylval = atoi(yytext);

return INTEGER;

}

[-+\n] { return *yytext; }

[ \t] ; /* skip whitespace */

. yyerror("Unknown character");

%%

int yywrap(void) {

return 1;

}

calc1.y

%{

#include <stdio.h>

int yylex(void);

void yyerror(char *);

%}

%token INTEGER

%%

program:

program expr '\n' { printf("%d\n", $2); }

|

;

expr:

INTEGER

| expr '+' expr { $$ = $1 + $3; }

| expr '-' expr { $$ = $1 - $3; }

;

%%

void yyerror(char *s) {

fprintf(stderr, "%s\n", s);

}

int main(void) {

yyparse();

return 0;

}

yacc内部维护两个栈，一个parse stack，一个value stack。这两个栈总是同步的。$$是lhs的值。$n是第n个term的值。

编译：

# bison -d calc1.y produces calc1.tab.c

# bison -y -d calc1.y produces y.tab.c (the standard yacc output)

# calc1

bison -y -d calc1.y

flex calc1.l

gcc -c y.tab.c lex.yy.c

gcc y.tab.o lex.yy.o -o calc1.exe

参考

Tom Niemann. Lex & Yacc Tutorial.

https://www.epaperpress.com/lexandyacc/

https://www.epaperpress.com/lexandyacc/download/LexAndYaccCode.zip

ILD

Introduction

Lex

Yacc

Pratice Part 1