.nr LL 7i .nr PO .75i .OH ''SDCL'' .EH ''SDCL'' .EF ''-%-'' .OF ''-%-'' .ND .LP .sp 2 .ce .cu CS409 Spring 1985 Semester Project .sp 1 .PP We can always work on specific programs that illustrate the usage of a language in a programming environment. I chose to have you write a compiler (preprocessor) for structured DCL (Digital Control Language) for the following reasons: .IP a. You will use C to develop a significantly sized program. .IP b. The methods used in developing the lexical analyser, the recursive descent parser, and the code generator are applicable in a number of other compiler type applications. .IP c. You would develop a very useful tool, which, I hope, you will cherish. .PP The command language in the VAX/VMS environment is called DCL. It provides fairly elaborate facilties for executing commands. One can write command procedures to execute a set of commands. The grave deficiency however, is the fact that in writing DCL command procedures, the only control struture provided is the .sp 1 .ti +5 IF condition THEN command .sp 1 If one wants to execute more that one command based on the condition, one has to use GOTO's (yugh!). For example, a command procedure to swap the contents of two files in DCL could be coded as follows: .DS $ IF (( p1 .eqs. "" ) .or. ( p2 .eqs. "" )) then goto usage $ copy 'p1' tmpfile $ copy 'p2' 'p1' $ copy tmpfile 'p2' $ exit $ usage: write sys$output "Usage: swap file1 file2" $ exit .DE p1 and p2 are first and second args to the command procedure. 'p1' and 'p2' are akin to UNIX shell's $1 and $2, i.e, the values of the parameters. The construct \fBusage:\fP is label that can be targeted by \fBgoto\fP's (yugh!). We need not get into any details of DCL. The point to note is that we had to resolve to using GOTO's to get the task done. What we want to do is to write a program that would allow one to write the same command procedure as follows: .DS if (( p1 .nes. "" ) .and. ( p2 .nes. "" )) { copy 'p1' tmpfile copy 'p2' 'p1' copy tmpfile 'p2' } else write sys$output "Usage: swap file1 file2" .DE I think you can easily infer our goal. We want an extension to DCL which will provide such control structures. Others we may want are the while and the for loops, break and next statements, and more if we become adventurous. We are really providing a small programming language. The compiler will then do a simple translation from the structured DCL to the form which VMS DCL recognizes. .PP One convenient way to describe a programming language is the Backus-Naur Form (BNF), which is a formal specification of the grammer of a language, that is, the set of rules by which a legal program in the langauge is written or recognized. Given a grammer, one can write a program that will analyse or .ul parse programs in that language. Fortunately, the grammer for s-DCL (the name is still debatable) is small: .KS .DS program : statement | program statement statement : \fBif (\fP condition \fB)\fP statement | \fBif (\fP condition \fB)\fP statement \fBelse\fP statement | \fBwhile (\fP condition \fB)\fP statement | \fBdo\fP statement \fBwhile (\fP condition \fB)\fP | \fBfor (\fP intialize \fB;\fP condition \fB;\fP reinitialze \fB)\fP statement | \fBbreak\fP | \fBnext\fP | \fB{\fP program \fB}\fP | \fBother\fP .DE .KE The first two lines say that a program is a statement, or a program followed by a statement. In other words, a program consists of one or more statements. A statement in turn is one of a handful of constructs; the vertical bar | indicates a choice of alternatives. Most statements are straightforward enough, standing for an occurence of the particular keyword. For example, a statement can consist of the keyword \fBif\fP followed by a parenthesized \fBcondition\fP and a \fBstatement\fP. (The definition is recursive, as is the definition of \fBprogram\fP.) A group of statements in braces can be used anywhere a single statement can, becuase of the rule .sp 1 .DS statement : { program } .DE .sp 1 This is the so-called \fBcompound\fP statement. \fBbreak\fP causes an exit from one level of a \fBwhile\fP or \fBfor\fP loop. \fBnext\fP causes the next iteration of the loop to begin. In a \fBwhile\fP it goes immediately to the \fBcondition\fP part; in a \fBfor\fP, it goes to the \fBreinitialize\fP step. .PP The last grammetical type is an \fBother\fP, which is anything that wasn't recognized as many of the preceding types. This category actually encompasses most of the DCL. For example, the statement .DS copy 'p1' tmpfile .DE is nothing recognizable by s-DCL, and is thus an \fBother\fP. .PP Type \fBother\fP is an important simplification, for it frees s-DCL from having to know very much about DCL. If a statement is encountered which does not begin with one of the keywords (or a left brace), it .ul must be an \fBother\fP, and no real processing is needed on it, other than emitting it with a $ preceeding it. The price of this simplication is that the error detecting abilities of s-DCL are not as good as they might be with a more comprehensive grammer. This is not a serious drawback, however, since we are translating into DCL, and DCL is perfectly capable of detecting any syntax errors that escape the preprocessor. .PP In principle, associated with each rule of the grammer is a \fBsemantic action\fP which states what is to be done when that particular construction is recognized in the program being translated. In s-DCL, the semantic actions are usually very simple, involving reformatting the incoming text and occasionally interspersing \fBif\fP's, \fBgoto\fP's and labels to translate the control flow statements. .PP The preprocessor is organized as follows. The top level is a controlling routine called the .ul parser, so called because it controls by analysing (parsing) the grammatically structure of the input it sees. For instance, when an .ul if is seen the parser calls a routine that handles .ul if statements. That routine in turn isolates the .ul condition part and generates the test, a semantic action. The parser must also remember that an .ul if has been seen so that when the end of the .ul statement part is reached, the correct terminating code for an .ul if can be produced. This may include dealing with an .ul else if it is prsent. Furthermore, all this is inherently recursive, as the BNF grammar indicates, because constructions can be nested as in .sp 1 .DS .KS for( i = 1; i .le. n; i = i + 1 ) for( j = 1; j .le. n; j = j + 1 ) if( p1 .eqs. "copy" ) append 'i' 'p2' .KE .DE .PP At the beginning of each statement, the parser calls a "lexical analysis" routine to classify it into one of the types specified in the grammer. The lexical routine returns the first token of the statement, which will determine the statement type. When the statement has been classified, the parser calls the appropriate code generation routine. .PP We will begin by describing the the lexical analyser, since it is essentially independent of everything else. Then we can present parsing and code generation more easily. .NH LEXICAL ANALYSIS .PP The purpose of the this stage is to provide a procedure that returns the next token and its classification (tokencode) from the input. Create a function called .ul lex which would begin as follows .sp 1 .DS lex(token) char *token; { .... .... return(tokencode); } .DE The formal parameter "token" will be loaded with the actual string that makes up a token, e.g, "if", "else", "*", "34587" etc. The value returned by the function "lex" will the tokencode for the token. These tokencode classify the input into catagories that are of interest to the parser. In your case, this classification consists of ID, integers, singlechar, comment, whitespace, newline, EOF and string. Some of the ID's (identifiers) are special: they are keywords e.g, "if", "else", "while", "for", "next" and "break". So if an identifier is found, you will need check whether it is a keyword or not. The classfication "singlechar" also emcompasses a number of single character tokens, e.g "(", "{", "*", "+", ".", "," ";" etc. If one of these is found, you will again need to check whether it happens to be one of the special one character tokens e.g, "(", ")", "{", "}", "#" etc. .PP A very convenient scheme to implement a table driven lexical analyser appeared in the following paper .DS Dedourek J. M., Gujar U. G., "Scanner Design", Software-Practice and Experience, Vol. 10., 959-972 (1980) .DE The authors of this paper present a step by step method for developing a lexical anaylser once a state diagram is at hand. Use the method outlined in the paper. Here are some guidelines for you to follow, if you wish, when you begin to write the "lex" function. .PP You may be tempted to define a new enumerated type that contains the character classes and use this to declare your two dimensional array. The only problem is that in C array indicies can only be integers. Unlike pascal, you .ul cannot use enumerated types to do things like .DS enum charclass { letter, digit, onechar, slash, whitespace, newline, eof, dqoute, star } enum charclass nextstate[0..12][letter..star]; .DE .PP Assuming that you got your tables developed by hand, here's what "lex" might look like. .sp 1 .DS /* lex -- return token and tokencode from standard input The lex routine is table driven. The two tables "nextstate" and "output" were developed by hand. Because no other routine needs to know what's in the tables, they are declared "static". */ /* first, defines to avoid using numbers for charclasses */ #define LETTER 0 /* [A-Za-z_$] (notice _ and $ are considered to be characters */ #define DIGIT 1 /* [0-9] */ #define ONECHAR 2 /* (){}[]'.;:#%?\e */ #define SLASH 3 /* / */ #define STAR 4 /* * */ #define WHITESPACE 5 /* blank,tab,non-printing chars, except NEWLINE. */ #define EOL 6 /* \en */ #define DQUOTE 7 /* " */ #define ENDFILE 8 /* don't use EOF */ /* declare and initialize the tables here. Use the correct size, the following is just an example */ static int nextstate[14][8] = { { 0, 1, 4, ...................... , 12 }, /* first row */ { 1, 1, 4, ...................... , 13 }, /* second row */ . . . { 2, 2, 4, ...................... , 12 } /* last row */ }; static int output[14][8] = { .... .... }; /* Use defines to set up token codes. It may better to put the following defines in a file, say "tcodes.h" and include it here because you would need the codes in your parsing routine also. So either */ #define IF 0 /* keywords first, if */ #define ELSE 1 #define WHILE 2 #define FOR 3 #define BREAK 4 #define NEXT 5 #define DO 6 #define ID 7 /* if identifier is not a keyword */ #define INTEGER 8 /* special single characters */ #define OPAREN 9 #define CPAREN 10 #define OBRACE 11 #define CBRACE 12 #define SCHAR 13 /* single character otherwise */ #define COMMENT 14 #define WSPACE 15 /* blanks,tab,non-printing characters */ #define NEWLINE 16 #define FILEEND 17 /* can't use EOF */ #define STRING 18 /* "...." */ /* or, if codes are in file "tcodes.h" */ #include "tcodes.h" /* variables that need to retain their values across calls to lex. */ static int state = 0; static int nextchar; lex(token) char *token; { int i, out; /* other declarations as needed */ /* here's the psuedo code. See the flow chart on p967 */ /* i = 0 token[i] = '\e0' loop if state != 0 then if i < MAXTOKENLEN then token[i] = nextchar i = i + 1 endif endif nextchar = getchar() class = findclass(nextchar) out = output[state][class] state = nextstate[state][class] exit loop if out == 0 endloop token[i] = '\e0' /* mark end of string */ /* out contains the tokencode. If it is ID, check for keywords by searching table of keywords. */ if out == ID then out = iskeyword( out, token ) else if out == SCHAR /* single character */ out = isspecialonechar( out, token ) endif return( out ) /* here's the token code */ } /* end of lex */ /* your keyword lookup routine and single character checking routines go here */ .DE .NH ASSIGNMENT 1 .PP Once you have your lexical part, test it out. As the first part of the project assignment, write a driver that calls your lex function. Feed it input from .ul /users/sohail/hwk/bun.vms. Your driver will be set up as follows .DS main() /* test of my lex */ { int tokecode; char token[81]; /* or whatever maximum you choose */ extern int lex(); /* because lex is on a separate file, you need this declaration */ while ( (tokencode = lex( token ) ) != FILEEND ) printf( "%3d: %s\en", tokencode, token ) } .DE Submit your driver routine, your lex files and the output as the first part of the assignment. .NH PARSER .PP Given the BNF grammar, you will write a .cu recursive descent parser that will parse the input language (s-dcl) and emit DCL. First let me describe how such a parser is developed; I'll come to code-generation later. .PP A few notes about terminolgy. The grammar consists of \fBproduction rules\fP, which essentially says that the left hand side is made up by gathering the stuff on the right hand side, e.g .DS statement : \fBif (\fP condition \fB)\fP statement .DE This is a grammar rule that says that a \fBstatement\fP is made up of \fBif\fP, followed by open-paren, followed by a condition, followed by a close-paren followed by a statement. The components of a grammar rule can be divided into two catagories: \fBterminals\fP and \fBnon-terminals\fP. Things that are atomic, e.g keywords like \fBif, else, while, next\fP etc are \fBterminals\fP because these cannot be broken down any further. Things like \fBstatement, program, condition\fP are called \fBnon-terminals\fP because they consists of further components, and this breakdown is indicated by a grammar rule. What the parser has to do is to gather up terminals by calling lex, and attempt to determine whether these terminals (tokens) can be grouped to match a grammar rule. The highest non-terminal that is ultimately matched is \fBprogram\fP. A recursive descent parser really begins with the highest level and works it way down. The actual coding works as follows: for each \fBnon-terminal\fP, you write a procedure. So in our case, you'll need procedures .ul program, .ul statement .ul condition .ul initialize .ul reinitialize and .ul other. These procedures will be invoked at various stages based upon the \fBterminals\fP that are seen. Let's assume that we write another function called .ul scan that calls lex. Scan will check to see if the token returned is a \fBCOMMENT, WSPACE\fP or \fBNEWLINE\fP. If so, it will call lex again until it returns a token that is not one of the above. Such a function comes in handy when your parser is looking for keywords or special characters. The function lex will still be available to the parser. Given this, here's how your parser will look like. I have not put in statements that will generate the actual code. Once you see the structure of the parser, you'll immediately realize how and where the actual code generation will take place. .DS #include #include "tcodes.h" /* Because a number of routines will want tokencode and token, make them global. */ int tokencode; char token[133]; scan() /* call lex and return a token that is not COMMENT, WSPACE or NEWLINE. */ { do { tokencode = lex( token ); } while ( tokencode == COMMENT || tokencode == WSPACE ||tokencode == NEWLINE ); } main() /* This is the the top level routine. Because the rotuines it needs are either ahead in the file, or somewhere else, it need to have extern declarations. */ { extern int statement(); /* keep going until EOF but first, get the first non-blank token */ scan(); while( tokencode != FILEEND ) statement(); } statement() /* based on keyword seen or not seen, invoke the appropriate routine to process a statement. */ { extern int ifstmt(), whilestmt(), dowhilestmt(); extern int forstmt(); extern int breakstmt(), nextstmt(), program(); extern other(); switch (tokencode) { case IF: ifstatement(); /* process rest of if-stmt */ break; case WHILE: whilestmt(); break; case DO; dowhilestmt(); break; case FOR: forstmt(); break; case BREAK; breakstmt(); /* this would be tricky */ break; case NEXT: nextstmt(); case OBRACE: program(); /* process { stmt stmt ... stmt } */ break; default: otherstmt(); break; } if (tokencode != FILEEND) return(true); /* still more to come */ return(false); /* no more statements */ } ifstmt() /* process an if stmt. Look for an else part too before returning */ { extern int condition(), statement(); scan(); /* get the next token */ if( tokencode == OPAREN ){ /* skip over OPAREN so conditon() gets the next token */ scan(); condition(); /* process the condition part */ } else /* missing "(", try anyway */ condition(); /* The tokencode should be CPAREN here. */ if( tokencode == CPAREN){ scan(); /* it was, skip it */ /* here's a recursive call */ statement(); } /* check for an else part */ if (tokencode == ELSE ){ scan(); /* get next token */ statement(); /* more recursion */ } return; } .DE .bp .NH Assignment 2 .PP It is easy to see that the parser is nothing more than the grammar rules coded in C. For the second part of the project I want you to write the recursive descent parser using the grammar previously specified in the handout. Don't worry about error checking or code generation. Pattern your functions similar to the examples in the previous handout. When any of your function recognizes a statement, have it print out a message indicating so. For example, right before the closing curly brace that marks the end of the "ifstmt()" function of the first handout, such a message can be produced with: .DS printf("Parsed an if statement\en"); .DE .PP There are a few consideration that you will have to keep in mind when you code you parser. .IP 1. When parsing a statement that belongs to the category \fBother\fP, the function you invoke will need to gather tokens by calling .ul scan or .ul lex until a \fBnewline\fP. We will have the s-dcl programs follow the syntax that a newline will mark the end of a statement. The C code may look like this: .DS other() /* Parse statement that is none of the special statements. Newline will mark the end. Note that when we will come to code generation, the blanks or tabs seen will be important. We can't just eat them up. Otherwise statements like copy 'p1' 'p2' which is legal DCL, will come out as copy'p1''p2' which is not legal DCL. */ { extern int lex(); while( tokencode != NEWLINE && tokencode != FILEEND ) tokencode = lex(token); /* skip over the NEWLINE that made us stop. */ scan(); printf("Parsed and other statement\en"); } .DE .IP 2. The function .ul program() is responsible for parsing compound statements. A compound statement according to the grammar is statements enclosed in balanced curly braces. This lends itself to the arrangement, as far as the function .ul program() is concerned, as follows: .DS program() /* Parse a compound statement. When this function is invoked the token is a {. Fire up a loop that terminates until the closing brace that matches the opening brace is seen. Notice that if the compound statement contains another compound statement, recursion will take care of it. */ { /* skip over the { */ scan(); while( tokencode != CBRACE ){ if( tokencode == FILEEND ){ printf("Fatal error. Premature EOF\en"); /* the c library function exit forces immediate abort of the program. */ exit(1); } /* otherwise, call stmt(). */ stmt(); } /* skip over the closing curly brace that stopped the while loop. */ scan(); } .DE .bp .NH Assignment 3\ --\ Code\ Generation .PP If your recursive descent parser can recognize the statements and classify them as an \fBif, while, for, other\fP etc, a few additions can now be made to generate code. Remember that VMS DCL does not have all the control structures your parser can recognize. Its job now is to take these control structures and by using \fBIF-THEN-GOTO\fP construct, which DCL does recognize, translate the s-dcl code into DCL code. In this assignment, you will implement the code generator. This also happens to be the last phase of the project. Turn in a listing of the entire program. This includes the lexical analyzer. Make sure that you organize the program in a modular fashion. C supports independent compilation; I expect that you would put this facility to use. The data for your parser is in "/users/sohail/hwk/bun.vms". Turn in the generated code. .NH Code Generation Rules .NH 1 \fBif\fP statement .PP The translation of .DS if ( condition ) statement .DE is essentially .DS if ( condition is not true ) go around statement .DE Thus when an \fBif\fP is encountered, we must .DS isolate the \fBcondition\fP part generate and save some unique label L output " \fBif( .not. (\fP condition )) \fBthen goto L\fP .DE In DCL, \fB.not.\fP inverts the truth value of the condition. When we get to the end of the statement that follows the \fBif\fP, there are two possibilities. If there is no \fBelse\fP following, we need output only .DS $ L: .DE If an \fBelse\fP follows, however, we must generate another label \fBL1\fP and output .DS $ goto L1 $ L: .DE to branch around the \fBelse\fP part, and then, after whatever statement follows the \fBelse\fP, .DS $ L1: .DE to terminate the \fBif-else\fP construction. In summary the code generation for .DS \fBif( \fP condition \fB)\fP statement .DE is .DS \fB$ if(.not.( \fPcondition \fB)) then goto L\fP $ statement \fB$ L:\fP .DE and for .DS \fBif( \fP condition \fB)\fP statement1 \fBelse\fP statement2 .DE is .DS \fB$ if(.not.( \fPcondition \fB)) then goto L\fP $ statement1 \fB$ goto L+1\fP \fB$ L:\fP $ statement2 \fB$ L+1:\fP .DE In a DCL command procedure, every statement must begin with a $ sign. Notice that this is opposite of what shell command procedures require. That is why every line we generate, we must output a $ as the first thing. If a line does not begin with a $ sign, DCL takes that to mean the this line is a .ul continuation of the previous line. You will run across cases where the generated code may turn out to be too long to fit on a 80-character line. It will better to break up the line into two or more lines. To indicate continuation, each line will need to end with minus sign "-" and the next line which contains the continuation must not begin with a $ sign. So the following will be taken as a single statement by DCL. .DS $ if( .not. (p1 .eqs. "" .and. - p2 .eqs. ""))- then goto 23002 .DE In DCL, labels are specified according the following regular expression .DS [a-zA-Z0-9_$][a-zA-Z0-9_$]*: .DE that is, a label can consist of a string of letters or digits or underscore or the dollar sign followed by a colon. It is the colon that differentiates a label from other entities. We need to generate unique labels when we do the code generation. To do so, the best choice is to use labels that are made up of integers, starting at 23000 or so. For a new label, all we need to do is increment the current value of the label and output this incremented value as a string of digits, appended with a colon. I will come to the details in a little while. First, let's finish the code generation scheme. .NH 1 \fBwhile\fP statement .PP The code for .DS \fBwhile( \fPcondition \fB)\fP statement .DE is .DS \fB$ L: if(.not.(\fP condition \fB)) then goto L+1\fP $ statement \fB$ goto L $ L+1:\fP .DE \fBL+1\fP means the value of the label here is \fBone plus\fP the value held by L. If the \fBwhile\fP contains a \fBnext\fP statement, it would merely goto L. L+1 will serve as the target for \fBbreak\fP statement. .NH 1 \fBdo while\fP statement .PP The code for .DS \fBdo\fP statement \fBwhile (\fP condition \fB)\fP .DE is .DS $ \fBL:\fP $ statement $ \fBL+1: if (\fP condition \fB) then goto L\fP $ \fBL+2:\fP .DE \fBL+1\fP will serve as the target for any \fBnext\fP statment. \fBL+2\fP will the target for any \fBbreak\fP statement. .NH 1 \fBfor\fP statement .PP The code for .DS \fBfor( \fPinitialize\fB;\fP condition\fB;\fP reinitialize\fB )\fP statement .DE is .DS $ initialize \fB$ L: if(.not.(\fP condition \fB)) then goto L+2\fP $ statement $ \fBL+1:\fP reinitialize $ \fBgoto L $ L+2:\fP .DE As with C, the intialize and reinitialize parts can only be a single statement. Compound statements are not allowed. The reason for generating L+1 is that if the statement part includes a \fBnext\fP statement, it would be translated into \fBgoto L+1\fP, i.e, L+1 is the target for any \fBnext\fP statement. A \fBbreak\fP statement will generate a \fBgoto L+2\fP. Note that the \fBfor\fP the initialize, condition and reinitialize parts are all optional. The following is an infinite loop. .DS for( ; ; ) stmt .DE In which case the generated code would simply be .DS $ L: $ stmt $ L+1: $ goto L $ L+2: .DE Notice that all the labels needed for a \fBfor\fP statment are still generated because the body of the loop could include \fBbreak\fP and \fBnext\fP statements. However, the initialize part, the "if(.not.(.." and the reinitialize portions are missing. A simple way to determine whether any of these parts are missing is as follows: after the open paren, get a non blank token. If it is semicolon, then don't emit any initialize statement. Then function .ul condition is normally invoked but before calling condition, check what the next non-blank token is. If it is another semicolon, donot call condition and neither generate the "if(.not.((" before condition nor generate the "then goto L+2" after the condition. .NH 1 Code Generation example .PP The following illustrates how the translation scheme will work. .DS /* this s-dcl program is an example */ if (p1 .eq. p2){ if (p1 .eqs. "greet") write sys$output "hello world" count = 20 while (count .gt. 0) write 'count' } else for (count = 30; count .gt. 0; count=count-1){ if (count .ge. 10) next /* goto reinit part */ write sys$output "Count ''count'" } .DE Here is what should be generated by your compiler. .DS $ if(.not.(p1.eq.p2)) then goto 23000 $ if(.not.(p1.eqs."greet")) then goto 23002 $ write sys$output "hello world" $ 23002: $ count = 20 $ 23003: if(.not.(count.gt.0)) then goto 23004 $ write 'count' $ goto 23003 $ 23004: $ goto 23001 $ 23000: $ count = 30 $ 23005: if(.not.(count.gt.0)) then goto 23007 $ if(.not.(count.ge.10)) then goto 23008 $ goto 23006 $ 23008: $ write sys$output "Count ''count'" $ 23006: count=count-1 $ goto 23005 $ 23007: $ 23001: .DE How you place the labels and the statements that follow is a matter of choice. Keep in mind that every line generated must begin with a $ sign and that the label must terminate with a colon. If the generated line is too long to fit in 80 columns, put a minus "-" at the end and continue on the next line of output \fBwithout a $\fP. .NH 1 \fBOther \fPstatement .PP If a s-dcl line does not belong to any of the above constructs, it must \fBother\fP, i.e, it holds no special meaning to your compiler. In this case all you need to do is emit a $ sign followed by the rest of the tokens found until the end of the line is seen. Few considerations you must keep in mind. Whitespaces are significant when you find a line that belongs to the \fBother\fP class. So call .ul lex and \fBnot\fP .ul scan to get the tokens until end of line. If you look at the example code I gave you in a previous handout for the function .ul other you'll notice that I call .ul lex. One liberty you may wish to take is that multiple blanks or tabs may be compressed into a single blank or tab. So before emitting the token string, you could give token to a function, say compress(token), that would compress token into a string containing a single blank or tab. If the token happens to a comment, do not emit it. .PP Sometimes, a need may arise where one may want to pass a line through your compiler .ul untouched. To do so, such a line will begin with a pound "#" sign. Surely this line of input will first be classified as \fBother\fP. In your function .ul other check to see if the tokencode happens to be POUND. If so, eat up the pound sign and emit the rest of without touching, i.e, no $ sign at the beginning, no blank of tab compression. Just call lex repeatedly and output whatever token is received until end of line. .NH Handling \fBnext\fP and \fBbreak\fP .PP When a \fBbreak\fP or a \fBnext\fP statement is seen, you need to generate the right goto statements that will transfer control to the correct point. The catch is: how do find out whether you are within the body of a loop (while, for) and how far deep? If for example, a while statement is nested within another while statement, then a \fBbreak\fP statement will cause exit from the inner while loop. The \fBbreak\fP statement will get translated into a goto statement. But where is this goto going to go? The point is that along with code generation, you will need to write code that remembers all the loops that were seen and the labels associated with the loops. Any time a \fBbreak\fP or a \fBnext\fP statement is seen, this code could be called upon to determine whether you are in a loop and what labels have been generated for the loop. A convenient way to do this is to use a .ul stack that stacks the loop type and the associated label. Here is how it may work. Every time you see a while or a for loop, stack it with the label that preceeds the condition "if(.not.(.." In the example above, such labels are 23003 and 23005. If while parsing the statement that forms the body of the loop you see a \fBnext\fP or \fBbreak\fP, peek at the top element of the stack. If there is one then look at the associated label. Let's call this label L. In case of a while loop, the next statement will imply that you generate "$ goto L" and break will generate "$ goto L+1". If the stack element is a for loop then next will generate "$ goto L+1" and break will generate "$ goto L+2". Notice that you will peek at the top element not pop it. When you parsed the body of the loop, pop the top element of the stack. .PP The function .ul whilestmt() could be coded as follows. .DS whilestmt() { extern int stmt(), push(), pop() extern int genlab(), errmsg(); extern int emitlabel(), emitstring(); int lab, looptype, clabel, junk; scan(); /* skip the key word */ lab = genlab(); /* get a unique label */ /* The target of the goto after the condition part will be label lab+1. Need to increment the next available label from genlab() such lab+1 is also used up. */ junk = genlab(); /* first char on the line has to be $ */ emitstring("$ "); /* emitlab will output "xxxxx: " where xxxxx is the value passed thru lab. */ emitlabel( lab ); emitstring(" if(.not.(" ); /* the condition part */ /* get the condition part within parenthesis. */ if (tokencode == OPAREN ) scan(); else errmsg("Missing ( in while condition" ); condition(); /* the token should a close paren at this point. */ if (tokencode == CPAREN ){ scan(); /* skip it */ else errmsg("Missing ) following condition in while"); .DE .DS /* emit "then goto lab+1" any way */ emitstring( "))then goto " ); emitlabel( lab+1 ); emitstring("\en"); /* newline to finish off line */ /* push a new entry onto the loopstack for this loop */ push( WHILE, lab ); /* and now parse the loop body */ stmt(); /* things at the bottom of the loop */ emitstring( "$ goto "); emitlabel( lab ); emitstring( "\en" ); emitstring("$ "); emitlabel ( lab+1 ); emitstring("\en"); /* pop the stack. Notice that the addresses are being passed so that pop() can return values through its parameters. */ pop( &looptype, &clabel ); /* values not used here */ } .DE .PP Develop a separate module that implements the stack routines. You would need .ul push, .ul pop, and .ul peek. There are a number of ways to define the data structure for the stack. However, I am going to ask you to set up the stack as singly-linked list. Here are the types and variable definitions that will do the job. .DS struct nodetype { int looptype; /* WHILE or FOR */ int label; /* the value of associated label */ struct nodetype *prev; /* pointer to the previous node */ }; /* keep stack private global. top=0 means empty stack */ static struct nodetype *top = 0; /* The following defines a macro the yields the size of structure "nodetype". */ #define NODESIZE sizeof( struct nodetype ) .DE The function push and pop can be coded as follows. .DS push( ltype, looplab ) int ltype, looplab; { struct nodetype *ptr; extern char *malloc(); /* just like pascal's NEW procedure, need to allocate storage for the node and have "ptr" point to it. Notice that malloc returns pointer to char. Need to coerce into a pointer to struct nodetype. */ ptr = (struct nodetype *)malloc( NODESIZE ); if( ptr == 0 ){ errmsg("Out of memory in stack push - abort"); exit(1); } /* store values and set up the links. */ ptr->looptype = ltype; ptr->label = looplab; ptr->prev = top; top = ptr; /* the new top elt */ } pop( pltype, plab ) int *pltype, *plab; { struct nodetype *ptr; if ( top == 0 ){ /* empty stack */ errmsg("Attempt to pop empty stack" ); return( 0 ); /* return false */ } *pltype = top->looptype; *plab = top->label; ptr = top; /* save current top for deallocation */ top = top->prev; /* the new top elt */ free( (char *)ptr ); /* free up memory */ return( 1 ); /* return true */ } .DE I am assuming that you can write the function "errmsg" which will print out its argument followed by newline. Make sure that you understand all the casting that is going on to convert pointers to the correct type. The function .ul malloc and .ul free are part of the standard C library. Malloc's argument is an .ul unsigned int which indicates the number of bytes you want allocated. It returns a char pointer to the first byte. If these bytes are going to used for other than storing characters (a character takes up a byte) you will have to coerce the pointer returned. The function .ul free takes a character pointer and frees the memory pointed. .NH 1 Continuation in s-dcl code .PP We should provide a facility in s-dcl such that source code lines can be continued on more than one line by using a continuation character. The scheme used in C pre-processor, for example, is that if a line ends with a backslash (\e), then the next line is taken to be part of the current line. Have your parser follow the same convention, i.e, ignore the newline character at the end of a line of input if the last token on the line is a backslash. The use of this feature might be something like this. .DS /* example of a long s-dcl line */ if( p1 .eq. p2 ){ mail/subject="news on ada compiler for vax"\e user3$:[engr.thesis.ada]news.txt\e opus::vaxa::clipper::sysmanager } .DE The source code line that begins with "mail" will be classified as \fBother\fP. The code to be generated is simply the source line copied out to output (with the $ of course) but because the token immediately before the newline is a backslash, your parser will simply retrieve the newline and discard it. .PP Please make the addition to your lex routine to return backslash as a separate token instead of claiming that it is just one of the single character token. .NH 1 Output Routines .PP You noticed that I used calls to routines .ul emitlabel and .ul emitstring to generate the output code. Having a set of routines to generate the actual output is always beneficial in that all the nitty gritty details of output formatting can be hidden away. The parser does not need to worry about things like whether a line of output has gone past 80th column, how an integer label will be put as a string of characters with a colon appended. Create a module with a set of output routines. The arrangement best suited for such a module will be to have the output lines collect character at a time in char array and a pointer that acts as an index into this array. Here is how the module may be coded. .DS /* OutputRoutines. This module exports routines that handle the actual text that is destined to be the output code. */ #define MAXCOL 80 #define MAXBUF MAXCOL+1 /* one more for NULL */ static char outbuf[MAXBUF] /* output lines collected here */ static int outp = (-1); /* last position filled in outbuf */ outdone() /* outdone flushes outbuf and resets outp to -1. It is called at the end of various statements, and is also called by outch to flush a line prior to starting a continuation. outdone is the only routine that actually produces output. */ { outbuf[++outp] = '\en'; outbuf[++outp] = '\e0'; printf("%s", outbuf ); outp = (-1); } outch(c) char c; /* outch enters characters into outbuf. It is responsible for handling continuation lines according to DCL conventions. If the character c happens to newline, outch calls outdone to flush the current line. */ { if( c == '\en' ) outdone(); else if( outp == MAXCOL-2 ){ /* 0..78 filled */ outbuf[++outp] = '-'; /* DCL continuation */ outdone(); /* start the next line without the $ sign. */ outbuf[++outp] = c; } else outbuf[++outp] = c; } emitstring(s) char *s; /* output the entire string s by repeated calls to outch. */ { char c; while( (c = *s++ ) ) outch(c); } emitlabel( lab ) int lab; /* convert integer label to a string of digits, append a colon and output the label. */ { char s[8]; /* 5 for label, 1 for :, 1 for blank and 1 for NULL */ /* call itoa to convert lab to string. We had this routine as an example in one of the lectures. It is listed on page 60 of the C book by K&R. */ iota( lab, s ); /* append a colon, blank, and then null terminate. */ s[5] = ':'; s[6] = ' '; s[7] = '\e0'; emitstring( s ); /* here goes the label */ } .DE .PP When emitting a quoted string, it may be desirable that the string is not broken across the output line. Write a special function .ul emitqstring that will be called to handle quoted strings only. It will check to see if the quoted string will fit on the current line of output. If not, it will append "-" to the outbuf array, flush it out through outdone, and put the quoted string afresh in the character buffer. .bp .PP Here is a listing of "bun.vms" that you will use as input to your program. You can invent your own input when you are testing your program. .DS /* Bun -- VMS DCL command procedure to bundle files into */ /* distribution package which can then be unbundled */ /* using UNIX shell. The output will be placed on the */ /* on the file given as the arg to this procedure */ if( p1 .eqs "" ){ write sys$output\e "Usage: bundle outfile (outfile will receive bundle)" exit /* DCL exit */ } /* if the file exists, open it, otherwise create it */ open/write/err=out_err fout 'p1' exist := "TRUE" out_err: if( exist .nes. "TRUE" ){ create 'p1' open/write/err=give_up fout 'p1' } q := "'" for( rc = 0; ; ){ /* no condition, no reinit */ inquire infile "File? " if( infile .eqs. "" ) break /* time to wrapup */ open/read/err=infile_err inf 'infile' write fout "echo ''infile' 1>&2" write fout "cat >''infile' <<''q'END OF ''infile'''q'" rc = rc + 2 done = 0 while( done .eq. 0 ){ read/end=eof inf line write fout line rc = rc + 1 } .DE .DS eof: close inf write fout "END OF ''infile'" rc = rc + 1 next /* come here if trouble opening 'infile' */ infile_err: write sys$output \e "error opening ''infile'" } if( rc .gt. 0 ){ write sys$output "''rc' records written to ''p1'" close fout } else write sys$output "0 records written out" exit .DE