CSC 304
CSC 304
PROGRAM TRANSLATORS
A program translator is a computer program that performs the translation of a program written in a given
programming language into a functionality equivalent program in a different computer language, without losing
the functional or logical structure of the original code (“the essence” of each program).
These include translations between higher level and human-readable computer languages such as C++, Java and
COBOL, intermediate-level language such as Java bytecode, low-level languages such as the assembly language
and machine code, and between similar levels of language on different computing platforms, as well as from any
of these to any other of these.
They also include translators between software implementations and hardware/ASIC microchip implementations
of the same program, and from software descriptions of a microchip to the logic gates needed to build it.
COMPILERS
A compiler is a computer program (or set of programs) that transforms source code written in a programming
language (the source language) into another computer language (the target language, often having a binary form
known as object code).
The most common reason for converting a source code is to create an executable program.
The name “compiler” is primarily used for programs that translate source code from a higher-level programming
language to a lower-level language (e.g., assembly language or machine code).
If the compiled program can run on a computer whose CPU or operating system is different from the one on which
the compiler runs, the compiler is known as a cross-compiler. More generally, compilers are a specific type of
translators.
A program that translates from a low-level language to a higher level one is a decompiler.
A program that translates between high-level languages is usually called a source-to-source compiler or transpiler.
A language rewriter is usually a program that translates the form of expressions without a change of language.
The term compiler-compiler is sometimes used to refer to a parser generator, a tool often used to help create the
lexer and parser.
1. Lexical analysis
2. Preprocessing
3. Parsing
4. Semantic analysis (syntax-directed translation),
5. Code generation, and code optimization.
Program faults caused by incorrect compiler behavior can be very difficult to track down and work around;
therefore, compiler implementors invest significant effort to ensure compiler correctness.
Before the development of FORTRAN, the first higher-level language, in the 1950’s, machine-dependent assembly
language was widely used.
While assembly language produces more abstraction than machine code on the same architecture, just as with
machine code, it has to be modified or rewritten if the program is to be executed on different computer hardware
architecture.
With the advent of high-level programming languages that followed FORTRAN, such as COBOL, C and BASIC,
programmers could write machine-independent source programs. A compiler translates the high-level source
programs into target programs in machine languages for the specific hardware. Once the target program is
generated, the user can execute the program.
INTERPRETERS
In computer science, an interpreter is a computer program that directly executes, i.e., performs, instructions
written in a programming or scripting language, without previously compiling them into a machine language
program.
An interpreter is a program that reads in as input a source program, along with data for the program, and
translates the source program instruction by instruction.
EXAMPLE
The java interpreter javac translate a. clas file into code that can be executed natively on the underlying
machine.
The program VirtualPC interprets programs written for the intel Pentium architecture (IBM-PC clone) for
the PowerPC architecture (Macintosh). This enable Macintosh users to run windows program on their
computer.
An interpreter generally uses one of the following strategies for program execution:
APPLICATIONS
1. Interpreters are frequently used to execute command language, and glue languages since each operator
executed in command language is usually an invocation of a complex routine such as an editor or
compiler.
2. Self-modifying code can easily be implemented in an interpreter language. This relates to the origins of
interpretation. In Lisp and artificial intelligence research.
3. Virtualization. Machine code intended for one hardware architecture can be run on another using a virtual
machine, which is essentially an interpreter.
4. Sandboxing: an interpreter or virtual machine is not compelled to actually execute all the instructions the
source code it is processing. In particular, it can refuse to execute code that violates any security
constraints it is operating under.
ADVANTAGES OF INTERPRETER
DISADVANTAGES OF INTERPRETER
1. Source code is required for the program to be executed and this source code can be read making it
insecure.
2. Interpreters are generally slower than compiled programs due to the per-line translation method.
ASSEMBLERS
An assembler translates assembly language into machine code. An assembler is a program that creates object
code by translating combinations of mnemonics and syntax for operations and addressing modes into their
numerical equivalents.
Assembly language
It consists of mnemonics for machine opcodes so assemblers perform a 1:1 translation from mnemonic to
a direct instruction.
An assembly language (or assembler language) is a low-level programming language for a computer, or
other programmable device, in which there is a very strong (generally one-to-one) correspondence
between the language and the architecture’s machine code instructions.
Each assembly language is specific to a particular computer architecture, in contrast to most high-level
programming language, which are generally portable across multiple architectures, but require
interpreting or compiling.
Assembly language is converted into executable machine code by a utility program referred to as an
assembler; the conversion process is referred to as assembly, or assembling the code.
For example:
Conversely, one instruction in a high level language will translate to one or more instructions at machine level.
TYPES OF ASSEMBLERS
There are two types of assemblers based on how many passes through the source are needed to produce the
executable program.
1. One-pass assemblers go through the source code once. An y symbol used before it is defined will require
“errata” at the end of the object code (or, at least, no earlier than the point where the symbol is defined) telling
the linker or the loader to “go back” and overwrite a placeholder which had been left where there are yet
undefined symbol was used.
2. Multi-pass assemblers create a table with all symbols and their values in the first passes; then use the table in
later passes to generate code.
In both cases, the assembler must be able to determine the size of each instruction on the initial passes in order to
calculate the addresses of subsequent symbols.
This means that if the size of an operation referring to an operand defined later depends on the type or distance
of the operand, the assembler will make a pessimistic estimate when first encountering the operation, and if
necessary pad it with one or more “no operation” instructions in a later pass or the errata. In an assembler with
peephole optimization, addresses may be recalculated between passes to allow replacing pessimistic code with
code tailored to the exact distance from the target.
The original reason for the use of one-pass assemblers was speed of assembly – often a second pass would require
rewinding and rereading a tape or rereading a deck of cards.
With modern computers this has ceased to be an issue. The advantage of the multi-pass assembler is that the
absence of errata makes the linking process (or the program load if the assembler directly produces executable
code) faster.
APPLICATIONS OF ASSEMBLERS
1. Assembly language is typically used in a system’s boot code, the low-level code that initializes and tests
the system hardware prior to booting the operating system and is often store in ROM. (BIOS on IBM-
Compatible PC systems and CP/M is an example.)
2. Some compilers translate high-level languages into assembly first before fully compiling, allowing the
assembly code to be viewed for debugging and optimization purposes.
3. Relatively low-level languages, such as C, allow the programmer to embed assembly language directly in
the source code. Programs using such facilities, such as the Linux kernel, can then construct abstractions
using different assembly language on each hardware platform. The system’s portable code can then use
these processor-specific components through a uniform interface.
4. Assembly language is useful in reverse engineering. Many programs are distributed only in machine code
form which is straightforward to translate into assembly language, but more difficult to translate into a
higher-level language. Tools such as the interactive Disassembler make extensive use of disassembly for
such a purpose.
5. Assemblers can be used to generate blocks of data, with no higher-level language overhead, from
formatted and commented source code, to be used by other code.
ADVANTAGES OF ASSEMBLER
STRUCTURE OF COMPILER
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler verifies code
syntax, generates efficient object code, performs run-time organization, and formats the output according to
assembler and linker conventions.
PHASES OF COMPILATION
1. Line reconstruction:
Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase
before parsing, which converts the input character sequence to a canonical form ready for the
parser.
The top-down, recursive-descent, table driven parsers used in the 1960s typically read the source
one character a separate tokenizing phase.
Atlas Autocode, and Imp (and some implementations of ALGOL and Coral 66) are examples of
stropped languages which compilers would have Line Reconstruction phase.
2. Lexical analysis
It breaks the source code text into small pieces called tokens. Each token is a single atomic unit of
the language, for instance a keyword, identifier or symbol name.
The token syntax is typically a regular language, so a finite state automation constructed from a
regular expression can be used to recognize it.
This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical
analyzer or scanner.
This may not be a separate step – it can be combined with the parsing step in scannerless parsing,
in which case parsing is done at the character level, not the token level.
3. Preprocessing
Some languages, e.g., C, require a preprocessing phase which supports macro substitution and
conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic
analysis: e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic
forms. However, some languages such as scheme support macro substitutions based on syntactic
forms.
4. Syntax analysis
It involves parsing the token sequence to identify the syntactic structure of the program.
This phase typically builds a pare tree, which replaces the linear sequence of tokens with a tree
structure built according to the rules of a formal grammar which define the language’s syntax.
The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
5. Semantic analysis
It is the phase in which the compiler adds semantic information to the parse tree and builds the
symbol table. This phase performs semantic checks such as type checking( checking for type
errors), or object binding (associating variable and function references with their definitions), or
define assignment (requiring all local variables to be initialized before use), rejecting incorrect
programs or issuing warnings.
Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows
the parsing phase, and logically precedes the code generation phase, though it is often possible to
fold multiple phases into one pass over the code in a compiler implementation.
Optimizes target code utilization of the hardware by figuring out how to keep parallel execution units busy, filing
delay slots.
Although most algorithms for optimization are in NP, heuristic techniques are well-developed.
The main phases of the back end include the following:
1. Analysis: This is the gathering of program information from the intermediate representation derived from
the input; data-flow analysis is used to build use-define chains, together widthdependence analysis, alias
analysis, pointer analysis, escape analysis, etc. accurate analysis is the basis for any compiler optimization.
The call graph and control flow graph are usually also built during the analysis phase.
2. Optimization: The intermediate language representation is transformed into functionally equivalent but
faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination, constant
propagation, loop transformation, register allocation and even automatic parallelization.
3. Code generation: The transformed intermediate language is translated into the output language, usually
the native machine language of the system. This involve resource and storage decisions, such as deciding
which variables to fit into registers and memory and the selection and scheduling of appropriate machine
instructions along with their associated addressing modes (see also Sethi-Ullman algorithm). Debug data
may also need to be generated to facilitate debugging.
Compiler analysis is the prerequisite for any compiler optimization and they tightly work together. For example,
dependence analysis is crucial for loop transformation.
TYPES OF COMPILERS
ADVANTAGES OF COMPILER
1. Source code is not included, therefore compiled code is more secure than interpreted code.
2. Tends to produce faster code than interpreting source code.
3. Produces an executable file, and therefore the program can be run without need of the source code.
DISADVANTAGES OF COMPILER
1. Object code needs to be produced before a final executable file; this can be a slow process.
2. The source code must be 100 percent correct for the executable file to be produced.