Open In App

Introduction of Lexical Analysis

Last Updated : 27 Jan, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Lexical analysis, also known as scanning is the first phase of a compiler which involves reading the source program character by character from left to right and organizing them into tokens. Tokens are meaningful sequences of characters. There are usually only a small number of tokens for a programming language including constants (such as integers, doubles, characters, and strings), operators (arithmetic, relational, and logical), punctuation marks and reserved keywords.

lexical_analysis
Lexical Analysis
  • The lexical analyzer takes a source program as input, and produces a stream of tokens as output.

What is a Token?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

Categories of Tokens

  • Keywords: In C programming, keywords are reserved words with specific meanings used to define the language's structure like if, else, for, and void. These cannot be used as variable names or identifiers, as doing so causes compilation errors. C programming has a total of 32 keywords.
  • Identifiers: Identifiers in C are names for variables, functions, arrays, or other user-defined items. They must start with a letter or an underscore (_) and can include letters, digits, and underscores. C is case-sensitive, so uppercase and lowercase letters are different. Identifiers cannot be the same as keywords like if, else or for.
  • Constants: Constants are fixed values that cannot change during a program's execution, also known as literals. In C, constants include types like integers, floating-point numbers, characters, and strings.
  • Operators: Operators are symbols in C that perform actions on variables or other data items, called operands.
  • Special Symbols: Special symbols in C are compiler tokens used for specific purposes, such as separating code elements or defining operations. Examples include ; (semicolon) to end statements, , (comma) to separate values, {} (curly braces) for code blocks, and [] (square brackets) for arrays. These symbols play a crucial role in the program's structure and syntax.

Read more about Tokens.

What is a Lexeme?

A lexeme is an actual string of characters that matches with a pattern and generates a token.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” . 

Lexemes and Tokens Representation

LexemesTokensLexemes Continued...Tokens Continued...
 while WHILEaIDENTIEFIER
  (LAPREN=ASSIGNMENT
   aIDENTIFIERaIDENTIFIER
  >=COMPARISON-ARITHMETIC
   bIDENTIFIER2INTEGER
  )RPAREN;SEMICOLON

How Lexical Analyzer Works?

Tokens in a programming language can be described using regular expressions. A scanner, or lexical analyzer, uses a Deterministic Finite Automaton (DFA) to recognize these tokens, as DFAs are designed to identify regular languages. Each final state of the DFA corresponds to a specific token type, allowing the scanner to classify the input. The process of creating a DFA from regular expressions can be automated, making it easier to handle token recognition efficiently.

Read more about Working of Lexical Analyzer in Compiler.

The lexical analyzer identifies the error with the help of the automation machine and the grammar of the given language on which it is based like C, C++, and gives row number and column number of the error.

Suppose we pass a statement through lexical analyzer: a = b + c;

It will generate token sequence like this: id=id+id; Where each id refers to it’s variable in the symbol table referencing all details For example, consider the program

int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens. You can observe that we have omitted comments. As another example, consider below printf statement. token There are 5 valid token in this printf statement.

Exercise 1: Count number of tokens:

int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Answer: Total number of token: 27.

Exercise 2: Count number of tokens:

int max(int i);

  • Lexical analyzer first read int and finds it to be valid and accepts as token.
  • max is read by it and found to be a valid function name after reading (
  • int is also a token , then again I as another token and finally ;

Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;

Advantages

  • Simplifies Parsing: Breaking down the source code into tokens makes it easier for computers to understand and work with the code. This helps programs like compilers or interpreters to figure out what the code is supposed to do. It's like breaking down a big puzzle into smaller pieces, which makes it easier to put together and solve.
  • Error Detection: Lexical analysis will detect lexical errors such as misspelled keywords or undefined symbols early in the compilation process. This helps in improving the overall efficiency of the compiler or interpreter by identifying errors sooner rather than later.
  • Efficiency: Once the source code is converted into tokens, subsequent phases of compilation or interpretation can operate more efficiently. Parsing and semantic analysis become faster and more streamlined when working with tokenized input.

Disadvantages

  • Limited Context: Lexical analysis operates based on individual tokens and does not consider the overall context of the code. This can sometimes lead to ambiguity or misinterpretation of the code's intended meaning especially in languages with complex syntax or semantics.
  • Overhead: Although lexical analysis is necessary for the compilation or interpretation process, it adds an extra layer of overhead. Tokenizing the source code requires additional computational resources which can impact the overall performance of the compiler or interpreter.
  • Debugging Challenges: Lexical errors detected during the analysis phase may not always provide clear indications of their origins in the original source code. Debugging such errors can be challenging especially if they result from subtle mistakes in the lexical analysis process.

For Previous Year Questions on Lexical Analysis, refer to https://www.geeksforgeeks.org/lexical-analysis-gq/.


Next Article
Article Tags :

Similar Reads