0% found this document useful (0 votes)
35 views120 pages

SOEN 380 (1)

Python is a versatile, high-level programming language known for its simplicity and readability, making it an excellent choice for beginners. It supports various programming paradigms, including object-oriented and functional programming, and offers extensive libraries for applications in data science, web development, and automation. Python's open-source nature and large community contribute to its popularity and widespread use across multiple domains.

Uploaded by

manumelly72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views120 pages

SOEN 380 (1)

Python is a versatile, high-level programming language known for its simplicity and readability, making it an excellent choice for beginners. It supports various programming paradigms, including object-oriented and functional programming, and offers extensive libraries for applications in data science, web development, and automation. Python's open-source nature and large community contribute to its popularity and widespread use across multiple domains.

Uploaded by

manumelly72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 120

Python Programming Language

Python is a widely used programming language that offers several unique features and
advantages compared to languages like Java and C++. Python is a general-purpose, dynamically
typed, high-level, compiled and interpreted, garbage-collected, and purely object-oriented
programming language that supports procedural, object-oriented, and functional programming.

Python Environment Setup Using Python Executable version 3.9


and Jupyterlab IDE(Pycharm, Anaconda, Jupyter Notebook,
Spyder, Atom, Thonny, Eclipse)
Python Features

Python provides many useful features which make it popular and valuable from the other
programming languages. It supports object-oriented programming, procedural programming
approaches and provides dynamic memory allocation. We have listed below a few essential
features.

1) Easy to Learn and Use

Python is easy to learn as compared to other programming languages. Its syntax is


straightforward and much the same as the English language. There is no use of the semicolon or
curly-bracket, the indentation defines the code block. It is the recommended programming
language for beginners.

2) Expressive Language

Python can perform complex tasks using a few lines of code. A simple example, the hello world
program you simply type print("Hello World"). It will take only one line to execute, while Java
or C takes multiple lines.

3) Interpreted Language

Python is an interpreted language; it means the Python program is executed one line at a time.
The advantage of being interpreted language, it makes debugging easy and portable.

4) Cross-platform Language

Python can run equally on different platforms such as Windows, Linux, UNIX, and Macintosh,
etc. So, we can say that Python is a portable language. It enables programmers to develop the
software for several competing platforms by writing a program only once.

5) Free and Open Source

1
Python is freely available for everyone. It is freely available on its official
website www.python.org. It has a large community across the world that is dedicatedly working
towards make new python modules and functions. Anyone can contribute to the Python
community. The open-source means, "Anyone can download its source code without paying any
penny."

6) Object-Oriented Language

Python supports object-oriented language and concepts of classes and objects come into
existence. It supports inheritance, polymorphism, and encapsulation, etc. The object-oriented
procedure helps to programmer to write reusable code and develop applications in less code.

7) Extensible

It implies that other languages such as C/C++ can be used to compile the code and thus it can be
used further in our Python code. It converts the program into byte code, and any platform can use
that byte code.

8) Large Standard Library

It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting. There are various machine learning libraries, such as
Tensor flow, Pandas, Numpy, Keras, and Pytorch, etc. Django, flask, pyramids are the popular
framework for Python web development.

9) GUI Programming Support

Graphical User Interface is used for the developing Desktop application. PyQT5, Tkinter, Kivy
are the libraries which are used for developing the web application.

10) Integrated

It can be easily integrated with languages like C, C++, and JAVA, etc. Python runs code line by
line like C, C++ Java. It makes easy to debug the code.

11. Embeddable

The code of the other programming language can be used in Python source code. We can use
Python source code in another programming language as well. It can embed other language into
our code.

12. Dynamic Memory Allocation

In Python, we don't need to specify the data-type of the variable. When we assign some value to
the variable, it automatically allocates the memory to the variable at run time. Suppose we are
assigned integer value 15 to x, then we don't need to write int x = 15. Just write x = 15.

2
Java vs. Python
Python is an excellent choice for rapid development and scripting tasks. Whereas Java
emphasizes a strong type system and object-oriented programming.

Printing 'Hello World!'

//Python code: print("Hello World!")

Java Code:

1. public class HelloWorld {


2. public static void main(String[] args) {
3. System.out.println("Hello, World!");
4. }
5. }

While both programs give the same output, we can notice the syntax difference in the print
statement.

o In Python, it is easy to learn and write code. While in Java, it requires more code to
perform certain tasks.
o Python is dynamically typed, meaning we do not need to declare the variable Whereas
Java is statistically typed, meaning we need to declare the variable type.
o Python is suitable for various domains such as Data Science, Machine Learning, Web
development, and more. Whereas Java is suitable for web development, mobile app
development (Android), and more.
The Python Software Foundation (PSF) was established in 2001 to promote, protect, and
advance the Python programming language and its community.

Reasons for Learning Python Language


Python provides many useful features to the programmer. These features make it the most
popular and widely used language. We have listed below few-essential features of Python.

o Easy to use and Learn: Python has a simple and easy-to-understand syntax, unlike
traditional languages like C, C++, Java, etc., making it easy for beginners to learn.
o Expressive Language: It allows programmers to express complex concepts in just a few
lines of code or reduces Developer's Time.
o Interpreted Language: Python does not require compilation, allowing rapid
development and testing. It uses Interpreter instead of Compiler.
o Object-Oriented Language: It supports object-oriented programming, making writing
reusable and modular code easy.
o Open-Source Language: Python is open-source and free to use, distribute and modify.
o Extensible: Python can be extended with modules written in C, C++, or other languages.

3
o Learn Standard Library: Python's standard library contains many modules and
functions that can be used for various tasks, such as string manipulation, web
programming, and more.
o GUI Programming Support: Python provides several GUI frameworks, such
as Tkinter and PyQt, allowing developers to create desktop applications easily.
o Integrated: Python can easily integrate with other languages and technologies, such as
C/C++, Java, and . NET.
o Embeddable: Python code can be embedded into other applications as a scripting
language.
o Dynamic Memory Allocation: Python automatically manages memory allocation,
making it easier for developers to write complex programs without worrying about
memory management.
o Wide Range of Libraries and Frameworks: Python has a vast collection of libraries
and frameworks, such as NumPy, Pandas, Django, and Flask, that can be used to solve a
wide range of problems.
o Versatility: Python is a universal language in various domains such as web
development, machine learning, data analysis, scientific computing, and more.
o Large Community: Python has a vast and active community of developers contributing
to its development and offering support. This makes it easy for beginners to get help and
learn from experienced developers.
o Career Opportunities: Python is a highly popular language in the job market. Learning
Python can open up several career opportunities in data science, artificial intelligence,
web development, and more.
o High Demand: With the growing demand for automation and digital transformation, the
need for Python developers is rising. Many industries seek skilled Python developers to
help build their digital infrastructure.
o Increased Productivity: Python has a simple syntax and powerful libraries that can help
developers write code faster and more efficiently. This can increase productivity and save
time for developers and organizations.
o Big Data and Machine Learning: Python has become the go-to language for big data
and machine learning. Python has become popular among data scientists and machine
learning engineers with libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and
more.
o Use of python in academics: Now python language is being treated as the core programming
language in schools and colleges due to its countless uses in Artificial Intelligence, Deep
Learning, Data Science, etc. It has now become a fundamental part of the development world
that schools and colleges cannot afford not to teach python language. In this way, it is increasing
more python Developers and Programmers and thus further expanding its growth and
popularity.
o Automation: Python language can help a lot in automation of tasks as there are lots of tools and
modules available, which makes things much more comfortable. It is incredible to know that one
can reach an advanced level of automation easily by just using necessary python codes. Python
is the best performance booster in the automation of software testing also. One will be amazed
at how much less time and few numbers of lines are required to write codes for automation
tools.
o

4
Application Areas of Python
Python is a general-purpose, popular programming language, and it is used in almost every
technical field. The various areas of Python use are given below.

o Data Science: Data Science is a vast field, and Python is an important language for this
field because of its simplicity, ease of use, and availability of powerful data analysis and
visualization libraries like NumPy, Pandas, and Matplotlib.
o Desktop Applications: PyQt and Tkinter are useful libraries that can be used in GUI -
Graphical User Interface-based Desktop Applications. There are better languages for this
field, but it can be used with other languages for making Applications.
o Console-based Applications: Python is also commonly used to create command-line or
console-based applications because of its ease of use and support for advanced features
such as input/output redirection and piping.
o Mobile Applications: While Python is not commonly used for creating mobile
applications, it can still be combined with frameworks like Kivy or BeeWare to create
cross-platform mobile applications.
o Software Development: Python is considered one of the best software-making
languages. Python is easily compatible with both from Small Scale to Large Scale
software.
o Artificial Intelligence: AI is an emerging Technology, and Python is a perfect language
for artificial intelligence and machine learning because of the availability of powerful
libraries such as TensorFlow, Keras, and PyTorch.
o Web Applications: Python is commonly used in web development on the backend with
frameworks like Django and Flask and on the front end with tools
like JavaScript HTML and CSS.
o Enterprise Applications: Python can be used to develop large-scale enterprise
applications with features such as distributed computing, networking, and parallel
processing.
o 3D CAD Applications: Python can be used for 3D computer-aided design (CAD)
applications through libraries such as Blender.
o Machine Learning: Python is widely used for machine learning due to its simplicity,
ease of use, and availability of powerful machine learning libraries.
o Computer Vision or Image Processing Applications: Python can be used for computer
vision and image processing applications through powerful libraries such as OpenCV and
Scikit-image.
o Speech Recognition: Python can be used for speech recognition applications through
libraries such as SpeechRecognition and PyAudio.
o Scientific computing: Libraries like NumPy, SciPy, and Pandas provide advanced
numerical computing capabilities for tasks like data analysis, machine learning, and
more.
o Education: Python's easy-to-learn syntax and availability of many resources make it an
ideal language for teaching programming to beginners.
o Testing: Python is used for writing automated tests, providing frameworks like unit tests
and pytest that help write test cases and generate reports.

5
o Gaming: Python has libraries like Pygame, which provide a platform for developing
games using Python.
o IoT: Python is used in IoT for developing scripts and applications for devices
like Raspberry Pi, Arduino, and others.
o Networking: Python is used in networking for developing scripts and applications for
network automation, monitoring, and management.
o DevOps: Python is widely used in DevOps for automation and scripting of infrastructure
management, configuration management, and deployment processes.
o Finance: Python has libraries like Pandas, Scikit-learn, and Statsmodels for financial
modeling and analysis.
o Audio and Music: Python has libraries like Pyaudio, which is used for audio processing,
synthesis, and analysis, and Music21, which is used for music analysis and generation.
o Writing scripts: Python is used for writing utility scripts to automate tasks like file
operations, web scraping, and data processing.

Python Popular Frameworks and Libraries


Python has wide range of libraries and frameworks widely used in various fields such as machine
learning, artificial intelligence, web applications, etc. We define some popular frameworks and
libraries of Python as follows.

o Web development (Server-side) - Django Flask, Pyramid, CherryPy


o GUIs based applications - Tkinter, PyGTK, PyQt, PyJs, etc.
o
o Machine Learning - TensorFlow, PyTorch, Scikit-learn, Matplotlib, Scipy, etc.
o Mathematics - NumPy, Pandas, etc.
o BeautifulSoup: a library for web scraping and parsing HTML and XML
o Requests: a library for making HTTP requests
o SQLAlchemy: a library for working with SQL databases
o Kivy: a framework for building multi-touch applications
o Pygame: a library for game development
o Pytest: a testing framework for Python Django
o REST framework: a toolkit for building RESTful APIs
o FastAPI: a modern, fast web framework for building APIs
o Streamlit: a library for building interactive web apps for machine learning and data
science
o NLTK: a library for natural language processing

6
Python print() Function
Python print() function is used to display output to the console or terminal. It allows us to display
text, variables and other data in a human readable format.

Syntax:

print(object(s), sep=separator, end=end, file=file, flush=flush)

Python Conditional Statements


Conditional statements help us to execute a particular block for a particular condition. We shall
learn how to use conditional expression to execute a different block of statements. Python
provides if and else keywords to set up logical conditions. The elif keyword is also used as a
conditional statement.

Example code for if..else statement

1. x = 10
2. y=5
3.
4. if x > y:
5. print("x is greater than y")
6. else:
7. print("y is greater than or equal to x")

Python Loops
Sometimes we may need to alter the flow of the program. The execution of a specific code may
need to be repeated several times. For this purpose, the programming languages provide various
loops capable of repeating some specific code several times.

Python For Loop

1. fruits = ["apple", "banana", "cherry"]


2. for x in fruits:
3. print(x, end=" ")

Python While Loop

1. i = 1
2. while i<5:
3. print(i, end=" ")
4. i += 1

7
Python Data Structures
Lists

o Lists are ordered collections of data elements of different data types.


o Lists are mutable meaning a list can be modified anytime.
o Elements can be accessed using indices.
o They are defined using square bracket '[]'.
Example:

1. # Create a list
2. fruits = ['apple', 'banana', 'cherry']
3. print("fruits[1] =", fruits[1])
4.
5. # Modify list
6. fruits.append('orange')
7. print("fruits =", fruits)
8.
9. num_list = [1, 2, 3, 4, 5]
10. # Calculate sum
11. sum_nums = sum(num_list)
12. print("sum_nums =", sum_nums)
Output:

fuirts[1] = banana
fruits = ['apple', 'banana', 'cherry', 'orange']
sum_nums = 15

Tuples

o Tuples are also ordered collections of data elements of different data types, similar to
Lists.
o Elements can be accessed using indices.
o Tuples are immutable meaning Tuples can't be modified once created.
o They are defined using open bracket '()'.
Example:

1. # Create a tuple
2. point = (3, 4)
3. x, y = point
4. print("(x, y) =", x, y)
5.
6. # Create another tuple
7. tuple_ = ('apple', 'banana', 'cherry', 'orange')
8. print("Tuple =", tuple_)

8
Output:

(x, y) = 3 4
Tuple = ('apple', 'banana', 'cherry', 'orange')

Sets

o Sets are unordered collections of immutable data elements of different data types.
o Sets are mutable.
o Elements can't be accessed using indices.
o Sets do not contain duplicate elements.
o They are defined using curly braces '{}'

1. # Create a set
2. set1 = {1, 2, 2, 1, 3, 4}
3. print("set1 =", set1)
4.
5. # Create another set
6. set2 = {'apple', 'banana', 'cherry', 'apple', 'orange'}
7. print("set2 =", set2)
Output:

set1 = {1, 2, 3, 4}
set2 = {'apple', 'cherry', 'orange', 'banana'}

Dictionaries

o Dictionary are key-value pairs that allow you to associate values with unique keys.
o They are defined using curly braces '{}' with key-value pairs separated by colons ':'.
o Dictionaries are mutable.
o Elements can be accessed using keys.
Example:

1. # Create a dictionary
2. person = {'name': 'Umesh', 'age': 25, 'city': 'Noida'}
3. print("person =", person)
4. print(person['name'])
5.
6. # Modify Dictionary
7. person['age'] = 27
8. print("person =", person)
Output:

person = {'name': 'Umesh', 'age': 25, 'city': 'Noida'}

9
Umesh
person = {'name': 'Umesh', 'age': 27, 'city': 'Noida'}

Python Functional Programming


This section of Python we define some important tools related to functional programming, such
as lambda and recursive functions. These functions are very efficient in accomplishing complex
tasks. We define a few important functions, such as reduce, map, and filter. Python provides the
functools module that includes various functional programming tools.

Recent versions of Python have introduced features that make functional programming more
concise and expressive. For example, the "walrus operator":= allows for inline variable
assignment in expressions, which can be useful when working with nested function calls or list
comprehensions.

Python Function

1. Lambda Function - A lambda function is a small, anonymous function that can take
any number of arguments but can only have one expression. Lambda functions are often
used in functional programming to create functions "on the fly" without defining a named
function.
2. Recursive Function - A recursive function is a function that calls itself to solve a
problem. Recursive functions are often used in functional programming to perform
complex computations or to traverse complex data structures.
3. Map Function - The map() function applies a given function to each item of an iterable
and returns a new iterable with the results. The input iterable can be a list, tuple, or other.
4. Filter Function - The filter() function returns an iterator from an iterable for which the
function passed as the first argument returns True. It filters out the items from an iterable
that do not meet the given condition.
5. Reduce Function - The reduce() function applies a function of two arguments
cumulatively to the items of an iterable from left to right to reduce it to a single value.
6. functools Module - The functools module in Python provides higher-order functions that
operate on other functions, such as partial() and reduce().
7. Currying Function - A currying function is a function that takes multiple arguments and
returns a sequence of functions that each take a single argument.
8. Memoization Function - Memoization is a technique used in functional programming to
cache the results of expensive function calls and return the cached Result when the same
inputs occur again.
9. Threading Function - Threading is a technique used in functional programming to run
multiple tasks simultaneously to make the code more efficient and faster.

Python Modules
Python modules are the program files that contain Python code or functions. Python has two
types of modules - User-defined modules and built-in modules. A module the user defines, or our
Python code saved with .py extension, is treated as a user-define module.

10
Built-in modules are predefined modules of Python. To use the functionality of the modules, we
need to import them into our current working program.

Python modules are essential to the language's ecosystem since they offer reusable code and
functionality that can be imported into any Python program. Here are a few examples of several
Python modules, along with a brief description of each:

Math: Gives users access to mathematical constants and pi and trigonometric functions.

Datetime: Provides classes for a simpler way of manipulating dates, times, and periods.

OS: Enables interaction with the base operating system, including administration of processes
and file system activities.

Random: The random function offers tools for generating random integers and picking random
items from a list.

JSON: JSON is a data structure that can be encoded and decoded and is frequently used in
online APIs and data exchange. This module allows dealing with JSON.
Re: Supports regular expressions, a potent text-search and text-manipulation tool.

Collections: Provides alternative data structures such as sorted dictionaries, default dictionaries,
and named tuples.

NumPy: NumPy is a core toolkit for scientific computing that supports numerical operations on
arrays and matrices.

Pandas: It provides high-level data structures and operations for dealing with time series and
other structured data types.

Requests: Offers a simple user interface for web APIs and performs HTTP requests.

11
Python File I/O - Read and Write Files
In Python, the IO module provides methods of three types of IO operations; raw binary files,
buffered binary files, and text files. The canonical way to create a file object is by using the
open() function.

Any file operations can be performed in the following three steps:

1. Open the file to get the file object using the built-in open() function. There are different access
modes, which you can specify while opening a file using the open() function.
2. Perform read, write, append operations using the file object retrieved from the open()
function.
3. Close and dispose the file object.

Reading File

File object includes the following methods to read data from the file.

 read(chars): reads the specified number of characters starting from the current position.
 readline(): reads the characters starting from the current reading position up to a newline
character.
 readlines(): reads all lines until the end of file and returns a list object.

The following C:\myfile.txt file will be used in all the examples of reading and writing files.

C:\myfile.txt
This is the first line.
This is the second line.
This is the third line.

The following example performs the read operation using the read(chars) method.

Example: Reading a File


>>> f = open('C:\myfile.txt') # opening a file
>>> lines = f.read() # reading a file
>>> lines
'This is the first line. \nThis is the second line.\nThis is the third line.'
>>> f.close() # closing file object

Above, f = open('C:\myfile.txt') opens the myfile.txt in the default read mode from the
current directory and returns a file object. f.read() function reads all the content until EOF as a
string. If you specify the char size argument in the read(chars) method, then it will read that
many chars only. f.close() will flush and close the stream.

12
Reading a Line

The following example demonstrates reading a line from the file.

Example: Reading Lines


>>> f = open('C:\myfile.txt') # opening a file
>>> line1 = f.readline() # reading a line
>>> line1
'This is the first line. \n'
>>> line2 = f.readline() # reading a line
>>> line2
'This is the second line.\n'
>>> line3 = f.readline() # reading a line
>>> line3
'This is the third line.'
>>> line4 = f.readline() # reading a line
>>> line4
''
>>> f.close() # closing file object

As you can see, we have to open the file in 'r' mode. The readline() method will return the
first line, and then will point to the second line in the file.

Reading All Lines

The following reads all lines using the readlines() function.

Example: Reading a File


>>> f = open('C:\myfile.txt') # opening a file
>>> lines = f.readlines() # reading all lines
>>> lines
'This is the first line. \nThis is the second line.\nThis is the third line.'
>>> f.close() # closing file object

The file object has an inbuilt iterator. The following program reads the given file line by line
until StopIteration is raised, i.e., the EOF is reached.

Example: File Iterator


f=open('C:\myfile.txt')
while True:
try:
line=next(f)
print(line)
except StopIteration:
break
f.close()

Use the for loop to read a file easily.

13
Example: Read File using the For Loop
f=open('C:\myfile.txt')
for line in f:
print(line)
f.close()
Output
This is the first line.
This is the second line.
This is the third line.

Writing to a File

The file object provides the following methods to write to a file.

 write(s): Write the string s to the stream and return the number of characters written.
 writelines(lines): Write a list of lines to the stream. Each line must have a separator at the end of
it.

Create a new File and Write

The following creates a new file if it does not exist or overwrites to an existing file.

Example: Create or Overwrite to Existing File


>>> f = open('C:\myfile.txt','w')
>>> f.write("Hello") # writing to file
5
>>> f.close()

# reading file
>>> f = open('C:\myfile.txt','r')
>>> f.read()
'Hello'
>>> f.close()

In the above example, the f=open("myfile.txt","w") statement opens myfile.txt in write


mode, the open() method returns the file object and assigns it to a variable f. 'w' specifies that
the file should be writable. Next, f.write("Hello") overwrites an existing content of the
myfile.txt file. It returns the number of characters written to a file, which is 5 in the above
example. In the end, f.close() closes the file object.

Appending to an Existing File

The following appends the content at the end of the existing file by passing 'a' or 'a+' mode in
the open() method.

Example: Append to Existing File

14
>>> f = open('C:\myfile.txt','a')
>>> f.write(" World!")
7
>>> f.close()

# reading file
>>> f = open('C:\myfile.txt','r')
>>> f.read()
'Hello World!'
>>> f.close()
Write Multiple Lines

Python provides the writelines() method to save the contents of a list object in a file. Since
the newline character is not automatically written to the file, it must be provided as a part of the
string.

Example: Write Lines to File


>>> lines=["Hello world.\n", "Welcome to TutorialsTeacher.\n"]
>>> f=open("D:\myfile.txt", "w")
>>> f.writelines(lines)
>>> f.close()

Opening a file with "w" mode or "a" mode can only be written into and cannot be read from.
Similarly "r" mode allows reading only and not writing. In order to perform simultaneous
read/append operations, use "a+" mode.

Writing to a Binary File

The open() function opens a file in text format by default. To open a file in binary format, add
'b' to the mode parameter. Hence the "rb" mode opens the file in binary format for reading,
while the "wb" mode opens the file in binary format for writing. Unlike text files, binary files are
not human-readable. When opened using any text editor, the data is unrecognizable.

The following code stores a list of numbers in a binary file. The list is first converted in a byte
array before writing. The built-in function bytearray() returns a byte representation of the object.

Example: Write to a Binary File


f=open("binfile.bin","wb")
num=[5, 10, 15, 20, 25]
arr=bytearray(num)
f.write(arr)
f.close()

15
Exception Handling in Python
The cause of an exception is often external to the program itself. For example, an incorrect input,
a malfunctioning IO device etc. Because the program abruptly terminates on encountering an
exception, it may cause damage to system resources, such as files. Hence, the exceptions should
be properly handled so that an abrupt termination of the program is prevented.

Python uses try and except keywords to handle exceptions. Both keywords are followed by
indented blocks.

Syntax:
try :
#statements in try block
except :
#executed when error in try block

The try: block contains one or more statements which are likely to encounter an exception. If the
statements in this block are executed without an exception, the subsequent except: block is
skipped.

If the exception does occur, the program flow is transferred to the except: block. The statements
in the except: block are meant to handle the cause of the exception appropriately. For example,
returning an appropriate error message.

You can specify the type of exception after the except keyword. The subsequent block will be
executed only if the specified exception occurs. There may be multiple except clauses with
different exception types in a single try block. If the type of exception doesn't match any of the
except blocks, it will remain unhandled and the program will terminate.

The rest of the statements after the except block will continue to be executed, regardless if the
exception is encountered or not.

The following example will throw an exception when we try to devide an integer by a string.

Example: try...except blocks


try:

a=5

b='0'

print(a/b)

except:

16
print('Some error occurred.')

print("Out of try except blocks.")

Output
Some error occurred.

Out of try except blocks.

You can mention a specific type of exception in front of the except keyword. The subsequent
block will be executed only if the specified exception occurs. There may be multiple except
clauses with different exception types in a single try block. If the type of exception doesn't match
any of the except blocks, it will remain unhandled and the program will terminate.

Example: Catch Specific Error Type


try:

a=5

b='0'

print (a+b)

except TypeError:

print('Unsupported operation')

print ("Out of try except blocks")

Output
Unsupported operation

Out of try except blocks

As mentioned above, a single try block may have multiple except blocks. The following example
uses two except blocks to process two different exception types:

Example: Multiple except Blocks


try:

a=5

b=0

print (a/b)

except TypeError:

print('Unsupported operation')

except ZeroDivisionError:

print ('Division by zero not allowed')

17
print ('Out of try except blocks')

Output
Division by zero not allowed

Out of try except blocks

However, if variable b is set to '0', TypeError will be encountered and processed by


corresponding except block.

else and finally


In Python, keywords else and finally can also be used along with the try and except clauses.
While the except block is executed if the exception occurs inside the try block, the else block
gets processed if the try block is found to be exception free.

Syntax:
try:

#statements in try block

except:

#executed when error in try block

else:

#executed if try block is error-free

finally:

#executed irrespective of exception occured or not

The finally block consists of statements which should be processed regardless of an exception
occurring in the try block or not. As a consequence, the error-free try block skips the except
clause and enters the finally block before going on to execute the rest of the code. If, however,
there's an exception in the try block, the appropriate except block will be processed, and the
statements in the finally block will be processed before proceeding to the rest of the code.

The example below accepts two numbers from the user and performs their division. It
demonstrates the uses of else and finally blocks.

Example: try, except, else, finally blocks


try:

print('try block')

x=int(input('Enter a number: '))

y=int(input('Enter another number: '))

18
z=x/y

except ZeroDivisionError:

print("except ZeroDivisionError block")

print("Division by 0 not accepted")

else:

print("else block")

print("Division = ", z)

finally:

print("finally block")

x=0

y=0

print ("Out of try, except, else and finally blocks." )

The first run is a normal case. The out of the else and finally blocks is displayed because the try
block is error-free.

Output
try block

Enter a number: 10

Enter another number: 2

else block

Division = 5.0

finally block

Out of try, except, else and finally blocks.

The second run is a case of division by zero, hence, the except block and the finally block are
executed, but the else block is not executed.

Output
try block

Enter a number: 10

Enter another number: 0

except ZeroDivisionError block

Division by 0 not accepted

19
finally block

Out of try, except, else and finally blocks.

In the third run case, an uncaught exception occurs. The finally block is still executed but the
program terminates and does not execute the program after the finally block.

Output
try block

Enter a number: 10

Enter another number: xyz

finally block

Traceback (most recent call last):

File "C:\python36\codes\test.py", line 3, in <module>

y=int(input('Enter another number: '))

ValueError: invalid literal for int() with base 10: 'xyz'

Typically the finally clause is the ideal place for cleaning up the operations in a process. For
example closing a file irrespective of the errors in read/write operations. This will be dealt with
in the next chapter.

Raise an Exception

Python also provides the raise keyword to be used in the context of exception handling. It
causes an exception to be generated explicitly. Built-in errors are raised implicitly. However, a
built-in or custom exception can be forced during execution.

The following code accepts a number from the user. The try block raises a ValueError exception
if the number is outside the allowed range.

Example: Raise an Exception


try:

x=int(input('Enter a number upto 100: '))

if x > 100:

raise ValueError(x)

except ValueError:

print(x, "is out of allowed range")

else:

print(x, "is within the allowed range")

20
Output
Enter a number upto 100: 200

200 is out of allowed range

Enter a number upto 100: 50

50 is within the allowed range

Here, the raised exception is a ValueError type. However, you can define your custom
exception type to be raised

Python File I/O


Files are used to store data in a computer disk. Here, we explain the built-in file object of Python.
We can open a file using Python script and perform various operations such as writing, reading,
and appending. There are various ways of opening a file. We are explained with the relevant
example. We will also learn to perform read/write operations on binary files.

Python's file input/output (I/O) system offers programs to communicate with files stored on a
disc. Python's built-in methods for the file object let us carry out actions like reading, writing,
and adding data to files.

The open() method in Python makes a file object when working with files. The name of the file
to be opened and the mode in which the file is to be opened are the two parameters required by
this function. The mode can be used according to work that needs to be done with the file, such
as "r" for reading, "w" for writing, or "a" for attaching.

After successfully creating an object, different methods can be used according to our work. If we
want to write in the file, we can use the write() functions, and if you want to read and write both,
then we can use the append() function and, in cases where we only want to read the content of
the file we can use read() function. Binary files containing data in a binary rather than a text
format may also be worked with using Python. Binary files are written in a manner that humans
cannot directly understand. The rb and wb modes can read and write binary data in binary files.

Python Exceptions
An exception can be defined as an unusual condition in a program resulting in an interruption in
the flow of the program.

Whenever an exception occurs, the program stops the execution, and thus the other code is not
executed. Therefore, an exception is the run-time errors that are unable to handle to Python
script. An exception is a Python object that represents an error.

21
Python Exceptions are an important aspect of error handling in Python programming. When a
program encounters an unexpected situation or error, it may raise an exception, which can
interrupt the normal flow of the program.

In Python, exceptions are represented as objects containing information about the error,
including its type and message. The most common type of Exception in Python is the Exception
class, a base class for all other built-in exceptions.

To handle exceptions in Python, we use the try and except statements. The try statement is used
to enclose the code that may raise an exception, while the except statement is used to define a
block of code that should be executed when an exception occurs.

For example, consider the following code:

1. try:
2. x = int ( input ("Enter a number: "))
3. y = 10 / x
4. print ("Result:", y)
5. except ZeroDivisionError:
6. print ("Error: Division by zero")
7. except ValueError:
8. print ("Error: Invalid input")
Output:

Enter a number: 0
Error: Division by zero
In this code, we use the try statement to attempt to perform a division operation. If either of these
operations raises an exception, the matching except block is executed.

Python also provides many built-in exceptions that can be raised in similar situations. Some
common built-in exceptions include IndexError, TypeError, and NameError. Also, we can
define our custom exceptions by creating a new class that inherits from the Exception class.

Python CSV
A CSV stands for "comma separated values", which is defined as a simple file format that uses
specific structuring to arrange tabular data. It stores tabular data such as spreadsheets or
databases in plain text and has a common format for data interchange. A CSV file opens into the
Excel sheet, and the rows and columns data define the standard format.

We can use the CSV.reader function to read a CSV file. This function returns a reader object that
we can use to repeat over the rows in the CSV file. Each row is returned as a list of values, where
each value corresponds to a column in the CSV file.

For example, consider the following code:

22
1. import csv
2.
3. with open('data.csv', 'r') as file:
4. reader = csv.reader(file)
5. for row in reader:
6. print(row)
Here, we open the file data.csv in read mode and create a csv.reader object using
the csv.reader() function. We then iterate over the rows in the CSV file using a for loop and
print each row to the console.

We can use the CSV.writer() function to write data to a CSV file. It returns a writer object we
can use to write rows to the CSV file. We can write rows by calling the writer () method on the
writer object.

For example, consider the following code:

1. import csv
2.
3. data = [ ['Name', 'Age', 'Country'],
4. ['Alice', '25', 'USA'],
5. ['Bob', '30', 'Canada'],
6. ['Charlie', '35', 'Australia']
7. ]
8.
9. with open('data.csv', 'w') as file:
10. writer = csv.writer(file)
11. for row in data:
12. writer.writerow(row)
In this program, we create a list of lists called data, where each inner list represents a row of data.
We then open the file data.csv in write mode and create a CSV.writer object using the
CSV.writer function. We then iterate over the rows in data using a for loop and write each row to
the CSV file using the writer method.

Python Sending Mail


We can send or read a mail using the Python script. Python's standard library modules are useful
for handling various protocols such as PoP3 and IMAP. Python provides the smtplib module for
sending emails using SMTP (Simple Mail Transfer Protocol). We will learn how to send mail
with the popular email service SMTP from a Python script.

Python Magic Methods

The Python magic method is the special method that adds "magic" to a class. It starts and ends
with double underscores, for example, _init_ or _str_.

23
The built-in classes define many magic methods. The dir() function can be used to see the
number of magic methods inherited by a class. It has two prefixes and suffix underscores in the
method name.

o Python magic methods are also known as dunder methods, short for "double
underscore" methods because their names start and end with a double underscore.
o Magic methods are automatically invoked by the Python interpreter in certain situations,
such as when an object is created, compared to another object, or printed.
o Magic methods can be used to customize the behavior of classes, such as defining how
objects are compared, converted to strings, or accessed as containers.
o Some commonly used magic methods include init for initializing an object, str for
converting an object to a string, eq for comparing two objects for equality,
and getitem and setitem for accessing items in a container object.
For example, the str magic method can define how an object should be represented as a string.
Here's an example

1. class Person:
2. def __init__(self, name, age):
3. self.name = name
4. self.age = age
5.
6. def __str__(self):
7. return f"{self.name} ({self.age})"
8.
9. person = Person('Vikas', 22)
10. print(person)
Output:

Vikas (22)
In this example, the str method is defined to return a formatted string representation of the
Person object with the person's name and age.

Another commonly used magic method is eq, which defines how objects should be compared for
equality. Here's an example:

Python Oops Concepts


Everything in Python is treated as an object, including integer values, floats, functions, classes,
and none. Apart from that, Python supports all oriented concepts. Below is a brief introduction to
the Oops concepts of Python.

o Classes and Objects - Python classes are the blueprints of the Object. An object is a
collection of data and methods that act on the data.

24
o Inheritance - An inheritance is a technique where one class inherits the properties of
other classes.
o Constructor - Python provides a special method __init__() which is known as a
constructor. This method is automatically called when an object is instantiated.
o Data Member - A variable that holds data associated with a class and its objects.
o Polymorphism - Polymorphism is a concept where an object can take many forms. In
Python, polymorphism can be achieved through method overloading and method
overriding.
o Method Overloading - In Python, method overloading is achieved through default
arguments, where a method can be defined with multiple parameters. The default values
are used if some parameters are not passed while calling the method.
o Method Overriding - Method overriding is a concept where a subclass implements a
method already defined in its superclass.
o Encapsulation - Encapsulation is wrapping data and methods into a single unit. In
Python, encapsulation is achieved through access modifiers, such as public, private, and
protected. However, Python does not strictly enforce access modifiers, and the naming
convention indicates the access level.
o Data Abstraction: A technique to hide the complexity of data and show only essential
features to the user. It provides an interface to interact with the data. Data abstraction
reduces complexity and makes code more modular, allowing developers to focus on the
program's essential features.

You need the following OOPs knowledge in detail

o Python Oops Concepts - In Python, the object-oriented paradigm is to design the program
using classes and objects. The object is related to real-word entities such as book, house,
pencil, etc. and the class defines its properties and behaviours.
o Python Objects and classes - In Python, objects are instances of classes and classes are
blueprints that defines structure and behaviour of data.
o Python Constructor - A constructor is a special method in a class that is used to initialize
the object's attributes when the object is created.
o Python Inheritance - Inheritance is a mechanism in which new class (subclass or child
class) inherits the properties and behaviours of an existing class (super class or parent
class).
o Python Polymorphism - Polymorphism allows objects of different classes to be treated as
objects of a common superclass, enabling different classes to be used interchangeably
through a common interface.

25
Python Advanced Topics
Python includes many advances and useful concepts that help the programmer solve complex
tasks. These concepts are given below.

Python Iterator

An iterator is simply an object that can be iterated upon. It returns one Object at a time. It can be
implemented using the two special methods, __iter__() and __next__().

Iterators in Python are objects that allow iteration over a collection of data. They process each
collection element individually without loading the entire collection into memory.

For example, let's create an iterator that returns the squares of numbers up to a given limit:

1. def __init__(self, limit):


2. self.limit = limit
3. self.n = 0
4.
5. def __iter__(self):
6. return self
7.
8. def __next__(self):
9. if self.n <= self.limit:
10. square = self.n ** 2
11. self.n += 1
12. return square
13. else:
14. raise StopIteration
15.
16. numbers = Squares(5)
17. for n in numbers:
18. print(n)
Output:

0
1
4
9
16
25

In this example, we have created a class Squares that acts as an iterator by implementing the
__iter__() and __next__() methods. The __iter__() method returns the Object itself, and the
__next__() method returns the next square of the number until the limit is reached.

26
Python Generators

Python generators produce a sequence of values using a yield statement rather than a return
since they are functions that return iterators. Generators terminate the function's execution while
keeping the local state. It picks up right where it left off when it is restarted. Because we don't
have to implement the iterator protocol thanks to this feature, writing iterators is made simpler.
Here is an illustration of a straightforward generator function that produces squares of numbers:

1. # Generator Function
2. def square_numbers(n):
3. for i in range(n):
4. yield i**2
5.
6. # Create a generator object
7. generator = square_numbers(5)
8.
9. # Print the values generated by the generator
10. for num in generator:
11. print(num)
Output:

0
1
4
9
16

Python Modifiers
Python Decorators are functions used to modify the behaviour of another function. They allow
adding functionality to an existing function without modifying its code directly. Decorators are
defined using the @ symbol followed by the name of the decorator function. They can be used
for logging, timing, caching, etc.

Here's an example of a decorator function that adds timing functionality to another function:

1. import time
2. from math import factorial
3.
4. # Decorator to calculate time taken by
5. # the function
6. def time_it(func):
7. def wrapper(*args, **kwargs):
8. start = time.time()
9. result = func(*args, **kwargs)
10. end = time.time()
11. print(f"{func.__name__} took {end-start:.5f} seconds to run.")

27
12. return result
13. return wrapper
14.
15. @time_it
16. def my_function(n):
17. time.sleep(2)
18. print(f"Factorial of {n} = {factorial(n)}")
19.
20. my_function(25)
Output:

In the above example, the time_it decorator function takes another function as an argument and
returns a wrapper function. The wrapper function calculates the time to execute the original
function and prints it to the console. The @time_it decorator is used to apply the time_it function
to the my_function function. When my_function is called, the decorator is executed, and the
timing functionality is added.

28
Introduction to Natural Language
Processing
You can perform text analysis by using Python library called Natural Language Tool Kit
(NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between
text analysis and web scraping.

Analyzing the words in the text can lead us to know about which words are important, which
words are unusual, how words are grouped. This analysis eases the task of web scraping.

Getting started with NLTK

The Natural language toolkit (NLTK) is collection of Python libraries which is designed
especially for identifying and tagging parts of speech found in the text of natural language like
English.

Installing NLTK

You can use the following command to install NLTK in Python −

pip install nltk

If you are using Anaconda, then a conda package for NLTK can be built by using the following
command −

conda install -c anaconda nltk

Downloading NLTK’s Data

After installing NLTK, we have to download preset text repositories. But before downloading
text preset repositories, we need to import NLTK with the help of import command as follows −

import nltk

Now, with the help of following command NLTK data can be downloaded −

nltk.download()

Installation of all available packages of NLTK will take some time, but it is always
recommended to install all the packages.

Installing Other Necessary packages

We also need some other Python packages like gensim and pattern for doing text analysis as
well as building building natural language processing applications by using NLTK.

29
gensim − A robust semantic modeling library which is useful for many applications. It can be
installed by the following command −

pip install gensim

pattern − Used to make gensim package work properly. It can be installed by the following
command −

pip install pattern


Tokenization

The Process of breaking the given text, into the smaller units called tokens, is called
tokenization. These tokens can be the words, numbers or punctuation marks. It is also called
word segmentation.

Example

NLTK module provides different packages for tokenization. We can use these packages as per
our requirement. Some of the packages are described here −

sent_tokenize package − This package will divide the input text into sentences. You can use the
following command to import this package −

from nltk.tokenize import sent_tokenize

word_tokenize package − This package will divide the input text into words. You can use the
following command to import this package −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package − This package will divide the input text as well as the
punctuation marks into words. You can use the following command to import this package −

from nltk.tokenize import WordPuncttokenizer


Stemming

In any language, there are different forms of a words. A language includes lots of variations due
to the grammatical reasons. For example, consider the words democracy, democratic, and
democratization. For machine learning as well as for web scraping projects, it is important for
machines to understand that these different words have the same base form. Hence we can say
that it can be useful to extract the base forms of the words while analyzing the text.

30
This can be achieved by stemming which may be defined as the heuristic process of extracting
the base forms of the words by chopping off the ends of words.

NLTK module provides different packages for stemming. We can use these packages as per our
requirement. Some of these packages are described here −

PorterStemmer package − Porter’s algorithm is used by this Python stemming package to


extract the base form. You can use the following command to import this package −

from nltk.stem.porter import PorterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer, the output would be
the word ‘write’ after stemming.

LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package


to extract the base form. You can use the following command to import this package −

from nltk.stem.lancaster import LancasterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would
be the word ‘writ’ after stemming.

SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package


to extract the base form. You can use the following command to import this package −

from nltk.stem.snowball import SnowballStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would
be the word ‘write’ after stemming.

Lemmatization

Another way to extract the base form of words is by lemmatization, normally aiming to remove
inflectional endings by using vocabulary and morphological analysis. The base form of any word
after lemmatization is called lemma.

NLTK module provides following packages for lemmatization −

WordNetLemmatizer package − It will extract the base form of the word depending upon
whether it is used as noun as a verb. You can use the following command to import this package

from nltk.stem import WordNetLemmatizer


Chunking

Chunking, which means dividing the data into small chunks, is one of the important processes in
natural language processing to identify the parts of speech and short phrases like noun phrases.

31
Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help
of chunking process.

Example

In this example, we are going to implement Noun-Phrase chunking by using NLTK Python
module. NP chunking is a category of chunking which will find the noun phrases chunks in the
sentence.

Steps for implementing noun phrase chunking

We need to follow the steps given below for implementing noun-phrase chunking −

Step 1 − Chunk grammar definition

In the first step we will define the grammar for chunking. It would consist of the rules which we
need to follow.

Step 2 − Chunk parser creation

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

In this last step, the output would be produced in a tree format.

First, we need to import the NLTK package as follows −

import nltk

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective,
IN: the preposition and NN: the noun.

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),


("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

Next, we are giving the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

Now, next line of code will define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

Now, the parser will parse the sentence.

parser_chunking.parse(sentence)

32
Next, we are giving our output in the variable.

Output = parser_chunking.parse(sentence)

With the help of following code, we can draw our output in the form of a tree as shown below.

output.draw()

Bag of Word (BoW) Model Extracting and converting the Text into
Numeric Form

Bag of Word (BoW), a useful model in natural language processing, is basically used to extract
the features from text. After extracting the features from the text, it can be used in modeling in
machine learning algorithms because raw data cannot be used in ML applications.

Working of BoW Model

Initially, model extracts a vocabulary from all the words in the document. Later, using a
document term matrix, it would build a model. In this way, BoW model represents the document
as a bag of words only and the order or structure is discarded.

Example

Suppose we have the following two sentences −

Sentence1 − This is an example of Bag of Words model.

Sentence2 − We can extract features by using Bag of Words model.

Now, by considering these two sentences, we have the following 14 distinct words −

 This

33
 is
 an
 example
 bag
 of
 words
 model
 we
 can
 extract
 features
 by
 using

Building a Bag of Words Model in NLTK

Let us look into the following Python script which will build a BoW model in NLTK.

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract


features by using Bag of Words model.']
vector_count = CountVectorizer()
features_text = vector_count.fit_transform(Sentences).todense()
print(vector_count.vocabulary_)

Output

It shows that we have 14 distinct words in the above two sentences −

{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}
Topic Modeling: Identifying Patterns in Text Data

Generally documents are grouped into topics and topic modeling is a technique to identify the
patterns in a text that corresponds to a particular topic. In other words, topic modeling is used to
uncover abstract themes or hidden structure in a given set of documents.

You can use topic modeling in following scenarios −

Text Classification

34
Classification can be improved by topic modeling because it groups similar words together rather
than using each word separately as a feature.

Recommender Systems

We can build recommender systems by using similarity measures.

Topic Modeling Algorithms

We can implement topic modeling by using the following algorithms −

Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the
probabilistic graphical models for implementing topic modeling.

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − It is based upon Linear


Algebra and uses the concept of SVD (Singular Value Decomposition) on document term matrix.

Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like
LDA.

The above mentioned algorithms would have the following elements −

 Number of topics: Parameter


 Document-Word Matrix: Input
 WTM (Word Topic Matrix) & TDM (Topic Document Matrix): Output

35
Sentiment Analysis Using Python
In today’s digital age, platforms like Twitter, Goodreads, and Amazon
overflow with people’s opinions, making it crucial for organizations to
extract insights from this massive volume of data. Sentiment Analysis in
Python offers a powerful solution to this challenge. This technique, a subset
of Natural Language Processing (NLP), involves classifying texts into
sentiments such as positive, negative, or neutral. By employing
various Python libraries and models, analysts can automate this process
efficiently.

Sentiment Analysis is a use case of Natural Language Processing


(NLP) and comes under the category of text classification. To put it
simply, Sentiment Analysis involves classifying a text into various
sentiments, such as positive or negative, Happy, Sad or Neutral, etc. Thus,
the ultimate goal of sentiment analysis is to decipher the underlying mood,
emotion, or sentiment of a text. This is also referred to as Opinion Mining.

How Does Sentiment Analysis Work?

Sentiment analysis in Python typically works by employing natural language

processing(NLP) techniques to analyze and understand the sentiment expressed in text. The

process involves several steps:

 Text Preprocessing: The text cleaning process involves removing irrelevant

information, such as special characters, punctuation, and stopwords, from the text

data.

 Tokenization: The text is divided into individual words or tokens to facilitate

analysis.

 Feature Extraction: The text extraction process involves extracting relevant

features from the text, such as words, n-grams, or even parts of speech.

36
 Sentiment Classification: Machine learning algorithms or pre-trained models are

used to classify the sentiment of each text instance. Researchers achieve this

through supervised learning, where they train models on labeled data, or through

pre-trained models that have learned sentiment patterns from large datasets.

 Post-processing: The sentiment analysis results may undergo additional

processing, such as aggregating sentiment scores or applying threshold rules to

classify sentiments as positive, negative, or neutral.

 Evaluation: Researchers assess the performance of the sentiment analysis model

using evaluation metrics, such as accuracy, precision, recall, or F1 score.

Types of Sentiment Analysis

Various types of sentiment analysis can be performed, depending on the specific

focus and objective of the analysis. Some common types include:

 Document-Level Sentiment Analysis: This type of analysis determines the overall

sentiment expressed in a document, such as a review or an article. It aims to

classify the entire text as positive, negative, or neutral.

 Sentence-Level Sentiment Analysis: Here, the sentiment of each sentence within

a document is analyzed. This type provides a more granular understanding of the

sentiment expressed in different text parts.

 Aspect-Based Sentiment Analysis: This approach focuses on identifying and

extracting the sentiment associated with specific aspects or entities mentioned in

37
the text. For example, in a product review, the sentiment towards different features

of the product (e.g., performance, design, usability) can be analyzed separately.

 Entity-Level Sentiment Analysis: This type of analysis identifies the sentiment

expressed towards specific entities or targets mentioned in the text, such as people,

companies, or products. It helps understand the sentiment associated with different

entities within the same document.

 Comparative Sentiment Analysis: This approach involves comparing the

sentiment between different entities or aspects mentioned in the text. It aims to

identify the relative sentiment or preferences expressed towards various entities or

features.

Sentiment Analysis Use Cases

We just saw how sentiment analysis can empower organizations with insights that

can help them make data-driven decisions. Now, let’s peep into some more use

cases of sentiment analysis:

 Social Media Monitoring for Brand Management: Brands can use sentiment

analysis to gauge their Brand’s public outlook. For example, a company can gather

all Tweets with the company’s mention or tag and perform sentiment analysis to

learn the company’s public outlook.

 Product/Service Analysis: Brands/Organizations can perform sentiment analysis

on customer reviews to see how well a product or service is doing in the market and

make future decisions accordingly.

38
 Stock Price Prediction: Predicting whether the stocks of a company will go up or down is

crucial for investors. One can determine the same by performing sentiment analysis on News

Headlines of articles containing the company’s name. If the news headlines pertaining to a particular

organization happen to have a positive sentiment — its stock prices should go up and vice-versa.
Ways to Perform Sentiment Analysis in Python

Python is one of the most powerful tools when it comes to performing data science

tasks — it offers a multitude of ways to perform sentiment analysis in Python. The

most popular ones are enlisted here:

 Using Text Blob

 Using Vader

 Using Bag of Words Vectorization-based Models

 Using LSTM-based Models

 Using Transformer-based Models

Note: For the purpose of demonstrations of methods 3 & 4 (Using Bag of Words

Vectorization-based Models and Using LSTM-based Models ) sentiment analysis

has been used. It comprises more than 5000 text labelled as positive, negative or

neutral. The dataset lies under the Creative Commons license.

Using Text Blob

Text Blob is a Python library for Natural Language Processing. Using Text Blob for

sentiment analysis is quite simple. It takes text as an input and can return polarity

and subjectivity as outputs.

39
 Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1

denotes a highly negative sentiment and 1 denotes a highly positive sentiment.

 Subjectivity determines whether a text input is a factual information or a personal

opinion. Its value lies between [0,1] where a value closer to 0 denotes a piece of

factual information and a value closer to 1 denotes a personal opinion.

Here is Steps to perform sentiment analysis using python and putting sentiment

analysis code in python.

Step1: Installation
pip install textblob
Copy Code
Step2: Importing Text Blob
from textblob import TextBlob
Copy Code
Step3: Code Implementation for Sentiment Analysis Using Text Blob

Writing code for sentiment analysis using TextBlob is fairly simple. Just import the

TextBlob object and pass the text to be analyzed with appropriate attributes as

follows:

from textblob import TextBlob

text_1 = "The movie was so awesome."


text_2 = "The food here tastes terrible."

#Determining the Polarity


p_1 = TextBlob(text_1).sentiment.polarity
p_2 = TextBlob(text_2).sentiment.polarity

#Determining the Subjectivity


s_1 = TextBlob(text_1).sentiment.subjectivity
s_2 = TextBlob(text_2).sentiment.subjectivity

print("Polarity of Text 1 is", p_1)


print("Polarity of Text 2 is", p_2)
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)
Copy Code

40
Output

Polarity of Text 1 is 1.0


Polarity of Text 2 is -1.0
Subjectivity of Text 1 is 1.0
Subjectivity of Text 2 is 1.0
Copy Code

Using VADER

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a rule-based

sentiment analyzer that has been trained on social media text. Just like Text Blob,

its usage in Python is pretty simple. We’ll see its usage in code implementation with

an example in a while.

Step1: Installation
pip install vaderSentiment
Copy Code
Step2: Importing SentimentIntensityAnalyzer class from Vader
vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Copy Code
Step3: Code for Sentiment Analysis Using Vader

Firstly, we need to create an object of the SentimentIntensityAnalyzer class; then

we need to pass the text to the polarity_scores() function of the object as follows:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 = "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)
Copy Code

Output:

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

41
As we can see, a VaderSentiment object returns a dictionary of sentiment scores for

the text to be analyzed.

Using Bag of Words Vectorization-Based Models

In the two approaches discussed as yet i.e. Text Blob and Vader, we have simply

used Python libraries to perform sentiment analysis. Now we’ll discuss an approach

wherein we’ll train our own model for the task. The steps involved in performing

sentiment analysis using the Bag of Words Vectorization method are as follows:

 Pre-Process the text of training data (Text pre-processing involves Normalization,

Tokenization, Stopwords Removal, and Stemming/Lemmatization.)

 Create a Bag of Words for the pre-processed text data using the Count

Vectorization or TF-IDF Vectorization approach.

 Train a suitable classification model on the processed data for sentiment

classification.

Code for Sentiment Analysis using Bag of Words Vectorization Approach:

To build a sentiment analysis in python model using the BOW Vectorization

Approach we need a labeled dataset. As stated earlier, the dataset used for this

demonstration has been obtained from Kaggle. We have simply used sklearn’s

count vectorizer to create the BOW. After, we trained a Multinomial Naive Bayes

classifier, for which an accuracy score of 0.84 was obtained.

#Loading the Dataset


import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer

42
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer =
token.tokenize)
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'],
test_size=0.25, random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score) Copy Code

Output:

Accuracuy Score: 0.9111675126903553

The trained classifier can be used to predict the sentiment of any given text input.

Using LSTM-Based Models

Though we were able to obtain a decent accuracy score with the Bag of Words

Vectorization method, it might fail to yield the same results when dealing with larger

datasets. This gives rise to the need to employ deep learning-based models for the

training of the sentiment analysis in python model.

For NLP tasks, we generally use RNN-based models since they are designed to

deal with sequential data. Here, we’ll train an LSTM (Long Short Term Memory)

model using TensorFlow with Keras. The steps to perform sentiment analysis using

LSTM-based models are as follows:

43
 Pre-Process the text of training data (Text pre-processing involves Normalization,

Tokenization, Stopwords Removal, and Stemming/Lemmatization.)

 Tokenizer is imported from Keras.preprocessing.text and created, fitting it to the

entire training text. Text embeddings are generated using texts_to_sequence() and

stored after padding to equal length. Embeddings are numerical/vectorized

representations of text, not directly fed to the model.

 The model is built using TensorFlow, including input, LSTM, and dense layers.

Dropouts and hyperparameters are adjusted for accuracy. In inner layers, we use

ReLU or LeakyReLU activation functions to avoid vanishing gradient problems, while

in the output layer, we use Softmax or Sigmoid activation functions.

Code for Sentiment Analysis Using LSTM-based Model

Here, we have used the same dataset as we used in the case of the BOW

approach. A training accuracy of 0.90 was obtained.

#Importing necessary libraries


import nltk
import pandas as pd
from textblob import Word
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
#Pre-Processing the text
def cleaning(df, stop_words):
df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x.lower() for x in x.split()))
# Replacing the digits/numbers
df['sentences'] = df['sentences'].str.replace('d', '')
# Removing stop words
df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x for x in x.split() if x not in
stop_words))

44
# Lemmatization
df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in
x.split()]))
return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ')
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation='LeakyReLU'))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics =
['accuracy'])
print(model.summary())
#Model Training
model.fit(X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,y_test) Copy Code

Using Transformer-Based Models

Transformer-based models are one of the most advanced Natural Language

Processing Techniques. They follow an Encoder-Decoder-based architecture and

employ the concepts of self-attention to yield impressive results. Though one can

always build a transformer model from scratch, it is quite tedious a task. Thus, we

can use pre-trained transformer models available on Hugging Face. Hugging Face

is an open-source AI community that offers a multitude of pre-trained models for

NLP applications. You can use these models as they are or fine-tune them for

specific tasks.

Step1: Installation
pip install transformers
Copy Code
Step2: Importing SentimentIntensityAnalyzer class from Vader
import transformers
Copy Code

45
Step3: Code for Sentiment Analysis Using Transformer based models

To perform any task using transformers, we first need to import the pipeline function

from transformers. Then, an object of the pipeline function is created and the task to

be performed is passed as an argument (i.e sentiment analysis in our case). We can

also specify the model that we need to use to perform the task. Here, since we have

not mentioned the model to be used, the distillery-base-uncased-finetuned-sst-2-

English mode is used by default for sentiment analysis.

from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Copy Code

Output

[{'label': 'POSITIVE', 'score': 0.999457061290741}, {'label': 'NEGATIVE', 'score':


0.9987301230430603}] Copy Code
What is the best Python library for sentiment analysis?

No single best library for sentiment analysis in Python, depends on your needs.

Here’s a quick comparison:

NLTK: Powerful, versatile, good for multiple NLP tasks, but complex for sentiment

analysis.

TextBlob: Beginner-friendly, simple interface for sentiment analysis (polarity,

subjectivity).

Pattern: More comprehensive analysis (comparatives, superlatives, fact/opinion),

46
steeper learning curve.

Polyglot: Fast, multilingual support (136+ languages), ideal for multiple languages.

Key Takeaways

 Python provides a versatile environment for performing sentiment analysis tasks due

to its rich ecosystem of libraries and frameworks.

 We explored multiple approaches including Text Blob, VADER, Bag of Words,

LSTM, and Transformer-based models to analyze sentiment in textual data.

 The process involves text preprocessing, tokenization, feature extraction, and

applying machine learning or deep learning models to classify sentiments.

 We applied these methods to real-world examples like customer reviews and social

media data to classify sentiments as positive, negative, or neutral.

 Sentiment analysis helps organizations monitor brand perception, analyze customer

feedback, and make data-driven decisions.

 With advancements in natural language processing, sentiment analysis in Python

continues to evolve, offering more accurate and sophisticated methods for

understanding textual sentiment.

47
GUI BUILDING IN PYTHON USING TKINTER
Modern computer applications are user-friendly. User interaction is not restricted to console-
based I/O. They have a more ergonomic graphical user interface (GUI) thanks to high speed
processors and powerful graphics hardware. These applications can receive inputs through
mouse clicks and can enable the user to choose from alternatives with the help of radio buttons,
dropdown lists, and other GUI elements (or widgets).

Such applications are developed using one of various graphics libraries available. A graphics
library is a software toolkit having a collection of classes that define a functionality of various
GUI elements. These graphics libraries are generally written in C/C++.

GUI elements and their functionality are defined in the Tkinter module. The following code
demonstrates the steps in creating a UI.

from tkinter import *

window=Tk()

# add widgets here

window.title('Hello Python')

window.geometry("300x200+10+20")

window.mainloop()

First of all, import the TKinter module. After importing, setup the application object by calling
the Tk() function. This will create a top-level window (root) having a frame with a title bar,
control box with the minimize and close buttons, and a client area to hold other widgets. The
geometry() method defines the width, height and coordinates of the top left corner of the frame
as below (all values are in pixels): window.geometry("widthxheight+XPOS+YPOS") The
application object then enters an event listening loop by calling the mainloop() method. The
application is now constantly waiting for any event generated on the elements in it. The event
could be text entered in a text field, a selection made from the dropdown or radio button,
single/double click actions of mouse, etc. The application's functionality involves executing
appropriate callback functions in response to a particular type of event. The event loop will
terminate as and when the close button on the title bar is clicked. The above code will create the
following window:

48
Python-Tkinter Window

All Tkinter widget classes are inherited from the Widget class. Let's add the most commonly
used widgets.

Button

The button can be created using the Button class. The Button class constructor requires a
reference to the main window and to the options.

Signature: Button(window, attributes)

You can set the following important properties to customize a button:

 text : caption of the button


 bg : background colour
 fg : foreground colour
 font : font name and size
 image : to be displayed instead of text
 command : function to be called when clicked

Example: Button
 from tkinter import *
 window=Tk()
 btn=Button(window, text="This is Button widget", fg='blue')
 btn.place(x=80, y=100)
 window.title('Hello Python')
 window.geometry("300x200+10+10")
 window.mainloop()

Label

A label can be created in the UI in Python using the Label class. The Label constructor requires
the top-level window object and options parameters. Option parameters are similar to the Button
object.

The following adds a label in the window.

Example: Label

49
from tkinter import *

window=Tk()

lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica",


16))

lbl.place(x=60, y=50)

window.title('Hello Python')

window.geometry("300x200+10+10")

window.mainloop()

Here, the label's caption will be displayed in red colour using Helvetica font of 16 point size.

Entry

This widget renders a single-line text box for accepting the user input. For multi-line text input
use the Text widget. Apart from the properties already mentioned, the Entry class constructor
accepts the following:

 bd : border size of the text box; default is 2 pixels.


 show : to convert the text box into a password field, set show property to "*".

The following code adds the text field.

txtfld=Entry(window, text="This is Entry Widget", bg='black',fg='white', bd=5)

The following example creates a window with a button, label and entry field.

Example: Create Widgets


from tkinter import *

window=Tk()

btn=Button(window, text="This is Button widget", fg='blue')

btn.place(x=80, y=100)

lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica",


16))

lbl.place(x=60, y=50)

txtfld=Entry(window, text="This is Entry Widget", bd=5)

txtfld.place(x=80, y=150)

window.title('Hello Python')

50
window.geometry("300x200+10+10")

window.mainloop()

The above example will create the following window.

Create UI Widgets in Python-Tkinter

Selection Widgets

Radiobutton: This widget displays a toggle button having an ON/OFF state. There may be more
than one button, but only one of them will be ON at a given time.

Checkbutton: This is also a toggle button. A rectangular check box appears before its caption.
Its ON state is displayed by the tick mark in the box which disappears when it is clicked to OFF.

Combobox: This class is defined in the ttk module of tkinterpackage. It populates drop down
data from a collection data type, such as a tuple or a list as values parameter.

Listbox: Unlike Combobox, this widget displays the entire collection of string items. The user
can select one or multiple items.

The following example demonstrates the window with the selection widgets: Radiobutton,
Checkbutton, Listbox and Combobox:

Example: Selection Widgets


from tkinter import *
from tkinter.ttk import Combobox
window=Tk()
var = StringVar()
var.set("one")
data=("one", "two", "three", "four")

51
cb=Combobox(window, values=data)
cb.place(x=60, y=150)

lb=Listbox(window, height=5, selectmode='multiple')


for num in data:
lb.insert(END,num)
lb.place(x=250, y=150)

v0=IntVar()
v0.set(1)
r1=Radiobutton(window, text="male", variable=v0,value=1)
r2=Radiobutton(window, text="female", variable=v0,value=2)
r1.place(x=100,y=50)
r2.place(x=180, y=50)

v1 = IntVar()
v2 = IntVar()
C1 = Checkbutton(window, text = "Cricket", variable = v1)
C2 = Checkbutton(window, text = "Tennis", variable = v2)
C1.place(x=100, y=100)
C2.place(x=180, y=100)

window.title('Hello Python')
window.geometry("400x300+10+10")
window.mainloop()

Create UI in Python-Tkinter

Event Handling

An event is a notification received by the application object from various GUI widgets as a result
of user interaction. The Application object is always anticipating events as it runs an event
listening loop. User's actions include mouse button click or double click, keyboard key pressed
while control is inside the text box, certain element gains or goes out of focus etc.

Events are expressed as strings in <modifier-type-qualifier> format.

52
Many events are represented just as qualifier. The type defines the class of the event.

The following table shows how the Tkinter recognizes different events:

Event Modifier Type Qualifier Action

<Button-1> Button 1 Left mouse button click.

<Button-2> Button 2 Middle mouse button click.

<Destroy> Destroy Window is being destroyed.

<Double-Button-1> Double Button 1 Double-click first mouse button 1.

<Enter> Enter Cursor enters window.

<Expose> Expose Window fully or partially exposed.

<KeyPress-a> KeyPress a Any key has been pressed.

<KeyRelease> KeyRelease Any key has been released.

<Leave> Leave Cursor leaves window.

<Print> Print PRINT key has been pressed.

<FocusIn> FocusIn Widget gains focus.

<FocusOut> FocusOut widget loses focus.

An event should be registered with one or more GUI widgets in the application. If it's not, it will
be ignored. In Tkinter, there are two ways to register an event with a widget. First way is by
using the bind() method and the second way is by using the command parameter in the widget
constructor.

Bind() Method

The bind() method associates an event to a callback function so that, when the even occurs, the
function is called.

Syntax:
Widget.bind(event, callback)

For example, to invoke the MyButtonClicked() function on left button click, use the following
code:

Example: Even Binding

53
from tkinter import *
window=Tk()
btn = Button(window, text='OK')
btn.bind('<Button-1>', MyButtonClicked)

The event object is characterized by many properties such as source widget, position coordinates,
mouse button number and event type. These can be passed to the callback function if required.

Command Parameter

Each widget primarily responds to a particular type. For example, Button is a source of the
Button event. So, it is by default bound to it. Constructor methods of many widget classes have
an optional parameter called command. This command parameter is set to callback the function
which will be invoked whenever its bound event occurs. This method is more convenient than
the bind() method.

btn = Button(window, text='OK', command=myEventHandlerFunction)

In the example given below, the application window has two text input fields and another one to
display the result. There are two button objects with the captions Add and Subtract. The user is
expected to enter the number in the two Entry widgets. Their addition or subtraction is displayed
in the third.

The first button (Add) is configured using the command parameter. Its value is the add() method
in the class. The second button uses the bind() method to register the left button click with the
sub() method. Both methods read the contents of the text fields by the get() method of the
Entry widget, parse to numbers, perform the addition/subtraction and display the result in third
text field using the insert() method.

Example:
from tkinter import *
class MyWindow:
def __init__(self, win):
self.lbl1=Label(win, text='First number')
self.lbl2=Label(win, text='Second number')
self.lbl3=Label(win, text='Result')
self.t1=Entry(bd=3)
self.t2=Entry()
self.t3=Entry()
self.btn1 = Button(win, text='Add')
self.btn2=Button(win, text='Subtract')
self.lbl1.place(x=100, y=50)
self.t1.place(x=200, y=50)
self.lbl2.place(x=100, y=100)
self.t2.place(x=200, y=100)
self.b1=Button(win, text='Add', command=self.add)
self.b2=Button(win, text='Subtract')
self.b2.bind('<Button-1>', self.sub)
self.b1.place(x=100, y=150)
self.b2.place(x=200, y=150)

54
self.lbl3.place(x=100, y=200)
self.t3.place(x=200, y=200)
def add(self):
self.t3.delete(0, 'end')
num1=int(self.t1.get())
num2=int(self.t2.get())
result=num1+num2
self.t3.insert(END, str(result))
def sub(self, event):
self.t3.delete(0, 'end')
num1=int(self.t1.get())
num2=int(self.t2.get())
result=num1-num2
self.t3.insert(END, str(result))

window=Tk()
mywin=MyWindow(window)
window.title('Hello Python')
window.geometry("400x300+10+10")
window.mainloop()

The above example creates the following UI.

UI in Python-Tkinter

Thus, you can create the UI using TKinter in Python.

CustomTkinter: It is an extension of the Tkinter module in python. It provides


additional UI elements compared to Tkinter and they can be customized in various
possible ways. For example, we can customize a button using CustomTkinter, we can

55
make customizations like adding an image, making the edges round, adding borders
around it, etc.
To install the customtkinter module in Python execute the below command in the
terminal:
pip install customtkinter

Designing the form

Design phase is an important phase in the cycle of development of an


application. To design the layout of our form we can use paint application or any
other online tool to design the page as given below.

Building the initial application


Here we are going to create a simple blank application page using
customtkinter module and setting its appearance as same as of system (Dark or
light mode). Defining a App class in which we have defined the __init__ function
and sets the title and its size.

56
# Import customtkinter module
import customtkinter as ctk

# Sets the appearance mode of the application


# "System" sets the appearance same as that of the system
ctk.set_appearance_mode("System")

# Sets the color of the widgets


# Supported themes: green, dark-blue, blue
ctk.set_default_color_theme("green")

# Create App class


class App(ctk.CTk):
# Layout of the GUI will be written in the init itself
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Sets the title of our window to "App"
self.title("App")
# Dimensions of the window will be 200x200
self.geometry("200x200")

if __name__ == "__main__":
app = App()
# Runs the app
app.mainloop()

Desiging the form using the customtkinter module in Python

57
Here we are going to build the form that we have design above. we are going to
use labels using CTkLabel() function, text field using CTkEntry() function, radio
button using CTkRadioButton() function, etc.
All the widgets are created and placed in the following manner
 We create a widget by using the ctk.CTk<widget_name>() (For example,
ctk.CTkLabel() creates a label)
 And then we pass the arguments to it depending on the type of the widget.
 .grid() is used to specify the position, alignment, padding, and other
dimensions of the widget in our window
Syntax of grid():
grid( grid_options )
Parameters:
 row: Specifies the row at which the widget must be placed.
 column: Specifies the column at which the widget must be placed.
 rowspan: Specifies the height of the widget (number of rows the widget
spans).
 columnspan: Specifies the length of the widget (number of columns the
widget spans).
 padx, pady: Specifies the padding of the widget along x and y axes
respectively.
 sticky: Specifies how the widget elongates with respect to the changes in
its corresponding row and column.

# Python program to create a basic GUI


# application using the customtkinter module

import customtkinter as ctk


import tkinter as tk

# Basic parameters and initializations


# Supported modes : Light, Dark, System
ctk.set_appearance_mode("System")

# Supported themes : green, dark-blue, blue


ctk.set_default_color_theme("green")

appWidth, appHeight = 600, 700

# App Class
class App(ctk.CTk):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

self.title("GUI Application")
self.geometry(f"{appWidth}x{appHeight}")

# Name Label

58
self.nameLabel = ctk.CTkLabel(self,

text="Name")
self.nameLabel.grid(row=0, column=0,
padx=20, pady=20,
sticky="ew")

# Name Entry Field


self.nameEntry = ctk.CTkEntry(self,
placeholder_text="Teja")
self.nameEntry.grid(row=0, column=1,
columnspan=3, padx=20,
pady=20, sticky="ew")

# Age Label
self.ageLabel = ctk.CTkLabel(self,

text="Age")
self.ageLabel.grid(row=1, column=0,
padx=20, pady=20,
sticky="ew")

# Age Entry Field


self.ageEntry = ctk.CTkEntry(self,
placeholder_text="18")
self.ageEntry.grid(row=1, column=1,
columnspan=3, padx=20,
pady=20, sticky="ew")

# Gender Label
self.genderLabel = ctk.CTkLabel(self,

text="Gender")
self.genderLabel.grid(row=2, column=0,
padx=20, pady=20,
sticky="ew")

# Gender Radio Buttons


self.genderVar = tk.StringVar(value="Prefer\

not to say")

self.maleRadioButton = ctk.CTkRadioButton(self,
text="Male",

variable=self.genderVar,

value="He is")
self.maleRadioButton.grid(row=2, column=1, padx=20,
pady=20,
sticky="ew")

self.femaleRadioButton = ctk.CTkRadioButton(self,

text="Female",

59
variable=self.genderVar,

value="She is")
self.femaleRadioButton.grid(row=2, column=2,
padx=20,
pady=20,
sticky="ew")

self.noneRadioButton = ctk.CTkRadioButton(self,

text="Prefer not to say",

variable=self.genderVar,

value="They are")
self.noneRadioButton.grid(row=2, column=3,
padx=20,
pady=20,
sticky="ew")

# Choice Label
self.choiceLabel = ctk.CTkLabel(self,

text="Choice")
self.choiceLabel.grid(row=3, column=0,
padx=20, pady=20,
sticky="ew")

# Choice Check boxes


self.checkboxVar = tk.StringVar(value="Choice 1")

self.choice1 = ctk.CTkCheckBox(self, text="choice 1",

variable=self.checkboxVar,

onvalue="choice1",

offvalue="c1")
self.choice1.grid(row=3, column=1, padx=20,
pady=20, sticky="ew")

self.choice2 = ctk.CTkCheckBox(self, text="choice 2",

variable=self.checkboxVar,

onvalue="choice2",

offvalue="c2")
self.choice2.grid(row=3, column=2, padx=20, pady=20,
sticky="ew")

# Occupation Label
self.occupationLabel = ctk.CTkLabel(self,

60
text="Occupation")
self.occupationLabel.grid(row=4, column=0,
padx=20,
pady=20,
sticky="ew")

# Occupation combo box


self.occupationOptionMenu = ctk.CTkOptionMenu(self,

values=["Student",
"
Working Professional"])
self.occupationOptionMenu.grid(row=4, column=1,
padx=20,
pady=20,

columnspan=2, sticky="ew")

# Generate Button
self.generateResultsButton = ctk.CTkButton(self,

text="Generate Results")
self.generateResultsButton.grid(row=5, column=1,

columnspan=2,

padx=20, pady=20,

sticky="ew")

# Text Box
self.displayBox = ctk.CTkTextbox(self, width=200,

height=100)
self.displayBox.grid(row=6, column=0, columnspan=4,
padx=20, pady=20,
sticky="nsew")

if __name__ == "__main__":
app = App()
app.mainloop()

Complete application using customtkinter module in Python

In the previous step we have designed our form UI and now we are going to
add some more functionality of generate result and printing the all information
entered by user in the result block. For that we are going to create two function
given below.

61
 generateResults(): This function is used to put the text into the textbox.
 createText(): Based on the selected preferences and entered text, this
function generates a text variable that sums up all the entries and returns it.

# Python program to create a basic form


# GUI application using the customtkinter module
import customtkinter as ctk
import tkinter as tk

# Sets the appearance of the window


# Supported modes : Light, Dark, System
# "System" sets the appearance mode to
# the appearance mode of the system
ctk.set_appearance_mode("System")

# Sets the color of the widgets in the window


# Supported themes : green, dark-blue, blue
ctk.set_default_color_theme("green")

# Dimensions of the window


appWidth, appHeight = 600, 700

# App Class
class App(ctk.CTk):
# The layout of the window will be written
# in the init function itself
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

# Sets the title of the window to "App"


self.title("GUI Application")
# Sets the dimensions of the window to 600x700
self.geometry(f"{appWidth}x{appHeight}")

# Name Label
self.nameLabel = ctk.CTkLabel(self,
text="Name")
self.nameLabel.grid(row=0, column=0,
padx=20, pady=20,
sticky="ew")

# Name Entry Field


self.nameEntry = ctk.CTkEntry(self,
placeholder_text="Teja")
self.nameEntry.grid(row=0, column=1,
columnspan=3, padx=20,
pady=20, sticky="ew")

62
# Age Label
self.ageLabel = ctk.CTkLabel(self, text="Age")
self.ageLabel.grid(row=1, column=0,
padx=20, pady=20,
sticky="ew")

# Age Entry Field


self.ageEntry = ctk.CTkEntry(self,
placeholder_text="18")
self.ageEntry.grid(row=1, column=1,
columnspan=3, padx=20,
pady=20, sticky="ew")

# Gender Label
self.genderLabel = ctk.CTkLabel(self,
text="Gender")
self.genderLabel.grid(row=2, column=0,
padx=20, pady=20,
sticky="ew")

# Gender Radio Buttons


self.genderVar = tk.StringVar(value="Prefer \
not to say")

self.maleRadioButton = ctk.CTkRadioButton(self,
text="Male",
variable=self.genderVar,
value="He is")
self.maleRadioButton.grid(row=2, column=1,
padx=20, pady=20,
sticky="ew")

self.femaleRadioButton = ctk.CTkRadioButton(self,
text="Female",
variable=self.genderVar,
value="She is")
self.femaleRadioButton.grid(row=2, column=2,
padx=20, pady=20,
sticky="ew")

self.noneRadioButton = ctk.CTkRadioButton(self,
text="Prefer not to say",
variable=self.genderVar,
value="They are")
self.noneRadioButton.grid(row=2, column=3, padx=20,
pady=20, sticky="ew")

# Choice Label
self.choiceLabel = ctk.CTkLabel(self,
text="Choice")
self.choiceLabel.grid(row=3, column=0,
padx=20, pady=20,
sticky="ew")

63
# Choice Check boxes
self.checkboxVar = tk.StringVar(value="Choice 1")

self.choice1 = ctk.CTkCheckBox(self,
text="choice 1",
variable=self.checkboxVar,
onvalue="choice1", offvalue="c1")
self.choice1.grid(row=3, column=1,
padx=20, pady=20,
sticky="ew")

self.choice2 = ctk.CTkCheckBox(self,
text="choice 2",
variable=self.checkboxVar,
onvalue="choice2",
offvalue="c2")

self.choice2.grid(row=3, column=2,
padx=20, pady=20,
sticky="ew")

# Occupation Label
self.occupationLabel = ctk.CTkLabel(self,
text="Occupation")
self.occupationLabel.grid(row=4, column=0,
padx=20, pady=20,
sticky="ew")

# Occupation combo box


self.occupationOptionMenu = ctk.CTkOptionMenu(self,
values=["Student",
"Working Professional"])
self.occupationOptionMenu.grid(row=4, column=1,
padx=20, pady=20,
columnspan=2, sticky="ew")

# Generate Button
self.generateResultsButton = ctk.CTkButton(self,
text="Generate Results",

command=self.generateResults)
self.generateResultsButton.grid(row=5, column=1,
columnspan=2, padx=20,
pady=20, sticky="ew")

# Text Box
self.displayBox = ctk.CTkTextbox(self,
width=200,
height=100)
self.displayBox.grid(row=6, column=0,
columnspan=4, padx=20,
pady=20, sticky="nsew")

64
# This function is used to insert the
# details entered by users into the textbox
def generateResults(self):
self.displayBox.delete("0.0", "200.0")
text = self.createText()
self.displayBox.insert("0.0", text)

# This function is used to get the selected


# options and text from the available entry
# fields and boxes and then generates
# a prompt using them
def createText(self):
checkboxValue = ""

# .get() is used to get the value of the checkboxes and entryfields

if self.choice1._check_state and self.choice2._check_state:


checkboxValue += self.choice1.get() + " and " + self.choice2.get()
elif self.choice1._check_state:
checkboxValue += self.choice1.get()
elif self.choice2._check_state:
checkboxValue += self.choice2.get()
else:
checkboxValue = "none of the available options"

# Constructing the text variable


text = f"{self.nameEntry.get()} : \n{self.genderVar.get()} {self.ageEntry.get()} years old and
prefers {checkboxValue}\n"
text += f"{self.genderVar.get()} currently a {self.occupationOptionMenu.get()}"

return text

if __name__ == "__main__":
app = App()
# Used to run the application
app.mainloop()

65
Model Building, Identification and Evaluation in
Python
Predictive Model
As the name implies, predictive modeling is used to determine a certain output
using historical data. For example, you can build a recommendation
system that calculates the likelihood of developing a disease, such as
diabetes, using some clinical & personal data such as:

 Age
 Gender
 Weight
 Average glucose level
 Daily calories

This way, doctors are better prepared to intervene with medications or


recommend a healthier lifestyle.

Another use case for predictive models is forecasting sales. Using time series
analysis, you can collect and analyze a company’s performance to estimate
what kind of growth you can expect in the future.

Essentially, with predictive programming, you collect historical data, analyze it,
and train a model that detects specific patterns so that when it encounters
new data later on, it’s able to predict future results.

There are different predictive models that you can build using different
algorithms. Popular choices include regressions, neural networks, decision
trees, K-means clustering, Naïve Bayes, and others.

66
Predictive Modelling Applications
There are many ways to apply predictive models in the real world. Most
industries use predictive programming either to detect the cause of a problem
or to improve future results. Applications include but are not limited to:

 Fraud detection
 Sales forecasting
 Natural disaster relief
 Business performance growth
 Speech recognition
 News categorization
 Vehicle maintenance

As the industry develops, so do the applications of these models. Companies


are constantly looking for ways to improve processes and reshape the world
through data. In a few years, you can expect to find even more diverse ways
of implementing Python models in your data science workflow.

We’ll build a binary logistic model step-by-step to predict floods based on the
monthly rainfall index for each year in Kerala, India.

Step 1: Import Python Libraries


First and foremost, import the necessary Python libraries. In our case, we’ll be
working with pandas, NumPy, matplotlib, seaborn, and scikit-learn.

To import them, use the following code:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

67
Step 2: Read the Dataset
We use pandas to display the first 5 rows in our dataset:
df= pd.read_csv('kerala.csv')
df.head(5)

Step 3: Explore the Dataset


It’s important to know your way around the data you’re working with so you
know how to build your predictive model. For this reason, Python has
several functions that will help you with your explorations.

info()

The info() function shows us the data type of each column, number of
columns, memory usage, and the number of records in the dataset:
df.info()

shape

The shape function displays the number of records and columns:


df.shape

describe()

The describe() function summarizes the dataset’s statistical properties, such


as count, mean, min, and max:
df.describe()

68
corr()

The corr() function displays the correlation between different variables in our
dataset:
df.corr()

Step 3: Feature Selection


In this step, we choose several features that contribute most to the target
output. So, instead of training the model using every column in our dataset,
we select only those that have the strongest relationship with the predicted
variable.

Use the SelectKBest library to run a chi-squared statistical test and select the
top 3 features that are most related to floods.

Author’s note: In case you want to learn about the math behind feature
selection the 365 Linear Algebra and Feature Selection course is a perfect
start.

Start by importing the SelectKBest library:


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

After, define X & Y:


X= df.iloc[:,1:14] all features
Y= df.iloc[:,-1] target output (floods)

Select the top 3 features:


best_features= SelectKBest(score_func=chi2, k=3)
fit= best_features.fit(X,Y)

Now we create data frames for the features and the score of each feature:
df_scores= pd.DataFrame(fit.scores_)

69
df_columns= pd.DataFrame(X.columns)

Finally, we’ll combine all the features and their corresponding scores in one
data frame:
features_scores= pd.concat([df_columns, df_scores], axis=1)
features_scores.columns= ['Features', 'Score']
features_scores.sort_values(by = 'Score')

Step 4: Build the Model


Now it’s time to get our hands dirty. First, split the dataset into X and Y:
X= df[['SEP', 'JUN', 'JUL']] the top 3 features
Y= df[['FLOODS']] the target output

Second, split the dataset into train and test:


X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.4,
random_state=100)

Third, create a logistic regression body:


logreg= LogisticRegression()
logreg.fit(X_train,y_train)

Finally, we predict the likelihood of a flood using the logistic regression body
we created:
y_pred=logreg.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values

Step 5: Evaluate the Model’s Performance


As a final step, we’ll evaluate how well our Python model performed predictive
analytics by running a classification report and a ROC curve.

70
Classification Report

A classification report is a performance evaluation report that is used to


evaluate the performance of machine learning models by the following 5
criteria:

 Accuracy is a score used to evaluate the model’s performance. The higher it


is, the better.
 Recall measures the model’s ability to correctly predict the true positive
values.
 Precision is the ratio of true positives to the sum of both true and false
positives.
 F-score combines precision and recall into one metric. Ideally, its value should
be closest to 1, the better.
 Support is the number of actual occurrences of each class in the dataset.

Call these scores by inserting these lines of code:


from sklearn import metrics
from sklearn.metrics import classification_report
print(‘Accuracy: ‘,metrics.accuracy_score(y_test, y_pred))
print(‘Recall: ‘,metrics.recall_score(y_test, y_pred, zero_division=1))
print(“Precision:”,metrics.precision_score(y_test, y_pred, zero_division=1))
print(“CL Report:”,metrics.classification_report(y_test, y_pred,
zero_division=1))

ROC Curve

The receiver operating characteristic (ROC) curve is used to display the


sensitivity and specificity of the logistic regression model by calculating the
true positive and false positive rates.

From the ROC curve, we can calculate the area under the curve (AUC) whose
value ranges from 0 to 1. You’ll remember that the closer to 1, the better it is
for our predictive modeling.

To determine the ROC curve, first define the metrics:


71
y_pred_proba= logreg.predict_proba(X_test) [::,1]

Then, calculate the true positive and false positive rates:


false_positive_rate, true_positive_rate, _ = metrics.roc_curve(y_test,
y_pred_proba)

Next, calculate the AUC to see the model's performance:


auc= metrics.roc_auc_score(y_test, y_pred_proba)

Finally, plot the ROC curve:


plt.plot(false_positive_rate, true_positive_rate,label="AUC="+str(auc))
plt.title('ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('false Positive Rate')
plt.legend(loc=4)

The AUC is 0.94, meaning that the model did a great job:

Gradient Boosting Regressor:


Boosting is an ensemble technique in which the predictors are not made independently, but
sequentially. Gradient Boosting uses Decision tree as weak models.

Boosting is a method of converting weak learners into strong learners by training many models
in a gradual, additive and sequential manner and minimizing Loss function (i.e squared error
for Regression problems) in the final model.

GBR has better accuracy than other Regression model because of its Boosting technique. It is the
most used Regression algorithm for competitions.

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
criterion='friedman_mse',init=None, learning_rate=0.1, loss='ls',
max_depth=6,max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None,
presort='deprecated',random_state=None, subsample=1.0,
tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False)

The result from GBR is as below:

72
Mean Squared Error: 1388.8979420780786
R score: 0.9579626971080454
Mean Absolute Error: 23.81293483364058

The Gradient Boosting Regressor gives us the best R2 square value of 0.957. However, to
interpret this model is very difficult.

Interpret Ensemble Model:

Ensemble models definitely fall into the category of “Black Box” models since they are
composed of many potentially complex individual models.

Each tree in sequentially fashion is trained on bagged data using random selection of features, so
gaining a full understanding of the decision process by examining each individual tree is
infeasible.

Model’s Goodness of Fit Test

Both the KMO and Bartlett’s test of sphericity are commonly used to verify the feasibility of the
data for Exploratory Factor Analysis (EFA).

 Kaiser-Meyer Olkin (KMO) model tests sampling adequacy by measuring the proportion
of variance in the items that may be common variance. Values ranging between .80 and
1.00 indicate sampling adequacy (Cerny & Kaiser, 1977).
 Bartlett’s test of sphericity examines whether a correlation matrix is significantly
different to the identity matrix, in which diagonal elements are unities and all off-
diagonal elements are zeros (Bartlett, 1950). Significant results indicate that variables in
the correlation matrix are suitable for factor analysis.

Classification of fit indices: Absolute and Comparative

 The logic behind absolute fit indices is essentially to test how well the model specified
by the researcher reproduces the observed data. Commonly used absolute fit statistics
include the χ2

 fit statistic, RMSEA, SRMR.


 In contrast, comparative fit indices are based on a different logic, i.e. they assess how
well a model specified by a researcher fits the observed sample data relative to a null
model (i.e., a model that is based on the assumption that all observed variables are not
correlated) (Miles & Shevlin, 2007). Popular comparative model fit indices are the CFI
and TLI.

The χ2 fit statistic


 The χ2

73
 measures the discrepancy between the observed and the implied covariance matrices.

 The χ2

 fit statistic is very popular and frequently reported in both CFA and SEM studies.
 However, it is notoriously sensitive to large sample sizes and increased model complexity
(i.e. models with a large number of indicators and degrees of freedom). Therefore, the
current practice is to report it mostly for historical reasons, and it rarely used to make
decisions about the adequacy of model fit.

The RMSEA

 The Root Mean Square Error of Approximation (RMSEA) provides information as to


how well the model, with unknown but optimally chosen parameter estimates, would fit
the population covariance matrix (Byrne, 1998).
 It is a very commonly used fit statistic.
 One of its key advantages is that the RMSEA calculates confidence intervals around its
value.
 Values below .060

indicate close fit (Hu & Bentler, 1999). Values up to .080

 are commonly accepted as adequate.

The SRMR
 The Standardized Root Mean Residual (SRMR) is the square root of the difference
between the residuals of the sample covariance matrix and the hypothesized covariance
model.
 As SRMR is standardized, its values range between 0

and 1. Commonly, models with values below .05 threshold are considered to indicate good fit (Byrne,
1998). Also, values up to .08

 are acceptable (Hu & Bentler, 1999).

The CFI and TLI


 Two comparative fit indices commonly reported are the Comparative Fit Index (CFI) and
the Tucker Lewis Index (TLI). The indices are similar; however, note that the CFI is
normed while the TLI is not. Therefore, the CFI’s values range between zero and one,
whereas the TLI’s values may fall below zero or be above one (Hair et al., 2013).

74
 For CFI and TLI values above .95 are indicative of good fit (Hu & Bentler, 1999). In
practice, CFI and TLI values from .90

to .95

 are considered acceptable.

 Note that the TLI is non-normed, so its values can go above 1.00

Note:
Further to the aforementioned information, Hoyle (2012) provides an excellent succinct
summary of numerous fit indices. This table includes, for example, information on the indices'
theoretical range, sensitivity to varying sample size and model complexity. Note that, in contrast
to the indices introduced above, a great number of other indices exist, as illustrated in Hoyle's
table. Yet, the frequency of their use is decreasing for various reasons. For example, RMR is
non-normed and thus it is hard to interpret. Here these indices are shown below simply for
everyone's general awareness, i.e. the fact that they exist, who developed them and what their
statistical properties are.

# Simple orientation to programming, basic mathematical packages


import statistics
import math
print("Welcome to the world of data science! Artificial Intelligence, Machine Learning and Deep
Learning ")
data = [34, 67, 89, 12, 43, 23, 123]
x = statistics.mean(data)
print("The minimum value here is : ", x)
print("Feel most welcome! Volume of the cylinder loading...")
#Enter values of the input variables
radius=float(input("Kindly enter the radius of the cylinder: "))
height=float(input("Kindly enter the height of the cylinder: "))
#Lets implement the process here
volume= math.pi*pow(radius, 2)*height

75
print("The volume of the cylinder is %.2f" %volume)

76
MS. SQL Server Database Development and Connectivity in Python
Python is a powerful programming language that is widely used in various industries,
including data science, web development, and automation. One of the key strengths of
Python is its ability to connect to various databases, including SQL Server. Here, we will
explore how to connect to SQL Server in Python.

Before we dive into the code, let’s briefly discuss what SQL Server is and why it is
important. SQL Server is a relational database management system (RDBMS)
developed by Microsoft that stores and retrieves data for various applications. It is
widely used in enterprise environments due to its scalability, security features, and
robustness.

Python provides several libraries for connecting to SQL Server, including pyodbc and
pymssql. These libraries allow Python developers to interact with SQL Server
databases by sending queries and receiving results.

Import pyodbc Module

To connect to SQL Server using Python, we need to use a module called pyodbc. This
module provides an interface between Python and Microsoft SQL Server, allowing us to
execute SQL statements and retrieve data from the database.

To import the pyodbc module, we first need to install it. We can do this using pip, which
is a package manager for Python. Open your command prompt or terminal and run the
following command:

Once you have installed pyodbc, you can import it into your Python script using the
following code: This will make all of the functions and classes provided by the pyodbc
module available in your script.

Establish Connection to SQL Server

Now that we have installed the necessary libraries and have the server credentials, we
can establish a connection to our SQL Server using Python. We will be using the
pyodbc library to connect to our SQL Server.

Here’s an example code snippet that shows how to establish a connection:

import pyodbc

server = 'your_server_name'
database = 'your_database_name'
username = 'your_username'
password = 'your_password'

# Establishing a connection to the SQL Server

77
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};\
SERVER='+server+';\
DATABASE='+database+';\
UID='+username+';\
PWD='+ password)

cursor = cnxn.cursor()

In this example, we imported the `pyodbc` library and defined our server name,
database name, username, and password. Then, we used the `connect()` method from
`pyodbc` to establish a connection to the SQL Server by passing in the necessary
parameters.

Once we have established a connection, we create a cursor object using


`cnxn.cursor()`. The cursor object allows us to execute SQL statements on our server.

Note that the specific driver you use may differ depending on your system configuration.
You can find out which driver you need by checking your ODBC Data Source
Administrator.

In summary, establishing a connection to SQL Server using Python is fairly


straightforward with the help of the `pyodbc` library. Once we have established a
connection and created a cursor object, we can execute SQL statements on our server.

Create Cursor Object and Execute SQL Queries

After establishing a connection to the SQL Server in Python, the next step is to create a
cursor object. A cursor object allows you to execute SQL queries against the database
and retrieve data.

To create a cursor object, you can use the `cursor()` method of the connection object.
Here’s an example:

cursor = conn.cursor()

Once you have a cursor object, you can use it to execute SQL queries by calling its `execute()`
method. The `execute()` method takes an SQL query as an argument and executes it against
the database. Here’s an example:
query = "SELECT * FROM employees"
cursor.execute(query)

In this example, we are executing a simple SELECT query that retrieves all rows from
the `employees` table. Note that we are passing the query as a string to the `execute()`
method.

78
After executing a query, you can retrieve the results using one of the fetch methods of
the cursor object. The most common fetch methods are `fetchone()`, which retrieves
one row at a time, and `fetchall()`, which retrieves all rows at once. Here’s an example:

# Fetch one row at a time


row = cursor.fetchone()
while row:
print(row)
row = cursor.fetchone()

# Fetch all rows at once


rows = cursor.fetchall()
for row in rows:
print(row)

In this example, we first use a while loop and the `fetchone()` method to retrieve one
row at a time and print it. We keep looping until there are no more rows to fetch. Then,
we use the `fetchall()` method to retrieve all rows at once and print them using a for
loop.

It’s important to note that after executing a query, you should always close the cursor
object using its `close()` method:

cursor.close()

Closing the cursor releases any resources that it was holding, such as locks on the database. It
also frees up memory on the client side.

Retrieve Data from SQL Server

Now that we have successfully connected to the SQL Server database, we can retrieve
data from it using Python. There are several ways to retrieve data from SQL Server in
Python, but we will be using the `pandas` library in this demonstration.

`pandas` is a popular data manipulation library that provides data structures for
efficiently storing and analyzing large datasets. It also has built-in functions for reading
and writing data to various file formats, including SQL databases.

To retrieve data from SQL Server using `pandas`, we first need to write a SQL query
that specifies which data we want to extract. We can then use the `read_sql_query()`
function from `pandas` to execute the query and store the results in a `DataFrame`.

Here’s an example of how to retrieve all records from a table called `employees` in our
SQL Server database:

79
mport pandas as pd
import pyodbc

# Set up connection
cnxn = pyodbc.connect('DRIVER={SQL
Server};SERVER=localhost;DATABASE=mydatabase;UID=username;PWD=password')

# Define SQL query


query = 'SELECT * FROM employees'

# Execute query and store results in a DataFrame


df = pd.read_sql_query(query, cnxn)

# Print first few rows of DataFrame


print(df.head())

In this example, we first import the necessary libraries (`pandas` and `pyodbc`) and set
up the database connection using the same parameters as in Step 4.

Next, we define our SQL query as a string variable called `query`. This query simply
selects all columns (`*`) from the `employees` table.

We then use the `pd.read_sql_query()` function to execute the query and store the
results in a DataFrame called `df`. This function takes two arguments: the SQL query
and the database connection object (`cnxn`).

Finally, we print out the first few rows of the DataFrame using the `head()` function to
verify that we have successfully retrieved the data.

Of course, this is just a simple example. You can modify the SQL query to retrieve
specific columns or filter the data using conditions. Once you have the data in a
DataFrame, you can use all the powerful data manipulation and analysis functions
provided by `pandas`.

Close Connection to SQL Server

After we have executed our queries and retrieved the necessary data from the SQL
Server, it is important to close the connection to the server. This is done using the
`close()` method of the connection object.

Here’s an example:

import pyodbc

# Establishing a connection to SQL Server


connection = pyodbc.connect('Driver={SQL Server};'

80
'Server=server_name;'
'Database=database_name;'
'Trusted_Connection=yes;')

# Creating a cursor object


cursor = connection.cursor()

# Executing SQL queries


cursor.execute('SELECT * FROM table_name')
data = cursor.fetchall()

# Closing the connection


connection.close()

In the above example, we have established a connection to the SQL Server, created a
cursor object, executed a SQL query, retrieved the data using `fetchall()` method, and
finally closed the connection using the `close()` method.

It is important to close the connection after we are done with our work as it releases any
resources that were being used by our program. Leaving connections open can also
cause issues with other applications trying to access the same server.

Closing the connection to SQL Server should always be done after executing queries
and retrieving data. This ensures that our program runs efficiently and does not cause
any issues for other applications accessing the same server.

Conclusion

Here, we have learned how to use Python to connect to SQL Server. We started by
installing the necessary packages and libraries such as pyodbc and pandas. We then
created a connection string with the required credentials to establish a connection
between our Python code and SQL Server.

After establishing the connection, we executed SQL queries using the execute() method
of the cursor object in pyodbc. We also saw how to retrieve data from the database
using fetchall() and fetchone() methods.

We also explored how to use pandas library to read data from SQL Server into a
pandas DataFrame, which can be further manipulated and analyzed using pandas
functions.

It is important to note that connecting to a database requires proper access credentials


and permissions. It is recommended to keep these credentials secure and not hardcode
them in your Python code.

Overall, Python provides a powerful and flexible way for data professionals to connect
to SQL Server databases and perform various data manipulation tasks. Here, you

81
should now be able to connect to SQL Server databases in Python and start exploring
your data using the power of Python.

82
Testing Linear Regression Assumptions in Python
Checking model assumptions is like commenting code. Everybody should be doing it often, but
it sometimes ends up being overlooked in reality. A failure to do either can result in a lot of time
being confused, going down rabbit holes, and can have pretty serious consequences from the
model not being interpreted correctly.

Linear regression is a fundamental tool that has distinct advantages over other regression
algorithms. Due to its simplicity, it’s an exceptionally quick algorithm to train, thus typically
makes it a good baseline algorithm for common regression scenarios. More importantly, models
trained with linear regression are the most interpretable kind of regression models available -
meaning it’s easier to take action from the results of a linear regression model. However, if the
assumptions are not satisfied, the interpretation of the results will not always be valid. This can
be very dangerous depending on the application.

This post contains code for tests on the assumptions of linear regression and examples with both
a real-world dataset and a toy dataset.

The Data
For our real-world dataset, we’ll use the Boston house prices dataset from the late 1970’s. The
toy dataset will be created using scikit-learn’s make_regression function which creates a dataset
that should perfectly satisfy all of our assumptions.

One thing to note is that I’m assuming outliers have been removed in this blog post. This is an
important part of any exploratory data analysis (which isn’t being performed in this post in order
to keep it short) that should happen in real world scenarios, and outliers in particular will cause
significant issues with linear regression. See Anscombe’s Quartet for examples of outliers
causing issues with fitting linear regression models.

Here are the variable descriptions for the Boston housing dataset straight from the
documentation:

 CRIM: Per capita crime rate by town

 ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.

 INDUS: Proportion of non-retail business acres per town.

 CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

 NOX: Nitric oxides concentration (parts per 10 million)

83
 RM: Average number of rooms per dwelling

 AGE: Proportion of owner-occupied units built prior to 1940

 DIS: Weighted distances to five Boston employment centers

 RAD: Index of accessibility to radial highways

 TAX: Full-value property-tax rate per $10,000

 PTRATIO: Pupil-teacher ratio by town

 B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

 LSTAT: % lower status of the population

 MEDV: Median value of owner-occupied homes in $1,000’s

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import datasets

%matplotlib inline

"""

Real-world data of Boston housing prices

Additional Documentation:
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Attributes:

data: Features/predictors

label: Target/label/response variable

feature_names: Abbreviations of names of features

"""

boston = datasets.load_boston()

84
"""

Artificial linear data using the same number of features and observations as
the

Boston housing prices dataset for assumption test comparison

"""

linear_X, linear_y = datasets.make_regression(n_samples=boston.data.shape[0],

n_features=boston.data.shape[1],

noise=75, random_state=46)

# Setting feature names to x1, x2, x3, etc. if they are not defined

linear_feature_names = ['X'+str(feature+1) for feature in


range(linear_X.shape[1])]

Now that the data is loaded in, let’s preview it:

df = pd.DataFrame(boston.data, columns=boston.feature_names)

df['HousePrice'] = boston.target

df.head()

Now that the data is loaded in, let’s preview it:

df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['HousePrice'] = boston.target

df.head()
Initial Setup

Before we test the assumptions, we’ll need to fit our linear regression models. I have a master
function for performing all of the assumption testing at the bottom of this post that does this
automatically, but to abstract the assumption tests out to view them independently we’ll have to
re-write the individual tests to take the trained model as a parameter.

from sklearn.linear_model import LinearRegression

# Fitting the model


boston_model = LinearRegression()
boston_model.fit(boston.data, boston.target)

# Returning the R^2 for the model


boston_r2 = boston_model.score(boston.data, boston.target)
print('R^2: {0}'.format(boston_r2))

85
R^2: 0.7406077428649428
# Fitting the model
linear_model = LinearRegression()
linear_model.fit(linear_X, linear_y)

# Returning the R^2 for the model


linear_r2 = linear_model.score(linear_X, linear_y)
print('R^2: {0}'.format(linear_r2))
R^2: 0.873743725796525
def calculate_residuals(model, features, label):
"""
Creates predictions on the features with the model and calculates
residuals
"""
predictions = model.predict(features)
df_results = pd.DataFrame({'Actual': label, 'Predicted': predictions})
df_results['Residuals'] = abs(df_results['Actual']) -
abs(df_results['Predicted'])

return df_results

We’re all set, so onto the assumption testing!

The Assumptions
I) Linearity Assumption

This assumes that there is a linear relationship between the predictors (e.g. independent variables
or features) and the response variable (e.g. dependent variable or label). This also assumes that
the predictors are additive.

Why it can happen: There may not just be a linear relationship among the data. Modeling is
about trying to estimate a function that explains a process, and linear regression would not be a
fitting estimator (pun intended) if there is no linear relationship.

What it will affect: The predictions will be extremely inaccurate because our model is
underfitting. This is a serious violation that should not be ignored.

How to detect it: If there is only one predictor, this is pretty easy to test with a scatter plot. Most
cases aren’t so simple, so we’ll have to modify this by using a scatter plot to see our predicted
values versus the actual values (in other words, view the residuals). Ideally, the points should lie
on or around a diagonal line on the scatter plot.

How to fix it: Either adding polynomial terms to some of the predictors or applying nonlinear
transformations . If those do not work, try adding additional variables to help capture the
relationship between the predictors and the label.

def linear_assumption(model, features, label):


"""

86
Linearity: Assumes that there is a linear relationship between the
predictors and
the response variable. If not, either a quadratic term or
another
algorithm should be used.
"""
print('Assumption 1: Linear Relationship between the Target and the
Feature', '\n')

print('Checking with a scatter plot of actual vs. predicted.',


'Predictions should follow the diagonal line.')

# Calculating residuals for the plot


df_results = calculate_residuals(model, features, label)

# Plotting the actual vs predicted values


sns.lmplot(x='Actual', y='Predicted', data=df_results, fit_reg=False,
size=7)

# Plotting the diagonal line


line_coords = np.arange(df_results.min().min(), df_results.max().max())
plt.plot(line_coords, line_coords, # X and y points
color='darkorange', linestyle='--')
plt.title('Actual vs. Predicted')
plt.show()

We’ll start with our linear dataset:

linear_assumption(linear_model, linear_X, linear_y)


Assumption 1: Linear Relationship between the Target and the Feature

Checking with a scatter plot of actual vs. predicted. Predictions should


follow the diagonal line.

87
We can see a relatively even spread around the diagonal line.

Now, let’s compare it to the Boston dataset:

linear_assumption(boston_model, boston.data, boston.target)


Assumption 1: Linear Relationship between the Target and the Feature

Checking with a scatter plot of actual vs. predicted. Predictions should


follow the diagonal line.

88
We can see in this case that there is not a perfect linear relationship. Our predictions are biased
towards lower values in both the lower end (around 5-10) and especially at the higher values
(above 40).

II) Normality of the Error Terms

More specifically, this assumes that the error terms of the model are normally distributed. Linear
regressions other than Ordinary Least Squares (OLS) may also assume normality of the
predictors or the label, but that is not the case here.

Why it can happen: This can actually happen if either the predictors or the label are
significantly non-normal. Other potential reasons could include the linearity assumption being
violated or outliers affecting our model.

What it will affect: A violation of this assumption could cause issues with either shrinking or
inflating our confidence intervals.

89
How to detect it: There are a variety of ways to do so, but we’ll look at both a histogram and the
p-value from the Anderson-Darling test for normality.

How to fix it: It depends on the root cause, but there are a few options. Nonlinear
transformations of the variables, excluding specific variables (such as long-tailed variables), or
removing outliers may solve this problem.

def normal_errors_assumption(model, features, label, p_value_thresh=0.05):


"""
Normality: Assumes that the error terms are normally distributed. If they
are not,
nonlinear transformations of variables may solve this.

This assumption being violated primarily causes issues with the confidence
intervals
"""
from statsmodels.stats.diagnostic import normal_ad
print('Assumption 2: The error terms are normally distributed', '\n')

# Calculating residuals for the Anderson-Darling test


df_results = calculate_residuals(model, features, label)

print('Using the Anderson-Darling test for normal distribution')

# Performing the test on the residuals


p_value = normal_ad(df_results['Residuals'])[1]
print('p-value from the test - below 0.05 generally means non-normal:',
p_value)

# Reporting the normality of the residuals


if p_value < p_value_thresh:
print('Residuals are not normally distributed')
else:
print('Residuals are normally distributed')

# Plotting the residuals distribution


plt.subplots(figsize=(12, 6))
plt.title('Distribution of Residuals')
sns.distplot(df_results['Residuals'])
plt.show()

print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')

As with our previous assumption, we’ll start with the linear dataset:

normal_errors_assumption(linear_model, linear_X, linear_y)


Assumption 2: The error terms are normally distributed

90
Using the Anderson-Darling test for normal distribution
p-value from the test - below 0.05 generally means non-normal: 0.335066045847
Residuals are normally distributed

Assumption satisfied

Now let’s run the same test on the Boston dataset:

normal_errors_assumption(boston_model, boston.data, boston.target)


Assumption 2: The error terms are normally distributed

Using the Anderson-Darling test for normal distribution


p-value from the test - below 0.05 generally means non-normal: 7.78748286642e-
25
Residuals are not normally distributed

91
Assumption not satisfied

Confidence intervals will likely be affected


Try performing nonlinear transformations on variables

This isn’t ideal, and we can see that our model is biasing towards under-estimating.

III) No Multicollinearity among Predictors

This assumes that the predictors used in the regression are not correlated with each other. This
won’t render our model unusable if violated, but it will cause issues with the interpretability of
the model.

Why it can happen: A lot of data is just naturally correlated. For example, if trying to predict a
house price with square footage, the number of bedrooms, and the number of bathrooms, we can
expect to see correlation between those three variables because bedrooms and bathrooms make
up a portion of square footage.

What it will affect: Multicollinearity causes issues with the interpretation of the coefficients.
Specifically, you can interpret a coefficient as “an increase of 1 in this predictor results in a
change of (coefficient) in the response variable, holding all other predictors constant.” This
becomes problematic when multicollinearity is present because we can’t hold correlated
predictors constant. Additionally, it increases the standard error of the coefficients, which results
in them potentially showing as statistically insignificant when they might actually be significant.

92
How to detect it: There are a few ways, but we will use a heatmap of the correlation as a visual
aid and examine the variance inflation factor (VIF).

How to fix it: This can be fixed by other removing predictors with a high variance inflation
factor (VIF) or performing dimensionality reduction.

def multicollinearity_assumption(model, features, label, feature_names=None):


"""
Multicollinearity: Assumes that predictors are not correlated with each
other. If there is
correlation among the predictors, then either remove
prepdictors with high
Variance Inflation Factor (VIF) values or perform
dimensionality reduction

This assumption being violated causes issues with


interpretability of the
coefficients and the standard errors of the
coefficients.
"""
from statsmodels.stats.outliers_influence import variance_inflation_factor
print('Assumption 3: Little to no multicollinearity among predictors')

# Plotting the heatmap


plt.figure(figsize = (10,8))
sns.heatmap(pd.DataFrame(features, columns=feature_names).corr(),
annot=True)
plt.title('Correlation of Variables')
plt.show()

print('Variance Inflation Factors (VIF)')


print('> 10: An indication that multicollinearity may be present')
print('> 100: Certain multicollinearity among the variables')
print('-------------------------------------')

# Gathering the VIF for each variable


VIF = [variance_inflation_factor(features, i) for i in
range(features.shape[1])]
for idx, vif in enumerate(VIF):
print('{0}: {1}'.format(feature_names[idx], vif))

# Gathering and printing total cases of possible or definite


multicollinearity
possible_multicollinearity = sum([1 for vif in VIF if vif > 10])
definite_multicollinearity = sum([1 for vif in VIF if vif > 100])
print()
print('{0} cases of possible
multicollinearity'.format(possible_multicollinearity))
print('{0} cases of definite
multicollinearity'.format(definite_multicollinearity))
print()

if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')

93
else:
print('Assumption possibly satisfied')
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')

else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')

Starting with the linear dataset:

multicollinearity_assumption(linear_model, linear_X, linear_y,


linear_feature_names)
Assumption 3: Little to no multicollinearity among predictors

Variance Inflation Factors (VIF)


> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables

94
-------------------------------------
X1: 1.030931170297102
X2: 1.0457176802992108
X3: 1.0418076962011933
X4: 1.0269600632251443
X5: 1.0199882018822783
X6: 1.0404194675991594
X7: 1.0670847781889177
X8: 1.0229686036798158
X9: 1.0292923730360835
X10: 1.0289003332516535
X11: 1.052043220821624
X12: 1.0336719449364813
X13: 1.0140788728975834

0 cases of possible multicollinearity


0 cases of definite multicollinearity

Assumption satisfied

Everything looks peachy keen. Onto the Boston dataset:

multicollinearity_assumption(boston_model, boston.data, boston.target,


boston.feature_names)
Assumption 3: Little to no multicollinearity among predictors

95
Variance Inflation Factors (VIF)
> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables
-------------------------------------
CRIM: 2.0746257632525675
ZN: 2.8438903527570782
INDUS: 14.484283435031545
CHAS: 1.1528909172683364
NOX: 73.90221170812129
RM: 77.93496867181426
AGE: 21.38677358304778
DIS: 14.699368125642422
RAD: 15.154741587164747
TAX: 61.226929320337554
PTRATIO: 85.0273135204276
B: 20.066007061121244
LSTAT: 11.088865100659874

10 cases of possible multicollinearity


0 cases of definite multicollinearity

Assumption possibly satisfied

Coefficient interpretability may be problematic

96
Consider removing variables with a high Variance Inflation Factor (VIF)

This isn’t quite as egregious as our normality assumption violation, but there is possible
multicollinearity for most of the variables in this dataset.

IV) No Autocorrelation of the Error Terms

This assumes no autocorrelation of the error terms. Autocorrelation being present typically
indicates that we are missing some information that should be captured by the model.

Why it can happen: In a time series scenario, there could be information about the past that we
aren’t capturing. In a non-time series scenario, our model could be systematically biased by
either under or over predicting in certain conditions. Lastly, this could be a result of a violation
of the linearity assumption.

What it will affect: This will impact our model estimates.

How to detect it: We will perform a Durbin-Watson test to determine if either positive or
negative correlation is present. Alternatively, you could create plots of residual autocorrelations.

How to fix it: A simple fix of adding lag variables can fix this problem. Alternatively,
interaction terms, additional variables, or additional transformations may fix this.

def autocorrelation_assumption(model, features, label):


"""
Autocorrelation: Assumes that there is no autocorrelation in the
residuals. If there is
autocorrelation, then there is a pattern that is not
explained due to
the current value being dependent on the previous value.
This may be resolved by adding a lag variable of either
the dependent
variable or some of the predictors.
"""
from statsmodels.stats.stattools import durbin_watson
print('Assumption 4: No Autocorrelation', '\n')

# Calculating residuals for the Durbin Watson-tests


df_results = calculate_residuals(model, features, label)

print('\nPerforming Durbin-Watson Test')


print('Values of 1.5 < d < 2.5 generally show that there is no
autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied')

97
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')

Testing with our ideal dataset:

autocorrelation_assumption(linear_model, linear_X, linear_y)


Assumption 4: No Autocorrelation

Performing Durbin-Watson Test


Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the
data
0 to 2< is positive autocorrelation
>2 to 4 is negative autocorrelation
-------------------------------------
Durbin-Watson: 2.00345051385
Little to no autocorrelation

Assumption satisfied

And with our Boston dataset:

autocorrelation_assumption(boston_model, boston.data, boston.target)


Assumption 4: No Autocorrelation

Performing Durbin-Watson Test


Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the
data
0 to 2< is positive autocorrelation
>2 to 4 is negative autocorrelation
-------------------------------------
Durbin-Watson: 1.0713285604
Signs of positive autocorrelation

Assumption not satisfied

We’re having signs of positive autocorrelation here, but we should expect this since we know our
model is consistently under-predicting and our linearity assumption is being violated. Since this
isn’t a time series dataset, lag variables aren’t possible. Instead, we should look into either
interaction terms or additional transformations.

V) Homoscedasticity/Heteroscedasticity

This assumes homoscedasticity, which is the same variance within our error terms.
Heteroscedasticity, the violation of homoscedasticity, occurs when we don’t have an even
variance across the error terms.

98
Why it can happen: Our model may be giving too much weight to a subset of the data,
particularly where the error variance was the largest.

What it will affect: Significance tests for coefficients due to the standard errors being biased.
Additionally, the confidence intervals will be either too wide or too narrow.

How to detect it: Plot the residuals and see if the variance appears to be uniform.

How to fix it: Heteroscedasticity (can you tell I like the scedasticity words?) can be solved either
by using weighted least squares regression instead of the standard OLS or transforming either the
dependent or highly skewed variables. Performing a log transformation on the dependent
variable is not a bad place to start.

def homoscedasticity_assumption(model, features, label):


"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('Assumption 5: Homoscedasticity of Error Terms', '\n')

print('Residuals should have relative constant variance')

# Calculating residuals for the plot


df_results = calculate_residuals(model, features, label)

# Plotting the residuals


plt.subplots(figsize=(12, 6))
ax = plt.subplot(111) # To remove spines
plt.scatter(x=df_results.index, y=df_results.Residuals, alpha=0.5)
plt.plot(np.repeat(0, df_results.index.max()), color='darkorange',
linestyle='--')
ax.spines['right'].set_visible(False) # Removing the right spine
ax.spines['top'].set_visible(False) # Removing the top spine
plt.title('Residuals')
plt.show()

Plotting the residuals of our ideal dataset:

homoscedasticity_assumption(linear_model, linear_X, linear_y)


Assumption 5: Homoscedasticity of Error Terms

Residuals should have relative constant variance

99
There don’t appear to be any obvious problems with that.

Next, looking at the residuals of the Boston dataset:

homoscedasticity_assumption(boston_model, boston.data, boston.target)


Assumption 5: Homoscedasticity of Error Terms

Residuals should have relative constant variance

100
We can’t see a fully uniform variance across our residuals, so this is potentially problematic.
However, we know from our other tests that our model has several issues and is under predicting
in many cases.

Conclusion
We can clearly see that a linear regression model on the Boston dataset violates a number of
assumptions which cause significant problems with the interpretation of the model itself. It’s not
uncommon for assumptions to be violated on real-world data, but it’s important to check them so
we can either fix them and/or be aware of the flaws in the model for the presentation of the
results or the decision making process.

It is dangerous to make decisions on a model that has violated assumptions because those
decisions are effectively being formulated on made-up numbers. Not only that, but it also
provides a false sense of security due to trying to be empirical in the decision making process.
Empiricism requires due diligence, which is why these assumptions exist and are stated up front.
Hopefully this code can help ease the due diligence process and make it less painful.

Code for the Master Function


This function performs all of the assumption tests listed in this blog post:

def linear_regression_assumptions(features, label, feature_names=None):


"""

101
Tests a linear regression on the model to see if assumptions are being met
"""
from sklearn.linear_model import LinearRegression

# Setting feature names to x1, x2, x3, etc. if they are not defined
if feature_names is None:
feature_names = ['X'+str(feature+1) for feature in
range(features.shape[1])]

print('Fitting linear regression')


# Multi-threading if the dataset is a size where doing so is beneficial
if features.shape[0] < 100000:
model = LinearRegression(n_jobs=-1)
else:
model = LinearRegression()

model.fit(features, label)

# Returning linear regression R^2 and coefficients before performing


diagnostics
r2 = model.score(features, label)
print()
print('R^2:', r2, '\n')
print('Coefficients')
print('-------------------------------------')
print('Intercept:', model.intercept_)

for feature in range(len(model.coef_)):


print('{0}: {1}'.format(feature_names[feature], model.coef_[feature]))

print('\nPerforming linear regression assumption testing')

# Creating predictions and calculating residuals for assumption tests


predictions = model.predict(features)
df_results = pd.DataFrame({'Actual': label, 'Predicted': predictions})
df_results['Residuals'] = abs(df_results['Actual']) -
abs(df_results['Predicted'])

def linear_assumption():
"""
Linearity: Assumes there is a linear relationship between the
predictors and
the response variable. If not, either a polynomial term or
another
algorithm should be used.
"""
print('\
n=============================================================================
==========')
print('Assumption 1: Linear Relationship between the Target and the
Features')

print('Checking with a scatter plot of actual vs. predicted.


Predictions should follow the diagonal line.')

# Plotting the actual vs predicted values

102
sns.lmplot(x='Actual', y='Predicted', data=df_results, fit_reg=False,
size=7)

# Plotting the diagonal line


line_coords = np.arange(df_results.min().min(),
df_results.max().max())
plt.plot(line_coords, line_coords, # X and y points
color='darkorange', linestyle='--')
plt.title('Actual vs. Predicted')
plt.show()
print('If non-linearity is apparent, consider adding a polynomial
term')

def normal_errors_assumption(p_value_thresh=0.05):
"""
Normality: Assumes that the error terms are normally distributed. If
they are not,
nonlinear transformations of variables may solve this.

This assumption being violated primarily causes issues with the


confidence intervals
"""
from statsmodels.stats.diagnostic import normal_ad
print('\
n=============================================================================
==========')
print('Assumption 2: The error terms are normally distributed')
print()

print('Using the Anderson-Darling test for normal distribution')

# Performing the test on the residuals


p_value = normal_ad(df_results['Residuals'])[1]
print('p-value from the test - below 0.05 generally means non-
normal:', p_value)

# Reporting the normality of the residuals


if p_value < p_value_thresh:
print('Residuals are not normally distributed')
else:
print('Residuals are normally distributed')

# Plotting the residuals distribution


plt.subplots(figsize=(12, 6))
plt.title('Distribution of Residuals')
sns.distplot(df_results['Residuals'])
plt.show()

print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')

103
def multicollinearity_assumption():
"""
Multicollinearity: Assumes that predictors are not correlated with
each other. If there is
correlation among the predictors, then either
remove prepdictors with high
Variance Inflation Factor (VIF) values or perform
dimensionality reduction

This assumption being violated causes issues with


interpretability of the
coefficients and the standard errors of the
coefficients.
"""
from statsmodels.stats.outliers_influence import
variance_inflation_factor
print('\
n=============================================================================
==========')
print('Assumption 3: Little to no multicollinearity among predictors')

# Plotting the heatmap


plt.figure(figsize = (10,8))
sns.heatmap(pd.DataFrame(features, columns=feature_names).corr(),
annot=True)
plt.title('Correlation of Variables')
plt.show()

print('Variance Inflation Factors (VIF)')


print('> 10: An indication that multicollinearity may be present')
print('> 100: Certain multicollinearity among the variables')
print('-------------------------------------')

# Gathering the VIF for each variable


VIF = [variance_inflation_factor(features, i) for i in
range(features.shape[1])]
for idx, vif in enumerate(VIF):
print('{0}: {1}'.format(feature_names[idx], vif))

# Gathering and printing total cases of possible or definite


multicollinearity
possible_multicollinearity = sum([1 for vif in VIF if vif > 10])
definite_multicollinearity = sum([1 for vif in VIF if vif > 100])
print()
print('{0} cases of possible
multicollinearity'.format(possible_multicollinearity))
print('{0} cases of definite
multicollinearity'.format(definite_multicollinearity))
print()

if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
else:
print('Assumption possibly satisfied')

104
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance
Inflation Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')

def autocorrelation_assumption():
"""
Autocorrelation: Assumes that there is no autocorrelation in the
residuals. If there is
autocorrelation, then there is a pattern that is not
explained due to
the current value being dependent on the previous
value.
This may be resolved by adding a lag variable of
either the dependent
variable or some of the predictors.
"""
from statsmodels.stats.stattools import durbin_watson
print('\
n=============================================================================
==========')
print('Assumption 4: No Autocorrelation')
print('\nPerforming Durbin-Watson Test')
print('Values of 1.5 < d < 2.5 generally show that there is no
autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')

def homoscedasticity_assumption():
"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('\
n=============================================================================
==========')

105
print('Assumption 5: Homoscedasticity of Error Terms')
print('Residuals should have relative constant variance')

# Plotting the residuals


plt.subplots(figsize=(12, 6))
ax = plt.subplot(111) # To remove spines
plt.scatter(x=df_results.index, y=df_results.Residuals, alpha=0.5)
plt.plot(np.repeat(0, df_results.index.max()), color='darkorange',
linestyle='--')
ax.spines['right'].set_visible(False) # Removing the right spine
ax.spines['top'].set_visible(False) # Removing the top spine
plt.title('Residuals')
plt.show()
print('If heteroscedasticity is apparent, confidence intervals and
predictions will be affected')

linear_assumption()
normal_errors_assumption()
multicollinearity_assumption()
autocorrelation_assumption()
homoscedasticity_assumption()

Summary of Assumptions of Linear


Regression
1. Linearity – There should be linear relationship between dependent and independent variable.
This is very logical and most essential assumption of Linear Regression. Visually it can be check
by making a scatter plot between dependent and independent variable

2. Homoscedasticity – Constant Error Variance, i.e, the variance of the error term is same across
all values of the independent variable. It can be easily checked by making a scatter plot between
Residual and Fitted Values. If there is no trend then the variance of error term is constant.

import seaborn as sns


sns.lmplot(x ="expected",
y = "residual", data = result)

A close observation of the above plot shows that the variance of residual term is relatively more
for higher fitted values. Note: In many real-life scenarios, it is practically difficult to ensure all
assumptions of linear regression will hold 100%

3. Normal Error – The error term should be normally distributed. QQ plot is a good way of
checking normality. If the plot forms a line that is roughly straight then we can assume there is
normality.

import statsmodels.api as sm

sm.qqplot(result["residual"], ylabel = "Residual Quantiles" )

106
4. No Autocorrelation of residual – This is typically applicable to time series data.
Autocorrelation means the current value of Yt is dependent on historic value of Yt-n with n as
lag period. Durbin-Watson test is a quick way to find if there is any autocorrelation.

5. No Perfect Multi-Collinearity – Multi-Collinearity is a phenomenon when two or more


independent variables are highly correlated. Multi-collinearity is checked by Variance Inflation
Factor (VIF). There should be no variable in the model having VIF above 2. (…for more details
see our blog on Multi-Collinearity)

6. Exogeneity – Exogeneity is a standard assumption of regression and it means that each X


variable does not depend on the dependent variable Y, rather Y depends on the Xs and on Error
(e). In simple terms X is completely unaffected by Y.

7. Sample Size – In linear regression, it is desirable that the number of records should be at least
10 or more times the number of independent variables to avoid the curse of dimensionality.

107
Skewness & Kurtosis
What is Skewness and how do we detect it?

If you will ask Mother Nature — What is her favorite probability distribution?

The answer will be — ‘Normal’ and the reason behind it is the existence of chance/random
causes that influence every known variable on earth. What if a process is under the influence of
assignable/significant causes as well? This is surely going to modify the shape of the distribution
(distort) and that’s when we need a measure like skewness to capture it. Below is a normal
distribution visual, also known as a bell curve. It is a symmetrical graph with all measures of
central tendency in the middle.

(Author, 2021)

But what if we encounter an asymmetrical distribution, how do we detect the extent of


asymmetry? Let’s see visually what happens to the measures of central tendency when we
encounter such graphs.

108
( Author, 2021)

Notice how these central tendency measures tend to spread when the normal distribution is
distorted. For the nomenclature just follow the direction of the tail — For the left graph since the
tail is to the left, it is left-skewed (negatively skewed) and the right graph has the tail to the right,
so it is right-skewed (positively skewed).

How about deriving a measure that captures the horizontal distance between the Mode and the
Mean of the distribution? It’s intuitive to think that the higher the skewness, the more apart these
measures will be. So let’s jump to the formula for skewness now:

Division by Standard Deviation enables the relative comparison among distributions on the same
standard scale. Since mode calculation as a central tendency for small data sets is not
recommended, so to arrive at a more robust formula for skewness we will replace mode with the
derived calculation from the median and the mean.

*approximately for skewed distributions

Replacing the value of mode in the formula of skewness, we get:

109
( Author, 2021)

What is Kurtosis and how do we capture it?

Think of punching or pulling the normal distribution curve from the top, what impact will it have
on the shape of the distribution? Let’s visualize:

110
(Author, 2021)

So there are two things to notice — The peak of the curve and the tails of the curve, Kurtosis
measure is responsible for capturing this phenomenon. The formula for kurtosis calculation is
complex (4th moment in the moment-based calculation) so we will stick to the concept and its
visual clarity. A normal distribution has a kurtosis of 3 and is called mesokurtic. Distributions
greater than 3 are called leptokurtic and less than 3 are called platykurtic. So the greater the value
more the peakedness. Kurtosis ranges from 1 to infinity. As the kurtosis measure for a normal
distribution is 3, we can calculate excess kurtosis by keeping reference zero for normal
distribution. Now excess kurtosis will vary from -2 to infinity.

Excess Kurtosis for Normal Distribution = 3–3 = 0

The lowest value of Excess Kurtosis is when Kurtosis is 1 = 1–3 = -2

111
(Author, 2021)

The topic of Kurtosis has been controversial for decades now, the basis of kurtosis all these years
has been linked with the peakedness but the ultimate verdict is that outliers (fatter tails) govern
the kurtosis effect far more than the values near the mean (peak).

So we can conclude from the above discussions that the horizontal push or pull distortion of a
normal distribution curve gets captured by the Skewness measure and the vertical push or pull
distortion gets captured by the Kurtosis measure. Also, it is the impact of outliers that dominate
the kurtosis effect which has its roots of proof sitting in the fourth-order moment-based formula.
I hope this blog helped you clarify the idea of Skewness & Kurtosis in a simplified manner,
watch out for more similar blogs in the future.

Matplotlib Violin Plot - violinplot() Function


we will cover the Violin Plot and how to create a violin plot using the violinplot() function in the
Matplotlib library.

The Violin Plot is used to indicate the probability density of data at different values and it is
quite similar to the Matplotlib Box Plot.

 These plots are mainly a combination of Box Plots and Histograms.


 The violin plot usually portrays the distribution, median, interquartile range of data.
 In this, the interquartile and median are statistical information that is provided by the
box plot whereas the distribution is being provided by the histogram.
 The violin plots are also used to represent the comparison of a variable distribution
across different "categories"; like the Box plots.
 The Violin plots are more informative as they show the full distribution of the data.

Here is a figure showing common components of the Box Plot and Violin Plot:

112
Creation of the Violin Plot

The violinplot() method is used for the creation of the violin plot.

The syntax required for the method is as follows:

violinplot(dataset, positions, vert, widths, showmeans,


showextrema,showmedians,quantiles,points=1, bw_method, *, data)

Parameters

The description of the Parameters of this function is as follows:

 dataset

This parameter denotes the array or sequence of vectors. It is the input data.

 positions

This parameter is used to set the positions of the violins. In this, the ticks and limits are set
automatically in order to match the positions. It is an array-like structured data with the default
as = [1, 2, …, n].

 vert

113
This parameter contains the boolean value. If the value of this parameter is set to true then it
will create a vertical plot, otherwise, it will create a horizontal plot.

 showmeans

This parameter contains a boolean value with false as its default value. If the value of this
parameter is True, then it will toggle the rendering of the means.

 showextrema

This parameter contains the boolean values with false as its default value. If the value of this
parameter is True, then it will toggle the rendering of the extrema.

 showmedians

This parameter contains the boolean values with false as its default value.If the value of this
parameter is True, then it will toggle the rendering of the medians.

 quantiles

This is an array-like data structure having None as its default value.If value of this parameter is
not None then,it set a list of floats in interval [0, 1] for each violin,which then stands for the
quantiles that will be rendered for that violin.

 points

It is scalar in nature and is used to define the number of points to evaluate each of the Gaussian
kernel density estimations.

 bw_method

This method is used to calculate the estimator bandwidth, for which there are many different
ways of calculation. The default rule used is Scott's Rule, but you can choose ‘silverman’, a scalar
constant, or a callable.

Now its time to dive into some examples in order to clear the concepts:

Violin Plot Basic Example:

Below we have a simple example where we will create violin plots for a different collection of
data.

import matplotlib.pyplot as plt


import numpy as np

np.random.seed(10)

114
collectn_1 = np.random.normal(120, 10, 200)
collectn_2 = np.random.normal(150, 30, 200)
collectn_3 = np.random.normal(50, 20, 200)
collectn_4 = np.random.normal(100, 25, 200)

data_to_plot = [collectn_1, collectn_2, collectn_3, collectn_4]

fig = plt.figure()

ax = fig.add_axes([0,0,1,1])

bp = ax.violinplot(data_to_plot)
plt.show()

The output will be as follows:

115
AI vs ML vs Deep Learning vs Data Science

Artifical Inteligence (AI)


A bunch of scientists at IBM founded the field of artificial intelligence as an academic discipline

in 1956. Yes, This is not a new technology. This is a very old concept. But if you see the history

of how people came arrived at this concept. and who those people are? So to answer that question.

A variety of handful scientists from different fields of study thought about creating an artificial
brain. It may surprise you to hear that the group of scientists who founded the field of artificial

intelligence as an academic discipline in 1956 came from diverse fields such as mathematics,

psychology, engineering, economics, and political science.

Sound interesting right? I always believe The origin of all discovery in technology come from

only one source, Philosophy.

About (AI)
The core analogy of this field is to mimic the human brain. We write a program in the AI field

that simulates human intelligence into a machine and programs it to think and behave like

humans. This field involved developing algorithms, which analyzes data and perform human-like

action. For example, understanding natural language like a human, and recognizing images to

understand the things inside the image.

If I asked you from this image, how many objects are in this and what are that objects? So you can

easily identify with your bare eye. because you know these objects before. How Cats and Dogs

look alike.

116
But by asking the same question to the machine you need to feed some intelligence into the

machine so that it can identify these objects. in order to make a machine talk like a human and the

process which involves making a machine talk comes under the artificial intelligent domain.

There are other examples like an application that can give question answers like IBM Watson.

decision-making system which can take marking budget decisions. where to spend and where not

to spend. and the list goes on and on. I will cover this in some other blog where I will only discuss

AI.

Machine Learning (ML)


The term Machine learning was getting popular in 1959, and all credit goes to Arthur Samuel.

According to him, Machine learning is one way to use AI. It was defined in the 1950s by AI

pioneer Arthur Samuel as “the field of study that gives computers the ability to learn without

explicitly being programmed.”

The 80s and 90s were the phases when machine learning came into the mainstream. And people

started recognizing separate as a separate field. In the early day Machine learning was focusing on

solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and

probability theory.

The difference between AI and ML is frequently misunderstood. People had a mindset that

Machine learning learns and predicts based on passive learning or you can say learning from past

history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to

learn and take action to maximize its chance of success. We know this technique

as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.

The 80s and 90s were the phases when machine learning came into the mainstream. And people

started recognizing separate as a separate field. In the early day Machine learning was focusing on

117
solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and

probability theory.

The difference between AI and ML is frequently misunderstood. People had a mindset that

Machine learning learns and predicts based on passive learning or you can say learning from past

history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to

learn and take action to maximize its chance of success. We know this technique

as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.

The 80s and 90s were the phases when machine learning came into the mainstream. And people

started recognizing separate as a separate field. In the early day Machine learning was focusing on

solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and

probability theory.

The difference between AI and ML is frequently misunderstood. People had a mindset that

Machine learning learns and predicts based on passive learning or you can say learning from past

history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to

learn and take action to maximize its chance of success. We know this technique

as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.

ML is a subset of AI is a very loose term. If we say Machine Learning is a subfield of AI through

which we can train algorithms to do AI work. In layman’s terms, ML has a method also called an

algorithm to make AI systems more powerful.

Because based on this method we can put human intelligence into machines. And Machine

learning technique is the way to create a human intelligence system or AI system.

Deep Learning (DL)

History
118
So very quick history, this all started in 1962 when Frank Rosenblatt publish a paper. “Principles

of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms” in Cornell University.

Then the natural progression happens people start exploring new ideas around this. And they

started working on other deep learning architectures to solve Computer Vision problems.

According to Wikipedia “The term Deep Learning was introduced to the machine learning

community by Rina Dechter in 1986″. and then in 1989, Yann LeCun et al. applied the standard

backpropagation algorithm.

Things started progressing new architecture started emerging. ANN (artificial neural network),

which was introduced in 1943 not getting used till 2000. And SVM (Support vector machine) was

the popular choice in the 1990s and 2000s.

The advancement of hardware in 2009 has driven renewed interest in deep learning. Now Deep

learning models can train in Nvidia GPUs. And now in 2023, we can see a lot of advancement in

new architecture around deep learning popping up.

About (DL)
To understand Deep Learning you have to understand the neural networks. Because DL is just a
multi-layer neural network. The term neural network comes from brain neurons. I will cover in
detail about this in another blog. but the overall concept of a neural network was to create a
network of neurons, and the neuron is nothing but a cell that receives an input signal in the form
of data (for humans, through the eye they can see things. And once a human sees things can easily
identify the object). and process to another neuron with information.

But if I define this in AI terms then, DL is a subfield of Machine learning through which you can
train an AI system.
Deep learning is designed to learn large data and extract information from raw data automatically.
The best use of deep learning is when you are working on image data, speech data, and natural
language processing.
Data Science (DS)

119
This topic is one of the buzz terms on the internet. all companies wanted to exploit the area. So
let’s define this term in an easy way so everyone can understand.
It is an interdisciplinary field that involves various methods, tools, and techniques to analyze and
extract knowledge and insight from data.
Data science combines various fields. for example Statistics, Computer Science, Domain Specific
Knowledge, Visualization, data engineering techniques, data quality techniques, and data
profiling techniques to analyze and interpret complex data sets.
Data Science involves tasks such as data cleaning, data preprocessing, data analysis, data
visualization, and data interpretation.
Basically, as a data scientist, you should have all the core knowledge of analyzing and interpreting
data. And create an AI system with the help of Machine learning techniques.
Analysis
Now coming back to over original discussion, why often do people get confused by these terms?
So the simple answer is they do not make a connection or relation between all these terms. okay,
let me help with this.
You wanted to create an AI system, so AI is our application layer. now first figure out what type
of method will solve the purpose. Can we use a machine learning algorithm or do we have to use
a Deep learning technique? once we finalize then we need to analyze and interpret data with Data
science techniques.
In summary, AI is the broader field of developing intelligent machines, ML is a subset of AI that
involves training algorithms to learn from data, DL is a subset of ML that uses ANNs to model
complex patterns in data, and DS is an interdisciplinary field that involves extracting insights
from data.

As we can see Deep learning is a subfield of machine learning and machine learning is a subfield
of AI. And in the other hand data science is a field that needs all attention.
Data Science is an interdisciplinary field that uses all the techniques to analyze and extracts
information from data.

120

You might also like