SOEN 380 (1)
SOEN 380 (1)
Python is a widely used programming language that offers several unique features and
advantages compared to languages like Java and C++. Python is a general-purpose, dynamically
typed, high-level, compiled and interpreted, garbage-collected, and purely object-oriented
programming language that supports procedural, object-oriented, and functional programming.
Python provides many useful features which make it popular and valuable from the other
programming languages. It supports object-oriented programming, procedural programming
approaches and provides dynamic memory allocation. We have listed below a few essential
features.
2) Expressive Language
Python can perform complex tasks using a few lines of code. A simple example, the hello world
program you simply type print("Hello World"). It will take only one line to execute, while Java
or C takes multiple lines.
3) Interpreted Language
Python is an interpreted language; it means the Python program is executed one line at a time.
The advantage of being interpreted language, it makes debugging easy and portable.
4) Cross-platform Language
Python can run equally on different platforms such as Windows, Linux, UNIX, and Macintosh,
etc. So, we can say that Python is a portable language. It enables programmers to develop the
software for several competing platforms by writing a program only once.
1
Python is freely available for everyone. It is freely available on its official
website www.python.org. It has a large community across the world that is dedicatedly working
towards make new python modules and functions. Anyone can contribute to the Python
community. The open-source means, "Anyone can download its source code without paying any
penny."
6) Object-Oriented Language
Python supports object-oriented language and concepts of classes and objects come into
existence. It supports inheritance, polymorphism, and encapsulation, etc. The object-oriented
procedure helps to programmer to write reusable code and develop applications in less code.
7) Extensible
It implies that other languages such as C/C++ can be used to compile the code and thus it can be
used further in our Python code. It converts the program into byte code, and any platform can use
that byte code.
It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting. There are various machine learning libraries, such as
Tensor flow, Pandas, Numpy, Keras, and Pytorch, etc. Django, flask, pyramids are the popular
framework for Python web development.
Graphical User Interface is used for the developing Desktop application. PyQT5, Tkinter, Kivy
are the libraries which are used for developing the web application.
10) Integrated
It can be easily integrated with languages like C, C++, and JAVA, etc. Python runs code line by
line like C, C++ Java. It makes easy to debug the code.
11. Embeddable
The code of the other programming language can be used in Python source code. We can use
Python source code in another programming language as well. It can embed other language into
our code.
In Python, we don't need to specify the data-type of the variable. When we assign some value to
the variable, it automatically allocates the memory to the variable at run time. Suppose we are
assigned integer value 15 to x, then we don't need to write int x = 15. Just write x = 15.
2
Java vs. Python
Python is an excellent choice for rapid development and scripting tasks. Whereas Java
emphasizes a strong type system and object-oriented programming.
Java Code:
While both programs give the same output, we can notice the syntax difference in the print
statement.
o In Python, it is easy to learn and write code. While in Java, it requires more code to
perform certain tasks.
o Python is dynamically typed, meaning we do not need to declare the variable Whereas
Java is statistically typed, meaning we need to declare the variable type.
o Python is suitable for various domains such as Data Science, Machine Learning, Web
development, and more. Whereas Java is suitable for web development, mobile app
development (Android), and more.
The Python Software Foundation (PSF) was established in 2001 to promote, protect, and
advance the Python programming language and its community.
o Easy to use and Learn: Python has a simple and easy-to-understand syntax, unlike
traditional languages like C, C++, Java, etc., making it easy for beginners to learn.
o Expressive Language: It allows programmers to express complex concepts in just a few
lines of code or reduces Developer's Time.
o Interpreted Language: Python does not require compilation, allowing rapid
development and testing. It uses Interpreter instead of Compiler.
o Object-Oriented Language: It supports object-oriented programming, making writing
reusable and modular code easy.
o Open-Source Language: Python is open-source and free to use, distribute and modify.
o Extensible: Python can be extended with modules written in C, C++, or other languages.
3
o Learn Standard Library: Python's standard library contains many modules and
functions that can be used for various tasks, such as string manipulation, web
programming, and more.
o GUI Programming Support: Python provides several GUI frameworks, such
as Tkinter and PyQt, allowing developers to create desktop applications easily.
o Integrated: Python can easily integrate with other languages and technologies, such as
C/C++, Java, and . NET.
o Embeddable: Python code can be embedded into other applications as a scripting
language.
o Dynamic Memory Allocation: Python automatically manages memory allocation,
making it easier for developers to write complex programs without worrying about
memory management.
o Wide Range of Libraries and Frameworks: Python has a vast collection of libraries
and frameworks, such as NumPy, Pandas, Django, and Flask, that can be used to solve a
wide range of problems.
o Versatility: Python is a universal language in various domains such as web
development, machine learning, data analysis, scientific computing, and more.
o Large Community: Python has a vast and active community of developers contributing
to its development and offering support. This makes it easy for beginners to get help and
learn from experienced developers.
o Career Opportunities: Python is a highly popular language in the job market. Learning
Python can open up several career opportunities in data science, artificial intelligence,
web development, and more.
o High Demand: With the growing demand for automation and digital transformation, the
need for Python developers is rising. Many industries seek skilled Python developers to
help build their digital infrastructure.
o Increased Productivity: Python has a simple syntax and powerful libraries that can help
developers write code faster and more efficiently. This can increase productivity and save
time for developers and organizations.
o Big Data and Machine Learning: Python has become the go-to language for big data
and machine learning. Python has become popular among data scientists and machine
learning engineers with libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and
more.
o Use of python in academics: Now python language is being treated as the core programming
language in schools and colleges due to its countless uses in Artificial Intelligence, Deep
Learning, Data Science, etc. It has now become a fundamental part of the development world
that schools and colleges cannot afford not to teach python language. In this way, it is increasing
more python Developers and Programmers and thus further expanding its growth and
popularity.
o Automation: Python language can help a lot in automation of tasks as there are lots of tools and
modules available, which makes things much more comfortable. It is incredible to know that one
can reach an advanced level of automation easily by just using necessary python codes. Python
is the best performance booster in the automation of software testing also. One will be amazed
at how much less time and few numbers of lines are required to write codes for automation
tools.
o
4
Application Areas of Python
Python is a general-purpose, popular programming language, and it is used in almost every
technical field. The various areas of Python use are given below.
o Data Science: Data Science is a vast field, and Python is an important language for this
field because of its simplicity, ease of use, and availability of powerful data analysis and
visualization libraries like NumPy, Pandas, and Matplotlib.
o Desktop Applications: PyQt and Tkinter are useful libraries that can be used in GUI -
Graphical User Interface-based Desktop Applications. There are better languages for this
field, but it can be used with other languages for making Applications.
o Console-based Applications: Python is also commonly used to create command-line or
console-based applications because of its ease of use and support for advanced features
such as input/output redirection and piping.
o Mobile Applications: While Python is not commonly used for creating mobile
applications, it can still be combined with frameworks like Kivy or BeeWare to create
cross-platform mobile applications.
o Software Development: Python is considered one of the best software-making
languages. Python is easily compatible with both from Small Scale to Large Scale
software.
o Artificial Intelligence: AI is an emerging Technology, and Python is a perfect language
for artificial intelligence and machine learning because of the availability of powerful
libraries such as TensorFlow, Keras, and PyTorch.
o Web Applications: Python is commonly used in web development on the backend with
frameworks like Django and Flask and on the front end with tools
like JavaScript HTML and CSS.
o Enterprise Applications: Python can be used to develop large-scale enterprise
applications with features such as distributed computing, networking, and parallel
processing.
o 3D CAD Applications: Python can be used for 3D computer-aided design (CAD)
applications through libraries such as Blender.
o Machine Learning: Python is widely used for machine learning due to its simplicity,
ease of use, and availability of powerful machine learning libraries.
o Computer Vision or Image Processing Applications: Python can be used for computer
vision and image processing applications through powerful libraries such as OpenCV and
Scikit-image.
o Speech Recognition: Python can be used for speech recognition applications through
libraries such as SpeechRecognition and PyAudio.
o Scientific computing: Libraries like NumPy, SciPy, and Pandas provide advanced
numerical computing capabilities for tasks like data analysis, machine learning, and
more.
o Education: Python's easy-to-learn syntax and availability of many resources make it an
ideal language for teaching programming to beginners.
o Testing: Python is used for writing automated tests, providing frameworks like unit tests
and pytest that help write test cases and generate reports.
5
o Gaming: Python has libraries like Pygame, which provide a platform for developing
games using Python.
o IoT: Python is used in IoT for developing scripts and applications for devices
like Raspberry Pi, Arduino, and others.
o Networking: Python is used in networking for developing scripts and applications for
network automation, monitoring, and management.
o DevOps: Python is widely used in DevOps for automation and scripting of infrastructure
management, configuration management, and deployment processes.
o Finance: Python has libraries like Pandas, Scikit-learn, and Statsmodels for financial
modeling and analysis.
o Audio and Music: Python has libraries like Pyaudio, which is used for audio processing,
synthesis, and analysis, and Music21, which is used for music analysis and generation.
o Writing scripts: Python is used for writing utility scripts to automate tasks like file
operations, web scraping, and data processing.
6
Python print() Function
Python print() function is used to display output to the console or terminal. It allows us to display
text, variables and other data in a human readable format.
Syntax:
1. x = 10
2. y=5
3.
4. if x > y:
5. print("x is greater than y")
6. else:
7. print("y is greater than or equal to x")
Python Loops
Sometimes we may need to alter the flow of the program. The execution of a specific code may
need to be repeated several times. For this purpose, the programming languages provide various
loops capable of repeating some specific code several times.
1. i = 1
2. while i<5:
3. print(i, end=" ")
4. i += 1
7
Python Data Structures
Lists
1. # Create a list
2. fruits = ['apple', 'banana', 'cherry']
3. print("fruits[1] =", fruits[1])
4.
5. # Modify list
6. fruits.append('orange')
7. print("fruits =", fruits)
8.
9. num_list = [1, 2, 3, 4, 5]
10. # Calculate sum
11. sum_nums = sum(num_list)
12. print("sum_nums =", sum_nums)
Output:
fuirts[1] = banana
fruits = ['apple', 'banana', 'cherry', 'orange']
sum_nums = 15
Tuples
o Tuples are also ordered collections of data elements of different data types, similar to
Lists.
o Elements can be accessed using indices.
o Tuples are immutable meaning Tuples can't be modified once created.
o They are defined using open bracket '()'.
Example:
1. # Create a tuple
2. point = (3, 4)
3. x, y = point
4. print("(x, y) =", x, y)
5.
6. # Create another tuple
7. tuple_ = ('apple', 'banana', 'cherry', 'orange')
8. print("Tuple =", tuple_)
8
Output:
(x, y) = 3 4
Tuple = ('apple', 'banana', 'cherry', 'orange')
Sets
o Sets are unordered collections of immutable data elements of different data types.
o Sets are mutable.
o Elements can't be accessed using indices.
o Sets do not contain duplicate elements.
o They are defined using curly braces '{}'
1. # Create a set
2. set1 = {1, 2, 2, 1, 3, 4}
3. print("set1 =", set1)
4.
5. # Create another set
6. set2 = {'apple', 'banana', 'cherry', 'apple', 'orange'}
7. print("set2 =", set2)
Output:
set1 = {1, 2, 3, 4}
set2 = {'apple', 'cherry', 'orange', 'banana'}
Dictionaries
o Dictionary are key-value pairs that allow you to associate values with unique keys.
o They are defined using curly braces '{}' with key-value pairs separated by colons ':'.
o Dictionaries are mutable.
o Elements can be accessed using keys.
Example:
1. # Create a dictionary
2. person = {'name': 'Umesh', 'age': 25, 'city': 'Noida'}
3. print("person =", person)
4. print(person['name'])
5.
6. # Modify Dictionary
7. person['age'] = 27
8. print("person =", person)
Output:
9
Umesh
person = {'name': 'Umesh', 'age': 27, 'city': 'Noida'}
Recent versions of Python have introduced features that make functional programming more
concise and expressive. For example, the "walrus operator":= allows for inline variable
assignment in expressions, which can be useful when working with nested function calls or list
comprehensions.
Python Function
1. Lambda Function - A lambda function is a small, anonymous function that can take
any number of arguments but can only have one expression. Lambda functions are often
used in functional programming to create functions "on the fly" without defining a named
function.
2. Recursive Function - A recursive function is a function that calls itself to solve a
problem. Recursive functions are often used in functional programming to perform
complex computations or to traverse complex data structures.
3. Map Function - The map() function applies a given function to each item of an iterable
and returns a new iterable with the results. The input iterable can be a list, tuple, or other.
4. Filter Function - The filter() function returns an iterator from an iterable for which the
function passed as the first argument returns True. It filters out the items from an iterable
that do not meet the given condition.
5. Reduce Function - The reduce() function applies a function of two arguments
cumulatively to the items of an iterable from left to right to reduce it to a single value.
6. functools Module - The functools module in Python provides higher-order functions that
operate on other functions, such as partial() and reduce().
7. Currying Function - A currying function is a function that takes multiple arguments and
returns a sequence of functions that each take a single argument.
8. Memoization Function - Memoization is a technique used in functional programming to
cache the results of expensive function calls and return the cached Result when the same
inputs occur again.
9. Threading Function - Threading is a technique used in functional programming to run
multiple tasks simultaneously to make the code more efficient and faster.
Python Modules
Python modules are the program files that contain Python code or functions. Python has two
types of modules - User-defined modules and built-in modules. A module the user defines, or our
Python code saved with .py extension, is treated as a user-define module.
10
Built-in modules are predefined modules of Python. To use the functionality of the modules, we
need to import them into our current working program.
Python modules are essential to the language's ecosystem since they offer reusable code and
functionality that can be imported into any Python program. Here are a few examples of several
Python modules, along with a brief description of each:
Math: Gives users access to mathematical constants and pi and trigonometric functions.
Datetime: Provides classes for a simpler way of manipulating dates, times, and periods.
OS: Enables interaction with the base operating system, including administration of processes
and file system activities.
Random: The random function offers tools for generating random integers and picking random
items from a list.
JSON: JSON is a data structure that can be encoded and decoded and is frequently used in
online APIs and data exchange. This module allows dealing with JSON.
Re: Supports regular expressions, a potent text-search and text-manipulation tool.
Collections: Provides alternative data structures such as sorted dictionaries, default dictionaries,
and named tuples.
NumPy: NumPy is a core toolkit for scientific computing that supports numerical operations on
arrays and matrices.
Pandas: It provides high-level data structures and operations for dealing with time series and
other structured data types.
Requests: Offers a simple user interface for web APIs and performs HTTP requests.
11
Python File I/O - Read and Write Files
In Python, the IO module provides methods of three types of IO operations; raw binary files,
buffered binary files, and text files. The canonical way to create a file object is by using the
open() function.
1. Open the file to get the file object using the built-in open() function. There are different access
modes, which you can specify while opening a file using the open() function.
2. Perform read, write, append operations using the file object retrieved from the open()
function.
3. Close and dispose the file object.
Reading File
File object includes the following methods to read data from the file.
read(chars): reads the specified number of characters starting from the current position.
readline(): reads the characters starting from the current reading position up to a newline
character.
readlines(): reads all lines until the end of file and returns a list object.
The following C:\myfile.txt file will be used in all the examples of reading and writing files.
C:\myfile.txt
This is the first line.
This is the second line.
This is the third line.
The following example performs the read operation using the read(chars) method.
Above, f = open('C:\myfile.txt') opens the myfile.txt in the default read mode from the
current directory and returns a file object. f.read() function reads all the content until EOF as a
string. If you specify the char size argument in the read(chars) method, then it will read that
many chars only. f.close() will flush and close the stream.
12
Reading a Line
As you can see, we have to open the file in 'r' mode. The readline() method will return the
first line, and then will point to the second line in the file.
The file object has an inbuilt iterator. The following program reads the given file line by line
until StopIteration is raised, i.e., the EOF is reached.
13
Example: Read File using the For Loop
f=open('C:\myfile.txt')
for line in f:
print(line)
f.close()
Output
This is the first line.
This is the second line.
This is the third line.
Writing to a File
write(s): Write the string s to the stream and return the number of characters written.
writelines(lines): Write a list of lines to the stream. Each line must have a separator at the end of
it.
The following creates a new file if it does not exist or overwrites to an existing file.
# reading file
>>> f = open('C:\myfile.txt','r')
>>> f.read()
'Hello'
>>> f.close()
The following appends the content at the end of the existing file by passing 'a' or 'a+' mode in
the open() method.
14
>>> f = open('C:\myfile.txt','a')
>>> f.write(" World!")
7
>>> f.close()
# reading file
>>> f = open('C:\myfile.txt','r')
>>> f.read()
'Hello World!'
>>> f.close()
Write Multiple Lines
Python provides the writelines() method to save the contents of a list object in a file. Since
the newline character is not automatically written to the file, it must be provided as a part of the
string.
Opening a file with "w" mode or "a" mode can only be written into and cannot be read from.
Similarly "r" mode allows reading only and not writing. In order to perform simultaneous
read/append operations, use "a+" mode.
The open() function opens a file in text format by default. To open a file in binary format, add
'b' to the mode parameter. Hence the "rb" mode opens the file in binary format for reading,
while the "wb" mode opens the file in binary format for writing. Unlike text files, binary files are
not human-readable. When opened using any text editor, the data is unrecognizable.
The following code stores a list of numbers in a binary file. The list is first converted in a byte
array before writing. The built-in function bytearray() returns a byte representation of the object.
15
Exception Handling in Python
The cause of an exception is often external to the program itself. For example, an incorrect input,
a malfunctioning IO device etc. Because the program abruptly terminates on encountering an
exception, it may cause damage to system resources, such as files. Hence, the exceptions should
be properly handled so that an abrupt termination of the program is prevented.
Python uses try and except keywords to handle exceptions. Both keywords are followed by
indented blocks.
Syntax:
try :
#statements in try block
except :
#executed when error in try block
The try: block contains one or more statements which are likely to encounter an exception. If the
statements in this block are executed without an exception, the subsequent except: block is
skipped.
If the exception does occur, the program flow is transferred to the except: block. The statements
in the except: block are meant to handle the cause of the exception appropriately. For example,
returning an appropriate error message.
You can specify the type of exception after the except keyword. The subsequent block will be
executed only if the specified exception occurs. There may be multiple except clauses with
different exception types in a single try block. If the type of exception doesn't match any of the
except blocks, it will remain unhandled and the program will terminate.
The rest of the statements after the except block will continue to be executed, regardless if the
exception is encountered or not.
The following example will throw an exception when we try to devide an integer by a string.
a=5
b='0'
print(a/b)
except:
16
print('Some error occurred.')
Output
Some error occurred.
You can mention a specific type of exception in front of the except keyword. The subsequent
block will be executed only if the specified exception occurs. There may be multiple except
clauses with different exception types in a single try block. If the type of exception doesn't match
any of the except blocks, it will remain unhandled and the program will terminate.
a=5
b='0'
print (a+b)
except TypeError:
print('Unsupported operation')
Output
Unsupported operation
As mentioned above, a single try block may have multiple except blocks. The following example
uses two except blocks to process two different exception types:
a=5
b=0
print (a/b)
except TypeError:
print('Unsupported operation')
except ZeroDivisionError:
17
print ('Out of try except blocks')
Output
Division by zero not allowed
Syntax:
try:
except:
else:
finally:
The finally block consists of statements which should be processed regardless of an exception
occurring in the try block or not. As a consequence, the error-free try block skips the except
clause and enters the finally block before going on to execute the rest of the code. If, however,
there's an exception in the try block, the appropriate except block will be processed, and the
statements in the finally block will be processed before proceeding to the rest of the code.
The example below accepts two numbers from the user and performs their division. It
demonstrates the uses of else and finally blocks.
print('try block')
18
z=x/y
except ZeroDivisionError:
else:
print("else block")
print("Division = ", z)
finally:
print("finally block")
x=0
y=0
The first run is a normal case. The out of the else and finally blocks is displayed because the try
block is error-free.
Output
try block
Enter a number: 10
else block
Division = 5.0
finally block
The second run is a case of division by zero, hence, the except block and the finally block are
executed, but the else block is not executed.
Output
try block
Enter a number: 10
19
finally block
In the third run case, an uncaught exception occurs. The finally block is still executed but the
program terminates and does not execute the program after the finally block.
Output
try block
Enter a number: 10
finally block
Typically the finally clause is the ideal place for cleaning up the operations in a process. For
example closing a file irrespective of the errors in read/write operations. This will be dealt with
in the next chapter.
Raise an Exception
Python also provides the raise keyword to be used in the context of exception handling. It
causes an exception to be generated explicitly. Built-in errors are raised implicitly. However, a
built-in or custom exception can be forced during execution.
The following code accepts a number from the user. The try block raises a ValueError exception
if the number is outside the allowed range.
if x > 100:
raise ValueError(x)
except ValueError:
else:
20
Output
Enter a number upto 100: 200
Here, the raised exception is a ValueError type. However, you can define your custom
exception type to be raised
Python's file input/output (I/O) system offers programs to communicate with files stored on a
disc. Python's built-in methods for the file object let us carry out actions like reading, writing,
and adding data to files.
The open() method in Python makes a file object when working with files. The name of the file
to be opened and the mode in which the file is to be opened are the two parameters required by
this function. The mode can be used according to work that needs to be done with the file, such
as "r" for reading, "w" for writing, or "a" for attaching.
After successfully creating an object, different methods can be used according to our work. If we
want to write in the file, we can use the write() functions, and if you want to read and write both,
then we can use the append() function and, in cases where we only want to read the content of
the file we can use read() function. Binary files containing data in a binary rather than a text
format may also be worked with using Python. Binary files are written in a manner that humans
cannot directly understand. The rb and wb modes can read and write binary data in binary files.
Python Exceptions
An exception can be defined as an unusual condition in a program resulting in an interruption in
the flow of the program.
Whenever an exception occurs, the program stops the execution, and thus the other code is not
executed. Therefore, an exception is the run-time errors that are unable to handle to Python
script. An exception is a Python object that represents an error.
21
Python Exceptions are an important aspect of error handling in Python programming. When a
program encounters an unexpected situation or error, it may raise an exception, which can
interrupt the normal flow of the program.
In Python, exceptions are represented as objects containing information about the error,
including its type and message. The most common type of Exception in Python is the Exception
class, a base class for all other built-in exceptions.
To handle exceptions in Python, we use the try and except statements. The try statement is used
to enclose the code that may raise an exception, while the except statement is used to define a
block of code that should be executed when an exception occurs.
1. try:
2. x = int ( input ("Enter a number: "))
3. y = 10 / x
4. print ("Result:", y)
5. except ZeroDivisionError:
6. print ("Error: Division by zero")
7. except ValueError:
8. print ("Error: Invalid input")
Output:
Enter a number: 0
Error: Division by zero
In this code, we use the try statement to attempt to perform a division operation. If either of these
operations raises an exception, the matching except block is executed.
Python also provides many built-in exceptions that can be raised in similar situations. Some
common built-in exceptions include IndexError, TypeError, and NameError. Also, we can
define our custom exceptions by creating a new class that inherits from the Exception class.
Python CSV
A CSV stands for "comma separated values", which is defined as a simple file format that uses
specific structuring to arrange tabular data. It stores tabular data such as spreadsheets or
databases in plain text and has a common format for data interchange. A CSV file opens into the
Excel sheet, and the rows and columns data define the standard format.
We can use the CSV.reader function to read a CSV file. This function returns a reader object that
we can use to repeat over the rows in the CSV file. Each row is returned as a list of values, where
each value corresponds to a column in the CSV file.
22
1. import csv
2.
3. with open('data.csv', 'r') as file:
4. reader = csv.reader(file)
5. for row in reader:
6. print(row)
Here, we open the file data.csv in read mode and create a csv.reader object using
the csv.reader() function. We then iterate over the rows in the CSV file using a for loop and
print each row to the console.
We can use the CSV.writer() function to write data to a CSV file. It returns a writer object we
can use to write rows to the CSV file. We can write rows by calling the writer () method on the
writer object.
1. import csv
2.
3. data = [ ['Name', 'Age', 'Country'],
4. ['Alice', '25', 'USA'],
5. ['Bob', '30', 'Canada'],
6. ['Charlie', '35', 'Australia']
7. ]
8.
9. with open('data.csv', 'w') as file:
10. writer = csv.writer(file)
11. for row in data:
12. writer.writerow(row)
In this program, we create a list of lists called data, where each inner list represents a row of data.
We then open the file data.csv in write mode and create a CSV.writer object using the
CSV.writer function. We then iterate over the rows in data using a for loop and write each row to
the CSV file using the writer method.
The Python magic method is the special method that adds "magic" to a class. It starts and ends
with double underscores, for example, _init_ or _str_.
23
The built-in classes define many magic methods. The dir() function can be used to see the
number of magic methods inherited by a class. It has two prefixes and suffix underscores in the
method name.
o Python magic methods are also known as dunder methods, short for "double
underscore" methods because their names start and end with a double underscore.
o Magic methods are automatically invoked by the Python interpreter in certain situations,
such as when an object is created, compared to another object, or printed.
o Magic methods can be used to customize the behavior of classes, such as defining how
objects are compared, converted to strings, or accessed as containers.
o Some commonly used magic methods include init for initializing an object, str for
converting an object to a string, eq for comparing two objects for equality,
and getitem and setitem for accessing items in a container object.
For example, the str magic method can define how an object should be represented as a string.
Here's an example
1. class Person:
2. def __init__(self, name, age):
3. self.name = name
4. self.age = age
5.
6. def __str__(self):
7. return f"{self.name} ({self.age})"
8.
9. person = Person('Vikas', 22)
10. print(person)
Output:
Vikas (22)
In this example, the str method is defined to return a formatted string representation of the
Person object with the person's name and age.
Another commonly used magic method is eq, which defines how objects should be compared for
equality. Here's an example:
o Classes and Objects - Python classes are the blueprints of the Object. An object is a
collection of data and methods that act on the data.
24
o Inheritance - An inheritance is a technique where one class inherits the properties of
other classes.
o Constructor - Python provides a special method __init__() which is known as a
constructor. This method is automatically called when an object is instantiated.
o Data Member - A variable that holds data associated with a class and its objects.
o Polymorphism - Polymorphism is a concept where an object can take many forms. In
Python, polymorphism can be achieved through method overloading and method
overriding.
o Method Overloading - In Python, method overloading is achieved through default
arguments, where a method can be defined with multiple parameters. The default values
are used if some parameters are not passed while calling the method.
o Method Overriding - Method overriding is a concept where a subclass implements a
method already defined in its superclass.
o Encapsulation - Encapsulation is wrapping data and methods into a single unit. In
Python, encapsulation is achieved through access modifiers, such as public, private, and
protected. However, Python does not strictly enforce access modifiers, and the naming
convention indicates the access level.
o Data Abstraction: A technique to hide the complexity of data and show only essential
features to the user. It provides an interface to interact with the data. Data abstraction
reduces complexity and makes code more modular, allowing developers to focus on the
program's essential features.
o Python Oops Concepts - In Python, the object-oriented paradigm is to design the program
using classes and objects. The object is related to real-word entities such as book, house,
pencil, etc. and the class defines its properties and behaviours.
o Python Objects and classes - In Python, objects are instances of classes and classes are
blueprints that defines structure and behaviour of data.
o Python Constructor - A constructor is a special method in a class that is used to initialize
the object's attributes when the object is created.
o Python Inheritance - Inheritance is a mechanism in which new class (subclass or child
class) inherits the properties and behaviours of an existing class (super class or parent
class).
o Python Polymorphism - Polymorphism allows objects of different classes to be treated as
objects of a common superclass, enabling different classes to be used interchangeably
through a common interface.
25
Python Advanced Topics
Python includes many advances and useful concepts that help the programmer solve complex
tasks. These concepts are given below.
Python Iterator
An iterator is simply an object that can be iterated upon. It returns one Object at a time. It can be
implemented using the two special methods, __iter__() and __next__().
Iterators in Python are objects that allow iteration over a collection of data. They process each
collection element individually without loading the entire collection into memory.
For example, let's create an iterator that returns the squares of numbers up to a given limit:
0
1
4
9
16
25
In this example, we have created a class Squares that acts as an iterator by implementing the
__iter__() and __next__() methods. The __iter__() method returns the Object itself, and the
__next__() method returns the next square of the number until the limit is reached.
26
Python Generators
Python generators produce a sequence of values using a yield statement rather than a return
since they are functions that return iterators. Generators terminate the function's execution while
keeping the local state. It picks up right where it left off when it is restarted. Because we don't
have to implement the iterator protocol thanks to this feature, writing iterators is made simpler.
Here is an illustration of a straightforward generator function that produces squares of numbers:
1. # Generator Function
2. def square_numbers(n):
3. for i in range(n):
4. yield i**2
5.
6. # Create a generator object
7. generator = square_numbers(5)
8.
9. # Print the values generated by the generator
10. for num in generator:
11. print(num)
Output:
0
1
4
9
16
Python Modifiers
Python Decorators are functions used to modify the behaviour of another function. They allow
adding functionality to an existing function without modifying its code directly. Decorators are
defined using the @ symbol followed by the name of the decorator function. They can be used
for logging, timing, caching, etc.
Here's an example of a decorator function that adds timing functionality to another function:
1. import time
2. from math import factorial
3.
4. # Decorator to calculate time taken by
5. # the function
6. def time_it(func):
7. def wrapper(*args, **kwargs):
8. start = time.time()
9. result = func(*args, **kwargs)
10. end = time.time()
11. print(f"{func.__name__} took {end-start:.5f} seconds to run.")
27
12. return result
13. return wrapper
14.
15. @time_it
16. def my_function(n):
17. time.sleep(2)
18. print(f"Factorial of {n} = {factorial(n)}")
19.
20. my_function(25)
Output:
In the above example, the time_it decorator function takes another function as an argument and
returns a wrapper function. The wrapper function calculates the time to execute the original
function and prints it to the console. The @time_it decorator is used to apply the time_it function
to the my_function function. When my_function is called, the decorator is executed, and the
timing functionality is added.
28
Introduction to Natural Language
Processing
You can perform text analysis by using Python library called Natural Language Tool Kit
(NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between
text analysis and web scraping.
Analyzing the words in the text can lead us to know about which words are important, which
words are unusual, how words are grouped. This analysis eases the task of web scraping.
The Natural language toolkit (NLTK) is collection of Python libraries which is designed
especially for identifying and tagging parts of speech found in the text of natural language like
English.
Installing NLTK
If you are using Anaconda, then a conda package for NLTK can be built by using the following
command −
After installing NLTK, we have to download preset text repositories. But before downloading
text preset repositories, we need to import NLTK with the help of import command as follows −
import nltk
Now, with the help of following command NLTK data can be downloaded −
nltk.download()
Installation of all available packages of NLTK will take some time, but it is always
recommended to install all the packages.
We also need some other Python packages like gensim and pattern for doing text analysis as
well as building building natural language processing applications by using NLTK.
29
gensim − A robust semantic modeling library which is useful for many applications. It can be
installed by the following command −
pattern − Used to make gensim package work properly. It can be installed by the following
command −
The Process of breaking the given text, into the smaller units called tokens, is called
tokenization. These tokens can be the words, numbers or punctuation marks. It is also called
word segmentation.
Example
NLTK module provides different packages for tokenization. We can use these packages as per
our requirement. Some of the packages are described here −
sent_tokenize package − This package will divide the input text into sentences. You can use the
following command to import this package −
word_tokenize package − This package will divide the input text into words. You can use the
following command to import this package −
WordPunctTokenizer package − This package will divide the input text as well as the
punctuation marks into words. You can use the following command to import this package −
In any language, there are different forms of a words. A language includes lots of variations due
to the grammatical reasons. For example, consider the words democracy, democratic, and
democratization. For machine learning as well as for web scraping projects, it is important for
machines to understand that these different words have the same base form. Hence we can say
that it can be useful to extract the base forms of the words while analyzing the text.
30
This can be achieved by stemming which may be defined as the heuristic process of extracting
the base forms of the words by chopping off the ends of words.
NLTK module provides different packages for stemming. We can use these packages as per our
requirement. Some of these packages are described here −
For example, after giving the word ‘writing’ as the input to this stemmer, the output would be
the word ‘write’ after stemming.
For example, after giving the word ‘writing’ as the input to this stemmer then the output would
be the word ‘writ’ after stemming.
For example, after giving the word ‘writing’ as the input to this stemmer then the output would
be the word ‘write’ after stemming.
Lemmatization
Another way to extract the base form of words is by lemmatization, normally aiming to remove
inflectional endings by using vocabulary and morphological analysis. The base form of any word
after lemmatization is called lemma.
WordNetLemmatizer package − It will extract the base form of the word depending upon
whether it is used as noun as a verb. You can use the following command to import this package
−
Chunking, which means dividing the data into small chunks, is one of the important processes in
natural language processing to identify the parts of speech and short phrases like noun phrases.
31
Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help
of chunking process.
Example
In this example, we are going to implement Noun-Phrase chunking by using NLTK Python
module. NP chunking is a category of chunking which will find the noun phrases chunks in the
sentence.
We need to follow the steps given below for implementing noun-phrase chunking −
In the first step we will define the grammar for chunking. It would consist of the rules which we
need to follow.
Now, we will create a chunk parser. It would parse the grammar and give the output.
import nltk
Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective,
IN: the preposition and NN: the noun.
grammar = "NP:{<DT>?<JJ>*<NN>}"
Now, next line of code will define a parser for parsing the grammar.
parser_chunking = nltk.RegexpParser(grammar)
parser_chunking.parse(sentence)
32
Next, we are giving our output in the variable.
Output = parser_chunking.parse(sentence)
With the help of following code, we can draw our output in the form of a tree as shown below.
output.draw()
Bag of Word (BoW) Model Extracting and converting the Text into
Numeric Form
Bag of Word (BoW), a useful model in natural language processing, is basically used to extract
the features from text. After extracting the features from the text, it can be used in modeling in
machine learning algorithms because raw data cannot be used in ML applications.
Initially, model extracts a vocabulary from all the words in the document. Later, using a
document term matrix, it would build a model. In this way, BoW model represents the document
as a bag of words only and the order or structure is discarded.
Example
Now, by considering these two sentences, we have the following 14 distinct words −
This
33
is
an
example
bag
of
words
model
we
can
extract
features
by
using
Let us look into the following Python script which will build a BoW model in NLTK.
Output
{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}
Topic Modeling: Identifying Patterns in Text Data
Generally documents are grouped into topics and topic modeling is a technique to identify the
patterns in a text that corresponds to a particular topic. In other words, topic modeling is used to
uncover abstract themes or hidden structure in a given set of documents.
Text Classification
34
Classification can be improved by topic modeling because it groups similar words together rather
than using each word separately as a feature.
Recommender Systems
Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the
probabilistic graphical models for implementing topic modeling.
Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like
LDA.
35
Sentiment Analysis Using Python
In today’s digital age, platforms like Twitter, Goodreads, and Amazon
overflow with people’s opinions, making it crucial for organizations to
extract insights from this massive volume of data. Sentiment Analysis in
Python offers a powerful solution to this challenge. This technique, a subset
of Natural Language Processing (NLP), involves classifying texts into
sentiments such as positive, negative, or neutral. By employing
various Python libraries and models, analysts can automate this process
efficiently.
processing(NLP) techniques to analyze and understand the sentiment expressed in text. The
information, such as special characters, punctuation, and stopwords, from the text
data.
analysis.
features from the text, such as words, n-grams, or even parts of speech.
36
Sentiment Classification: Machine learning algorithms or pre-trained models are
used to classify the sentiment of each text instance. Researchers achieve this
through supervised learning, where they train models on labeled data, or through
pre-trained models that have learned sentiment patterns from large datasets.
37
the text. For example, in a product review, the sentiment towards different features
expressed towards specific entities or targets mentioned in the text, such as people,
features.
We just saw how sentiment analysis can empower organizations with insights that
can help them make data-driven decisions. Now, let’s peep into some more use
Social Media Monitoring for Brand Management: Brands can use sentiment
analysis to gauge their Brand’s public outlook. For example, a company can gather
all Tweets with the company’s mention or tag and perform sentiment analysis to
on customer reviews to see how well a product or service is doing in the market and
38
Stock Price Prediction: Predicting whether the stocks of a company will go up or down is
crucial for investors. One can determine the same by performing sentiment analysis on News
Headlines of articles containing the company’s name. If the news headlines pertaining to a particular
organization happen to have a positive sentiment — its stock prices should go up and vice-versa.
Ways to Perform Sentiment Analysis in Python
Python is one of the most powerful tools when it comes to performing data science
Using Vader
Note: For the purpose of demonstrations of methods 3 & 4 (Using Bag of Words
has been used. It comprises more than 5000 text labelled as positive, negative or
Text Blob is a Python library for Natural Language Processing. Using Text Blob for
sentiment analysis is quite simple. It takes text as an input and can return polarity
39
Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1
opinion. Its value lies between [0,1] where a value closer to 0 denotes a piece of
Here is Steps to perform sentiment analysis using python and putting sentiment
Step1: Installation
pip install textblob
Copy Code
Step2: Importing Text Blob
from textblob import TextBlob
Copy Code
Step3: Code Implementation for Sentiment Analysis Using Text Blob
Writing code for sentiment analysis using TextBlob is fairly simple. Just import the
TextBlob object and pass the text to be analyzed with appropriate attributes as
follows:
40
Output
Using VADER
sentiment analyzer that has been trained on social media text. Just like Text Blob,
its usage in Python is pretty simple. We’ll see its usage in code implementation with
an example in a while.
Step1: Installation
pip install vaderSentiment
Copy Code
Step2: Importing SentimentIntensityAnalyzer class from Vader
vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Copy Code
Step3: Code for Sentiment Analysis Using Vader
we need to pass the text to the polarity_scores() function of the object as follows:
Output:
Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}
41
As we can see, a VaderSentiment object returns a dictionary of sentiment scores for
In the two approaches discussed as yet i.e. Text Blob and Vader, we have simply
used Python libraries to perform sentiment analysis. Now we’ll discuss an approach
wherein we’ll train our own model for the task. The steps involved in performing
sentiment analysis using the Bag of Words Vectorization method are as follows:
Create a Bag of Words for the pre-processed text data using the Count
classification.
Approach we need a labeled dataset. As stated earlier, the dataset used for this
demonstration has been obtained from Kaggle. We have simply used sklearn’s
count vectorizer to create the BOW. After, we trained a Multinomial Naive Bayes
42
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer =
token.tokenize)
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'],
test_size=0.25, random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score) Copy Code
Output:
The trained classifier can be used to predict the sentiment of any given text input.
Though we were able to obtain a decent accuracy score with the Bag of Words
Vectorization method, it might fail to yield the same results when dealing with larger
datasets. This gives rise to the need to employ deep learning-based models for the
For NLP tasks, we generally use RNN-based models since they are designed to
deal with sequential data. Here, we’ll train an LSTM (Long Short Term Memory)
model using TensorFlow with Keras. The steps to perform sentiment analysis using
43
Pre-Process the text of training data (Text pre-processing involves Normalization,
entire training text. Text embeddings are generated using texts_to_sequence() and
The model is built using TensorFlow, including input, LSTM, and dense layers.
Dropouts and hyperparameters are adjusted for accuracy. In inner layers, we use
Here, we have used the same dataset as we used in the case of the BOW
44
# Lemmatization
df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in
x.split()]))
return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ')
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation='LeakyReLU'))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics =
['accuracy'])
print(model.summary())
#Model Training
model.fit(X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,y_test) Copy Code
employ the concepts of self-attention to yield impressive results. Though one can
always build a transformer model from scratch, it is quite tedious a task. Thus, we
can use pre-trained transformer models available on Hugging Face. Hugging Face
NLP applications. You can use these models as they are or fine-tune them for
specific tasks.
Step1: Installation
pip install transformers
Copy Code
Step2: Importing SentimentIntensityAnalyzer class from Vader
import transformers
Copy Code
45
Step3: Code for Sentiment Analysis Using Transformer based models
To perform any task using transformers, we first need to import the pipeline function
from transformers. Then, an object of the pipeline function is created and the task to
also specify the model that we need to use to perform the task. Here, since we have
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Copy Code
Output
No single best library for sentiment analysis in Python, depends on your needs.
NLTK: Powerful, versatile, good for multiple NLP tasks, but complex for sentiment
analysis.
subjectivity).
46
steeper learning curve.
Polyglot: Fast, multilingual support (136+ languages), ideal for multiple languages.
Key Takeaways
Python provides a versatile environment for performing sentiment analysis tasks due
We applied these methods to real-world examples like customer reviews and social
47
GUI BUILDING IN PYTHON USING TKINTER
Modern computer applications are user-friendly. User interaction is not restricted to console-
based I/O. They have a more ergonomic graphical user interface (GUI) thanks to high speed
processors and powerful graphics hardware. These applications can receive inputs through
mouse clicks and can enable the user to choose from alternatives with the help of radio buttons,
dropdown lists, and other GUI elements (or widgets).
Such applications are developed using one of various graphics libraries available. A graphics
library is a software toolkit having a collection of classes that define a functionality of various
GUI elements. These graphics libraries are generally written in C/C++.
GUI elements and their functionality are defined in the Tkinter module. The following code
demonstrates the steps in creating a UI.
window=Tk()
window.title('Hello Python')
window.geometry("300x200+10+20")
window.mainloop()
First of all, import the TKinter module. After importing, setup the application object by calling
the Tk() function. This will create a top-level window (root) having a frame with a title bar,
control box with the minimize and close buttons, and a client area to hold other widgets. The
geometry() method defines the width, height and coordinates of the top left corner of the frame
as below (all values are in pixels): window.geometry("widthxheight+XPOS+YPOS") The
application object then enters an event listening loop by calling the mainloop() method. The
application is now constantly waiting for any event generated on the elements in it. The event
could be text entered in a text field, a selection made from the dropdown or radio button,
single/double click actions of mouse, etc. The application's functionality involves executing
appropriate callback functions in response to a particular type of event. The event loop will
terminate as and when the close button on the title bar is clicked. The above code will create the
following window:
48
Python-Tkinter Window
All Tkinter widget classes are inherited from the Widget class. Let's add the most commonly
used widgets.
Button
The button can be created using the Button class. The Button class constructor requires a
reference to the main window and to the options.
Example: Button
from tkinter import *
window=Tk()
btn=Button(window, text="This is Button widget", fg='blue')
btn.place(x=80, y=100)
window.title('Hello Python')
window.geometry("300x200+10+10")
window.mainloop()
Label
A label can be created in the UI in Python using the Label class. The Label constructor requires
the top-level window object and options parameters. Option parameters are similar to the Button
object.
Example: Label
49
from tkinter import *
window=Tk()
lbl.place(x=60, y=50)
window.title('Hello Python')
window.geometry("300x200+10+10")
window.mainloop()
Here, the label's caption will be displayed in red colour using Helvetica font of 16 point size.
Entry
This widget renders a single-line text box for accepting the user input. For multi-line text input
use the Text widget. Apart from the properties already mentioned, the Entry class constructor
accepts the following:
The following example creates a window with a button, label and entry field.
window=Tk()
btn.place(x=80, y=100)
lbl.place(x=60, y=50)
txtfld.place(x=80, y=150)
window.title('Hello Python')
50
window.geometry("300x200+10+10")
window.mainloop()
Selection Widgets
Radiobutton: This widget displays a toggle button having an ON/OFF state. There may be more
than one button, but only one of them will be ON at a given time.
Checkbutton: This is also a toggle button. A rectangular check box appears before its caption.
Its ON state is displayed by the tick mark in the box which disappears when it is clicked to OFF.
Combobox: This class is defined in the ttk module of tkinterpackage. It populates drop down
data from a collection data type, such as a tuple or a list as values parameter.
Listbox: Unlike Combobox, this widget displays the entire collection of string items. The user
can select one or multiple items.
The following example demonstrates the window with the selection widgets: Radiobutton,
Checkbutton, Listbox and Combobox:
51
cb=Combobox(window, values=data)
cb.place(x=60, y=150)
v0=IntVar()
v0.set(1)
r1=Radiobutton(window, text="male", variable=v0,value=1)
r2=Radiobutton(window, text="female", variable=v0,value=2)
r1.place(x=100,y=50)
r2.place(x=180, y=50)
v1 = IntVar()
v2 = IntVar()
C1 = Checkbutton(window, text = "Cricket", variable = v1)
C2 = Checkbutton(window, text = "Tennis", variable = v2)
C1.place(x=100, y=100)
C2.place(x=180, y=100)
window.title('Hello Python')
window.geometry("400x300+10+10")
window.mainloop()
Create UI in Python-Tkinter
Event Handling
An event is a notification received by the application object from various GUI widgets as a result
of user interaction. The Application object is always anticipating events as it runs an event
listening loop. User's actions include mouse button click or double click, keyboard key pressed
while control is inside the text box, certain element gains or goes out of focus etc.
52
Many events are represented just as qualifier. The type defines the class of the event.
The following table shows how the Tkinter recognizes different events:
An event should be registered with one or more GUI widgets in the application. If it's not, it will
be ignored. In Tkinter, there are two ways to register an event with a widget. First way is by
using the bind() method and the second way is by using the command parameter in the widget
constructor.
Bind() Method
The bind() method associates an event to a callback function so that, when the even occurs, the
function is called.
Syntax:
Widget.bind(event, callback)
For example, to invoke the MyButtonClicked() function on left button click, use the following
code:
53
from tkinter import *
window=Tk()
btn = Button(window, text='OK')
btn.bind('<Button-1>', MyButtonClicked)
The event object is characterized by many properties such as source widget, position coordinates,
mouse button number and event type. These can be passed to the callback function if required.
Command Parameter
Each widget primarily responds to a particular type. For example, Button is a source of the
Button event. So, it is by default bound to it. Constructor methods of many widget classes have
an optional parameter called command. This command parameter is set to callback the function
which will be invoked whenever its bound event occurs. This method is more convenient than
the bind() method.
In the example given below, the application window has two text input fields and another one to
display the result. There are two button objects with the captions Add and Subtract. The user is
expected to enter the number in the two Entry widgets. Their addition or subtraction is displayed
in the third.
The first button (Add) is configured using the command parameter. Its value is the add() method
in the class. The second button uses the bind() method to register the left button click with the
sub() method. Both methods read the contents of the text fields by the get() method of the
Entry widget, parse to numbers, perform the addition/subtraction and display the result in third
text field using the insert() method.
Example:
from tkinter import *
class MyWindow:
def __init__(self, win):
self.lbl1=Label(win, text='First number')
self.lbl2=Label(win, text='Second number')
self.lbl3=Label(win, text='Result')
self.t1=Entry(bd=3)
self.t2=Entry()
self.t3=Entry()
self.btn1 = Button(win, text='Add')
self.btn2=Button(win, text='Subtract')
self.lbl1.place(x=100, y=50)
self.t1.place(x=200, y=50)
self.lbl2.place(x=100, y=100)
self.t2.place(x=200, y=100)
self.b1=Button(win, text='Add', command=self.add)
self.b2=Button(win, text='Subtract')
self.b2.bind('<Button-1>', self.sub)
self.b1.place(x=100, y=150)
self.b2.place(x=200, y=150)
54
self.lbl3.place(x=100, y=200)
self.t3.place(x=200, y=200)
def add(self):
self.t3.delete(0, 'end')
num1=int(self.t1.get())
num2=int(self.t2.get())
result=num1+num2
self.t3.insert(END, str(result))
def sub(self, event):
self.t3.delete(0, 'end')
num1=int(self.t1.get())
num2=int(self.t2.get())
result=num1-num2
self.t3.insert(END, str(result))
window=Tk()
mywin=MyWindow(window)
window.title('Hello Python')
window.geometry("400x300+10+10")
window.mainloop()
UI in Python-Tkinter
55
make customizations like adding an image, making the edges round, adding borders
around it, etc.
To install the customtkinter module in Python execute the below command in the
terminal:
pip install customtkinter
56
# Import customtkinter module
import customtkinter as ctk
if __name__ == "__main__":
app = App()
# Runs the app
app.mainloop()
57
Here we are going to build the form that we have design above. we are going to
use labels using CTkLabel() function, text field using CTkEntry() function, radio
button using CTkRadioButton() function, etc.
All the widgets are created and placed in the following manner
We create a widget by using the ctk.CTk<widget_name>() (For example,
ctk.CTkLabel() creates a label)
And then we pass the arguments to it depending on the type of the widget.
.grid() is used to specify the position, alignment, padding, and other
dimensions of the widget in our window
Syntax of grid():
grid( grid_options )
Parameters:
row: Specifies the row at which the widget must be placed.
column: Specifies the column at which the widget must be placed.
rowspan: Specifies the height of the widget (number of rows the widget
spans).
columnspan: Specifies the length of the widget (number of columns the
widget spans).
padx, pady: Specifies the padding of the widget along x and y axes
respectively.
sticky: Specifies how the widget elongates with respect to the changes in
its corresponding row and column.
# App Class
class App(ctk.CTk):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.title("GUI Application")
self.geometry(f"{appWidth}x{appHeight}")
# Name Label
58
self.nameLabel = ctk.CTkLabel(self,
text="Name")
self.nameLabel.grid(row=0, column=0,
padx=20, pady=20,
sticky="ew")
# Age Label
self.ageLabel = ctk.CTkLabel(self,
text="Age")
self.ageLabel.grid(row=1, column=0,
padx=20, pady=20,
sticky="ew")
# Gender Label
self.genderLabel = ctk.CTkLabel(self,
text="Gender")
self.genderLabel.grid(row=2, column=0,
padx=20, pady=20,
sticky="ew")
not to say")
self.maleRadioButton = ctk.CTkRadioButton(self,
text="Male",
variable=self.genderVar,
value="He is")
self.maleRadioButton.grid(row=2, column=1, padx=20,
pady=20,
sticky="ew")
self.femaleRadioButton = ctk.CTkRadioButton(self,
text="Female",
59
variable=self.genderVar,
value="She is")
self.femaleRadioButton.grid(row=2, column=2,
padx=20,
pady=20,
sticky="ew")
self.noneRadioButton = ctk.CTkRadioButton(self,
variable=self.genderVar,
value="They are")
self.noneRadioButton.grid(row=2, column=3,
padx=20,
pady=20,
sticky="ew")
# Choice Label
self.choiceLabel = ctk.CTkLabel(self,
text="Choice")
self.choiceLabel.grid(row=3, column=0,
padx=20, pady=20,
sticky="ew")
variable=self.checkboxVar,
onvalue="choice1",
offvalue="c1")
self.choice1.grid(row=3, column=1, padx=20,
pady=20, sticky="ew")
variable=self.checkboxVar,
onvalue="choice2",
offvalue="c2")
self.choice2.grid(row=3, column=2, padx=20, pady=20,
sticky="ew")
# Occupation Label
self.occupationLabel = ctk.CTkLabel(self,
60
text="Occupation")
self.occupationLabel.grid(row=4, column=0,
padx=20,
pady=20,
sticky="ew")
values=["Student",
"
Working Professional"])
self.occupationOptionMenu.grid(row=4, column=1,
padx=20,
pady=20,
columnspan=2, sticky="ew")
# Generate Button
self.generateResultsButton = ctk.CTkButton(self,
text="Generate Results")
self.generateResultsButton.grid(row=5, column=1,
columnspan=2,
padx=20, pady=20,
sticky="ew")
# Text Box
self.displayBox = ctk.CTkTextbox(self, width=200,
height=100)
self.displayBox.grid(row=6, column=0, columnspan=4,
padx=20, pady=20,
sticky="nsew")
if __name__ == "__main__":
app = App()
app.mainloop()
In the previous step we have designed our form UI and now we are going to
add some more functionality of generate result and printing the all information
entered by user in the result block. For that we are going to create two function
given below.
61
generateResults(): This function is used to put the text into the textbox.
createText(): Based on the selected preferences and entered text, this
function generates a text variable that sums up all the entries and returns it.
# App Class
class App(ctk.CTk):
# The layout of the window will be written
# in the init function itself
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Name Label
self.nameLabel = ctk.CTkLabel(self,
text="Name")
self.nameLabel.grid(row=0, column=0,
padx=20, pady=20,
sticky="ew")
62
# Age Label
self.ageLabel = ctk.CTkLabel(self, text="Age")
self.ageLabel.grid(row=1, column=0,
padx=20, pady=20,
sticky="ew")
# Gender Label
self.genderLabel = ctk.CTkLabel(self,
text="Gender")
self.genderLabel.grid(row=2, column=0,
padx=20, pady=20,
sticky="ew")
self.maleRadioButton = ctk.CTkRadioButton(self,
text="Male",
variable=self.genderVar,
value="He is")
self.maleRadioButton.grid(row=2, column=1,
padx=20, pady=20,
sticky="ew")
self.femaleRadioButton = ctk.CTkRadioButton(self,
text="Female",
variable=self.genderVar,
value="She is")
self.femaleRadioButton.grid(row=2, column=2,
padx=20, pady=20,
sticky="ew")
self.noneRadioButton = ctk.CTkRadioButton(self,
text="Prefer not to say",
variable=self.genderVar,
value="They are")
self.noneRadioButton.grid(row=2, column=3, padx=20,
pady=20, sticky="ew")
# Choice Label
self.choiceLabel = ctk.CTkLabel(self,
text="Choice")
self.choiceLabel.grid(row=3, column=0,
padx=20, pady=20,
sticky="ew")
63
# Choice Check boxes
self.checkboxVar = tk.StringVar(value="Choice 1")
self.choice1 = ctk.CTkCheckBox(self,
text="choice 1",
variable=self.checkboxVar,
onvalue="choice1", offvalue="c1")
self.choice1.grid(row=3, column=1,
padx=20, pady=20,
sticky="ew")
self.choice2 = ctk.CTkCheckBox(self,
text="choice 2",
variable=self.checkboxVar,
onvalue="choice2",
offvalue="c2")
self.choice2.grid(row=3, column=2,
padx=20, pady=20,
sticky="ew")
# Occupation Label
self.occupationLabel = ctk.CTkLabel(self,
text="Occupation")
self.occupationLabel.grid(row=4, column=0,
padx=20, pady=20,
sticky="ew")
# Generate Button
self.generateResultsButton = ctk.CTkButton(self,
text="Generate Results",
command=self.generateResults)
self.generateResultsButton.grid(row=5, column=1,
columnspan=2, padx=20,
pady=20, sticky="ew")
# Text Box
self.displayBox = ctk.CTkTextbox(self,
width=200,
height=100)
self.displayBox.grid(row=6, column=0,
columnspan=4, padx=20,
pady=20, sticky="nsew")
64
# This function is used to insert the
# details entered by users into the textbox
def generateResults(self):
self.displayBox.delete("0.0", "200.0")
text = self.createText()
self.displayBox.insert("0.0", text)
return text
if __name__ == "__main__":
app = App()
# Used to run the application
app.mainloop()
65
Model Building, Identification and Evaluation in
Python
Predictive Model
As the name implies, predictive modeling is used to determine a certain output
using historical data. For example, you can build a recommendation
system that calculates the likelihood of developing a disease, such as
diabetes, using some clinical & personal data such as:
Age
Gender
Weight
Average glucose level
Daily calories
Another use case for predictive models is forecasting sales. Using time series
analysis, you can collect and analyze a company’s performance to estimate
what kind of growth you can expect in the future.
Essentially, with predictive programming, you collect historical data, analyze it,
and train a model that detects specific patterns so that when it encounters
new data later on, it’s able to predict future results.
There are different predictive models that you can build using different
algorithms. Popular choices include regressions, neural networks, decision
trees, K-means clustering, Naïve Bayes, and others.
66
Predictive Modelling Applications
There are many ways to apply predictive models in the real world. Most
industries use predictive programming either to detect the cause of a problem
or to improve future results. Applications include but are not limited to:
Fraud detection
Sales forecasting
Natural disaster relief
Business performance growth
Speech recognition
News categorization
Vehicle maintenance
We’ll build a binary logistic model step-by-step to predict floods based on the
monthly rainfall index for each year in Kerala, India.
67
Step 2: Read the Dataset
We use pandas to display the first 5 rows in our dataset:
df= pd.read_csv('kerala.csv')
df.head(5)
info()
The info() function shows us the data type of each column, number of
columns, memory usage, and the number of records in the dataset:
df.info()
shape
describe()
68
corr()
The corr() function displays the correlation between different variables in our
dataset:
df.corr()
Use the SelectKBest library to run a chi-squared statistical test and select the
top 3 features that are most related to floods.
Author’s note: In case you want to learn about the math behind feature
selection the 365 Linear Algebra and Feature Selection course is a perfect
start.
Now we create data frames for the features and the score of each feature:
df_scores= pd.DataFrame(fit.scores_)
69
df_columns= pd.DataFrame(X.columns)
Finally, we’ll combine all the features and their corresponding scores in one
data frame:
features_scores= pd.concat([df_columns, df_scores], axis=1)
features_scores.columns= ['Features', 'Score']
features_scores.sort_values(by = 'Score')
Finally, we predict the likelihood of a flood using the logistic regression body
we created:
y_pred=logreg.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values
70
Classification Report
ROC Curve
From the ROC curve, we can calculate the area under the curve (AUC) whose
value ranges from 0 to 1. You’ll remember that the closer to 1, the better it is
for our predictive modeling.
The AUC is 0.94, meaning that the model did a great job:
Boosting is a method of converting weak learners into strong learners by training many models
in a gradual, additive and sequential manner and minimizing Loss function (i.e squared error
for Regression problems) in the final model.
GBR has better accuracy than other Regression model because of its Boosting technique. It is the
most used Regression algorithm for competitions.
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
criterion='friedman_mse',init=None, learning_rate=0.1, loss='ls',
max_depth=6,max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None,
presort='deprecated',random_state=None, subsample=1.0,
tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False)
72
Mean Squared Error: 1388.8979420780786
R score: 0.9579626971080454
Mean Absolute Error: 23.81293483364058
The Gradient Boosting Regressor gives us the best R2 square value of 0.957. However, to
interpret this model is very difficult.
Ensemble models definitely fall into the category of “Black Box” models since they are
composed of many potentially complex individual models.
Each tree in sequentially fashion is trained on bagged data using random selection of features, so
gaining a full understanding of the decision process by examining each individual tree is
infeasible.
Both the KMO and Bartlett’s test of sphericity are commonly used to verify the feasibility of the
data for Exploratory Factor Analysis (EFA).
Kaiser-Meyer Olkin (KMO) model tests sampling adequacy by measuring the proportion
of variance in the items that may be common variance. Values ranging between .80 and
1.00 indicate sampling adequacy (Cerny & Kaiser, 1977).
Bartlett’s test of sphericity examines whether a correlation matrix is significantly
different to the identity matrix, in which diagonal elements are unities and all off-
diagonal elements are zeros (Bartlett, 1950). Significant results indicate that variables in
the correlation matrix are suitable for factor analysis.
The logic behind absolute fit indices is essentially to test how well the model specified
by the researcher reproduces the observed data. Commonly used absolute fit statistics
include the χ2
73
measures the discrepancy between the observed and the implied covariance matrices.
The χ2
fit statistic is very popular and frequently reported in both CFA and SEM studies.
However, it is notoriously sensitive to large sample sizes and increased model complexity
(i.e. models with a large number of indicators and degrees of freedom). Therefore, the
current practice is to report it mostly for historical reasons, and it rarely used to make
decisions about the adequacy of model fit.
The RMSEA
The SRMR
The Standardized Root Mean Residual (SRMR) is the square root of the difference
between the residuals of the sample covariance matrix and the hypothesized covariance
model.
As SRMR is standardized, its values range between 0
and 1. Commonly, models with values below .05 threshold are considered to indicate good fit (Byrne,
1998). Also, values up to .08
74
For CFI and TLI values above .95 are indicative of good fit (Hu & Bentler, 1999). In
practice, CFI and TLI values from .90
to .95
Note that the TLI is non-normed, so its values can go above 1.00
Note:
Further to the aforementioned information, Hoyle (2012) provides an excellent succinct
summary of numerous fit indices. This table includes, for example, information on the indices'
theoretical range, sensitivity to varying sample size and model complexity. Note that, in contrast
to the indices introduced above, a great number of other indices exist, as illustrated in Hoyle's
table. Yet, the frequency of their use is decreasing for various reasons. For example, RMR is
non-normed and thus it is hard to interpret. Here these indices are shown below simply for
everyone's general awareness, i.e. the fact that they exist, who developed them and what their
statistical properties are.
75
print("The volume of the cylinder is %.2f" %volume)
76
MS. SQL Server Database Development and Connectivity in Python
Python is a powerful programming language that is widely used in various industries,
including data science, web development, and automation. One of the key strengths of
Python is its ability to connect to various databases, including SQL Server. Here, we will
explore how to connect to SQL Server in Python.
Before we dive into the code, let’s briefly discuss what SQL Server is and why it is
important. SQL Server is a relational database management system (RDBMS)
developed by Microsoft that stores and retrieves data for various applications. It is
widely used in enterprise environments due to its scalability, security features, and
robustness.
Python provides several libraries for connecting to SQL Server, including pyodbc and
pymssql. These libraries allow Python developers to interact with SQL Server
databases by sending queries and receiving results.
To connect to SQL Server using Python, we need to use a module called pyodbc. This
module provides an interface between Python and Microsoft SQL Server, allowing us to
execute SQL statements and retrieve data from the database.
To import the pyodbc module, we first need to install it. We can do this using pip, which
is a package manager for Python. Open your command prompt or terminal and run the
following command:
Once you have installed pyodbc, you can import it into your Python script using the
following code: This will make all of the functions and classes provided by the pyodbc
module available in your script.
Now that we have installed the necessary libraries and have the server credentials, we
can establish a connection to our SQL Server using Python. We will be using the
pyodbc library to connect to our SQL Server.
import pyodbc
server = 'your_server_name'
database = 'your_database_name'
username = 'your_username'
password = 'your_password'
77
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};\
SERVER='+server+';\
DATABASE='+database+';\
UID='+username+';\
PWD='+ password)
cursor = cnxn.cursor()
In this example, we imported the `pyodbc` library and defined our server name,
database name, username, and password. Then, we used the `connect()` method from
`pyodbc` to establish a connection to the SQL Server by passing in the necessary
parameters.
Note that the specific driver you use may differ depending on your system configuration.
You can find out which driver you need by checking your ODBC Data Source
Administrator.
After establishing a connection to the SQL Server in Python, the next step is to create a
cursor object. A cursor object allows you to execute SQL queries against the database
and retrieve data.
To create a cursor object, you can use the `cursor()` method of the connection object.
Here’s an example:
cursor = conn.cursor()
Once you have a cursor object, you can use it to execute SQL queries by calling its `execute()`
method. The `execute()` method takes an SQL query as an argument and executes it against
the database. Here’s an example:
query = "SELECT * FROM employees"
cursor.execute(query)
In this example, we are executing a simple SELECT query that retrieves all rows from
the `employees` table. Note that we are passing the query as a string to the `execute()`
method.
78
After executing a query, you can retrieve the results using one of the fetch methods of
the cursor object. The most common fetch methods are `fetchone()`, which retrieves
one row at a time, and `fetchall()`, which retrieves all rows at once. Here’s an example:
In this example, we first use a while loop and the `fetchone()` method to retrieve one
row at a time and print it. We keep looping until there are no more rows to fetch. Then,
we use the `fetchall()` method to retrieve all rows at once and print them using a for
loop.
It’s important to note that after executing a query, you should always close the cursor
object using its `close()` method:
cursor.close()
Closing the cursor releases any resources that it was holding, such as locks on the database. It
also frees up memory on the client side.
Now that we have successfully connected to the SQL Server database, we can retrieve
data from it using Python. There are several ways to retrieve data from SQL Server in
Python, but we will be using the `pandas` library in this demonstration.
`pandas` is a popular data manipulation library that provides data structures for
efficiently storing and analyzing large datasets. It also has built-in functions for reading
and writing data to various file formats, including SQL databases.
To retrieve data from SQL Server using `pandas`, we first need to write a SQL query
that specifies which data we want to extract. We can then use the `read_sql_query()`
function from `pandas` to execute the query and store the results in a `DataFrame`.
Here’s an example of how to retrieve all records from a table called `employees` in our
SQL Server database:
79
mport pandas as pd
import pyodbc
# Set up connection
cnxn = pyodbc.connect('DRIVER={SQL
Server};SERVER=localhost;DATABASE=mydatabase;UID=username;PWD=password')
In this example, we first import the necessary libraries (`pandas` and `pyodbc`) and set
up the database connection using the same parameters as in Step 4.
Next, we define our SQL query as a string variable called `query`. This query simply
selects all columns (`*`) from the `employees` table.
We then use the `pd.read_sql_query()` function to execute the query and store the
results in a DataFrame called `df`. This function takes two arguments: the SQL query
and the database connection object (`cnxn`).
Finally, we print out the first few rows of the DataFrame using the `head()` function to
verify that we have successfully retrieved the data.
Of course, this is just a simple example. You can modify the SQL query to retrieve
specific columns or filter the data using conditions. Once you have the data in a
DataFrame, you can use all the powerful data manipulation and analysis functions
provided by `pandas`.
After we have executed our queries and retrieved the necessary data from the SQL
Server, it is important to close the connection to the server. This is done using the
`close()` method of the connection object.
Here’s an example:
import pyodbc
80
'Server=server_name;'
'Database=database_name;'
'Trusted_Connection=yes;')
In the above example, we have established a connection to the SQL Server, created a
cursor object, executed a SQL query, retrieved the data using `fetchall()` method, and
finally closed the connection using the `close()` method.
It is important to close the connection after we are done with our work as it releases any
resources that were being used by our program. Leaving connections open can also
cause issues with other applications trying to access the same server.
Closing the connection to SQL Server should always be done after executing queries
and retrieving data. This ensures that our program runs efficiently and does not cause
any issues for other applications accessing the same server.
Conclusion
Here, we have learned how to use Python to connect to SQL Server. We started by
installing the necessary packages and libraries such as pyodbc and pandas. We then
created a connection string with the required credentials to establish a connection
between our Python code and SQL Server.
After establishing the connection, we executed SQL queries using the execute() method
of the cursor object in pyodbc. We also saw how to retrieve data from the database
using fetchall() and fetchone() methods.
We also explored how to use pandas library to read data from SQL Server into a
pandas DataFrame, which can be further manipulated and analyzed using pandas
functions.
Overall, Python provides a powerful and flexible way for data professionals to connect
to SQL Server databases and perform various data manipulation tasks. Here, you
81
should now be able to connect to SQL Server databases in Python and start exploring
your data using the power of Python.
82
Testing Linear Regression Assumptions in Python
Checking model assumptions is like commenting code. Everybody should be doing it often, but
it sometimes ends up being overlooked in reality. A failure to do either can result in a lot of time
being confused, going down rabbit holes, and can have pretty serious consequences from the
model not being interpreted correctly.
Linear regression is a fundamental tool that has distinct advantages over other regression
algorithms. Due to its simplicity, it’s an exceptionally quick algorithm to train, thus typically
makes it a good baseline algorithm for common regression scenarios. More importantly, models
trained with linear regression are the most interpretable kind of regression models available -
meaning it’s easier to take action from the results of a linear regression model. However, if the
assumptions are not satisfied, the interpretation of the results will not always be valid. This can
be very dangerous depending on the application.
This post contains code for tests on the assumptions of linear regression and examples with both
a real-world dataset and a toy dataset.
The Data
For our real-world dataset, we’ll use the Boston house prices dataset from the late 1970’s. The
toy dataset will be created using scikit-learn’s make_regression function which creates a dataset
that should perfectly satisfy all of our assumptions.
One thing to note is that I’m assuming outliers have been removed in this blog post. This is an
important part of any exploratory data analysis (which isn’t being performed in this post in order
to keep it short) that should happen in real world scenarios, and outliers in particular will cause
significant issues with linear regression. See Anscombe’s Quartet for examples of outliers
causing issues with fitting linear regression models.
Here are the variable descriptions for the Boston housing dataset straight from the
documentation:
ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
83
RM: Average number of rooms per dwelling
import numpy as np
import pandas as pd
%matplotlib inline
"""
Additional Documentation:
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
Attributes:
data: Features/predictors
"""
boston = datasets.load_boston()
84
"""
Artificial linear data using the same number of features and observations as
the
"""
n_features=boston.data.shape[1],
noise=75, random_state=46)
# Setting feature names to x1, x2, x3, etc. if they are not defined
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['HousePrice'] = boston.target
df.head()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['HousePrice'] = boston.target
df.head()
Initial Setup
Before we test the assumptions, we’ll need to fit our linear regression models. I have a master
function for performing all of the assumption testing at the bottom of this post that does this
automatically, but to abstract the assumption tests out to view them independently we’ll have to
re-write the individual tests to take the trained model as a parameter.
85
R^2: 0.7406077428649428
# Fitting the model
linear_model = LinearRegression()
linear_model.fit(linear_X, linear_y)
return df_results
The Assumptions
I) Linearity Assumption
This assumes that there is a linear relationship between the predictors (e.g. independent variables
or features) and the response variable (e.g. dependent variable or label). This also assumes that
the predictors are additive.
Why it can happen: There may not just be a linear relationship among the data. Modeling is
about trying to estimate a function that explains a process, and linear regression would not be a
fitting estimator (pun intended) if there is no linear relationship.
What it will affect: The predictions will be extremely inaccurate because our model is
underfitting. This is a serious violation that should not be ignored.
How to detect it: If there is only one predictor, this is pretty easy to test with a scatter plot. Most
cases aren’t so simple, so we’ll have to modify this by using a scatter plot to see our predicted
values versus the actual values (in other words, view the residuals). Ideally, the points should lie
on or around a diagonal line on the scatter plot.
How to fix it: Either adding polynomial terms to some of the predictors or applying nonlinear
transformations . If those do not work, try adding additional variables to help capture the
relationship between the predictors and the label.
86
Linearity: Assumes that there is a linear relationship between the
predictors and
the response variable. If not, either a quadratic term or
another
algorithm should be used.
"""
print('Assumption 1: Linear Relationship between the Target and the
Feature', '\n')
87
We can see a relatively even spread around the diagonal line.
88
We can see in this case that there is not a perfect linear relationship. Our predictions are biased
towards lower values in both the lower end (around 5-10) and especially at the higher values
(above 40).
More specifically, this assumes that the error terms of the model are normally distributed. Linear
regressions other than Ordinary Least Squares (OLS) may also assume normality of the
predictors or the label, but that is not the case here.
Why it can happen: This can actually happen if either the predictors or the label are
significantly non-normal. Other potential reasons could include the linearity assumption being
violated or outliers affecting our model.
What it will affect: A violation of this assumption could cause issues with either shrinking or
inflating our confidence intervals.
89
How to detect it: There are a variety of ways to do so, but we’ll look at both a histogram and the
p-value from the Anderson-Darling test for normality.
How to fix it: It depends on the root cause, but there are a few options. Nonlinear
transformations of the variables, excluding specific variables (such as long-tailed variables), or
removing outliers may solve this problem.
This assumption being violated primarily causes issues with the confidence
intervals
"""
from statsmodels.stats.diagnostic import normal_ad
print('Assumption 2: The error terms are normally distributed', '\n')
print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')
As with our previous assumption, we’ll start with the linear dataset:
90
Using the Anderson-Darling test for normal distribution
p-value from the test - below 0.05 generally means non-normal: 0.335066045847
Residuals are normally distributed
Assumption satisfied
91
Assumption not satisfied
This isn’t ideal, and we can see that our model is biasing towards under-estimating.
This assumes that the predictors used in the regression are not correlated with each other. This
won’t render our model unusable if violated, but it will cause issues with the interpretability of
the model.
Why it can happen: A lot of data is just naturally correlated. For example, if trying to predict a
house price with square footage, the number of bedrooms, and the number of bathrooms, we can
expect to see correlation between those three variables because bedrooms and bathrooms make
up a portion of square footage.
What it will affect: Multicollinearity causes issues with the interpretation of the coefficients.
Specifically, you can interpret a coefficient as “an increase of 1 in this predictor results in a
change of (coefficient) in the response variable, holding all other predictors constant.” This
becomes problematic when multicollinearity is present because we can’t hold correlated
predictors constant. Additionally, it increases the standard error of the coefficients, which results
in them potentially showing as statistically insignificant when they might actually be significant.
92
How to detect it: There are a few ways, but we will use a heatmap of the correlation as a visual
aid and examine the variance inflation factor (VIF).
How to fix it: This can be fixed by other removing predictors with a high variance inflation
factor (VIF) or performing dimensionality reduction.
if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
93
else:
print('Assumption possibly satisfied')
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')
94
-------------------------------------
X1: 1.030931170297102
X2: 1.0457176802992108
X3: 1.0418076962011933
X4: 1.0269600632251443
X5: 1.0199882018822783
X6: 1.0404194675991594
X7: 1.0670847781889177
X8: 1.0229686036798158
X9: 1.0292923730360835
X10: 1.0289003332516535
X11: 1.052043220821624
X12: 1.0336719449364813
X13: 1.0140788728975834
Assumption satisfied
95
Variance Inflation Factors (VIF)
> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables
-------------------------------------
CRIM: 2.0746257632525675
ZN: 2.8438903527570782
INDUS: 14.484283435031545
CHAS: 1.1528909172683364
NOX: 73.90221170812129
RM: 77.93496867181426
AGE: 21.38677358304778
DIS: 14.699368125642422
RAD: 15.154741587164747
TAX: 61.226929320337554
PTRATIO: 85.0273135204276
B: 20.066007061121244
LSTAT: 11.088865100659874
96
Consider removing variables with a high Variance Inflation Factor (VIF)
This isn’t quite as egregious as our normality assumption violation, but there is possible
multicollinearity for most of the variables in this dataset.
This assumes no autocorrelation of the error terms. Autocorrelation being present typically
indicates that we are missing some information that should be captured by the model.
Why it can happen: In a time series scenario, there could be information about the past that we
aren’t capturing. In a non-time series scenario, our model could be systematically biased by
either under or over predicting in certain conditions. Lastly, this could be a result of a violation
of the linearity assumption.
How to detect it: We will perform a Durbin-Watson test to determine if either positive or
negative correlation is present. Alternatively, you could create plots of residual autocorrelations.
How to fix it: A simple fix of adding lag variables can fix this problem. Alternatively,
interaction terms, additional variables, or additional transformations may fix this.
97
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')
Assumption satisfied
We’re having signs of positive autocorrelation here, but we should expect this since we know our
model is consistently under-predicting and our linearity assumption is being violated. Since this
isn’t a time series dataset, lag variables aren’t possible. Instead, we should look into either
interaction terms or additional transformations.
V) Homoscedasticity/Heteroscedasticity
This assumes homoscedasticity, which is the same variance within our error terms.
Heteroscedasticity, the violation of homoscedasticity, occurs when we don’t have an even
variance across the error terms.
98
Why it can happen: Our model may be giving too much weight to a subset of the data,
particularly where the error variance was the largest.
What it will affect: Significance tests for coefficients due to the standard errors being biased.
Additionally, the confidence intervals will be either too wide or too narrow.
How to detect it: Plot the residuals and see if the variance appears to be uniform.
How to fix it: Heteroscedasticity (can you tell I like the scedasticity words?) can be solved either
by using weighted least squares regression instead of the standard OLS or transforming either the
dependent or highly skewed variables. Performing a log transformation on the dependent
variable is not a bad place to start.
99
There don’t appear to be any obvious problems with that.
100
We can’t see a fully uniform variance across our residuals, so this is potentially problematic.
However, we know from our other tests that our model has several issues and is under predicting
in many cases.
Conclusion
We can clearly see that a linear regression model on the Boston dataset violates a number of
assumptions which cause significant problems with the interpretation of the model itself. It’s not
uncommon for assumptions to be violated on real-world data, but it’s important to check them so
we can either fix them and/or be aware of the flaws in the model for the presentation of the
results or the decision making process.
It is dangerous to make decisions on a model that has violated assumptions because those
decisions are effectively being formulated on made-up numbers. Not only that, but it also
provides a false sense of security due to trying to be empirical in the decision making process.
Empiricism requires due diligence, which is why these assumptions exist and are stated up front.
Hopefully this code can help ease the due diligence process and make it less painful.
101
Tests a linear regression on the model to see if assumptions are being met
"""
from sklearn.linear_model import LinearRegression
# Setting feature names to x1, x2, x3, etc. if they are not defined
if feature_names is None:
feature_names = ['X'+str(feature+1) for feature in
range(features.shape[1])]
model.fit(features, label)
def linear_assumption():
"""
Linearity: Assumes there is a linear relationship between the
predictors and
the response variable. If not, either a polynomial term or
another
algorithm should be used.
"""
print('\
n=============================================================================
==========')
print('Assumption 1: Linear Relationship between the Target and the
Features')
102
sns.lmplot(x='Actual', y='Predicted', data=df_results, fit_reg=False,
size=7)
def normal_errors_assumption(p_value_thresh=0.05):
"""
Normality: Assumes that the error terms are normally distributed. If
they are not,
nonlinear transformations of variables may solve this.
print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')
103
def multicollinearity_assumption():
"""
Multicollinearity: Assumes that predictors are not correlated with
each other. If there is
correlation among the predictors, then either
remove prepdictors with high
Variance Inflation Factor (VIF) values or perform
dimensionality reduction
if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
else:
print('Assumption possibly satisfied')
104
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance
Inflation Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation
Factor (VIF)')
def autocorrelation_assumption():
"""
Autocorrelation: Assumes that there is no autocorrelation in the
residuals. If there is
autocorrelation, then there is a pattern that is not
explained due to
the current value being dependent on the previous
value.
This may be resolved by adding a lag variable of
either the dependent
variable or some of the predictors.
"""
from statsmodels.stats.stattools import durbin_watson
print('\
n=============================================================================
==========')
print('Assumption 4: No Autocorrelation')
print('\nPerforming Durbin-Watson Test')
print('Values of 1.5 < d < 2.5 generally show that there is no
autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')
def homoscedasticity_assumption():
"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('\
n=============================================================================
==========')
105
print('Assumption 5: Homoscedasticity of Error Terms')
print('Residuals should have relative constant variance')
linear_assumption()
normal_errors_assumption()
multicollinearity_assumption()
autocorrelation_assumption()
homoscedasticity_assumption()
2. Homoscedasticity – Constant Error Variance, i.e, the variance of the error term is same across
all values of the independent variable. It can be easily checked by making a scatter plot between
Residual and Fitted Values. If there is no trend then the variance of error term is constant.
A close observation of the above plot shows that the variance of residual term is relatively more
for higher fitted values. Note: In many real-life scenarios, it is practically difficult to ensure all
assumptions of linear regression will hold 100%
3. Normal Error – The error term should be normally distributed. QQ plot is a good way of
checking normality. If the plot forms a line that is roughly straight then we can assume there is
normality.
import statsmodels.api as sm
106
4. No Autocorrelation of residual – This is typically applicable to time series data.
Autocorrelation means the current value of Yt is dependent on historic value of Yt-n with n as
lag period. Durbin-Watson test is a quick way to find if there is any autocorrelation.
7. Sample Size – In linear regression, it is desirable that the number of records should be at least
10 or more times the number of independent variables to avoid the curse of dimensionality.
107
Skewness & Kurtosis
What is Skewness and how do we detect it?
If you will ask Mother Nature — What is her favorite probability distribution?
The answer will be — ‘Normal’ and the reason behind it is the existence of chance/random
causes that influence every known variable on earth. What if a process is under the influence of
assignable/significant causes as well? This is surely going to modify the shape of the distribution
(distort) and that’s when we need a measure like skewness to capture it. Below is a normal
distribution visual, also known as a bell curve. It is a symmetrical graph with all measures of
central tendency in the middle.
(Author, 2021)
108
( Author, 2021)
Notice how these central tendency measures tend to spread when the normal distribution is
distorted. For the nomenclature just follow the direction of the tail — For the left graph since the
tail is to the left, it is left-skewed (negatively skewed) and the right graph has the tail to the right,
so it is right-skewed (positively skewed).
How about deriving a measure that captures the horizontal distance between the Mode and the
Mean of the distribution? It’s intuitive to think that the higher the skewness, the more apart these
measures will be. So let’s jump to the formula for skewness now:
Division by Standard Deviation enables the relative comparison among distributions on the same
standard scale. Since mode calculation as a central tendency for small data sets is not
recommended, so to arrive at a more robust formula for skewness we will replace mode with the
derived calculation from the median and the mean.
109
( Author, 2021)
Think of punching or pulling the normal distribution curve from the top, what impact will it have
on the shape of the distribution? Let’s visualize:
110
(Author, 2021)
So there are two things to notice — The peak of the curve and the tails of the curve, Kurtosis
measure is responsible for capturing this phenomenon. The formula for kurtosis calculation is
complex (4th moment in the moment-based calculation) so we will stick to the concept and its
visual clarity. A normal distribution has a kurtosis of 3 and is called mesokurtic. Distributions
greater than 3 are called leptokurtic and less than 3 are called platykurtic. So the greater the value
more the peakedness. Kurtosis ranges from 1 to infinity. As the kurtosis measure for a normal
distribution is 3, we can calculate excess kurtosis by keeping reference zero for normal
distribution. Now excess kurtosis will vary from -2 to infinity.
111
(Author, 2021)
The topic of Kurtosis has been controversial for decades now, the basis of kurtosis all these years
has been linked with the peakedness but the ultimate verdict is that outliers (fatter tails) govern
the kurtosis effect far more than the values near the mean (peak).
So we can conclude from the above discussions that the horizontal push or pull distortion of a
normal distribution curve gets captured by the Skewness measure and the vertical push or pull
distortion gets captured by the Kurtosis measure. Also, it is the impact of outliers that dominate
the kurtosis effect which has its roots of proof sitting in the fourth-order moment-based formula.
I hope this blog helped you clarify the idea of Skewness & Kurtosis in a simplified manner,
watch out for more similar blogs in the future.
The Violin Plot is used to indicate the probability density of data at different values and it is
quite similar to the Matplotlib Box Plot.
Here is a figure showing common components of the Box Plot and Violin Plot:
112
Creation of the Violin Plot
The violinplot() method is used for the creation of the violin plot.
Parameters
dataset
This parameter denotes the array or sequence of vectors. It is the input data.
positions
This parameter is used to set the positions of the violins. In this, the ticks and limits are set
automatically in order to match the positions. It is an array-like structured data with the default
as = [1, 2, …, n].
vert
113
This parameter contains the boolean value. If the value of this parameter is set to true then it
will create a vertical plot, otherwise, it will create a horizontal plot.
showmeans
This parameter contains a boolean value with false as its default value. If the value of this
parameter is True, then it will toggle the rendering of the means.
showextrema
This parameter contains the boolean values with false as its default value. If the value of this
parameter is True, then it will toggle the rendering of the extrema.
showmedians
This parameter contains the boolean values with false as its default value.If the value of this
parameter is True, then it will toggle the rendering of the medians.
quantiles
This is an array-like data structure having None as its default value.If value of this parameter is
not None then,it set a list of floats in interval [0, 1] for each violin,which then stands for the
quantiles that will be rendered for that violin.
points
It is scalar in nature and is used to define the number of points to evaluate each of the Gaussian
kernel density estimations.
bw_method
This method is used to calculate the estimator bandwidth, for which there are many different
ways of calculation. The default rule used is Scott's Rule, but you can choose ‘silverman’, a scalar
constant, or a callable.
Now its time to dive into some examples in order to clear the concepts:
Below we have a simple example where we will create violin plots for a different collection of
data.
np.random.seed(10)
114
collectn_1 = np.random.normal(120, 10, 200)
collectn_2 = np.random.normal(150, 30, 200)
collectn_3 = np.random.normal(50, 20, 200)
collectn_4 = np.random.normal(100, 25, 200)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
bp = ax.violinplot(data_to_plot)
plt.show()
115
AI vs ML vs Deep Learning vs Data Science
in 1956. Yes, This is not a new technology. This is a very old concept. But if you see the history
of how people came arrived at this concept. and who those people are? So to answer that question.
A variety of handful scientists from different fields of study thought about creating an artificial
brain. It may surprise you to hear that the group of scientists who founded the field of artificial
intelligence as an academic discipline in 1956 came from diverse fields such as mathematics,
Sound interesting right? I always believe The origin of all discovery in technology come from
About (AI)
The core analogy of this field is to mimic the human brain. We write a program in the AI field
that simulates human intelligence into a machine and programs it to think and behave like
humans. This field involved developing algorithms, which analyzes data and perform human-like
action. For example, understanding natural language like a human, and recognizing images to
If I asked you from this image, how many objects are in this and what are that objects? So you can
easily identify with your bare eye. because you know these objects before. How Cats and Dogs
look alike.
116
But by asking the same question to the machine you need to feed some intelligence into the
machine so that it can identify these objects. in order to make a machine talk like a human and the
process which involves making a machine talk comes under the artificial intelligent domain.
There are other examples like an application that can give question answers like IBM Watson.
decision-making system which can take marking budget decisions. where to spend and where not
to spend. and the list goes on and on. I will cover this in some other blog where I will only discuss
AI.
According to him, Machine learning is one way to use AI. It was defined in the 1950s by AI
pioneer Arthur Samuel as “the field of study that gives computers the ability to learn without
The 80s and 90s were the phases when machine learning came into the mainstream. And people
started recognizing separate as a separate field. In the early day Machine learning was focusing on
solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and
probability theory.
The difference between AI and ML is frequently misunderstood. People had a mindset that
Machine learning learns and predicts based on passive learning or you can say learning from past
history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to
learn and take action to maximize its chance of success. We know this technique
as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.
The 80s and 90s were the phases when machine learning came into the mainstream. And people
started recognizing separate as a separate field. In the early day Machine learning was focusing on
117
solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and
probability theory.
The difference between AI and ML is frequently misunderstood. People had a mindset that
Machine learning learns and predicts based on passive learning or you can say learning from past
history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to
learn and take action to maximize its chance of success. We know this technique
as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.
The 80s and 90s were the phases when machine learning came into the mainstream. And people
started recognizing separate as a separate field. In the early day Machine learning was focusing on
solving AI problems but after 1990. The focus shifted toward Statistical models, fuzzy logic, and
probability theory.
The difference between AI and ML is frequently misunderstood. People had a mindset that
Machine learning learns and predicts based on passive learning or you can say learning from past
history of data. And AI (Artificial intelligence) uses an agent to interact with the environment to
learn and take action to maximize its chance of success. We know this technique
as Reinforcement learning. I just introduce jargon keywords which we will see in another blog.
which we can train algorithms to do AI work. In layman’s terms, ML has a method also called an
Because based on this method we can put human intelligence into machines. And Machine
History
118
So very quick history, this all started in 1962 when Frank Rosenblatt publish a paper. “Principles
Then the natural progression happens people start exploring new ideas around this. And they
started working on other deep learning architectures to solve Computer Vision problems.
According to Wikipedia “The term Deep Learning was introduced to the machine learning
community by Rina Dechter in 1986″. and then in 1989, Yann LeCun et al. applied the standard
backpropagation algorithm.
Things started progressing new architecture started emerging. ANN (artificial neural network),
which was introduced in 1943 not getting used till 2000. And SVM (Support vector machine) was
The advancement of hardware in 2009 has driven renewed interest in deep learning. Now Deep
learning models can train in Nvidia GPUs. And now in 2023, we can see a lot of advancement in
About (DL)
To understand Deep Learning you have to understand the neural networks. Because DL is just a
multi-layer neural network. The term neural network comes from brain neurons. I will cover in
detail about this in another blog. but the overall concept of a neural network was to create a
network of neurons, and the neuron is nothing but a cell that receives an input signal in the form
of data (for humans, through the eye they can see things. And once a human sees things can easily
identify the object). and process to another neuron with information.
But if I define this in AI terms then, DL is a subfield of Machine learning through which you can
train an AI system.
Deep learning is designed to learn large data and extract information from raw data automatically.
The best use of deep learning is when you are working on image data, speech data, and natural
language processing.
Data Science (DS)
119
This topic is one of the buzz terms on the internet. all companies wanted to exploit the area. So
let’s define this term in an easy way so everyone can understand.
It is an interdisciplinary field that involves various methods, tools, and techniques to analyze and
extract knowledge and insight from data.
Data science combines various fields. for example Statistics, Computer Science, Domain Specific
Knowledge, Visualization, data engineering techniques, data quality techniques, and data
profiling techniques to analyze and interpret complex data sets.
Data Science involves tasks such as data cleaning, data preprocessing, data analysis, data
visualization, and data interpretation.
Basically, as a data scientist, you should have all the core knowledge of analyzing and interpreting
data. And create an AI system with the help of Machine learning techniques.
Analysis
Now coming back to over original discussion, why often do people get confused by these terms?
So the simple answer is they do not make a connection or relation between all these terms. okay,
let me help with this.
You wanted to create an AI system, so AI is our application layer. now first figure out what type
of method will solve the purpose. Can we use a machine learning algorithm or do we have to use
a Deep learning technique? once we finalize then we need to analyze and interpret data with Data
science techniques.
In summary, AI is the broader field of developing intelligent machines, ML is a subset of AI that
involves training algorithms to learn from data, DL is a subset of ML that uses ANNs to model
complex patterns in data, and DS is an interdisciplinary field that involves extracting insights
from data.
As we can see Deep learning is a subfield of machine learning and machine learning is a subfield
of AI. And in the other hand data science is a field that needs all attention.
Data Science is an interdisciplinary field that uses all the techniques to analyze and extracts
information from data.
120