0% found this document useful (0 votes)
3 views

Mid Prep Data

The document provides an overview of Python programming concepts, including modules, expressions, statements, and indentation. It covers data types, such as integers, strings, lists, dictionaries, and tuples, along with their properties and methods. Additionally, it discusses handling missing data in Pandas, data loading, and combining datasets, emphasizing the importance of index alignment and data integrity.

Uploaded by

zaxra0909
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Mid Prep Data

The document provides an overview of Python programming concepts, including modules, expressions, statements, and indentation. It covers data types, such as integers, strings, lists, dictionaries, and tuples, along with their properties and methods. Additionally, it discusses handling missing data in Pandas, data loading, and combining datasets, emphasizing the importance of index alignment and data integrity.

Uploaded by

zaxra0909
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

A module contains a set of Python commands. Is a single file, and has statements.

Expression: a combination of values and operations that creates a new value that
we call a return value - i.e. the value returned by the operation(s).

Statement: doesn't return a value, but does perform some task. Some statements
may control the flow of the program, and others might ask for resources.

Indentation
Indentation is used by all programmers to make code more readable.
Programmers often indent code to indicate that the indented code grouped
together, meaning those statements have some common purpose. However,
indentation is treated uniquely in Python. Python requires it for grouping.

Naming Objects

1. Every name must begin with a letter or the underscore character _:


o A numeral is not allowed as the first character.
o Multiple-word names can be linked together using the underscore
character _ — e.g. this_is_a_variable_name_with_underscore.
A name starting with an underscore is often used by
Python and Python programmers to denote a variable with
special characteristics. You will see these as we go along,
but for now it is best to not start variable names with an
underscore until you understand what that implies.
1. After the first letter, the name may contain any combination of
letters, numbers, and underscores:
o The name cannot be a keyword.
o You cannot have any delimiters, punctuation, or operators in a name.
1. A name can be of any length.
1. UPPERCASE is different from lowercase:
o my_name is different than my_Name or My_Name or My_name.

Python types:
 Int
 Float - floating point values or real numbers
 Boolean - True or False
 String - 'any text'
 List - [4, 3.26, 'Baku']
 Dictionary - {'Namiq' : '15-March-1989', 'Rasim' : '14-June-2008'}
 Tuple - (1, 2)
 Set - {'ag', 'qara', 'qirmizi'}

For integer numbers result of division / is float, while others are int type(+; -; *);
// - is specifically integer division, return float
% returns integer

Strings:
String-
any sequence of printable characters is referred to as a string

Ord('f') --> returns unicode value


Chr(5) --> returns string

 concatenate +: The operator + requires two string objects and creates a new string
object. The new string object is formed by concatenating copies of the two string
objects together: the first string joined at its end to the beginning of the second
string.
 repeat *: The * takes a string object and an integer and creates a new string
object. The new string object has as many copies of the string as is indicated by
the integer.

Comparing Strings with More than One Character


String comparison—in fact, any sequence comparison—works as follows. The
basic idea is to, in parallel, examine both string characters at some index and then
walk through both strings until a difference in characters is found.
1. Start at index 0, the beginning of both strings.
o Compare the two single characters at the present index of each each
string.
 If the two characters are equal, increase the present index of both
strings by 1 and go back to the beginning of step 2.
 If the two characters are not equal, return the result of comparing
those two characters as the result of the string comparison.
1. If both strings are equal up to some point but one is shorter
than the other, then the longer string is always greater. For
example, `'ab' < 'abc'` returns `True`.

String Collections are immutable


Given that a string is a collection—a sequence, in fact—it is tempting to try the
following kind of operation: create a string and then try to change a particular
character in that string to a new character. In Python, that would look something
like the following session:

String Functions and Methods


Functions
Think of a function as a small program that performs a specific task. That program
is packaged up, or encapsulated, and made available for us

Methods
A method is a variation on a function. It looks very similar. It has a name and it has
a list of arguments in parentheses. It differs, however, in the way it is invoked.
Every method is called in conjunction with a particular object. The kinds of
methods that can be used in conjunction with an object depends on the object’s
type. String objects have a set of methods suited for strings, just as integers have
integer methods, and floats have float methods. The invocation is done using what
is called the dot notation.

Lists:

 A list can contain elements other than characters. In fact, a list can contain a
sequence of elements of any type, even different typed elements mixed together
in the same list.
 A list is a mutable type. This means that, unlike a string object, a list object can be
changed after it is initially created.

Lists are created inside [] and have "," character.


Note inside lists strings and integers cannot be compared.
[]

List functions: len(list_name), min, max, sum;


A method is a function that works with a particular type of Python object.

Two types of methods:


1. Doesn't modify the method
index(x) -- error in case if there is no such index
count(x) -- 0 in case there is no that element

2. Ones that modify the list (they have a return type)


o Append- adds element
o Pop()-- removes element at given index and returns it
o Extend(s) adds s to end . If you want to add the contents of another
collection to the end, use extend
o Insert(I, x)
o Remove(x)
o Sort
o Reverse

Split- string into list


Join - list to string

Tuples- immutable lists. So function or method cannot change it. Tuple is written
inside () and elements are separated with " ," . Tuples are efficient to tome, space
and algorithm so they are being used.

The operations familiar from other sequences (lists and strings) are available,
except, of course, those operators that violate immutability.
 Operators such as + (concatenate) and * (repeat) work as before.
 Slicing also works as before.
 Membership in and for iteration also work on tuples.
 len, min, max, greater than (>), less than (<), sum and others work the same
way. In particular, any comparison operation has the same restrictions for
mixed types.
None of the operations that change lists are available for tuples. For
example, append, extend, insert, remove, pop, reverse, and sort do not work on
tuples. Here is a session demonstrating the various operators working with tuples.

Data structures examples:


A queue is a sequence of data elements that “stand in a line,” in which the data
can only be removed from the front of the line and new elements can only be
added to the back of the line
A dictionary data structure is one in which we can look up a key
A set of unordered elements allows us to access and modify individual elements.
Sets are mutable

Dictonaries and Sets:

A dictionary is a collection, but it is not a sequence


keys can only be immutable
Dictionaries are Mutable

Each element have 2 elements seperated with :

Intersection is done using the & operator or the intersection method.

Union is done with |

Superset- boyuk olan


Subset- small one

Because of Python's dynamic typing, we can even create heterogeneous lists:

But this flexibility comes at a cost: to allow these flexible types, each item in the
list must contain its own type info, reference count, and other information–that is,
each item is a complete Python object. In the special case that all variables are of
the same type, much of this information is redundant: it can be much more
efficient to store data in a fixed-type array.

Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional;


here's one way of initializing a multidimensional array using a list of lists:

For working with arrays of mixed dimensions, it can be clearer to use


the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:
The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of
indices giving the split points:

Ufuncs exist in two flavors: unary ufuncs, which operate on a single input,
and binary ufuncs, which operate on two inputs. We'll see examples of both these
types of functions here.

Operating on Null Values


As we have seen, Pandas treats None and NaN as essentially interchangeable for
indicating missing or null values. To facilitate this convention, there are several
useful methods for detecting, removing, and replacing null values in Pandas data
structures. They are:
 isnull(): Generate a boolean mask indicating missing values
 notnull(): Opposite of isnull()
 dropna(): Return a filtered version of the data
 fillna(): Return a copy of the data with missing values filled or imputed

Numpy

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension),
and size (the total size of the array):

Another useful attribute is the dtype, the data type of the array

Other attributes include itemsize, which lists the size (in bytes) of each array element, and nbytes,
which lists the total size (in bytes) of the array

One important–and extremely useful–thing to know about array slices is that they
return views rather than copies of the array data

Panda:
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array
as follows:

Data Loading, Storage, and File Formats


 Reading Tabular Data with Pandas:
 pandas offers functions like read_csv, read_excel, read_json, etc., to load data into
DataFrame objects.
 These functions can handle various file formats including CSV, Excel, JSON, HTML,
etc.
 Optional Arguments and Type Inference:
 Data loading functions like read_csv have many optional arguments to handle
different file formats and data variations.
 Type inference is used to automatically infer column data types in some cases.
 Handling Special Cases:
 Options like header, names, index_col, sep, etc., help handle special cases like
missing headers, custom column names, index columns, and variable delimiters.
 Data Export:
 Data can be exported using the to_csv method to write to CSV format.
 JSON Data:
 JSON is a common data format used for exchanging data between applications.
 pandas can read and write JSON using read_json and to_json methods respectively.
 Binary Data Formats:
 pandas supports binary data formats like HDF5, ORC, Parquet, and Pickle.
 Pickle is Python's built-in serialization format, while others require additional
libraries.
 Reading Excel Files:
 pandas supports reading Excel files using ExcelFile class or read_excel function.
 Additional packages like xlrd and openpyxl may be needed for reading old and new
Excel formats.
 Handling Data in DataFrames:
 Once data is loaded into a DataFrame, various operations like filtering, calculating
percentages, and value counts can be performed easily.

Ufuncs: Index Preservation


 NumPy Compatibility:
 Pandas inherits functionality from NumPy, including element-wise operations.
 Unary and binary operations preserve index and column labels, respectively.
 Example:
 Operations like exponentiation (np.exp) on Series or DataFrame objects preserve
indices.
 Binary operations like trigonometric functions also maintain index and column
alignment.
UFuncs: Index Alignment
 Binary Operations:
 Pandas aligns indices when performing binary operations on Series or DataFrame
objects.
 Example: Division of two Series objects results in aligned indices with missing values
marked as NaN.
Operations Between DataFrame and Series
 Alignment and Broadcasting:
 Operations between DataFrame and Series objects maintain index and column
alignment.
 Broadcasting rules similar to NumPy apply, preserving context and preventing
errors.
 Example:
 Broadcasting subtracts a Series from a DataFrame row-wise by default.
 Column-wise operations can be performed using appropriate methods with the axis
keyword.
 Preservation of Context:
 Indices and columns are aligned automatically, ensuring data context is maintained.
 Prevents errors common with heterogeneous or misaligned data in raw NumPy
arrays.

1. Introduction to Missing Data: Real-world datasets often contain missing values,


represented as null, NaN, or NA values. Different data sources may indicate missing data
differently.
2. Trade-Offs in Missing Data Conventions: There are two main strategies for handling
missing data: using a mask to globally indicate missing values or choosing a sentinel value
for missing entries. Each approach has trade-offs in terms of storage, computation, and
compatibility with different data types.
3. Missing Data in Pandas: Pandas relies on NumPy, which lacks a built-in notion of NA
values for non-floating-point data types. Pandas uses sentinels for missing data, primarily
None (Python object) and NaN (floating-point value).
4. None: Pythonic Missing Data: None is a Python singleton object used for missing data. It
can only be used in arrays with data type 'object', leading to slower operations compared
to native types.
5. NaN: Missing Numerical Data: NaN is a special floating-point value recognized by all
systems using the standard IEEE floating-point representation. It infects any other object
it touches and is specifically for numerical data.
6. NaN and None in Pandas: Pandas treats NaN and None nearly interchangeably,
converting between them where appropriate. It automatically type-casts when NA values
are present.
7. Operating on Null Values: Pandas provides methods like isnull(), notnull(), dropna(), and
fillna() for handling null values in data structures. These methods allow detecting,
removing, and replacing null values efficiently.
8. Detecting Null Values: isnull() and notnull() methods generate Boolean masks indicating
missing values.
9. Dropping Null Values: dropna() method removes rows or columns containing null values,
with options to specify how many null values to allow.
10. Filling Null Values: fillna() method replaces null values with specified values, including
single values, forward-fill, or back-fill methods. It can be applied along different axes in
DataFrames.

11. Introduction to Combining Datasets: Data analysis often involves combining different
data sources, from simple concatenation to more complex joins and merges. Pandas
provides functions and methods to facilitate this process.
12. Simple Concatenation with pd.concat(): The pd.concat() function is similar to NumPy's
np.concatenate() but designed for Series and DataFrames. It allows concatenation along
specified axes and handles missing data efficiently.
13. Concatenating Series and DataFrames: Examples show how to concatenate Series and
DataFrames using pd.concat(). Data can be concatenated row-wise (default) or column-
wise by specifying the axis parameter.
14. Handling Duplicate Indices: pd.concat() preserves indices by default, even if they are
duplicates. Options like verify_integrity and ignore_index allow handling of duplicate
indices.
15. Adding MultiIndex Keys: The keys parameter in pd.concat() adds hierarchical indexing to
the resulting DataFrame, useful for identifying the source of data.
16. Concatenation with Joins: pd.concat() offers options for handling columns with different
names (join='inner' for intersection, join_axes to specify resulting columns).
17. The append() Method: Series and DataFrames have an append() method that
concatenates objects along a particular axis. It's a simpler alternative to pd.concat() but
creates a new object each time.
18. Efficiency Considerations: While append() is convenient, it's less efficient than pd.concat()
when dealing with multiple concatenations, as it involves creating new indices and data
buffers.

You might also like