0% found this document useful (0 votes)
5 views

Complete Basic Stats

this notes presents the information about basic statistics

Uploaded by

Vinut P Maradur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
5 views

Complete Basic Stats

this notes presents the information about basic statistics

Uploaded by

Vinut P Maradur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
Sourav Kapil Data Scientist 4st Jan, 2023 Statistics Notes for Data Science Basic Stats Statistics Statistics is the science of collecting, organising and analysing the data. We are specifically doing this for better decision making. Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and interpret it * Data - Facts or peace of information that can be measured, For example: Age of students of a class - Data : {25, 21, 22, 20, 23) Types of Statistics ‘The types of statistics are categorised based on these features: * Descriptive Statistics © Inferential Statistics 1. Descriptive Statistics - Descriptive statistics is a means of describing features of a dataset by generating summaries about the data samples. There are 4 major types of descriptive stats - > Measure of Frequency (Count, Percent, Frequency) > Measure of Central Tendency (Mean, Median, Mode) > Measure of Dispersion and Va (Range, Variance, Standard Deviation) > Measure of Position (Percentile/Quartile Ranks) 4 The distribution concems the frequency of each value. ‘ The central tendency concerns the averages of the values. ‘ The variability or dispersion concerns how spread out the values are. Dee Ene} Prec Mean Ronge Medion Stondord deviation Mode Vorionee Intorquertile range SScribbr 2. Inferential Statistics - Inferential statistics is a way of making inferences about population based on samples. It's a technique where we use the data that we have measured to form conclusions, There are 3 major types of inferential stats - > Confidence Interval > Hypothesis Testing > Regression Analysis Population Vs Sample A population is the entire group that you want to draw conclusions about, A sample is the specific group that you will collect data from, The size of the sample is always less than the total size of the population © Samples are used to make inferences about populations. © Samples are easier to collect data from because they are practical, cost-effective, convenient, and manageable. * In research, a population doesn't always refer to people. Population ‘Sample titte Population Mean And Sample Mean + The sample mean is the mean calculated from a group of random variables, drawn from the population. Compared to the population, the sample size is small. The sample size is represented by ‘n.’ The sample mean is calculated as under :- a. Sample Mean & Ot ize of sample n where, n = Add up ai= All the observations + Population mean represents the actual mean of the whole population.The population size is large, and the sample size is denoted by ‘N.! The population mean is calculated as under :- Population Mean where N = Size of the population I= Add up ai= All the observations Sampling ‘Sampling means selecting the group of data as a sample from the entire data for your research. It allows you to test hypotheses about the characteristics of a population, Reasons for sampling - > Necessity: Sometimes it's simply not possible to study the whole population due to its size or inaccessibility. > Practicality: I's easier and more efficient to collect data from a sample. > Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs involved > Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable. Sampling Techniques ‘When you conduct research about a group of people, it’s rarely possible to collect data from every person in that group. Instead, you select a sample. The sample is the group of individuals who will actually participate in the research. To draw valid conclusions from your results, you have to carefully decide how you will select a sample that is representative of the group as a whole. This is called a sampling method. There are two primary types of sampling methods that you can use in your research: * Probability Sampling involves random selection, allowing you to make strong statistical inferences about the whole group. > Non-probability Sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data Notes :- Sampling Frame The sampling frame Is the actual list of individuals that the sample will be drawn from. Ideally, it should include the entire target population (and nobody who is not part of that population). Example: Sampling Frame You are doing research on working conditions at a social media marketing company. Your population is all 1000 employees of the company. Your sampling frame is the company's HR database, which lists the names and contact details of every employee. Probability Sampling ‘There are four main types of probability samples - > Simple Random Sampling > Systematic Sampling > Stratified Sampling > Cluster sampling ‘Simple random sample ‘Systematic sample sercere er eerert Stratified sample i tet =e woo a0 =e =e =e é é ‘SSeribbr 1. Simple Random Sampling + Ina simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population. 4 To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance. Example: You want to select a simple random sample of 1000 employees of a social media marketing company. You assign a number to every employee in the company database from 1 to 1000, and use a random number generator to select 100 numbers. Simple random sampling hi idiiii Pibidi 2. Systematic Random Sampling ‘Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals. Example: All employees of the company are listed in alphabetical order. From the first 10 numbers, you randomly select a starting point: number 6. From number 6 onwards, every 10th person. on the list is selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people. Systematic sampling hi idiid S"_ ~~" Tut 3. Stratified Sampling + Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you to draw more precise conclusions by ensuring that every subgroup is properly represented in the sample, 4 To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristics (e.g., gender identity, age range, income bracket, job role) %* Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup, Example: ‘The company has 800 female employees and 200 male employees. You want to ensure that the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100 people. Stratified sampling bidi idl biid bid idii ia 4, Cluster Sampling * Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of ‘sampling individuals from each subgroup, you randomly select entire subgroups. If itis practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling 4% This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be substantial differences between clusters. I's dificult to guarantee that the sampled clusters are really representative of the whole population Example: ‘The company has offices in 10 cities across the country (all wth roughly the same number of employees in similar roles). You don't have the capacity to travel to every office to collect your data, so you use random sampling to select 3 offices — these are your clusters. CLUSTER SAMPLING hid ( fil bit Non-Probability Sampling ‘There are also 4 types of non-probabilty sampling - > Convenience Sampling > Purposive sampling > Snowball Sampling > Quota Sampling Convenience sample Purposive sample ¢ t. ? Sone samo vote sample é é 23% &38 ‘SScribbr 1. Convenience Sampling 4 A convenience sample simply includes the individuals who happen to be most accessible to the researcher. This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample is representative of the population, so it can't produce generalizable results, Convenience samples are at risk for both sampling bias and selection bias, Exampl You are researching opinions about student support services in your university, so after each of your classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather data, but as you only surveyed students taking the same classes as you at the same level, the sample is not representative of all the students at your university, Convenience sampling ity tha Mgho AL Variable % In statistical research, a variable is defined as an attribute of an object of study. %* A variable is a characteristic that can be measured and that can assume different values. Height, age, income, grades obtained at school and type of housing are all ‘examples of variables. Variables may be classified into two main categories =- > Quantitative/Numerical represents amounts > Qualitative/Categorical represents groupings VARIABLE QUANTITATIVE CATEGORICAL CONTINUOS DISCRETE ORDINAL NOMINAL BINARY Hight Number fads Grates air elow Lethe eight under faints Steet cating gang ge Per fs Sly el UpDewn vx A 1. Quantitative Variable A variable that contains quantitative data is a quantitative variable. When you collect quantitative data, the numbers you record represent the real amounts that can be added, subtracted, divided, etc. There are two types of quantitative variables: > Discrete Variables > Con uous Variables Discrete vs Continuous ‘Type of variable What does the data Examples represent? Discrete variables Counts of individual items or | © Number of students in a (Integer variables) values. class. © Number of different tree species in a forest Continuous variables Measurements of © Distance (Ratio variables) continuous or non-inite © Volume values © Age 2. Qualitative Variable Quaitative or Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things. There are three types of categorical variables -- > Binary Variables > Nominal Variables > Ordinal Variables Binary vs nominal vs ordinal variables ‘Type of variable What does the data Examples represent? Binary variables Yes or no outcomes. © Headsttails in a coin flip © Winvlose in a football game Nominal variables Groups with no rank or order | © Species names between them. © Colours Brands Ordinal variables Groups that are ranked in a ‘© Finishing place in a race specific order. © Employee Designation Frequency Distribution And Cumulative Frequency In a frequency distribution, the sum of all the frequencies ‘equal to the total number of ‘observations. But, in the cumulative frequency distribution, the last cumulative frequency is, the same as the total number of observations. ‘School Grade Frequency of Cumulative students Frequency 1 23 2B 2 20, 2320-43 3 15 43515558 4 12 58+ 12=70 5 10 70+ 10= 80 6 8 8018 =88 What is a Histogram? A histogram is a graphical representation of data points organized into user-specified ranges. Similar in appearance to a bar graph, the histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins. 0 Cite5 Chas 6 Clas 7 Classes — case 8 Lunitlenth=5 gts ‘see No. of students — 5 10 15 20 25 30 Marks obtained — Descriptive Statistics 1. Measure of Frequency distribution A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or percentages. This is called a frequency distribution, For the variable of gender, you list all possible answers on the left hand column, You count the number or percentage of responses for each answer and display it on the right hand column, Gender Number Male: 182 Female 235 Other a From this table, you can see that more women than men or people with another gender identity took part in the study. 2. Measure of Central Tendency Measures of central tendency estimate the centre or average of a dala set. The mean, median and mode are 3 ways of finding the average. (a) Negatively skewed (b) Normal (no skew) (c) Positively skewed Mean Median Mode Mode Mode Median Median Mean x x ! x <—_—_— > Negative Direction Perfectly Symmetrical Positive Direction Distribution % The Mean or M, is the most commonly used method for finding the average. To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N. ‘Mean number of library visits Data set 15, 3, 12, 0, 24,3 ‘Sum of all values 15+3412+0+24+3-57 Total number of responses N=6 Mean Divide the sum of values by N to find M: 57/6 = 9.5 4 The Median is the value that's exactly in the middle of a data set. To find the median, order each response value from the smallest to the biggest. Then, the median is the number in the middle. If there are two numbers in the middle, find their mean. Median number of library visit Ordered data set 0,3, 3,12, 15, 24 Middle numbers 3,12 Median Find the mean of the two middle numbers: (3 + 122= 7.5 4 The Mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode. To find the mode, order your data set from lowest to highest and find the response that occurs most frequently Mode number of library visits Ordered dataset 0, 3, 3, 12, 15, 24 Mode Find the most frequently occurring response: 3 3. Measure of Variability or Dispersion - Measures of variability give you a sense of how spread out the response values are, The range, standard deviation and variance each reflect different aspects of spread. * Range The range gives you an idea of how far apart the most exireme response scores are. To find the range, simply subtract the lowest value from the highest value. Range of its to the library the past year - Ordered data set: 0, 3, 3, 12, 15, 24 Range: 24-0 = 24 + Standard deviation The standard deviation (s or SD) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is. There are six steps for finding the standard deviation: List each score and find their mean, ‘Subtract the mean from each score to get the deviation from the mean. ‘Square each of these deviations. ‘Add up all of the squared deviations. Divide the sum of the squared deviations by N Find the square root of the number you found, rary in the past year - Raw data Deviation from mean Squared deviation 15 15-95-55 30.25 3 42.25 2 12-95 =25 6.25 ° o-95=.05 20.28 24 24-95 = 145 210.25 3 42.25 M=95 sum Sum of squares Step 5: 421.5/5 = 84.3 Stop 6: ¥84.3=9.18 From learning that s mean by 9.18 points. 9.18, you can say that on average, each score deviates from the + Variance The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean. To find the variance, simply square the standard deviation. The symbol for variance is s2 Variance of visits to the library in the past year - Population Sample : Dive Dio Variance =S=— . ——— w nt Standard deviation o Data set: 15, 3, 12, 0, 24,3 18 82 = 84,3 * Z-Score ee 4S Difference between univariate, bivariate and multivariate descriptive statistics - > Univariate statistics summarize only one variable at a time. > Bivariate statistics compare two variables. > Multivariate statistics compare more than two variables,

You might also like