0% found this document useful (0 votes)
198 views

Unit of Analysis

The document discusses key concepts related to the unit of analysis in predictive analytics. The unit of analysis refers to the specific entity that predictions are made about, such as individuals, companies, or time periods. Choosing the appropriate unit of analysis impacts the data collection, modeling techniques, and interpretation of results. The document outlines different types of units of analysis including individuals, groups, temporal and spatial scopes, aggregation levels, hierarchical and panel data structures, and cross-sectional data. Integrating data from multiple sources is also an important part of building accurate predictive models.

Uploaded by

mallabhi354
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

Unit of Analysis

The document discusses key concepts related to the unit of analysis in predictive analytics. The unit of analysis refers to the specific entity that predictions are made about, such as individuals, companies, or time periods. Choosing the appropriate unit of analysis impacts the data collection, modeling techniques, and interpretation of results. The document outlines different types of units of analysis including individuals, groups, temporal and spatial scopes, aggregation levels, hierarchical and panel data structures, and cross-sectional data. Integrating data from multiple sources is also an important part of building accurate predictive models.

Uploaded by

mallabhi354
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Unit Of

Analysis
Unit Of Analysis
In predictive analytics, the unit of analysis refers to the specific entity
or object that you are trying to make predictions about. It is a
fundamental concept because it defines what you are studying and
what you want to make predictions for. The choice of the unit of
analysis is crucial as it impacts the data you collect, the modeling
techniques you use, and the interpretation of your results. Here are
some key concepts related to the unit of analysis in predictive
analytics:
The unit of analysis can be individuals,
such as customers, patients, students, or
Individuals or employees. It can also be entities like
Entities: companies, products, or households.
When you're making predictions, you are
often interested in understanding or
forecasting the behavior or
characteristics of these individuals or
entities.
The unit of analysis may have a
Temporal temporal aspect. For example, in time
Scope: series analysis, you might analyze data
for each time point (e.g., daily, monthly)
to make predictions about future time
points. This is common in financial
forecasting, weather prediction, and
demand forecasting.
Spatial Scope: In some cases, the unit of
analysis may have a spatial
component. For instance, if
you're predicting real estate
prices, you might analyze
data for specific geographic
areas like neighborhoods.
Aggregation You can choose to aggregate your data at
Levels: different levels. For instance, you might
choose to analyze data at an individual
customer level or aggregate it at a higher
level, such as the overall sales for a
region. The level of aggregation depends
on your research question and the
insights you seek.
Some datasets have a hierarchical
Hierarchical
Structure: structure with multiple levels of
units of analysis. For example, in
education, you may have students
within classrooms within schools.
Understanding the appropriate
level of analysis is critical for
accurate predictions.
Panel data involves tracking the same
Panel Data: units of analysis over time. This can be
valuable for understanding changes and
making predictions. For instance,
tracking the performance of the same
group of employees over several years.
Cross- Cross-sectional data is collected at a
Sectional Data: single point in time and does not
involve tracking the same units over
time. It is often used for making
predictions or inferences about a
population at a specific moment.
Segmentation: You can segment your data into
different units of analysis based on
specific characteristics or criteria.
For example, segmenting
customers into high-value and low-
value groups for predictive
marketing.
In predictive analytics, the choice of the unit of analysis
is driven by your research objectives and the availability
of data. It is important to select the most appropriate unit
of analysis to ensure that your predictive models provide
valuable insights and accurate predictions for the specific
entities or phenomena you are interested in.
You can create a dataset with one record per customer in three ways. This slide shows the
option to keep one of the records in the group, named Distinct in IBM SPSS Modeler.

In this example the group of records is defined by ID and the first record of each ID is
retained. Because the data are sorted Ascending by Product, you will retain the most recent
record.
Another method is to summarize the information over the records
in the group. This option is called Aggregate.
The last method is useful to transform a
nominal field into a series of flag fields,
so that the categories make up the
columns of the dataset instead of the
rows. This operation is called SetToFlag.
In this example ID defines a group of
records, and the nominal field
PRODUCT with categories A, B, C and
D is transformed into a new dataset with
one record per ID, with the fields A, B, C
and D flagging if one has purchased the
particular product.
Integrating data in predictive analytics is a
Integrate data critical step in the process of building accurate
and robust predictive models. Data integration
involves bringing together data from multiple
sources, often in different formats and
structures, to create a unified dataset that can
be used for analysis and modeling. Here are the
key steps and considerations for integrating
data in predictive analytics:
Data Collection 1.Identify and collect relevant data
and Sourcing: from various sources, which may
include databases, spreadsheets,
APIs, external data providers, and
more.

2.Ensure that the data you collect


with your research objectives and
the unit of analysis you have
defined.
Data Cleaning 1.Clean and preprocess each
and dataset separately to handle
Preprocessing: missing values, outliers, and
inconsistencies.
2.Standardize data formats, units,
and scales to ensure that data
from different sources can be
compared and combined
Data 1.Transform and reshape the data as
Transformation: needed. This may involve
aggregating, or joining datasets to
create a unified dataset.
2.Perform feature engineering to
create new variables that may
enhance predictive power.
Data 1.Merge or join the cleaned and
Integration: transformed datasets into a single,
integrated dataset. You may use
common keys or identifiers to link
the data.
2.Pay attention to the type of joins
(e.g., inner join, left join, right join)
to ensure you don't lose any
valuable information.
Address missing data by using
Data appropriate imputation techniques, such
Imputation: as mean imputation, regression
imputation, or using domain-specific
knowledge.
Data Scaling Normalize or scale variables if
and needed to ensure that they have
Normalization: similar scales, especially when
using algorithms like neural
networks or support vector
machines.
Split the integrated dataset into
training, validation, and test sets
Data Splitting: for model development and
evaluation. Cross-validation may
also be used.
Ensure that data integration practices
Data Security comply with relevant data security and
and Privacy: privacy regulations. Anonymize or
pseudonymize sensitive information as
needed.
Maintain thorough documentation of the
data integration process, including data
Documentation sources, transformations, and any
decisions made during integration.
Implement monitoring procedures to
Data detect data drift or changes in data
Monitoring: quality over time, as this can impact the
performance of predictive models.
.

CLEM Expression
SPSS CLEM is the control Language for
CLEM Expression Manipulation, which is used
Expression to build expressions within SPSS
Modeler streams. CLEM is actually used
in a number of SPSS “nodes” (among
these are the Select and Derive nodes).
•Compare and evaluate conditions on record
CLEM is used fields

within SPSS •Derive values for new fields


Modeler to: •Derive new values for existing fields
•Reason about the sequence of records
•Insert data from records into reports
CLEM expressions are Mandatory for data preparation in SPSS Modeler and
can be used in a wide range of nodes—from record and field operations
(Select, Balance, Filler) to plots and output (Analysis, Report, Table). For
example, you can use CLEM in a Derive node to create a new field based on
a formula such as ratio.

CLEM expressions can also be used for global search and replace operations.
For example, the expression @NULL(@FIELD) can be used in a Filler node
to replace system-missing values with the integer value 0. (To replace user-
missing values, also called blanks, use the @BLANK function.)
CLEM datatypes
This section covers CLEM datatypes.

CLEM datatypes can be made up of any of the following:

Integers

Reals

Characters

Strings

Lists

Fields

Date/Time
Rules for quoting

Although SPSS Modeler is flexible when you're determining the fields, values, parameters, and strings
used in a CLEM expression, the following general rules provide a list of best practices to use in creating
expressions:

•Strings: Always use double quotes when writing strings, such as "Type 2". Single quotes can be
used instead but at the risk of confusion with quoted fields.

•Fields: Use single quotes only where necessary to enclose spaces or other special characters, such
as ’Order Number'. Fields that are quoted but undefined in the data set will be misread as strings.

•Parameters: Always use single quotes when using parameters, such as '$P-threshold’.

•Characters: Always use single backquotes (`), such as stripchar(`d`, "drugA").


Integers

Integers are represented as a sequence of decimal digits.


Optionally, you can place a minus sign (−) before the
integer to denote a negative number (for
example, 1234, 999, −77).
Reals

Real refers to a floating-point number. Reals are


represented by one or more digits followed by a decimal
point followed by one or more digits. CLEM reals are
held in double precision.
Optionally, you can place a minus sign (−) before the real
to denote a negative number (for
example, 1.234, 0.999, −77.001).
Strings

Generally, you should enclose strings in double quotation marks.


Examples of strings
are "c35product2" and "referrerID".
Lists

A list is an ordered sequence of elements, which may be of mixed


type. Lists are enclosed in square brackets ([ ]).
Examples of lists are [1 2 4 16] and ["abc"
"def"] and [A1, A2, A3].
Fields

Names in CLEM expressions that aren’t names of functions are


assumed to be field names.
for example, 'Power Increase', '2nd
answer', '#101', '$P-NextField'
Date
Table 1. CLEM language date formats
Format Examples
DDMMYY 150163
MMDDYY 011563
YYMMDD 630115
YYYYMMDD 19630115

Four-digit year followed by a three-digit number


representing the day of the year—for
YYYYDDD
example, 2000032 represents the 32nd day of 2000, or 1
February 2000.

Day of the week in the current locale—for


DAY
example, Monday, Tuesday, ..., in English.
Month in the current locale—for
MONTH
example, January, February, ….

DD/MM/YY 15/01/63

DD/MM/YYYY 15/01/1963

MM/DD/YY 01/15/63

MM/DD/YYYY 01/15/1963

DD-MM-YY 15-01-63

DD-MM-YYYY 15-01-1963

MM-DD-YY 01-15-63

MM-DD-YYYY 01-15-1963
DD.MM.YY 15.01.63 DD/MON/YYYY 15/JAN/1963, 15/jan/1963, 15/Jan/1963

DD.MM.YYYY 15.01.1963 DD.MON.YYYY 15.JAN.1963, 15.jan.1963, 15.Jan.1963

MM.DD.YY 01.15.63 MON YYYY Jan 2004

MM.DD.YYYY 01.15.1963
Date represented as a digit (1–4) representing the quarter
followed by the letter Q and a four-digit year—for
q Q YYYY
15-JAN-63, 15-jan-63, 15- example, 25 December 2004 would be represented as 4
DD-MON-YY Q 2004.
Jan-63

15/JAN/63, 15/jan/63,
DD/MON/YY
15/Jan/63

15.JAN.63, 15.jan.63, Two-digit number representing the week of the year


DD.MON.YY followed by the letters WK and then a four-digit year. The
15.Jan.63 ww WK YYYY week of the year is calculated assuming that the first day
of the week is Monday and there is at least one day in the
first week.

15-JAN-1963, 15-jan-1963,
DD-MON-YYYY
15-Jan-1963
Time
The CLEM language supports the time formats listed in this
section.

Table 1. CLEM language time formats

Format Examples

HHMMSS 120112, 010101, 221212

HHMM 1223, 0745, 2207

MMSS 5558, 0100

HH:MM:SS 12:01:12, 01:01:01, 22:12:12

HH:MM 12:23, 07:45, 22:07

MM:SS 55:58, 01:00


(H)H:(M)M:(S)S 12:1:12, 1:1:1, 22:12:12

(H)H:(M)M 12:23, 7:45, 22:7

(M)M:(S)S 55:58, 1:0

12.01.12, 01.01.01,
HH.MM.SS
22.12.12

HH.MM 12.23, 07.45, 22.07

MM.SS 55.58, 01.00

(H)H.(M)M.(S)S 12.1.12, 1.1.1, 22.12.12

(H)H.(M)M 12.23, 7.45, 22.7

(M)M.(S)S 55.58, 1.0


CLEM operators
Identifying the modeling objective.
Identifying the modeling objective is a crucial step in any modeling
or data analysis process. It involves defining the specific goal or
purpose of your modeling effort. Without a well-defined objective,
your analysis may lack direction, and it can be challenging to make
meaningful decisions or draw valuable insights from your data. Here
are some key steps to help you identify and clarify your modeling
objective:
1. Understand the Problem:
Start by gaining a deep understanding of the problem you
want to address or the question you want to answer. What is
the context, and why is this problem important? Who are the
stakeholders, and what are their needs or expectations?
2. Define the Scope:
Determine the scope of your modeling project. What are the
boundaries and constraints? Consider the available data,
resources, and time. Understanding these limitations can help
you set realistic modeling objectives.
3. Specify the Research Question:
Formulate a clear and specific research question that your
modeling will address. This question should be focused and
clearly. For example, "Can we predict customer churn based on
historical data?" is a well-defined research question.
4. Determine the Type of Model:
Decide what type of model you intend to build. Is it a predictive
model, a descriptive model, a classification model, or a
regression model? The type of model will depend on the nature
of the problem and the information you want to extract.
5. Select Performance Metrics:
Choose appropriate performance metrics or evaluation criteria
that align with your modeling objective. For example, if you're
building a predictive model, you might use metrics like
accuracy, precision, recall, or F1 score.
6. Consider End Goals:
Think about the end goals of your modeling project. What
actions or decisions will be based on the model's results?
Understanding the practical applications of your model can
help refine your objective.
7. Data Requirements:
Identify the data needed to achieve your modeling objective.
Consider data sources, data quality, and data availability. If
necessary, outline a data collection or data preprocessing plan.
8. Hypothesize Outcomes:
Make initial hypotheses about the expected outcomes of your
modeling. What do you believe the model will reveal, and how
will it address the research question?
9. Stakeholder Involvement:
Involve relevant stakeholders in the process of defining the
modeling objective. Their insights and expectations are
valuable for ensuring that the objective aligns with the needs of
the business or organization.
10. Document the Objective:
Finally, document your modeling objective in a clear and
concise manner. This documentation will serve as a reference
point throughout the modeling process and help communicate
the objective to others involved in the project.

You might also like