0% found this document useful (0 votes)
27 views

EUC1502 Module1 Machine Learning

This document provides an overview of classical econometrics, including multiple regression analysis and time series models. It defines econometrics as the application of statistics and mathematics to identify relationships between predicted and predictor variables. Multiple regression analysis uses the ordinary least squares method to estimate coefficients and predict variable values while minimizing residuals. Time series models analyze variables measured sequentially over time, such as GDP, employment, births and population. The document also discusses issues like collinearity among predictor variables.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

EUC1502 Module1 Machine Learning

This document provides an overview of classical econometrics, including multiple regression analysis and time series models. It defines econometrics as the application of statistics and mathematics to identify relationships between predicted and predictor variables. Multiple regression analysis uses the ordinary least squares method to estimate coefficients and predict variable values while minimizing residuals. Time series models analyze variables measured sequentially over time, such as GDP, employment, births and population. The document also discusses issues like collinearity among predictor variables.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Module 1.

Classical vs
machine learning
econometrics

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Eurostat
▪ Overview of classical econometrics:

➢ What is econometrics

➢ Multiple regression

➢ Time series

▪ Econometric methods in Official statistics

▪ Models of inference
2
Eurostat
1. What is econometrics

Eurostat
Overview of classical econometrics
What is econometrics

Econometric is an application of statistics


and mathematics aimed at identifying and
quantifying the relationship between two
sets of variables

▪ The predicted variables


▪ The predictor variables

Y = β0 + β1 X1 +… + βk Xk + ɛ 4
Eurostat
Overview of classical econometrics
What is econometrics

Aspects:
▪ Uncertainty regarding an outcome
▪ Relationships suggested by (economic) theory
▪ Assumptions and hypotheses to be specified
▪ Sampling process including functional form
▪ Obtaining data for the analysis
▪ Estimation rule with good statistical properties
▪ Fit and test model using software package
▪ Analyse and evaluate implications of the results
▪ Problems suggest approaches for further research

5
Eurostat
Overview of classical econometrics
What is econometrics

Examples of econometrics models:


▪ Demand and supply Models

▪ Production Functions

▪ Cost Functions

▪ Etc.

6
Eurostat
Overview of classical econometrics
What is econometrics

Demand model

ln 𝑦𝑡𝑑 = 𝛽1 + 𝛽2 ln 𝑥 𝑡 + 𝜀𝑡

Quantity demanded price

Supply model

ln 𝑦𝑡𝑠 = 𝛽1 + 𝛽2 ln 𝑥 𝑡 + 𝜀𝑡
Quantity supplied price
7
Eurostat
Overview of classical econometrics
What is econometrics

Production function

ln 𝑦𝑡 = 𝛽1 + 𝛽2 ln 𝑥 𝑡 + 𝜀𝑡

output input

Cobb-Douglas production function

8
Eurostat
Overview of classical econometrics
What is econometrics

Cost function

𝑦𝑡 = 𝛽1 + 𝛽2 𝑥𝑡2 + 𝜀𝑡

Total cost output

9
Eurostat
Overview of classical econometrics
What is econometrics

There are also non-lineal models:


𝑦 = 𝛽1 𝛼 𝛽2𝑥 𝑒 u

And models that can be linearises:


𝑦 = 𝛽1 𝑥 𝛽2 𝑒 u

ln y = ln 𝛽1 + 𝛽2 lnx + u= 𝛼 + 𝛽2 ln x + u
10
Eurostat
2. The multiple linear
regression model

Eurostat
Overview of classical econometrics
The multiple linear regression model

slopes Error term

Y = β0 + β1 X1 +… + βk Xk + ɛ

Predicted variable, intercept Predictor variables,


dependent variable independent variables

12
Eurostat
Overview of classical econometrics
The multiple linear regression model

Ordinary Least Squares method

Minimise σni=1 yi − β෠ 0 − β෠ 1 xi1 −. . . − β෠ k xik 2

OLS estimators of β0, β1,…βk

They give the variation of yi for


one-unit variation of xi,
mantaining the other variables
constants:
Δ𝑦ො = 𝛽መ𝑖 Δ𝑥𝑖
13
Eurostat
Overview of classical econometrics
The multiple linear regression model

Ordinary Least Squares method

Predicted value of yi : 𝑦ො𝑖 = β෠ 0 + β෠ 1 xi1 +. . . + β෠ k xik

Residual or error term : 𝜀𝑖 = 𝑦𝑖 − 𝑦ො𝑖

14
Eurostat
Overview of classical econometrics
The multiple linear regression model

Assumptions:
▪ E(𝜀𝑖 |Xi) = 0 𝜀𝑖 has conditional zero mean

▪ (Xi,Yi) i.i.d i=1,..n

▪ Xi and 𝜀𝑖 have nonzero finite fourth moment

▪ There is no perfect multicollinearity (see later)

▪ var(𝜀𝑖 |Xi) = 𝜎𝜀2 homoschedasticies

▪ The conditional distribution of 𝜀𝑖 given Xi is normal


15
Eurostat
Overview of classical econometrics
The multiple linear regression model

Goodness of Fit
𝑛
TSS = ෌𝑖=1 𝑦𝑖 − 𝑦ത 2 total variation of y
Total sum of squares
Or

TSS= ESS + RSS


Explained Sum of Squares: Residual Sum of Squares: residual
variation explained by the variation, i.e. variation explained by the
model, i.e. variation of Y residuals:
explained by X: 𝑛 𝑛
𝑛 ෌𝑖=1 𝒖𝟐𝒊 = ෌𝑖=1 𝒚𝒊 − 𝒚
ෝ𝑖 2

ෝ𝑖 − 𝑦ത
෍ 𝒚 2
16
𝑖=1 Eurostat
Overview of classical econometrics
The multiple linear regression model

Goodness of Fit

𝐄𝐒𝐒
R2= 0≤ R2 ≤ 1
𝐓𝐒𝐒

𝐑𝐒𝐒
It can also be written as 1 –
𝐓𝐒𝐒

𝑛−1
Adjusted R2 = 1- (1- R2)
𝑛−𝑘
17
Eurostat
Overview of classical econometrics
Collinearity

▪ The term “independent variable” means an


explanatory variable is independent of the
error term, but not necessarily independent
of other explanatory variables.

▪ Since economists typically have no control over


the implicit “experimental design”, explanatory
variables tend to move together which often
makes sorting out their separate influences
rather problematic.

18
Eurostat
Overview of classical econometrics
Collinearity

Evidence of high collinearity include:

▪ a high pairwise correlation between two


explanatory variables
▪ a high R-squared when regressing one
explanatory variable at a time on each of the
remaining explanatory variables
▪ a statistically significant F-value when the
t-values are statistically insignificant
▪ an R2 that doesn’t fall by much when dropping
any of the explanatory variables 19
Eurostat
Overview of classical econometrics
Collinearity

▪ Collinearity doesn’t mean the model is


misspecified

▪ Especially common problem in time series


regressions

▪ It depends on lack of adequate information in the


sample

20
Eurostat
Overview of classical econometrics
Collinearity

Some solutions:

➢ collect more data with better information

➢ impose economic restrictions as appropriate

➢ impose statistical restrictions when justified

➢ if all else fails at least point out that the poor


model performance might be due to the
collinearity problem (or it might not).
21
Eurostat
3. Time series models

Eurostat
Overview of classical econometrics
Time series models

A collection of observations made


sequentially in time (stochastic process)

Examples:
- Unemployment rate over time
- Inflation rate
- Production indices
- Number of deaths/births
- Etc.

23
Eurostat
Overview of classical econometrics
Time series models

Spanish quarterly GDP from 1995 to 2011


24
Eurostat
Overview of classical econometrics
Time series models

Total employees in Spain, from 1980 to 2004,


quarterly variation 25
Eurostat
Overview of classical econometrics
Time series models

Number of births in Spain from 1975 to 2013;


monthly data 26
Eurostat
Overview of classical econometrics
Time series models

Total number of population in Spain from 1971


to 2016 27
Eurostat
Overview of classical econometrics
Time series models

28
Eurostat
Overview of classical econometrics
Time series models

29
Eurostat
Overview of classical econometrics
Time series models

Univariate Time Series describe the behaviour


of a variable in terms of its own past values

𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝜀𝑡 Random error


(white noise)
intercept coefficient

Multivariate Time Series describe the behaviour


of a variable in terms of its own past values and
the past values of other variables

𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝛿1 𝑋𝑡−1 + 𝜀𝑡 30
Eurostat
Overview of classical econometrics
Time series models

First order autoregression (AR1)


𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝜀𝑡

Second order autoregression (AR2)


𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝛽2 𝑌𝑡−2 𝜀𝑡

pth order autoregression (ARp)


𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝛽2 𝑌𝑡−2 + 𝛽𝑝 𝑌𝑡−𝑝 + 𝜀𝑡
31
Eurostat
Overview of classical econometrics
Time series models

We use OLS to estimate the coefficients


𝑌෠𝑡 = 𝛽መ0 + 𝛽መ1 𝑌𝑡−1 + 𝜀𝑡
forecast

𝜀𝑡 = 𝑌෠𝑡 − 𝑌𝑡 forecast error

▪ The forecast error is not a residual


▪ The forecast and the forecast errors pertain to
“out-of-sample” observations (in contrast to
“in-sample observations)

32
Eurostat
Overview of classical econometrics
Time series models

Lag length selection (choosing the order of p):


▪ F-statistics approach
▪ BIC (Bayes Information Criterion)
▪ AIC (Akaike Information Criterion)

33
Eurostat
Overview of classical econometrics
Time series models

Moving average process


𝑌𝑡 = 𝜇 + 𝜀𝑡 + 𝜃𝜀𝑡−1 (MA1)
....
𝑌𝑡 = 𝜇 + 𝜀𝑡 + 𝜃1 𝜀𝑡−1 + 𝜃2 𝜀𝑡−2 + ⋯ 𝜃𝑞 𝜀𝑡−𝑞 (MAq)

MA processes depend not on the level of the last


time point, but rather on the level of the last
time point’s error (ε)

34
Eurostat
Overview of classical econometrics
Time series models

ARMA process ARp

𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡−1 + 𝛽2 𝑌𝑡−2 + ⋯ + 𝛽𝑝 𝑌𝑡−𝑝


+ 𝜀𝑡 + 𝜃1 𝜀𝑡−1 + 𝜃2 𝜀𝑡−2 + ⋯ +𝜃𝑞 𝜀𝑡−𝑞

MAq

35
Eurostat
Overview of classical econometrics
Time series models

Nonstationarity
▪ Most economic variables (GDP, consumption, price
level, etc.) are non-stationary (upward or
downward trend over time)

▪ Nonstationarity when the probability ditribution of


𝑌𝑡 changes over time

▪ Many nonstationarity time series can be be made


stationary by differencing them one or more times
(Integrated processes)
36
Eurostat
Overview of classical econometrics
Time series models

Nonstationarity
▪ Deterministic
▪ Stochastic

Random walk: 𝑌𝑡 = 𝑌𝑡−1 + 𝜀𝑡

Random walk with drift: 𝑌𝑡 = 𝛽0 + 𝑌𝑡−1 + 𝜀𝑡


Specific case of
AR(1) with 𝜷𝟏 =1

37
Eurostat
Overview of classical econometrics
Time series models

If 𝛽1 = 1 nonstationary time series

If | 𝛽1 | <1 stationary time series

𝜷𝟏 = 1 is called Unit root

38
Eurostat
Overview of classical econometrics
Time series models

If a time series with a stochastic trend (i.e. A unit


root), the first difference of the series does not
have a trend
𝑌𝑡 − 𝑌𝑡−1 = 𝛽0 + 𝜀𝑡

ΔY stationary

𝑌𝑡 is said to be integrated of order one I(1)

39
Eurostat
Overview of classical econometrics
Time series models

▪ 𝑌𝑡 is said to be integrated of order d I(d) if it


becomes stationary after being first differenced d
times
▪ Resulting model is ARIMA model
ΔdY = 𝛽0 + 𝛽1 Δd𝑌𝑡−1 + 𝛽2 Δd𝑌𝑡−2 + ⋯ + 𝛽𝑝 Δd𝑌𝑡−𝑝
+ 𝜀𝑡 + 𝜃1 𝜀𝑡−1 + 𝜃2 𝜀𝑡−2 + ⋯ +𝜃𝑞 𝜀𝑡−𝑞

40
Eurostat
Overview of classical econometrics
Time series models
The Box-Jenkins approach:
▪ Identification
Inspect the data for stationarity, identify p and q, take first
differences
▪ Estimation
Apply least squares method (linear or no linear)

▪ Validation
Check the estimated model fit well with no autocorrelation

41
Eurostat
4. Econometric methods in
Official statistics

Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

Price comparisons over time and across countries are strongly


affected by the statistical treatment of changes in product
quality over time and differences in product quality across
countries

Matching method is not adequate to deal with substantial changes


or differences in quality bias in the price index:
• The inside the sample bias: prices of non-identical products are
matched
• The outside the sample bias: price changes of matched items are not
representative of price changes of unmatched items

43
Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

A Hedonic Price Index use a regression analysis to


estimate the effect of individual characteristics, the
determinants of quality, on a product’s price.

𝑝𝑖 = ℎ 𝑧𝑖 + 𝜀𝑖 Error term

Function of the
quality characteristics

44
Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

Hedonic modelling

Fully linear model


𝐾

𝑝𝑛𝑡 = 𝛽0𝑡 + ෍ 𝛽𝑘𝑡 𝑍𝑛𝑘


𝑡
+ 𝜀𝑛𝑡
𝑘=1
Multiple linear
regression
Logarithmic-linear model
𝐾

𝑙𝑛𝑝𝑛𝑡 = 𝛽0𝑡 + ෍ 𝛽𝑘𝑡 𝑍𝑛𝑘


𝑡
+ 𝜀𝑛𝑡
𝑘=1

45
Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

Applications
• Housing prices
• ICT- product prices
• Producer prices

46
Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

Advantages

✓ Offer a solution for the quality problem in price indices and


international comparison, provided sufficient information on
characteristics can be obtained

✓ It is used to estimate the willing to pay for, or marginal


cost of producing, the characteristics, or the underlying
demand or supply functions of these characteristics and
corresponding consumer of producer surplus

47
Eurostat
Econometric methods in Official statistics
Regression methods - Hedonic prices

Difficulties

✓ Characteristics should represent user value and user cost

✓ Needs large datasets BIG DATA

✓ Excluded variables

✓ Other price determining variables: price mark-ups

✓ New features

✓ Multicollinearity

✓ Small quantities 48
Eurostat
Econometric methods in Official statistics
Hedonic prices and machine learning

Data Sources: GIS Land data (Big Data)

Hedonic function: to estimate the value associated with: land


characteristics, accessibility, externalities and expectations of
future land developments

Ln 𝑃𝑟𝑖𝑐𝑒 = 𝛼 + 𝐿𝑎𝑛𝑑 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐𝑠′𝛽1 + 𝐴𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 ′ 𝛽2


+𝐸𝑥𝑡𝑒𝑟𝑛𝑎𝑙𝑖𝑡𝑖𝑒𝑠 ′ 𝛽3 + 𝐸𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑠 ′𝛽4 + 𝑍𝑜𝑛𝑖𝑛𝑔′ 𝛽5
+𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑠 ′ 𝛽5 + 𝜀

49
Eurostat
Econometric methods in Official statistics
Hedonic prices and machine learning

Figure 1. Effect of parcel characteristics on land prices.

The map shows log price differentials generated by different values


associated with land characteristics.

i.e. compare the price predicted by the model for each observation
combining all land characteristics and comparing it to the price
predicted for an ad-hoc observation with mean values for each
explanatory variable corresponding to this group

50
Eurostat
Econometric methods in Official statistics
Hedonic prices and machine learning
Figure 1. Effect of parcel characteristics on land prices.

Observations that reduce The majority of observations in the city


prices, mainly seen have predicted prices 20% to 57%
towards the west where higher than the observation with
price differentials reach - average land characteristics in the
148% sample
51
Eurostat
Econometric methods in Official statistics
Hedonic prices and machine learning

Figure 2. Accessibility values.

log price differentials generated by different accessibility values

i.e. compare the price predicted by the model for each observation
combining all accessibility variables and comparing it to the price
predicted for an ad-hoc observation with mean values for each
explanatory variable corresponding to this group

52
Eurostat
Econometric methods in Official statistics
Hedonic prices and machine learning

Figure 2. Accessibility values.

Proximity to the city center


adds value to the land

53
Eurostat
Econometric methods in Official statistics
Deseasonalisation

54
Eurostat
Econometric methods in Official statistics
Classical vs machine learning econometrics
Traditional econometrics Machine Learning
econometrics
Data features - Small to medium size - Large size
- Monthly, quarterly data (in - High frequency
Official Statistics)
- “Reasonable” number of - High dimensional datasets
variables
Model definition - Model-based relationships - Algorithm based
(usual) between variables, grounded
on (economic) theory
e.g. Multiple linear regression,
time series (ARIMA)
Model selection - Expert’s knowledge - Artificial intelligence (but it
and estimation does not avoid knowing what
- Distribution of asymptotic type of technique to apply!!)
significance methods - Automatic optimization
(“regularization”) of modes
Assumptions - Rigid distributional, - No assumptions
independence assumptions
55
Eurostat
5. Models of inference

Eurostat
Models of inference
The context of Official Statistics

• Budget restrictions to carry out traditional surveys


• Increasing concern for response burden
• Increasing non-response
• New sources of available data:
➢ Administrative data
➢ Big Data sources: traffic sensors, M2M transactions, social
media, satellite images…
• Development of mathematical-statistical methods and IT
tools that allow for other forms of data treatment

57
Eurostat
Models of inference
The objectives of statistical inference

▪ The purpose of statistical inference is to obtain information


about a population (finite or infinite) from a sample from
this population

▪ Stochastic assumptions about the individual observations


and/or the population are made

▪ Statistical information of interest includes totals, means,


proportions, ratios, quantiles, etc. or the probability
distribution of a random variable

58
Eurostat
Models of inference
Overview of different modes of inference (paradigms)

▪ Design-based

▪ Model-assisted

▪ Model-based
predictive
▪ Algorithm-based

59
Eurostat
Models of inference
Design-based inference

▪ Traditionally used by National Statistical


Institutes
➢ Use of surveys to collect data
➢ NSIs prefer not to rely on model assumptions,
particularly if they are not verifiable
➢ Statistical (mathematical) models may be difficult to
understand, communicate or even calculate in a
production environment
➢ The concepts of random sample, sampling error,
weighting observations, etc. are familiar to (educated)
users of Official Statistics
60
Eurostat
Models of inference
Design-based inference: estimation

▪ Estimators (of a mean, a total, a proportion) are obtained


by expanding or weighting the observations in the sample
with survey weights
➢ Survey weights are derived from the sample design and
available auxiliary information

▪ The statistical properties of estimators are based on the


probability distributions from the sampling design
➢ Design-based estimators have «good» statistical
properties such as asymptotic unbiasedness

61
Eurostat
Models of inference
Design-based inference: theoretical example

▪ Horvitz-Thompson estimator of a total

1
𝑌෠𝐻𝑇 = ෍ 𝑦
𝜋𝑖 𝑖
𝑖∈𝑆

where 𝜋𝑖 is the probability of selection of unit i, and 1/𝜋𝑖 is the


weight of unit i calculated on the basis of the design:
➢ Stratification (auxiliary variables that define the strata)
➢ Sample size
➢ Corrections for non-response, calibration, etc.

62
Eurostat
Models of inference
Design-based inference: limitations

▪ Design-based inference may not be suitable when

➢ samples are small


➢ in presence of non-sampling errors
➢ discontinuities in survey design (e.g. change in data
collection mode, new classifications, methodological
change of concepts)
➢ Design-based estimators do not take into account the changes and cannot
separate the «real» change from the methodological change

63
Eurostat
Models of inference
Model-assisted inference

▪ Design-based estimators of the parameter of a variable can


be improved by using auxiliary information and modelling
the relationship between the variable and the auxiliary
information (=model-assisted)

64
Eurostat
Models of inference
Model-assisted inference: estimation

▪ HT estimator obtained from a linear regression model that


relates the parameter to auxiliary information
▪ Observed (𝑥𝑘 ; 𝑦𝑘 ) for a sample S (e.g. administrative and survey
data), x are observed for the whole U universe
1
▪ 𝑋෠𝐻𝑇 = σ𝑖∈𝑆 𝑥 is the grossed-up total of observed auxiliary x values
𝜋𝑖 𝑖
▪ 𝑋 = σ𝑖∈𝑼 𝑥𝑖 is the known total of auxiliary x values
1
▪ 𝑌෠𝐻𝑇 = σ𝑖∈𝑆 𝑦 is the Horvitz-Thompson estimate
𝜋𝑖 𝑖
෡ 𝐻𝑇 is the regression (=model-based) estimate
▪ 𝑌෠𝑹 = 𝑌෠𝐻𝑇 + 𝒃 · 𝑋 − 𝑿
based on the regression model 𝑦 = 𝑎 + 𝑏 · 𝑥 estimated from the
sample of observed (𝑥𝑘 ; 𝑦𝑘 )

65
Eurostat
Models of inference
Design-based and model-assisted: examples of
application in official statistics
▪ Generalised regression estimator (GREG) widely used by
NSIs for calibration
➢ Adjusts totals for sub-populations (consistency across tables)
➢ Adjusts to known totals
▪ Small Area Estimation (estimation borrowing strength
over space)
▪ Surveys based on panels (estimation borrowing strength
from the past)
▪ Modelling survey discontinuities
▪ Integration of sources in National Accounts
▪ Hedonic Price Indices
▪ Seasonal adjustment of statistical series 66
Eurostat
Models of inference
Algorithm-based inference

• In the algorithmic approach, the equivalent of fitting a


model is tuning an algorithm, so that it predicts well

• It is generally impossible to express algorithmic methods


analytically in terms of a mathematical expression

• In the algorithmic approach, the data for which both x and


y are known is split into two parts
• TRAINING SET: part is used to tune the algorithm
• TEST SET: part used to evaluate – or test – the predictive capabilities
of the trained algorithm

67
Eurostat
Models of inference
Algorithm-based inference: types of data

▪ collected from units through a targeted survey (e.g.


Structural Business Survey, Labour Force survey)

▪ collected from units in support of some administrative


process (e.g. tax records, unemployment benefits)

▪ other types, registering events (e.g. a transaction, an e-


mail, a Tweet) generated as by-products of processes
unrelated to statistics or administration

68
Eurostat
Models of inference
Algorithm-based inference: types of data

Feature Survey Admin data Other


data data
Records are units of a target population Yes Yes No
Target variables are directly available Yes Yes No
Auxiliary variables are directly available Yes Often No
Data preparation/ conversion is needed No No Yes
Data covers the complete target population No Often Rarely
Data are (almost) representative Usually Usually No
Susceptibility to measurement error High Medium low

Source: Buelens et al. (2012)

69
Eurostat
Models of inference
Algorithm-based inference: theoretical examples

• Similar to the model-based estimator, the algorithmic


estimator is
𝑌෠𝑨𝒍𝒈 = ෍ 𝒚𝒌 + ෍ 𝑭(𝒙𝒌 )
𝒌∈𝑺 𝒌∈𝑹

• For some function F() which maps the observed x to the


corresponding y within S (training set of units for which y is
known), the set R contains the population units with
unknown y.

• Uncertainty of this estimator arises from the imperfect


predictive power of the algorithm, and is assessed on the
test set using some cost function. 70
Eurostat
Models of inference
Algorithm-based inference: examples of application in
official statistics

▪ Central Statistics Office of Ireland: automatic coding


system for Classification of Individual Consumption by
Purpose (COICOP) assignment for their Household Budget
Survey, using previously coded records as training data
▪ Statistics New Zealand: Support Vector Machines (SVM)
to improve coding of variables Occupation and Post-school
Qualification, using two disjoint sets of observations, each
of size 10,000, from Census 2013 data for training and
testing (50% correctness).
▪ Statistics Portugal: classification trees (a type of decision
trees whose response variables are categorical) for error
detection in foreign trade transaction data.
71
Eurostat
Models of inference
Algorithm-based inference: examples of application in
official statistics (2)

▪ US Department of Agriculture: hierarchical clustering to


reduce the number of Quarterly Agriculture Survey (QAS)
questionnaire versions (states x crops).
▪ Italian National Institute of Statistics: substituting
(fully or partially) ICT in Enterprises surveys by collecting
data via website scraping and extracting information using
machine learning methods.
▪ Statistics Canada: use of satellite imaging data to assist
with estimation of crop yields. Field surveyors were sent to
corresponding actual locations to ascertain crop types and
yields; these were used as response variables. Probabilistic
image processing algorithms were used to learn and predict
the field observations based on the satellite data. 72
Eurostat
References
▪ J.H. Stock and M.W. Watson (2003). Introduction to econometrics,
Addison Wesley
▪ W.H. Green (2003). Econometrics analysis, Prentice Hall
▪ J. van den Brakel and J. Betlehem (2008). Model-based estimation for
official statistics. Statistics Netherlands, discussion paper (08002)
▪ K. Chu and Cl. Poirier, Statistics Canada (2015). Machine Learning
Documentation Initiative. HIGH-LEVEL GROUP FOR THE MODERNISATION
OF STATISTICAL PRODUCTION AND SERVICES, Modernisation Committee
on Production and Methods
▪ Buelens, B. H.J. Boonstra, J. van den Brakel, P. Daas (2012). Shifting
paradigms in official statistics. Statistics Netherlands, discussion paper
(201218)
▪ CROS Portal on MEMOBUST:
▪ Generalised Regression Estimator (Method)
▪ Calibration (Method)

73
Eurostat
References
▪ OECD, Eurostat, ILO, IMF, The World Bank, UNECE (2013). Handbook on
Residential Property Price Indices (RPPIs)
▪ Peter Hein van Mulligen (2003). Quality aspects in price indices and
international comparisons: Applications of the hedonic method
▪ C. Goytia and G. Dorna (Universidad Torcuato Di Tella)(2016). Big data
and a Spatial Hedonic Approach: Addressing the land market information
gap and estimating land prices determinants in metropolitan regions from
developing countries

74
Eurostat

You might also like