Computational Journalism 2017 Week 1: Introduction
Computational Journalism 2017 Week 1: Introduction
Computational Journalism
Columbia Journalism School
Week 1: Introduction, Clustering
September 8, 2017
Computational Journalism: Definitions
Data Reporting
User
Computer
Science
CS for presentation / interaction
CS
CS
Data Reporting
User
Filter stories for user
CS
CS
Reporting
Data
CS CS CS
Data
Reporting Filtering
CS CS User
Reporting
Data
Examples of filters
Reporting
Data
CS
CS CS CS
CS CS
User
Reporting
Data
Journalism as a cycle
CS
Effects Data
CS
Reporting
User CS
CS
Filtering
Journalism with algorithms
vs.
Journalism about algorithms
Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012
Computer Science and Journalism
Statistics Epistemology
Course Structure
Unit 1: Filters
Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering
system design.
Unit 3: Methods
Visualization, knowledge representation, social network analysis, privacy and
security, tracking flow and effects
Administration
Assignments
Some assignments require programming, but
your writing counts for more than your code!
Final project
Code, story, or research
Course blog
http://compjournalism.com
Grading
40% assignements
40% final project
20% class participation
This class
Introduction
Classification and clustering
Text analysis in journalism
The Document Vector Space model
Classification and Clustering
Classification and Clustering
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.
d(x, x) = 0
- reflexivity: zero distance to self
d(x, y) = d(y, x)
- symmetry: x to y same as y to x
Distance matrix
! d d1,2 dM ,M $
# 1,1 &
# d2,1 d2,2 &
Dij = D ji = d(xi , x j ) = # &
# &
# d1,M dM ,M &%
"
Different clustering algorithms
Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
o Also LDA and many Bayesian models, from a certain perspective
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
http://www.paused21.net/off/kmeans/bin/
UK House of Lords voting clusters
UK House of Lords voting clusters
Algorithm instructed to separate MPs into five clusters. Output:
1 2 2 1 3 2 2 2 1 4
1 1 1 1 1 1 5 2 1 1
2 2 1 2 3 2 2 4 2 1
2 3 2 1 3 1 1 2 1 2
1 5 2 1 4 2 2 1 2 1
1 4 1 1 4 1 2 2 1 5
1 1 1 2 3 3 2 2 2 5
2 3 1 2 1 4 1 1 4 4
1 1 2 1 1 2 2 2 2 1
2 1 2 1 2 2 1 3 2 1
1 2 2 1 2 3 4 2 2 2
.
.
Voting clusters with parties
LDem XB Lab LDem XB Lab XB Lab Con XB
1 2 2 1 3 2 2 2 1 4
Con Con LDem Con Con Con LDem Lab Con LDem
1 1 1 1 1 1 5 2 1 1
Lab Lab Con Lab XB XB Lab XB Lab Con
2 2 1 2 3 2 2 4 2 1
Lab XB Lab Con XB XB LDem Lab XB Lab
2 3 2 1 3 1 1 2 1 2
Con Con Lab Con XB Lab Lab Con XB XB
1 5 2 1 4 2 2 1 2 1
Con XB Con Con XB Con Lab XB LDem Con
1 4 1 1 4 1 2 2 1 5
Con Con Con Lab Bp XB Lab Lab Lab LDem
1 1 1 2 3 3 2 2 2 5
Lab XB Con Lab Con XB Con Con XB XB
2 3 1 2 1 4 1 1 4 4
Con Con Lab Con Con XB Lab Lab Lab Con
1 1 2 1 1 2 2 2 2 1
Lab LDem Lab Con Lab Lab Con XB Lab Con
2 1 2 1 2 2 1 3 2 1
Con Lab XB Con XB XB XB Lab Lab Lab
1 2 2 1 2 3 4 2 2 2
Clustering Algorithm
Visualization
We have to go from
x RN
to much lower dimensional points
y RK<<N
y = Px
where P is a K by N matrix.
y = f(x)
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")
House of Lords MDS plot
Robustness of results
Regarding these analyses of legislative voting, we could still
ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do representatives
really have ideal points in a preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?
Text Analysis in Journalism
USA Today/Twitter Political Issues Index
Politico analysis of GOP primary, 2012
CNN State of the Union Twitter analysis, 2010
The Post obtained draft versions of 12 audits by the inspector generals office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector generals office and interviewed former and current employees. E-
mails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
30 the
23 to
19 and
19 a
18 animal
17 cruelty
15 of
15 crimes
14 in
14 for
11 that
8 crime
7 we
Features = words works fine
Note that this won't work at all for Chinese. It will fail in
,many ways even for English. How?
Distance metric for text
Useful for:
clustering documents
finding docs similar to example
matching a search query
similarity(a, b) a b
If each word occurs exactly once in each document,
equivalent to counting overlapping words.
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
Problem: long documents always win
ab
similarity(a, b)
a b
= cos()
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
2 1
similarity(a, q) = = 0.707
4 2 2
3
similarity(b, q) = 0.514
17 2
Cosine similarity
ab
cos = similarity(a, b)
a b
Cosine distance (finally)
ab
dist(a, b) 1
a b
Problem: common words
We want to look at words that discriminate among
documents.
= contains car
= does not contain car
Document Frequency
df (t, D) = d D : t d D
- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
TF TF-IDF
Some, but not much, theory to explain why this works. (E.g. why
that particular IDF formula? why doesnt indexing bigrams
improve performance?)