\quad两个随机变量X,YX,YX,Y的互信息为I(X;Y)=∑x∈X,y∈Yp(x,y)logp(x,y)p(x)p(y)I(X;Y)=\sum_{x\in X, y\in Y}p(x,y)log\frac{p(x,y)}{p(x)p(y)}I(X;Y)=x∈X,y∈Y∑p(x,y)logp(x)p(y)p(x,y)。互信息可以衡量两个变量之间的相似程度。如果我们要衡量某个数据集中任意两个单词x,yx,yx,y的关联程度,可以这样计算I(x;y)=p(x,y)logp(x,y)p(x)p(y)I(x;y)=p(x,y)log\frac{p(x,y)}{p(x)p(y)}I(x;y)=p(x,y)logp(x)p(y)p(x,y),其中
- p(x),p(y)p(x),p(y)p(x),p(y)为x,yx,yx,y独立出现的概率,直接统计词频后除以总的词数即可
- p(x,y)p(x,y)p(x,y)为x,yx,yx,y同时出现的概率,直接统计二者同时出现的次数,再除以所有无序对的个数即可
\quad假设我们的数据集有6个句子,如下所示:
dataSet = [['r', 'z'],
['x', 'y', 't', 's', 'z'],
['z'],
['s', 'x', 'r'],
['x', 'y', 'r', 'z', 't'],
['z', 'x', 'y', 's', 't']]
我们可以通过如下程序计算两两单词间的互信息:
from collections import defaultdict
from math import log2
class I():
# 使用互信息计算两者之间的相似性
def __init__(self, dataSet):
self.wordCount = defaultdict(lambda: 0) # 统计单词频率
self.pairsCount = defaultdict(lambda: defaultdict(lambda: 0)) # 统计两两单词共同出现的频率
self.count = 0.0 # 统计单词无序对数目
self.num = 0.0 # 统计总的单词数
self.dataSet = dataSet
self._update()
def _update(self):
"""遍历整个数据集,更新各个变量"""
for sample in self.dataSet:
n = len(sample)
self.count += n * (n - 1) / 2
self.num += n
for word in sample:
self.wordCount[word] += 1
for i in range(n):
for j in range(n):
self.pairsCount[sample[i]][sample[j]] += 1
def query(self, x, y):
"""计算x和y的互信息, I(x,y)=p(x,y)log[p(x,y) / (p(x)p(y))]"""
p_x = self.wordCount[x] / self.num
p_y = self.wordCount[y] / self.num
p_xy = self.pairsCount[x][y] / self.count
if p_x == 0 or p_y == 0: # 表示数据集里面没有x或者y,无法得到二者关系
return -1
return p_xy * log2(p_xy / p_x / p_y)
if __name__ == '__main__':
dataSet = [['r', 'z'],
['x', 'y', 't', 's', 'z'],
['z'],
['s', 'x', 'r'],
['x', 'y', 'r', 'z', 't'],
['z', 'x', 'y', 's', 't']]
test = I(dataSet)
words = list(test.wordCount.keys())
for i in range(len(words)):
for j in range(i + 1, len(words)):
print("I({}, {}) = {}".format(words[i], words[j], test.query(words[i], words[j])))
运行程序结果如下:
I(r, z) = 0.04648714168815664
I(r, x) = 0.06542408844623677
I(r, y) = 0.015507264790143206
I(r, t) = 0.015507264790143206
I(r, s) = 0.015507264790143206
I(z, x) = 0.08472409501243901
I(z, y) = 0.12134505083116053
I(z, t) = 0.12134505083116053
I(z, s) = 0.04648714168815664
I(x, y) = 0.14975047096828073
I(x, t) = 0.14975047096828073
I(x, s) = 0.14975047096828073
I(y, t) = 0.18637142678700225
I(y, s) = 0.08983805899205112
I(t, s) = 0.08983805899205112
\quad可以看出,I(y,t)I(y,t)I(y,t)值最大,表明y,ty,ty,t的关联度最高;I(r,t),I(r,s)I(r,t),I(r,s)I(r,t),I(r,s)值最小,表明r和t,r和sr和t,r和sr和t,r和s的关联度很小。这与我们直观感受数据得到的结论是一致的。
\quad根据互信息的定义,我们可以计算两个随机变量X,YX,YX,Y的关联程度,这里面X,YX,YX,Y可以是由若干个单词组成,意思就是,我们可以用互信息衡量两个句子之间的关联程度。