寻找独立特征--非负矩阵因式分解（NMF）

时间：2011-02-05 来源：kewing

本文以下部分主要讨论特征提取之非负矩阵因式分解（NMF）。

构造数据集

和聚类算法一样，特征提取算法所需数据集是一个大型数字矩阵。其中，每一行代表一个数据项每一列代表数据项的一个属性。

本文的矩阵，行代表各类文章，列对应于文章中的单词，行列中的数字代表了某个词在该文章中出现的次数。

如（来自《Programming Collective Intelligence》）：

acticles = ['A', 'B', 'C', ......]

words = ['w1', 'w2', 'w3', ....]

data = [[3.0,1...],

[2,1,1,...],

.....]

（回忆：嗯，在聚类算法中的数据也是这样的。。。）

 1 import feedparser
 2  import re
 3  # 新闻来源 -- RSS
 4  feedlist=['http://today.reuters.com/rss/topNews',
 5           'http://today.reuters.com/rss/domesticNews',
 6           'http://today.reuters.com/rss/worldNews',
 7           'http://hosted.ap.org/lineups/TOPHEADS-rss_2.0.xml',
 8           'http://hosted.ap.org/lineups/USHEADS-rss_2.0.xml',
 9           'http://hosted.ap.org/lineups/WORLDHEADS-rss_2.0.xml',
10           'http://hosted.ap.org/lineups/POLITICSHEADS-rss_2.0.xml',
11           'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml',
12           'http://www.nytimes.com/services/xml/rss/nyt/International.xml',
13           'http://news.google.com/?output=rss',
14           'http://feeds.salon.com/salon/news',
15           'http://www.foxnews.com/xmlfeed/rss/0,4313,0,00.rss',
16           'http://www.foxnews.com/xmlfeed/rss/0,4313,80,00.rss',
17           'http://www.foxnews.com/xmlfeed/rss/0,4313,81,00.rss',
18           'http://rss.cnn.com/rss/edition.rss',
19           'http://rss.cnn.com/rss/edition_world.rss',
20           'http://rss.cnn.com/rss/edition_us.rss']
21  #----------------------------------------------------------------------
22 # 构造数据
23 #----------------------------------------------------------------------
24  def stripHTML(h):
25     """strip the HTML"""
26     p = ''
27     s = 0
28     for c in h:
29         if c == '<': s = 1
30         elif c == '>':
31             s = 0
32             p += ' '
33         elif s == 0: p += c
34     return p
35  #----------------------------------------------------------------------
36  def separatewords(text):
37     """separete the word using re"""
38     splitter = re.compile(r'\W*')
39     return [s.lower() for s in splitter.split(text) if len(s) > 3]
40  #----------------------------------------------------------------------
41 # 构造数据主例程
42  def getarticlewords():
43     """construre data"""
44     allwords = {}
45     articlewords = []
46     articletitles = []
47     ec = 0
48     
49     # 遍历订阅源
50      for feed in feedlist:
51         f = feedparser.parse(feed)
52      
53         # 遍历每篇文章
54          for e in f.entries:
55             # 跳过标题相同的文章
56              if e.title in articletitles: continue
57             
58             # 提取单词
59              txt = e.title.encode('utf-8') + stripHTML(e.description.encode('utf-8'))
60             words = separatewords(txt)
61             articlewords.append({})
62             articletitles.append(e.title)
63             
64             # 在allwords和articlewords中增加对当前单词的计数
65              for word in words:
66                 allwords.setdefault(word, 0)
67                 allwords[word] += 1
68                 articlewords[ec].setdefault(word, 0)
69                 articlewords[ec][word] += 1
70             ec += 1
71         
72     return allwords,articlewords,articletitles
73  #----------------------------------------------------------------------
74 # 构造符合题意的matrix
75  def makematrix(allw, articlew):
76     """make matrix"""
77     wordvec = []
78     
79     # 限制频度
80      for w,c in allw.items():
81         if c > 3 and c < len(articlew)*0.6:
82             wordvec.append(w)
83     
84     # 构造单词矩阵
85      print articlew[0]
86     li = [ [(word in f and f[word] or 0) for word in wordvec]
87            for f in articlew]
88     return li,wordvec
89  #========================================================================
90 # test: newsfeatures.py
91 #========================================================================
92  if __name__ == '__main__':
93     allwords,articlewords,articletitles = getarticlewords()
94     wordmatrix, wordvec = makematrix(allwords, articlewords)

有点神的是这句：

    li = [ [(word in f and f[word] or 0) for word in wordvec]
           for f in articlew]

这个列表推导用得，嗯，好。。。

这里是feedparser的文档。

其他和在聚类算法中构造数据如出一辙（甚至可以对聚类中的代码稍作修改而复用），就不再多说什么了。

我们测试一下：

1 #========================================================================
2 # test: newsfeatures.py
3 #========================================================================
4  if __name__ == '__main__':
5     allwords,articlewords,articletitles = getarticlewords()
6     wordmatrix, wordvec = makematrix(allwords, articlewords)
7     print wordvec[0:10]
8     print articletitles[1]
9     print wordmatrix[1][0:10]

测试结果：

['updates', 'school', 'street', 'reported', 'military', 'reagan', 'would', 'quot', 'after', 'president']
Week in Review: 2 Detained Reporters Saw Secret Police’s Methods Firsthand
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

我们打出了前10个单词，列出了第二篇文章的标题，后面是该文章中所出现的这10个词的频度。

非负矩阵因式分解（NMF）

（矩阵相关基础知识在这里）

以上的代码，我们构造了一个关于单词和文章的文章矩阵。特征提取，即是从该文章矩阵中提取出一特征矩阵和一权重矩阵，使得二者相乘得到原谅的文章矩阵。理想的情况下，我们能利用这二者完全还原文章矩阵。但鉴于，现实是不完美的，因此，我们的目标是：找到尽量完美的两个矩阵，尽量还原原始文章矩阵，这样所得的特征矩阵必然是质量最好的矩阵，从而完成了对文章的特征提取。

我们的特征矩阵是（特征X单词）：

权重矩阵是（文章X特征）：

此二者相乘，得到文章矩阵（文章X单词）：

算法实现：

因为我们要找到尽量完美的两矩阵，因此有必要先定义一个函数，该函数来判断所给出矩阵是否完美。。。

 1 from numpy import *
 2 
 3  #----------------------------------------------------------------------
 4 # 判断是否完美
 5  def difcost(a,b ):
 6     """cal the cost of current matrix"""
 7     dif = 0
 8     # 遍历每一行&每一列
 9      for i in range(shape(a)[0]):
10         for j in range(shape(a)[1]):
11             dif += pow((a[i,j] - b[i,j]), 2)
12     return dif

好吧，其实很简单。这里有关于numpy.（这里很显著的体现了Python的三方库确实挺多的，也难怪，Python的表达能力很强。。。）

既然有了关于矩阵是否完美的判断函数，那么我们要做的主要工作就是：不断的更新矩阵，使矩阵成本更低更趋于完美。虽然我们可以使用之前介绍过的优化算法（也不算很复杂，重要的是找到合适的成本函数），但这里要使用的是另一种方法：乘法更新法则（Multiplicative Update Rules）。

关于乘法更新原则理论/方法的讨论可参看相关文献。

这里直接贴出《Programming Collective Intelligence》一书中的源码：

 1 def factorize(v,pc=10,iter=50):
 2   ic=shape(v)[0]
 3   fc=shape(v)[1]
 4 
 5   # Initialize the weight and feature matrices with random values
 6    w=matrix([[random.random() for j in range(pc)] for i in range(ic)])
 7   h=matrix([[random.random() for i in range(fc)] for i in range(pc)])
 8 
 9   # Perform operation a maximum of iter times
10    for i in range(iter):
11     wh=w*h
12     
13     # Calculate the current difference
14      cost=difcost(v,wh)
15     
16     if i%10==0: print cost
17     
18     # Terminate if the matrix has been fully factorized
19      if cost==0: break
20     
21     # Update feature matrix
22      hn=(transpose(w)*v)
23     hd=(transpose(w)*w*h)
24   
25     h=matrix(array(h)*array(hn)/array(hd))
26 
27     # Update weights matrix
28      wn=(v*transpose(h))
29     wd=(w*h*transpose(h))
30 
31     w=matrix(array(w)*array(wn)/array(wd))  
32     
33   return w,h