Php文档 Php问答行业资讯 Php论坛 Php手册 Php博客

游戏榜单

软件榜单

关闭导航

热搜榜

热门下载

热门标签

关闭搜索

php爱好者> php文档>机器学习IN 搜索引擎（1）

机器学习IN 搜索引擎（1）

时间：2011-02-08 来源：kewing

首先简要说明建立搜索引擎的步骤：

基本上，很简略的，可以将建立一个搜索引擎（或，建立一个玩具级的搜索引擎）分为以下几步：

1.搜集文档，抓取网页etc。

2.建立索引（文档，及文档中不同单词的位置信息）

3.查询，返回经排序的稳定列表。（处理结果的排列方式：基于网页内容的度量，如：单词频度；基于网页外部信息的度量，如：PageRank）

4.对查询结果排名（如，神经网络）

本文主要涉及上述步骤的前两部。这其中涉及到的机器学习算法并不多（嗯，我觉得是没有……）。

好，我们先搭建搜索引擎的初步架构：

 1 import urllib2
 2 from BeautifulSoup import *
 3 from urlparse import urljoin 
 4 
 5 ignorewords = set( ['the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'or'] )
 6 
 7  #-----------------------------------------------------------------
 8 # crawler: 检索网页&创建数据库
 9 #-----------------------------------------------------------------
10  class crawler:
11     # dbname: 数据库名称
12      def __init__(self, dbname):
13         pass
14     
15     def __del__(self):
16         pass 
17     
18     def dbcommit(self):
19         pass
20     
21     # 辅助函数--用于获取条目ID，若条目不存在则将之加入数据库
22      def getentryid(self, table, field, value, createnew = True):
23         return None 
24     
25     # 为网页建立索引
26      def addtoindex(self, url, soup):
27         print('Indexing %s' % url)
28     
29     # 从HTML中提取文字（无标签）
30      def gettextonly(self, soup):
31         return None 
32     
33     # 根据任何非空白字符进行分词处理
34      def separatewords(self, text):
35         return None 
36     
37     # 若url已建立索引则返回True
38      def isindexed(self, url):
39         return False
40     
41     # 添加一关联两个网页的链接
42      def addlinkref(self, urlFrom, urlTo, linkText):
43         pass 
44     
45     # 从一小组网页开始进行广度优先搜索，直至某一给定深度
46      # 期间为网页建立索引
47      def crawl(self, pages, depth = 2):
48         pass
49     
50     # 创建数据库表
51      def createindextables(self):
52         pass

以下要做的是完善该框架。

1.搜集文档，抓取网页etc--爬虫

我们使用urllib2和BeautifulSoup来构造一个Python版的爬虫程序

（由于以后会大量使用beautifulsoup，因此建议各位大神Google一下（如果你不愿意，可以点这里），由于不想太陷入细节，因此，以后不太对BeautifulSoup进行讨论。）

嗯，为了调试的需要，我在文件头增加了些代码：

1 # -*- coding: gb2312 -*-
2  import urllib2
3  from BeautifulSoup import *
4  from urlparse import urljoin
5  from sqlite3 import dbapi2 as sqlite
6 
7 DEBUG_ON = True
8  if DEBUG_ON: import traceback

首先，一定一定要在文件头上写上# -*- coding: gb2312 -*-之类的代码，否则。。。（Python3就要好得多，全是Unicode，关于Python中，类也好函数也好，这些个字符编码的问题还真相当麻烦，不过还好Python3中有了很大改善。。。）。那个DEBUG_ON是为了取得和C中的#ifdef相似的功能……

爬虫函数如下：

 1     # 从一小组网页开始进行广度优先搜索，直至某一给定深度
 2     # 期间为网页建立索引
 3     def crawl(self,pages,depth=2):
 4          for i in range(depth):
 5               newpages={}
 6               for page in pages:
 7                    try:
 8                         c=urllib2.urlopen(page)
 9                    except:
10                         print "Could not open %s" % page
11                         continue
12                    
13                    try:
14                         soup=BeautifulSoup(c.read())
15                         self.addtoindex(page,soup)
16                         
17                         links=soup('a')
18                         for link in links:
19                              if ('href' in dict(link.attrs)):
20                                   url=urljoin(page,link['href'])
21                                   if url.find("'")!=-1: continue
22                                   url=url.split('#')[0]
23                                   if url[0:4]=='http' and not self.isindexed(url):
24                                        newpages[url]=1
25                                        linkText=self.gettextonly(link)
26                                        self.addlinkref(page,url,linkText)
27                         self.dbcommit()
28                    except:
29                         print "Could not parse page %s" % page
30                         if DEBUG_ON: print traceback.format_exc()
31                    pages = newpages

DEBUG_ON一定要定义，否则这里就要注意处理NameError。。。貌似Python的debug功能不是很强大。。。

这其中涉及到了相当多的关于web，url之类的处理。由于本文关注的主要是机器学习，因此不打算对这些知识逐一介绍，那是几本书才能完成的事情。。。

代码中，我们指定了depth。因为我们要对该页面上的所有链接（我们将他们加入newpages中）进行遍历，所以有必要指定下遍历要进行的深度。我们传入一个list，其中是我们的爬虫程序开始的页面。我们也会对该网页做索引，我们用isindexed检查该url是否已经被做过索引，如果没有我们就将之加入我们的索引表。

我们测一下：

1 #================================================================================
2 # Test: searchengine.py
3 #================================================================================
4  if __name__ == '__main__':
5     pagelist = [r'http://www.baidu.com']
6     crawler = crawler('')
7     crawler.crawl(pagelist)

结果如下：

代码

2.建立索引

首先，数据库处理

嗯，我们来为全文索引建立数据库。索引对应于一个列表，其中包含了所有不同单词 & 这些单词所在的文档 & 单词在文档中出现的位置。

我们将使用简单的嵌入式数据库SQLite。关于该数据库的使用，在《文档分类2》中有所涉及。

我们先在crawler中修改部分代码，以进行SQLite操作。

1     # dbname: 数据库名称
2      def __init__(self, dbname):
3         self.con = sqlite.connect(dbname)
4     
5     def __del__(self):
6         self.con.close()
7     
8     def dbcommit(self):
9         self.con.commit()

笔者使用的是Python2.6（确切的说，在写这个东东时，为了不抽风的使用BeautifulSoup而不得不使用Python2.6。。。），所以相关的imort语句是（嗯，实际上Python3中也是这样。。。）：

from sqlite3 import dbapi2 as sqlite

我们先来建立如下几张数据库表。

他们分别是：

urllist：索引过的url列表

wordlist：单词列表

wordlocation：单词在文档中所处位置的列表

link：保存了两个url，指明了从一张表到另一张表的链接关系。

linkwords：利用wordid和linkid，记录了哪些单词与链接相关

（嗯，也许上面那幅图把你搞得晕晕乎乎的，那么，由于“所有SQLite中的表默认都有个rowid”（就是这里的第一句话所说的那样），因此，我们不需要显示的为这些表指定这个rowid，也因此，在看这幅图时，直接把图中的rowid忽视掉，然后，豁然开朗。。。）

我们所要做的就是建立上面那些个表，如下代码所示：

 1     # 创建数据库表
 2      def createindextables(self):
 3         self.con.execute('create table urllist(url)')
 4         self.con.execute('create table wordlist(word)')
 5         self.con.execute('create table wordlocation(urlid,wordid,location)')
 6         self.con.execute('create table link(fromid integer,toid integer)')
 7         self.con.execute('create table linkwords(wordid,linkid)')
 8         self.con.execute('create index wordidx on wordlist(word)')
 9         self.con.execute('create index urlidx on urllist(url)')
10         self.con.execute('create index wordurlidx on wordlocation(wordid)')
11         self.con.execute('create index urltoidx on link(toid)')
12         self.con.execute('create index urlfromidx on link(fromid)')
13         self.dbcommit()

可以看到，我们并不是仅仅建立了5张表，实际上，为了效率，我们另建了几张表。

不妨一测：

1 #===============================================================================
2 # Test: searchengine.py
3 #===============================================================================
4  if __name__ == '__main__':
5     pagelist = [r'http://www.baidu.com']
6     crawler = crawler('searchindex.db')
7     crawler.createindextables()
8     crawler.crawl(pagelist)

我们在当前工作目录建立个数据库文件：searchindex.db，嗯。

其次，为了在网页中查找单词，我们首先：

 1     # 从HTML中提取文字（无标签）
 2      def gettextonly(self, soup):
 3         v = soup.string
 4         if v == None:
 5             c = soup.contents
 6             resultext = ""
 7             for t in c:
 8                 subtext = self.gettextonly(t)
 9                 resultext += subtext + '\n'
10             return resultext
11         else:
12             return  v.strip()

然后：

1     # 根据任何非空白字符进行分词处理
2      def separatewords(self, text):
3         splitter = re.compile(r'\W*')
4         return (s.lower() for s in splitter.split(text) if s!='')

想说下的是，separatewords足够简短和谐，但可能工作的不如，嗯，比如Google，那么好。有大量的研究旨在改善该函数的功能。（Google一下）（关于中文分词也是相当巨大的一块，有兴趣的可以稍微看看笔者很久以前写的一个，玩具级分词程序。。。）

(也正因为该函数，所以有了本文开头关于本程序所处理文本的假设。所以，实质上从后文来讲，我们测试中使用http://www.baidu.com是不行的。。)

（还可以对分词进行一些处理，比如，针对英语文本的词干词缀处理，可以让我们输入play，而得到play & playing 的相关页面。。。这些均属于此处分词的任务。。。）

好吧，建立索引

在这里，我们的目标是：把某个url网页中出现的所有单词都与该url相联系起来。

我们把这些信息保存在wordlocation中。

 1      # 为网页建立索引
 2       def addtoindex(self, url, soup):
 3           if self.isindexed(url): return
 4           print('Indexing %s' % url)
 5           
 6           # 获取单词
 7            text = self.gettextonly(soup)
 8           words = self.separatewords(text)
 9           
10           # 得到URL的id
11            urlid = self.getentryid('urllist', 'url', url)
12           
13           # 将每个单词与该url关联
14            for i in range(len(words)):
15                word = words[i]
16                if word in ignorewords: continue
17                worid = self.getentryid('wordlist', 'word', word)
18                self.con.execute("insert into wordlocation(urlid, wordid, location) \
19                 values (%d, %d, %d)" % (urlid, worid, i) )

从这里我们可以看出到目前为止整个算法的一个大体的流程。我们利用BeautifulSoup对某个url soup一下，该soup保存了该url的html形式。我们利用该soup得到所有的text。

在这里我们还需要一个辅助函数：getentryid，来得到相应的id。如你所想，肯定是与数据库相关。

 1     # 辅助函数--用于获取条目ID，若条目不存在则将之加入数据库
 2      def getentryid(self, table, field, value, createnew = True):
 3         cur=self.con.execute(
 4             "select rowid from %s where %s='%s'" % (table,field,value))
 5         res=cur.fetchone()
 6         if res==None:
 7             cur=self.con.execute(
 8                 "insert into %s (%s) values ('%s')" % (table,field,value))
 9             return cur.lastrowid
10         else:
11             return res[0]

嗯，很简单。如果存在，我们将返回相应的id，如果不存在，我们就插入之。可能你需要参考下我们建立数据库的函数createindextables。

isindexed也很简单。

 1     # 若url已建立索引则返回True
 2       def isindexed(self, url):
 3           u = self.con.execute \
 4             ("select rowid from urllist where url='%s'" % url).fetchone()
 5           if u != None:
 6               # 是否已被检索?
 7                 v = self.con.execute \
 8                 ("select * from wordlocation where urlid=%d" % u[0]).fetchone()
 9                if v != None: return True
10           return False

嗯，可以一测了：

1 #===============================================================================
2 # Test: searchengine.py
3 #===============================================================================
4  if __name__ == '__main__':
5      pagelist = [r'http://www.baidu.com']
6      crawler = crawler('searchindex.db')
7      crawler.createindextables()
8      crawler.crawl(pagelist)
9      print [row for row in crawler.con.execute("select rowid from wordlocation where wordid=1")]

对baidu crawler了一下，结果如下：

代码

得到的searchindex.db有1.15Mb，嗯，还可以。

至此，我们的代码完整代码如下：

代码

(Wait for continue...)