Apache Nutch 1.2 Released
时间:2010-09-25 来源:红薯
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
Apache Nutch 1.2 包含了不少的改进和bug修复,详情请看 CHANGES 文件。
你可以通过下面地址下载最新版的 Apache Nutch:
http://www.apache.org/dyn/closer.cgi/nutch/
CHANGES:
* NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via mattmann)
* NUTCH-908 Infinite Loop and Null Pointer Bugs in Searching (kubes via mattmann)
* NUTCH-906 Nutch OpenSearch sometimes raises DOMExceptions (Asheesh Laroia via ab)
* NUTCH-862 HttpClient null pointer exception (Sebastian Nagel via ab)
* NUTCH-905 Configurable file protocol parent directory crawling (Thorsten Scherler, mattmann, ab)
* NUTCH-877 Allow setting of slop values for non-quote phrase queries on query-basic plugin (kubes via jnioche)
* NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev via jnioche)
* NUTCH-878 ScoringFilters should not override the injected score
* NUTCH-870 Injector should add the metadata before calling injectedScore (jnioche via mattmann)
* NUTCH-858 No longer able to set per-field boosts on lucene documents (ab)
* NUTCH-869 Add parse-html back (jnioche)
* NUTCH-871 MoreIndexingFilter missing date format (Max Lynch via mattmann)
* NUTCH-696 Timeout for Parser (ab, jnioche)
* NUTCH-857 DistributedBeans should not close their RPC counterparts (kubes)
* NUTCH-855 ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing (Scott Gonyea via mattmann)
* NUTCH-677 Segment merge filering based on segment content (Marcin Okraszewski via mattmann)
* NUTCH-774 Retry interval in crawl date is set to 0 (Reinhard Schwab via mattmann)
* NUTCH-697 Generate log output for solr indexer and dedup (Dmitry Lihachev, Jeroen van Vianen via mattmann)
* NUTCH-850 SolrDeleteDuplicates needs to clone the SolrRecord objects (jnioche)
* NUTCH-838 Add timing information to all Tool classes (Jeroen van Vianen, mattmann)
* NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab)
* NUTCH-831 Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized (Jeroen van Vianen via mattmann)
* NUTCH-278 Fetcher-status might need clarification: kbit/s instead of kb/s shown (Alex McLintock via mattmann)
* NUTCH-833 Website is still Lucene branded (mattmann, Alex McLintock)
* NUTCH-832 Website menu has lots of broken links - in particular the API docs (Alex McLintock via mattmann)