Search - heritrix

[Search Engine] 4pm

Description: 本文用lucene和Heritrix构建了一个Web 搜索应用程序 Lucene 是基于 Java 的全文信息检索包，它目前是 Apache Jakarta 家族下面的一个开源项目。 Lucene很强大，但是，无论多么强大的搜索引擎工具，在其后台，都需要一样东西来支援它，那就是网络爬虫Spider。网络爬虫，又被称为蜘蛛Spider，或是网络机器人、BOT等，这些都无关紧要，最重要的是要认识到，由于爬虫的存在，才使得搜索引擎有了丰富的资源。 Heritrix是一个纯由Java开发的、开源的Web网络爬虫，用户可以使用它从网络上抓取想要的资源。它来自于www.archive.org。Heritrix最出色之处在于它的可扩展性，开发者可以扩展它的各个组件，来实现自己的抓取逻辑。-In this paper, lucene and Heritrix build a Web search application Lucene is a Java-based full-text information retrieval package, it is now the Apache Jakarta family, following an open source project. Lucene is very powerful, but, no matter how powerful search engine tool, in its background, we need something to support it, that is, Web crawler Spider. Web crawlers, also known as Spider Spider, or robot network, BOT, etc., which are insignificant, the most important thing is to recognize that, due to the presence of reptiles, which makes the search engine there are plenty of resources. Heritrix is a pure Java developed by the, open source Web crawler, the user can use it to grab you want from the network resources. It comes from www.archive.org. Heritrix is that it is the best scalability, developers can extend its various components, to achieve their capture logic.
Platform: | Size: 2989056 | Author: 曹志聪 | Hits:

[Internet-Network] heritrix

Description: 网络爬虫工具，源码，可以爬取网页数据，保存在本地数据库-network snap tool, get data from the network and save it to the database
Platform: | Size: 11276288 | Author: li | Hits:

[JSP/Java] LucenePHeritrix

Description: heritrix+lucene的网页爬取的源码-this is the code for heritrix+lucene
Platform: | Size: 25793536 | Author: tai | Hits:

[Other Web Code] testDWR

Description: 网络爬虫的一个实例。配合heritrix和lucene应用-this is a example for web
Platform: | Size: 214016 | Author: tai | Hits:

[JSP/Java] heritrixProject

Description: heritrix爬虫实例，抓取了PCONLINE和163的手机产品信息-the heritrix reptiles instance, crawl PCONLINE and 163 phone product information
Platform: | Size: 11102208 | Author: hwq | Hits:

[Search Engine] heritrixDktj131_2012

Description: 扩展Heritrix开发包开发的面向主题的网络爬虫-The extended the Heritrix development package developed theme-oriented web crawler
Platform: | Size: 12328960 | Author: xcx0617 | Hits:

[JSP/Java] MD5

Description: MD5算法一种非常好用散列函数可用于lucene+heritrix架构搜索引擎-MD5 algorithm
Platform: | Size: 1024 | Author: zhaolinfang | Hits:

[JSP] Eclipse-Heritrix1.14.4

Description: heritrix在eclipse上的配置-heritrix in eclipse
Platform: | Size: 16384 | Author: 肖剑锋 | Hits:

[Search Engine] search-eginee

Description: Luncene2.0+Heritrix开发自己的搜索引擎，书籍中的源码。-Luncene2.0+Heritrix develop its own search engine, in a book source.
Platform: | Size: 17222656 | Author: wangyilin | Hits:

[Software Engineering] heritrixs

Description: 根据heritrix最新版本，实践安装后，并整理的分布式爬虫heritrix安装方式-According to the latest version heritrix, practice after installation and finishing installation heritrix distributed crawler
Platform: | Size: 4096 | Author: | Hits:

[Search Engine] heritrix_developer_manual

Description: Heritrix官方开发文档，crawler.archive.org/articles，提供了基本的类的开发介绍。-(Heritrix official development documents, crawler.archive.org/articles, provides a basic introduction class development.)
Platform: | Size: 83968 | Author: Liu | Hits:

[Search Engine] TmallSearch20130507

Description: 面向天猫网的搜索系统，使用了lucene和heritrix等开源工具。-Lynx-oriented network search system using lucene and heritrix and other open source tools.
Platform: | Size: 5970944 | Author: 王东升 | Hits:

[JSP/Java] sample.dw.paper.lucene

Description: 通过lucene和heritrix实现的简单搜索引擎代码，基本功能都已实现-Through Lucene and heritrix to achieve a simple search engine code, the basic functions have been achieved
Platform: | Size: 3278848 | Author: zhang | Hits:

[WEB Code] mysearch

Description: heritrix 原代码加上自己自定义的一些过滤工具
Platform: | Size: 12267520 | Author: Anthony | Hits:

[Search Engine] WPCrawler

Description: 网络爬虫，也叫网络蜘蛛，有的项目也把它称作“walker”。维基百科所给的定义是“一种系统地扫描互联网，以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目，其中比较有名的是Heritrix和Apache Nutch。有时需要在网上搜集信息，如果需要搜集的是获取方法单一而人工搜集费时费力的信息，比如统计一个网站每个月发了多少篇文章、用了哪些标签，为自然语言处理项目搜集语料，或者为模式识别项目搜集图片等等，就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。 -Web crawler, also known as the spider web, some projects also called it walker . Wikipedia is defined as a systematic scanning of the Internet, in order to obtain the index for the purpose of the network program . There are a lot of open source projects on the web crawler, which is more popular Apache and Nutch Heritrix. Sometimes you need to collect information on the Internet, if you need to collect the method is a single and manual collection of information, such as a website each month made a number of articles, with which tags, for natural language processing project data collection, or for the pattern recognition project to collect pictures, and so on, you need to complete the task of crawler. And one of the essential components of the search engine is the web crawler.
Platform: | Size: 1863680 | Author: Francis | Hits:

« 1 2 3 4 5»

Category

Source Code

Web/Internet

Develop Tools

Document

Other

Search in results

OS

Platform

Language

File Type

Search list