无论是通用搜索还是垂直搜索,其关键的核心技术之一就是网络爬虫的设计。本文结合
HTMLParser 信息提取方法,对生活类垂直搜索引擎中网络爬虫进行了详细研究。通过深入分析生活类网站网址的树形结构的构架,开发了收集种子页面URL 的模拟搜索器,并基于HTMLParser 的信息提取方法,从种子页面中提取出与生活类主题相关的目标URL。经实验测试证明该爬虫的爬准率达93.552% ,爬全率达96.720% ,表明该网络爬虫是有效的,达到中等规模的垂直搜索企业级应用的要求。
关键词:网络爬虫;垂直搜索; HTMLParser
Abstract:Whether general search engine or vertical search engine, the design of web crawler is the core technology. In this article, a novel system of life-theme web crawler based on HTMLParser information extraction is thoroughly studied. In this system, a simulation searcher is designed for collecting the seed URL by analyzing tree structure of life-theme website, then, based on the discussion of HTMLParser information extraction, the target URL that relate to life-theme is extracted from the seed pages. Empirical studies show that the Pr ecision = 93.552% and the Re call = 96.720%, proving its effectiveness and achieving requirements for general enterprise-level application of vertical search engine.
Key words:web crawler; vertical search engine;HTMLParser
声明:本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人,不代表电子发烧友网立场。文章及其配图仅供工程师学习之用,如有内容侵权或者其他违规问题,请联系本站处理。 举报投诉
全部0条评论
快来发表一下你的评论吧 !