网页向量化表示:一种优化网络爬虫质量的方法

WEB PAGE VECTORIZATION: A METHOD TO OPTIMIZE THE QUALITY OF WEB CRAWLERS

  • 摘要: 搜索引擎的性能一定程度上依赖网络爬虫(网络内容获取)的能力。受到深度学习向量化表示和卷积神经网络技术启发,关注计算机对于信息(自然语言和图片)理解与信息的关联性,提出网页向量表示(Page2Vec)算法,并基于Page2Vec算法构建爬虫-过滤(Crawler-Filter)算法。实验表明:Crawler-Filter算法在网络爬虫过程中,能够在覆盖合理的内容的同时绕过低质量或无关内容。

     

    Abstract: To a certain extent, the performance of search engines depends on the ability of web crawlers (network content acquisition). Inspired by deep learning vectorized representation and convolutional neural network technology, this paper focuses on the computer's understanding of information (natural language and pictures) and the relevance of information. Web page vector representation (Page2Vec) algorithm was proposed, and the algorithm built a crawler-filter (Crawler-Filter) algorithm based on Page2Vec. Experiments show that the Crawler-Filter algorithm can cover reasonable content while bypassing low-quality or irrelevant content in the process of web crawling.

     

/

返回文章
返回