WEB PAGE VECTORIZATION: A METHOD TO OPTIMIZE THE QUALITY OF WEB CRAWLERS
-
Abstract
To a certain extent, the performance of search engines depends on the ability of web crawlers (network content acquisition). Inspired by deep learning vectorized representation and convolutional neural network technology, this paper focuses on the computer's understanding of information (natural language and pictures) and the relevance of information. Web page vector representation (Page2Vec) algorithm was proposed, and the algorithm built a crawler-filter (Crawler-Filter) algorithm based on Page2Vec. Experiments show that the Crawler-Filter algorithm can cover reasonable content while bypassing low-quality or irrelevant content in the process of web crawling.
-
-