1, the search engine will identify the noise based on visual information, so we in the construction of the web search engine if you want to have as soon as possible denoising should follow the general principles, the text will arrange in the middle area of the page, in line with the general rules of the website do not get it, some very personalized pages, increase search engine noise recognition the difficulty.
3, the basic principle of
, how the artificial noise reduction of
wants to reduce the noise we should start from the noise elimination principle of search engine, search engine can be carried out according to the principle of artificial noise reduction:
? 1, the principle of information visualization based on How to use the search engine The
2, based on the principle of "structure. According to the HTML label on the page to separate the number of partitions, header, navigation, text, advertising and so on the block, just grab the text and other important part.
2, the search engine will be based on Web page structure identification of noise, since it is based on the page HTML tag, it is the first to grab distinction, that has nothing to do with the text content if you do not grab is not noise? So many useless block header, such as advertising, copyright and other content can be achieved through the JS call because, these blocks are repeated in the station, especially in advertising, copyright, these comments. Once it is included to denoise a lot, even cause repeat. But there is one thing we must pay attention to, is put into the JS plate if you don’t want to be captured "
search engine noise in general are on the same site, search engines will not be because of a part of a website is noise and determine another site of the corresponding part is noise, and for a web site, search engine noise elimination principle now can be divided into three categories:
Shanghai dragon Er denoising principle
denoising is a basic step of search engine in preprocessing, denoising is in the search engine grab end pages after extracting text, word segmentation, to stop words in the process of pretreatment, refers to the search engine to identify the web page ranking on the calculation of no significance, such as a navigation bar copyright, text, advertising and so on block. The search engine needs to deal with a very large number of web pages, which is part of the meaningless contents is very large, in order to save the computing resources, faster computing, search engine in the pretreatment will be removed after the identification of these contents, this process is called denoising, which eliminate the noise content is.
search engine denoising principle
template. Refers to the same template extracted from a set of web pages, and then using these templates to extract the useful information from the web page.
What is the
. Refers to the use of the elements in the page layout information, which can divide the page using the layout information, to retain the middle area of the page, and the other areas are considered noise.