Search Engine Spamming refers to the practice of creating Web pages, or sets of Web pages, designed to get a high relevance rank for some queries , even though the sites are not actually popular sites.
For example, a travel site may want to be ranked high for queries with keyword "Travel". It can get high TF-IDF scores by repeating the word "Travel" many times in its page.
( Repeated words in a web page may confuse users; spammers can tackle this problem by delivering different pages to search engines than to other users, for the same URL , or by making the repeated words invisible , for example by formatting the words in small white font on a white background.)
Even a site unrelated to travel, such as a pornographic site, could do the same thing, and would get highly ranked for a query on the word travel. In fact, this sort of spamming of TF-IDF was common in early days of Web Search, and there was a constant battle between such sites and search engines that tried to detect spamming and deny them a high ranking.
Popularity ranking schemes such as PageRank make the job of search engine spamming more difficult , since just repeating words to get a high TF-IDF score was no longer sufficient. However , even these techniques can be spammed , by creating a collection of Web pages that point to each other, increasing their popularity rank.
Techniques such as using sites instead of pages as the unit of ranking( with appropriately normalized jump probablities ) have been proposed to avoid some spamming techniques , but are not fully effective against other spamming techniques. The war between search engine spammers and the search engines continues even today.
The hubs and authorities approach of the HITS algorithm is more susceptible to spamming. A spammer can create a Web page containing links to good authorities on a topic , and gains a high hub score as a result.
Monday, November 24
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment