If they pretend to be Googlebot they should be pretty easy to block, Google will let you check the source IP of Googlebot to prevent others from pretending to be it (which on some sites bypasses certain pages; plenty of PHPBB forum out there that will require an account to view threads, except when your user agent is Googlebot). It’s just a matter of doing a DNS lookup, which can be cached and shouldn’t take very long, even for larger sites. A similar method works for Bingbot as well.
Doing this verification will also kick out tons of other crawlers and bots that you probably don’t want anyway.
I don’t really see what optimisation of search engines has to do with censorship. Search engine users want answers, they’re not just an SEO API. Without some manual balancing, search engines would be as useless as the second or third page of Google.
They don’t pretent to be googlebot, they use their own crawler they just don’t share the name they use for it, so sites can’t exclude it with robots.txt. They just scrape the same sites that googlebot does, so if the site is excluded by googlebot they also skip it.
If they pretend to be Googlebot they should be pretty easy to block, Google will let you check the source IP of Googlebot to prevent others from pretending to be it (which on some sites bypasses certain pages; plenty of PHPBB forum out there that will require an account to view threads, except when your user agent is Googlebot). It’s just a matter of doing a DNS lookup, which can be cached and shouldn’t take very long, even for larger sites. A similar method works for Bingbot as well.
Doing this verification will also kick out tons of other crawlers and bots that you probably don’t want anyway.
I don’t really see what optimisation of search engines has to do with censorship. Search engine users want answers, they’re not just an SEO API. Without some manual balancing, search engines would be as useless as the second or third page of Google.
They don’t pretent to be googlebot, they use their own crawler they just don’t share the name they use for it, so sites can’t exclude it with robots.txt. They just scrape the same sites that googlebot does, so if the site is excluded by googlebot they also skip it.