Are Bad Bots Eating Up Your Sites Resources? Block Em Out!
Bots come in many flavors.
Some like the Google Bot and Bing Bot are welcome visitors.
Most of the well behaved bots will obey your robots.txt file, then there are other snoops and bandwidth leeches that could care less about the rules as written in your robots file.
I have been dealing with some performance issues lately and discovered the Chinese Bot BaiduSpider has been ignoring my robots.txt file and scraping my site, and sucking up my bandwidth and system resources. There are others as well that need to be shown the door, but Baidu is the worst offender i have seen so far!
Baidu is an image thief and if you have a lot of photos on your site they will make multiple hits from different IP’s and literally bring your site down to a crawl. And they steal more than images, videos, mp3′s, html and text documents. Anything on your server is fair game!
The best blocking technique is using your .htaccess file, if your server is running under Linux (most common shared hosting operating system). If your hosting account has a file manager, look in your public_html root folder for a file called .htaccess and make a copy of it in a temp folder! You can also usually download a copy of it to your local drive if your host provides cPanel or GoDaddy’s FTP file manager. Or use a FTP program to download it to your local drive. Make a copy and put it somewhere safe just in case your server croaks after modifying the file.
Updated 02/01/2012. Here is my updated exclusion list. These bots are my personal preference. I suggest you Google the bot’s user agent to research what it does, then decide if you want to block it or not.
BrowserMatchNoCase OmniExplorer_Bot/6.11.1 bad_bot
BrowserMatchNoCase omniexplorer_bot bad_bot
BrowserMatchNoCase Baiduspider bad_bot
BrowserMatchNoCase Baiduspider/2.0 bad_bot
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase yandeximages bad_bot
BrowserMatchNoCase Spinn3r bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase Sogouwebspider/3.0 bad_bot
BrowserMatchNoCase Sogouwebspider/4.0 bad_bot
BrowserMatchNoCase sosospider+ bad_bot
BrowserMatchNoCase jikespider bad_bot
BrowserMatchNoCase ia_archiver bad_bot
BrowserMatchNoCase PaperLiBot bad_bot
BrowserMatchNoCase ahrefsbot bad_bot
BrowserMatchNoCase ahrefsbot/1.0 bad_bot
BrowserMatchNoCase SiteBot/0.1 bad_bot
BrowserMatchNoCase DNS-Digger/1.0 bad_bot
BrowserMatchNoCase DNS-Digger-Explorer/1.0 bad_bot
BrowserMatchNoCase boardreader bad_bot
BrowserMatchNoCase radian6 bad_bot
BrowserMatchNoCase R6_FeedFetcher bad_bot
BrowserMatchNoCase R6_CommentReader bad_bot
BrowserMatchNoCase ScoutJet bad_bot
BrowserMatchNoCase ezooms bad_bot
BrowserMatchNoCase CC-rget/5.818 bad_bot
BrowserMatchNoCase libwww-perl/5.813 bad_bot
BrowserMatchNoCase magpie-crawler 1.1 bad_bot
BrowserMatchNoCase jakarta bad_bot
BrowserMatchNoCase discobot/1.0 bad_bot
BrowserMatchNoCase MJ12bot bad_bot
BrowserMatchNoCase MJ12bot/v1.2.0 bad_bot
BrowserMatchNoCase MJ12bot/v1.2.5 bad_bot
BrowserMatchNoCase SemrushBot/0.9 bad_bot
BrowserMatchNoCase MLBot bad_bot
BrowserMatchNoCase butterfly bad_bot
BrowserMatchNoCase SeznamBot/3.0 bad_bot
BrowserMatchNoCase HuaweiSymantecSpider bad_bot
BrowserMatchNoCase Exabot/2.0 bad_bot
BrowserMatchNoCase netseer/0.1 bad_bot
BrowserMatchNoCase NetSeer crawler/2.0 bad_bot
BrowserMatchNoCase NetSeer/Nutch-0.9 bad_bot
BrowserMatchNoCase psbot/0.1 bad_bot
BrowserMatchNoCase Moreoverbot/x.00 bad_bot
BrowserMatchNoCase moreoverbot/5.0 bad_bot
BrowserMatchNoCase Jakarta Commons-HttpClient/3.0 bad_bot
BrowserMatchNoCase SocialSpider-Finder/0.2 bad_bot
BrowserMatchNoCase MaxPointCrawler/Nutch-1.1 bad_bot
BrowserMatchNoCase willow bad_bot
Order Deny,Allow
Deny from env=bad_bot
# End Bad Bot Blocking
If your add on domains do not have an .htaccess file in there directory just create a new file and paste this block snippet into it. Be sure to add a dot (.) to the beginning of the file name “.htaccess”
order allow,deny
allow from all
deny from 24.45.228.190
# End of IP Block
If you are running WordPress there is a great plugin called “Visitor Maps and Whos Online” this app can show you what bots are visiting your site. Then you can Google up the bots user agent string to see what it is up to.
Below is a screen grab i did before adding my block.
Another screen shot after giving them to bums rush out the door!




Thanks. Baidu has been driving me nuts, using up all my CPU and I’ve tried several different ways to block it.
I’ve added your code to htaccess, so …fingers crossed. Hop eit works.
Mark
That should do the trick.
Also be sure to block via robots.txt too.
user-agent: *
disallow: /cgi
disallow: /images
user-agent: baiduspider
disallow: /
user-agent: yeti
disallow: /
user-agent: omniexplorer_bot
disallow: /
user-agent: yandeximages
disallow: /
user-agent: Spinn3r
disallow: /
user-agent: sogou
disallow: /
user-agent: jikespider
disallow: /
user-agent: ia_archiver
disallow: /
user-agent: PaperLiBot
disallow: /
user-agent: Amazon
disallow: /
user-agent: boardreader
disallow: /
user-agent: boardtracker
disallow: /
user-agent: R6_CommentReader
disallow: /
user-agent: R6_FeedFetcher
disallow: /
user-agent: Radian6
disallow: /
user-agent: ScoutJet
disallow: /
User-agent: spbot
Disallow: /
User-agent: Sogou web spider/3.0
Disallow
User-agent: Sogou web spider/4.0
Disallow
Be sure to make a backup of your .htaccess file!