You probably follow and are aware of your daily visitors, pageviews, and a variety of other statistics surrounding your website traffic. Have you ever thought about or wondered just how many of those visitors are actually bots? It is probably something most of us don’t think about. We see those nice pretty stats in Google Analytics, but we don’t really think too much about the non-human (bot) visitors.
Since I like to test out software and services that claim to make things safer, faster, or better I decided to try out another service I had my eye on. The purpose of it is to make my website safer for the most part. It has a few additional useful features, but I was mainly after the additional security it is supposed to provide.
It just so happens that it tells me the total visits and the percentage of those that are human and bots. How cool is that!
I should mention at the time of this post I have only been using this for barely two days now. In that time frame it tells me 27% are humans and 73% are bots. Not what I would have guessed or expected. I was thinking more along the lines of 50%/50%, or possibly 60%/40%.
Some people get a little uneasy when words like bot/bots are mentioned, but they seem to be OK with a word like crawler. If I say Google bot or Google crawler no big deal, but if I mention other bots and crawlers some people are ready to block them all.
Obviously there are good bots and some not so nice bots. Not all of them are good, but not all of them are bad either. One person may consider certain bots bad, while someone else may just be annoyed by them.
Since I am seeing a lot of bots I thought I would look into where they are coming from. I don’t have a lot of time to go through tons of logs and do a thorough and complete examination of every single hit. I do check them fairly often and look for things that stand out as strange and out of the ordinary, or activity that shows up frequently.
It didn’t take long to find a bot that showed up over and over again. It is identified as Magpie Crawler from the United Kingdom. Here, in this screen shot you can see it hit 3 of my pages at nearly the same time.
Just 4 minutes before that it had visited another page.
It says 1 Access Control and 3 Access Control because it did block those visits.
In this particular case it amounted to visiting 4 pages in a 4 minute time period. It doesn’t seem to follow any particular pattern, and the frequency comes and goes. It will visit several pages in a few minutes time, then wait 20 or 30 minutes and visit several more. Other times it crawls 3 pages at a time, waits a few minutes and comes back again. This happens all day long.
Here in this screen shot you can see in I have had more visits from the Magpie Crawler with 436 (7.49%) than from Google Chrome users with 405 (6.96%).
You might disagree with me, but in my opinion there is no reason that this Magpie Crawler needs to visit my pages this often. It visits and re-visits over and over again pretty much non-stop. Google bot is not as active or aggressive as this thing.
So naturally I wanted to see what the deal is with this Magpie Crawler. As you can see in the screen shot it has something to do with brandwatch.net. The actual details and information about their crawler can be found here:
I don’t have time to do extensive research on it, but I did run a few Google searches to see what other people were saying. It turns out there are quite a few topics about the Magpie Crawler.
If you visit the details and information page I mentioned above it says:
Who are Brandwatch?
We are a social media monitoring company, helping our customers find useful and relevant comments and discussion on the web. We crawl blogs, forums, news sites, and all kinds of social media content. The content is indexed, much like a search engine, allowing our users to find the pages that mention the words they are interested in.
How does our crawling benefit you?
In the same way that search engines help people find content, our crawlers make your pages visible to a wider audience. Our application links to your content, directing more people to read the pages on your site. Many large companies use our services to listen to what people think about their brands, products and services, so there’s a good chance your opinion will be heard.
What to do if you don’t want our crawler downloading pages from your site:
If you add the following rule to your robots.txt file, our crawler will not download any pages from your site:
That is great, if their crawler bot abides by your robots.txt rules. When I was doing some research there were a couple of sites that said it doesn’t abide by and follow robots.txt rules. I haven’t verified that. It was just what some other sites said. It might be better to block with .htaccess just to make sure if you don’t want this thing visiting your website.
It says that their crawler can benefit you because large companies use their services and there’s a good chance your opinion will be heard. Honestly, I doubt very many large companies will see and hear anything I post if any at all.
I also checked the brandwatch.com pricing page here:
They have a pro plan for £400/Month, an enterprise plan for £1500/Month, and a slightly hidden 7 day pitch plan for £100/month. So what this seems to imply is that they charge their customers a pretty steep fee to watch pages that mention words that interest them. In the long run it costs me additional bandwidth and more server resources just so their crawler can mine my website and sell my info to their customers in the form of a monthly fee. It is not like they are a free public search engine like Google, nor do they make this info available to everyone.
So after careful consideration I have decided to block the Magpie Crawler completely. I probably wouldn’t have if their bot wasn’t so aggressive and didn’t visit so frequently.
You might say: but you are potentially losing traffic by blocking it. Yes, this is true. However, since they charge a substantial monthly fee the way I look at it is they most likely cater to large business and corporations that will not be interested in my website on a regular basis anyway. If I lose any traffic as a result of blocking them it will be minimal if any at all.
Now…one bot down…a few thousand more to go!