marco-urban wrote:
I didnt't know, that there is a difference between
www.domain.com and domain.com.
Normally, they will point to the same website, and in most cases they would also redirect to one (your choice) of either www- or non-www.
marco-urban wrote:What I'm doing at the moment is checking strategies against AI-Bots. I'm chairman of FREELENS, a association of more than 2000 german professional photographers. I did block the bots with the cloudflare free-version, but I wanted to change the robots.txt just to see, how it works.
robots.txt will instruct so-called "good bots" to not scrape the website. However, not all "good bots" are necessarily good, but they definitely aren't "bad" (they aren't hacking your website). There are many SEO bots and AI bots for instance, that offer nothing else than slowing down your server and website. It's debatable wether you should allow or block them, because in some few cases, a website may get an incoming reference link from AI.
Then there are anonymous bots and bad bots, or bots who don't care about your robots.txt. Most don't have bad intentions, although some are searching for vulnerabilities (eg Wordpress logins and plugins). These bots are difficult to identify, because they often don't introduce themselves as a bot. Cloudflare has "Bot Fight mode", which will block bots based on algorithms from Cloudflare. Other bots with generic names like "Scrapy" can be anything, because any person can set them up.
It's really difficult to setup a perfect "block" for bots, unless you also want to block bots that should normally be welcome (Google search, Facebook page previews etc). Personally, I use Cloudflare "bot fight mode" and also have some rules in WAF > Custom rules to block certain bots (user agents). If you wanted to block ALL non-human requests, you could enforce a Cloudflare "Challenge" ... I do that for control panels and stuff that is only meant for humans.
marco-urban wrote:Where can I see, the bots, scraping on my website?
You can check what visitors are blocked by Cloudflare, from Security > Events. However, it's not like you will be given a neat list of names of bots that are blocked.
As for bots that are NOT blocked by Cloudflare (there will be many of course, because Cloudflare will only block the ones that seem "dangerous", unless you instruct it otherwise), you can only review visitors from your Apache/Nginx "access log". This isn't easy to read either, unless you know what you are looking for.
Depending on what you are trying to achieve, there is no simple recipe.
- You can use robots.txt, but if you start analyzing visitors, you will end up with a robots.txt that is too long, and many of the bots won't care about the limitations in your robots.txt. Only self-proclaimed "good bots" will respect instructions from your robots.txt.
- If I was going to use robots.txt, I would probably instead setup a rule to initially disallow ALL, and then only select the few user agents I would like to allow (like common search engine crawlers and page sharing bots).
- So even if you use robots.txt, there are zillions of automated bots that simply don't care what you have in robots.txt, and will still crawl your pages. You can only block these by using Apache/Nginx rules or Cloudflare (I prefer to use Cloudflare, because it's better to block before they hit your server in the first place).
marco-urban wrote:it´s also important for legal aspects in Europe.
I'm not quite sure about that. How are you breaking any law by allowing visitors to view your website? That's almost like getting arrested because someone breaks into your house and you failed to stop them.