Search…

X3 Photo Gallery Support Forums

Search…
 
marco-urban
Topic Author
Posts: 12
Joined: 07 Jul 2008, 15:42

Change robots.txt

01 Nov 2024, 15:16

Hello!
I need to change the robots.txt to keep out Bots that want to scrape pictures for training AI. It will need to be updated with new bots often.
I put the robots.txt into the root directory and it was shown at www.marco-urban.de/robots.txt but not at marco-urban.de/robots.txt
At marco-urban.de/robots.txt always the robots.txt created by you is shown. 
I must be possible to create my own robots.txt for my domain. How can I do that?
Thanks
Marco
 
User avatar
mjau-mjau
X3 Wizard
Posts: 14452
Joined: 30 Sep 2006, 03:37

Re: Change robots.txt

01 Nov 2024, 22:50

marco-urban wrote:it was shown at www.marco-urban.de/robots.txt but not at marco-urban.de/robots.txt
Ok, so let's start with first things first ...

www.marco-urban.de/robots.txt
marco-urban.de/robots.txt

They are both definitely showing the same text. If you can't see that in your browsers, then it means they are getting temporarily cached in Cloudflare (which I see you are using). I see that you have them set to cache in Cloudflare for 4 hours. Thus, once you read this post, I assume they are both showing correctly in your browser, with the same text. Just to be 100% clear here, it is the SAME file displaying in both domains, just that one was cached in your browser.

More important question, why don't you redirect www to non-www, or non-ww to www? Your website should only be hosted on ONE choice of www or not www. Having both is confusing, and could be bad for SEO. Just redirect one to the other. Just like we do for https://photo.gallery, which you see will redirect to www.photo.gallery ... Having both serves no purpose for nobody.
marco-urban wrote: I need to change the robots.txt to keep out Bots that want to scrape pictures for training AI. It will need to be updated with new bots often.
Why not block bots with Cloudflare WAF instead? Only "good" bots will respect your "robots.txt" file ... Many bots will ignore it and just go ahead and scrape your content wether you like it or not. For example, I see you have many generic bots omgilibot, peer39_crawler, PerplexityBot, Scrapy, TurnitinBot, and I'm pretty sure not all of those do not read or care about your robots.txt.
 
marco-urban
Topic Author
Posts: 12
Joined: 07 Jul 2008, 15:42

Re: Change robots.txt

03 Nov 2024, 11:33

first Things first: I love your service! Superfast! Thank you very much!

I didnt't know, that there is a difference between www.domain.com and domain.com. I'll check that and will redirect.
What I'm doing at the moment is checking strategies against AI-Bots. I'm chairman of FREELENS, a association of more than 2000 german professional photographers. I did block the bots with the cloudflare free-version, but I wanted to change the robots.txt just to see, how it works.  it´s also important for legal aspects in Europe.
Where can I see, the bots, scraping on my website?
Best
Marco
P.S.: Thank you very much for your amazing service. I recommend X3 to everyone.
 
User avatar
mjau-mjau
X3 Wizard
Posts: 14452
Joined: 30 Sep 2006, 03:37

Re: Change robots.txt

03 Nov 2024, 22:49

marco-urban wrote: I didnt't know, that there is a difference between www.domain.com and domain.com.
Normally, they will point to the same website, and in most cases they would also redirect to one (your choice) of either www- or non-www.
marco-urban wrote:What I'm doing at the moment is checking strategies against AI-Bots. I'm chairman of FREELENS, a association of more than 2000 german professional photographers. I did block the bots with the cloudflare free-version, but I wanted to change the robots.txt just to see, how it works.
robots.txt will instruct so-called "good bots" to not scrape the website. However, not all "good bots" are necessarily good, but they definitely aren't "bad" (they aren't hacking your website). There are many SEO bots and AI bots for instance, that offer nothing else than slowing down your server and website. It's debatable wether you should allow or block them, because in some few cases, a website may get an incoming reference link from AI.

Then there are anonymous bots and bad bots, or bots who don't care about your robots.txt. Most don't have bad intentions, although some are searching for vulnerabilities (eg Wordpress logins and plugins). These bots are difficult to identify, because they often don't introduce themselves as a bot. Cloudflare has "Bot Fight mode", which will block bots based on algorithms from Cloudflare. Other bots with generic names like "Scrapy" can be anything, because any person can set them up.

It's really difficult to setup a perfect "block" for bots, unless you also want to block bots that should normally be welcome (Google search, Facebook page previews etc). Personally, I use Cloudflare "bot fight mode" and also have some rules in WAF > Custom rules to block certain bots (user agents). If you wanted to block ALL non-human requests, you could enforce a Cloudflare "Challenge" ... I do that for control panels and stuff that is only meant for humans.
marco-urban wrote:Where can I see, the bots, scraping on my website?
You can check what visitors are blocked by Cloudflare, from Security > Events. However, it's not like you will be given a neat list of names of bots that are blocked.

As for bots that are NOT blocked by Cloudflare (there will be many of course, because Cloudflare will only block the ones that seem "dangerous", unless you instruct it otherwise), you can only review visitors from your Apache/Nginx "access log". This isn't easy to read either, unless you know what you are looking for.

Depending on what you are trying to achieve, there is no simple recipe.
  • You can use robots.txt, but if you start analyzing visitors, you will end up with a robots.txt that is too long, and many of the bots won't care about the limitations in your robots.txt. Only self-proclaimed "good bots" will respect instructions from your robots.txt.
  • If I was going to use robots.txt, I would probably instead setup a rule to initially disallow ALL, and then only select the few user agents I would like to allow (like common search engine crawlers and page sharing bots).
  • So even if you use robots.txt, there are zillions of automated bots that simply don't care what you have in robots.txt, and will still crawl your pages. You can only block these by using Apache/Nginx rules or Cloudflare (I prefer to use Cloudflare, because it's better to block before they hit your server in the first place).
marco-urban wrote:it´s also important for legal aspects in Europe.
I'm not quite sure about that. How are you breaking any law by allowing visitors to view your website? That's almost like getting arrested because someone breaks into your house and you failed to stop them.
 
marco-urban
Topic Author
Posts: 12
Joined: 07 Jul 2008, 15:42

Re: Change robots.txt

04 Nov 2024, 02:27

"I'm not quite sure about that. How are you breaking any law by allowing visitors to view your website? That's almost like getting arrested because someone breaks into your house and you failed to stop them."

Thank you very much for the information, it will help me. Not every question can be answered by googling.

Regarding the last part: It is not that it is illegal to allow crawlers to visit my website, but that it is or can be illegal for the crawler or someone else to use the images for AI training if I have indicated on the website that I do not allow this, i.e. an opt-out. Both European legislation and the European AI Act provide for this in order to protect the rights of the copyright holder. However, it is not yet completely clear exactly how this is to be done. There is still a lack of case law from the courts. A court in Hamburg, for example, has ruled that a simple notice in the website's legal notice is sufficient. A reference in the robots.txt file will be better. It can therefore be useful in addition to the Cloudflare service. Which service have you booked with Cloudflare? Is the free service sufficient for your purposes?
 
User avatar
mjau-mjau
X3 Wizard
Posts: 14452
Joined: 30 Sep 2006, 03:37

Re: Change robots.txt

04 Nov 2024, 04:22

marco-urban wrote:It is not that it is illegal to allow crawlers to visit my website, but that it is or can be illegal for the crawler or someone else to use the images for AI training if I have indicated on the website that I do not allow this
Yes, but then it's the crawler or the owner of the crawler that is doing something illegal, not you. You can do many things to help the police, but being the police is not your job. If you want to block bots to some degree, that would be to protect your work, content and copyrights, but not to prevent others from breaking the law. Just for some reference.
marco-urban wrote:It can therefore be useful in addition to the Cloudflare service.
Yes, of course you could use both robots.txt and Cloudflare, if you have a clear plan about what you want to achieve, and how you are going to achieve this using both.
marco-urban wrote:Which service have you booked with Cloudflare? Is the free service sufficient for your purposes?
I use Cloudflare free, and indeed it is sufficient with free service for basic blocking from the WAF (Web Application Firewall).