Home > News content

Google open source robots.txt industry standard to lead search engine crawlers

via:cnBeta.COM     time:2019/7/2 14:31:43     readed:1360

(screenshot viaVentureBeat)

For example, Googlebot will scan a robots.txt file when indexing a website to determine which parts it should ignore. If the file is not included in the root directory, the search engine will index the entire site by default.

It is worth mentioning that this file can be used not only to provide a direct crawler index, but also to populate some keywords to achieve "search engine optimization" (SEO). In addition, not all crawlers will strictly follow the robots.txt file.

For example, a few years ago, Internet Archives chose to support its "Wayback Machine" archiving tool, and some malicious crawlers also intentionally ignored REP.

It should be noted, however, that even though REP has become the default implementation standard, it has never actually become a true Internet standard, as defined by the Internet Engineering Task Force (IETF - a non-profit open labeling organization).

In order to promote this change, Google is actively participating in the action. The search giant said that the current REP can be "interpreted" but does not always cover marginalized cases.

In addition, Google has proposed a more comprehensive "undefined scene." For example, how should a crawler handle a "scan for known content" task?serverA robots.txt file with an inaccessible failure, or a spelling error rule?

Google wrote in a blog post: "This is a very challenging issue for website owners. Because of the vague factual standards, it is difficult for them to write rules properly."

We want to help website owners and developers create amazing experiences on the Internet without worrying about how to limit crawling tools.

So Google worked with REP original author Martijn Koster, webmasters, and other search engines to submit a proposal to the IETF on how to apply REP on modern networks.

The company has not yet released the full picture of the draft, but it has provided some guidance. For example, any URI-based transport protocol can be applied to robots.txt. And it is no longer limited to HTTP, but also to FTP or CoAP.

It is reported that developers must parse at least the first 500 KB of robots.txt. Defining the size of the file ensures that each connection is not opened too long, reducing unnecessary server stress.

In addition, the new maximum cache time is set to 24 hours (or available cache command values), giving site owners the flexibility to update their robots.txt at any time, and the crawler does not overload the site.

For example, in the case of HTTP, the Cache-Control header can be used to determine the cache time. When a previously accessible robots.txt file becomes unavailable due to a server failure, the site is no longer retrieved for a significant period of time.

The point to note here is that the crawler can parse the instructions in the robots.txt file in different ways, which can lead to confusion for the site owner.

To this end, Google has specifically introduced the C++ library, which supports the parsing and matching system provided by Googlebot on GitHub for anyone to access.

According to the release notes on GitHub, Google wants developers to build their own parser to better reflect the parsing and matching of Google's robots.txt file.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments