Home

Robotstxt

Robots.txt is a plain text file used by websites to guide web crawlers regarding which parts of a site may be crawled or indexed. It is part of the Robots Exclusion Protocol and is not a security mechanism; it relies on voluntary compliance by crawlers and should not be relied on to protect sensitive data.

The file is placed at the root of a domain, and most crawlers fetch it before indexing.

Other commonly used directives include Crawl-delay, which requests a delay between requests for some bots, though

Limitations and best practices: robots.txt only expresses preferences and does not prevent access; sensitive data should

For
example,
https://example.com/robots.txt.
A
robots.txt
file
contains
one
or
more
records.
Each
record
begins
with
one
or
more
User-agent
lines
that
identify
the
crawler
or
group
of
crawlers
to
which
the
directives
apply.
The
directives
that
follow,
such
as
Disallow
and
Allow,
specify
URL-path
prefixes.
A
Disallow
directive
blocks
access
to
a
path,
while
an
empty
Disallow
value
or
the
absence
of
a
Disallow
line
can
indicate
that
the
path
is
allowed.
The
Allow
directive
is
used
to
override
a
broader
Disallow
for
particular
subpaths.
Multiple
records
allow
different
rules
for
different
crawlers.
support
varies
by
crawler,
and
Sitemap,
which
points
crawlers
to
a
site’s
sitemap.
The
Host
directive
is
supported
by
a
few
crawlers
to
indicate
a
preferred
domain.
be
protected
via
server-side
access
controls.
Keep
the
file
up-to-date,
test
rules
with
a
robots.txt
tester,
and
avoid
blocking
resources
needed
for
proper
rendering
or
indexing.