The GTM Provider Directory
Public DatasetPublic
Public Datasets

Common Crawl

Web crawl data with 300B+ web pages in WARC format on AWS.

Recently updated
2 days ago
Visit Website
0 community mentions
Entry · The data

Web crawl data with 300B+ web pages in WARC format on AWS.

Why GTM teams care

Largest free web crawl dataset for analysis.

Best use cases

  1. Analyze web content and structure
  2. Build web crawl datasets for research
Entry · Public dataset

Dataset details

Steward / publisher
Common Crawl Foundation
Jurisdiction
global
License
Custom (free non-commercial)
Access method
bulk-download
Auth required
none
Record count
300B+ pages
Entry · Request

Want Common Crawl on Deepline?

Common Crawl isn’t wired into Deepline yet. Drop your email and we’ll notify you when it ships.

Contribute · Review

Share your experience with Common Crawl

No vendor influence — your review is published as-is. Post anonymously or with your name.

Post anonymously

Questions mentioning Common Crawl

0 questions reference this provider.

No questions mention Common Crawl yet.

Ask a Question