The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
Features
- deeply and thoroughly harvests website content
- works on any Java platform (Linux recommended)
- stores content to ARC or ISO WARC aggregate/transcript format
- web interface for operator control and monitoring of crawls
License
Apache License V2.0, GNU Library or Lesser General Public License version 2.0 (LGPLv2)Follow Heritrix: Internet Archive Web Crawler
Other Useful Business Software
Ship Agents Faster
Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
Rate This Project
Login To Rate This Project
User Reviews
-
Cool
-
Cool.
-
Useful project. Thanks
-
Great software, thank you.
-
The app works well in my PC. Serves its purpose too, so no regrets for me.