Star AlbumentationsX on GitHub — it powers this leaderboard

Star on GitHub
← Back to leaderboard
apache

tika

apache/tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

3,583stars
Forks
912
Open issues
63
Watchers
3,583
Size
420.2 MB
JavaApache License 2.0
contentextractionjavametadatatika
Created: May 21, 2009
Updated: Feb 27, 2026
Last push: Feb 27, 2026