Star AlbumentationsX on GitHub — it powers this leaderboard

Star on GitHub
← Back to leaderboard
opendatalab

MinerU-HTML

opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

209stars
Forks
23
Open issues
2
Watchers
209
Size
0.1 MB
HTMLApache License 2.0
article-extractorcorpus-toolsnlpragscrapingtext-extractiontrafilaturaweb-scrapingwebagent
Created: Nov 26, 2025
Updated: Feb 23, 2026
Last push: Dec 25, 2025