OpenDataLab
@opendatalabOrganizationOpenDataLab provides access to numerous significant open-source datasets.
On the leaderboard
| Rank | Repository | Stars |
|---|---|---|
| 300 | opendatalab/MinerU | 58,095 |
Top repositories by stars
- opendatalab/MinerU(on leaderboard)
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Python54,483 - opendatalab/PDF-Extract-Kit
A Comprehensive Toolkit for High-Quality PDF Content Extraction
Python9,371 - opendatalab/DocLayout-YOLO
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Python2,003 - opendatalab/OmniDocBench
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
Python1,498 - opendatalab/labelU
Data annotation toolbox supports image, audio and video data.
Python1,496 - opendatalab/LabelLLM
The Open-Source Data Annotation Platform
TypeScript1,181 - opendatalab/WanJuan1.0
万卷1.0多模态语料
569 - Python547
- Python523
- opendatalab/UniMERNet
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
Python455 - opendatalab/MinerU-HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
HTML208 - opendatalab/Meta-rater
[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"
Python189 - opendatalab/LOKI
[ICLR 2025 Spotlight] The official implementation of the paper “LOKI:A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models”
Python175 - opendatalab/labelU-Kit
Data annotation component library --provided as NPM packages
TypeScript146 - opendatalab/opendatalab-datasets
datasets resource
130 - Python120
- opendatalab/FakeVLM
[NeurIPS 2025 🔥] FakeVLM: Advancing Synthetic Image Detection through Explainable Multimodal Models and Fine-Grained Artifact Analysis
Python118 - opendatalab/VHM
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis
Python110 - opendatalab/mineru-vl-utils
A Python package for interacting with the MinerU Vision-Language Model.
Python103 - opendatalab/HA-DPO
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Python100 - opendatalab/VIGC
AAAI 2024: Visual Instruction Generation and Correction
Python96 - opendatalab/Earth-Agent
[ICLR 2026] The official implementation of the paper “Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents”
Python95 - opendatalab/OHR-Bench
(ICCV 2025) OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Python94 - opendatalab/MLS-BRN
[CVPR 2024] 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
Python88 - opendatalab/skydiffusion
[ICCV 2025] The official implementation of the paper “Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm”
Python82 - opendatalab/Vis3
Data browser based on s3. 一个基于 S3 的数据(json / jsonl / parquet / html / md等)可视化工具。👇 Try online.
TypeScript79 - opendatalab/LEGION
[ICCV25 Highlight] The official implementation of the paper "LEGION: Learning to Ground and Explain for Synthetic Image Detection"
Python74 - opendatalab/CLIP-Parrot-Bias
ECCV2024_Parrot Captions Teach CLIP to Spot Text
Python66 - opendatalab/opendatalab-python-sdk
SDK of OpenDataLab - https://opendatalab.org.cn
Python59 - opendatalab/MLLM-DataEngine
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Python48 - opendatalab/dsdl-docs
Data Set Description Language Specification (新一代人工智能数据集描述语言DSDL)
HTML47 - opendatalab/CHARM
[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs
Python44 - opendatalab/WanJuan3.0
WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB
43 - opendatalab/ProverGen
[ICLR 2025] This is the official implementation for the paper: "Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation"
Python42 - opendatalab/UrBench
[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”
Python36 - Python33
- Python29
- opendatalab/TRivia
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Python25 - TypeScript25
- opendatalab/Miner-PDF-Benchmark
MPB (Miner-PDF-Benchmark) is an end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios.
Python24 - opendatalab/CrossViewDiff
The official implementation of the paper "CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis"
JavaScript16 - Python14
- opendatalab/WanJuan2.0-WanJuan-CC
WanJuan-CC是以CommonCrawl为基础,经过数据抽取,规则清洗,去重,安全过滤,质量清洗等步骤得到的高质量数据。
14 - Jupyter Notebook13
- opendatalab/WebMainBench
WebMainBench is a specialized benchmark tool for end-to-end evaluation of web main content extraction quality.
Python12 - Python10
- opendatalab/labelU-frontend
LabelU front-end library
TypeScript9 - opendatalab/allz
A universal command line tool for compression and decompression
Python6 - opendatalab/awesome-mineru
🕶️ A curated list of awesome things related to MinerU
Python4 - opendatalab/CRaFT
[AAAI25] Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning
Python4 - opendatalab/GRAIT
[NAACL25 findings] Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Python3 - Python1
- opendatalab/rdkit
A forked repo of the official RDKit library
HTML0