A dataset of 362 thousand anonymized Discord conversations from late spring to late summer 2025 for training and evaluating conversational AI models in a ChatML-friendly format.
Features
- Real users only (no bots); links, embeds, and commands removed
- Filtered for ToS violations and unsafe content
- Casual Discord tone
- Two-author chains only
- Merged self-replies from the same author into a single message
- Cleaned and deduplicated for relevance
- Primarily English, with some other languages present
Use
- Fine-tuning conversational models
- Training relevance/reward models
- Dialogue generation research
Dataset
- STX: 260,670 single-turn prompt/response pairs
- Chains: 101,480 multi-turn conversations (2 authors)
High-level totals
- Total tokens: 22.4 M
- Total characters: 107 M
- Total words: 15.0 M
- Assistant blocks: 480 k
Length Distribution (tokens)
31–38
39–46
47–54
55–62
63–70
71–78
79–86
87–94
95–102
103–110
License
Apache License 2.0
Related
All data collected following Discord's Terms of Service.