A large-scale dataset of anonymized Discord conversations from late spring to early fall 2025 for training and evaluating conversational AI models in a ChatML-friendly format.
This dataset contains 7,303,464 exchanges spread out over 16,881,010 turns, with more than 139,922,950 words.
View on Nomic
Atlas
Use case example: mookiezi/Discord-Micae-8B-Preview
All data was collected adhering to Discord's Terms of Service.
| Metric | Value |
|---|---|
| Samples (count) | 7,303,464 |
| Total turns | 16,881,010 |
| Total assistant turns | 9,016,287 |
| Min length (tokens) | 10 |
| Max length (tokens) | 2,542 |
| Mean length (tokens) | 32.79 |
| Median length (tokens) | 28 |
| Std dev (tokens) | 16.56 |
| Skew | 6.04 |
| Kurtosis | 326.54 |
| Total tokens | 239,458,213 |
| Total characters | 1,242,238,794 |
| Total words | 139,922,950 |
| Avg chars per sample | 170.09 |
| Avg words per sample | 19.16 |
| Avg chars per word | 8.88 |
| Tokens per char | 0.19 |
| 8–16 | 107,264 |
| 16–32 | 4,278,713 |
| 32–64 | 2,566,176 |
| 64–128 | 334,829 |
| 128–256 | 15,920 |
| 256–384 | 363 |
| 384–512 | 71 |
| 512–768 | 78 |
| 768–1024 | 30 |
| 1024–2048 | 17 |
| 2048–4096 | 3 |
| 2 | 5,795,019 |
| 3 | 1,038,500 |
| 4 | 304,442 |
| 5 | 96,758 |
| 6 | 38,620 |
| 7 | 15,714 |
| 8 | 7,108 |
| 9 | 3,391 |
| 10 | 1,709 |
| 11 | 909 |
| 12 | 526 |
| 13 | 291 |
| 14 | 163 |
| 15 | 113 |
| 16 | 58 |
| 17 | 57 |
| 18 | 28 |
| 19 | 20 |
| 20 | 7 |
| 21 | 10 |
| 22 | 10 |
| 23 | 2 |
| 24 | 1 |
| 25 | 2 |
| 27 | 2 |
| 29 | 1 |
| 32 | 1 |
| 33 | 2 |