Discord-Dialogues is a large-scale dataset of anonymized Discord conversations for training and evaluating conversational AI models in a ChatML-friendly format.
All data was collected adhering to Discord's Terms of Service.
Metric | Value |
---|---|
Samples (count) | 10,008,676 |
Min length (tokens) | 8 |
Max length (tokens) | 5,979 |
Mean length (tokens) | 35.99 |
Median length (tokens) | 30 |
Std dev (tokens) | 21.43 |
Total tokens | 360,279,716 |
Total characters | 1,847,454,279 |
Total words | 216,225,706 |
Total Assistant Blocks | 12,775,402 |
Range | Count |
---|---|
0–8 | 0 |
8–16 | 119,858 |
16–32 | 5,268,146 |
32–64 | 3,836,602 |
64–128 | 716,410 |
128–256 | 64,263 |
256–384 | 2,738 |
384–512 | 374 |
512–768 | 188 |
768–1024 | 49 |
1024–2048 | 37 |
2048–4096 | 7 |
Turns | Count |
---|---|
2 | 7,660,871 |
3 | 1,532,180 |
4 | 496,352 |
5 | 173,547 |
6 | 75,284 |
7 | 33,418 |
8 | 16,661 |
9 | 8,485 |
10 | 4,699 |
11 | 2,695 |
12 | 1,578 |
13 | 979 |
14 | 643 |
15 | 403 |
16 | 275 |
17 | 178 |
18 | 120 |
19 | 87 |
20 | 44 |
21 | 46 |
22 | 38 |
23 | 28 |
24 | 11 |
25 | 10 |
26 | 9 |
27 | 9 |
28 | 3 |
29 | 8 |
30 | 3 |
31 | 1 |
32 | 3 |
33 | 3 |
34 | 1 |
37 | 2 |
39 | 2 |
Although filtering the full data dump reduced it significantly, this dataset is still intended as a large-scale dump. For best training results, further curation to target high-signal data relevant to your goals is recommended.
Apache License 2.0
@misc{discord-dialogues-2025, title = {Discord-Dialogues}, author = {mookiezi}, year = {2025}, url={https://huggingface.co/datasets/mookiezi/Discord-Dialogues} }