Discord-Dialogues Main Cover

Discord-Dialogues

A large-scale dataset of anonymized Discord conversations from late spring to early fall 2025 for training and evaluating conversational AI models in a ChatML-friendly format.

This dataset contains 7,303,464 exchanges spread out over 16,881,010 turns, with more than 139,922,950 words.



Nomic Atlas Map Preview View on Nomic Atlas

Features

Use

Use case example: mookiezi/Discord-Micae-8B-Preview

Collection Policy

All data was collected adhering to Discord's Terms of Service.

Dataset Statistics
(Hermes-3-Llama-3.1-8B tokenizer)

MetricValue
Samples (count)7,303,464
Total turns16,881,010
Total assistant turns9,016,287
Min length (tokens)10
Max length (tokens)2,542
Mean length (tokens)32.79
Median length (tokens)28
Std dev (tokens)16.56
Skew6.04
Kurtosis326.54
Total tokens239,458,213
Total characters1,242,238,794
Total words139,922,950
Avg chars per sample170.09
Avg words per sample19.16
Avg chars per word8.88
Tokens per char0.19

Tokens

8–16107,264
16–324,278,713
32–642,566,176
64–128334,829
128–25615,920
256–384363
384–51271
512–76878
768–102430
1024–204817
2048–40963

Turns per Exchange

25,795,019
31,038,500
4304,442
596,758
638,620
715,714
87,108
93,391
101,709
11909
12526
13291
14163
15113
1658
1757
1828
1920
207
2110
2210
232
241
252
272
291
321
332

Related