Discord-Dialogues Main Cover

Discord-Dialogues

Discord-Dialogues is a large-scale dataset of anonymized Discord conversations for training and evaluating conversational AI models in a ChatML-friendly format.


Nomic Atlas Map Preview View on Nomic Atlas

Features

Use

Collection Policy

All data was collected adhering to Discord's Terms of Service.

Dataset Statistics (using the Hermes-3-8B tokenizer)

MetricValue
Samples (count)10,008,676
Min length (tokens)8
Max length (tokens)5,979
Mean length (tokens)35.99
Median length (tokens)30
Std dev (tokens)21.43
Total tokens360,279,716
Total characters1,847,454,279
Total words216,225,706
Total Assistant Blocks12,775,402

Tokens

RangeCount
0–80
8–16119,858
16–325,268,146
32–643,836,602
64–128716,410
128–25664,263
256–3842,738
384–512374
512–768188
768–102449
1024–204837
2048–40967

Turns per Exchange

TurnsCount
27,660,871
31,532,180
4496,352
5173,547
675,284
733,418
816,661
98,485
104,699
112,695
121,578
13979
14643
15403
16275
17178
18120
1987
2044
2146
2238
2328
2411
2510
269
279
283
298
303
311
323
333
341
372
392

Disclaimer

Although filtering the full data dump reduced it significantly, this dataset is still intended as a large-scale dump. For best training results, further curation to target high-signal data relevant to your goals is recommended.

License

Apache License 2.0

How to cite

@misc{discord-dialogues-2025,
  title = {Discord-Dialogues},
  author = {mookiezi},
  year = {2025},
  url={https://huggingface.co/datasets/mookiezi/Discord-Dialogues}
}

Related