Discord-OpenMicae

Name: Discord-OpenMicae
Creator: mookiezi
License: https://www.apache.org/licenses/LICENSE-2.0

A dataset of 362 thousand anonymized Discord conversations from late spring to late summer 2025 for training and evaluating conversational AI models in a ChatML-friendly format.

View on Nomic Atlas

Features

Real users only (no bots); links, embeds, and commands removed
Filtered for ToS violations and unsafe content
Casual Discord tone
Two-author chains only
Merged self-replies from the same author into a single message
Cleaned and deduplicated for relevance
Primarily English, with some other languages present

Use

Fine-tuning conversational models
Training relevance/reward models
Dialogue generation research

Dataset

STX: 260,670 single-turn prompt/response pairs
Chains: 101,480 multi-turn conversations (2 authors)

High-level totals

Total tokens: 22.4 M
Total characters: 107 M
Total words: 15.0 M
Assistant blocks: 480 k

Length Distribution (tokens)

31–38

39–46

47–54

55–62

63–70

71–78

79–86

87–94

95–102

103–110

License

Apache License 2.0

All data collected following Discord's Terms of Service.