Datasetting for AI Modules

Dataset Cleaning Guidelines
Preparation of a clean dataset is the most important factor in the creation of a proper AI Module. Simply scraping a wiki or throwing together some PDF conversions is vastly insufficient and will at best result in suboptimal outputs riddled with leaked symbols, odd spacing, and circular repetition.

Luckily cleaning your dataset is not difficult, only time-consuming. Assuming you have the patience and follow these guidelines, you too can create high quality AI modules.

General Overview
Dataset files should be... Plain text in UTF-8 encoded  format with no tags/markdown/html, instead focusing on standard-formatted English prose.

One paragraph per newline. That is, no paragraphs should be split onto multiple lines. To visualize this, it helps to enable Word Wrapping on whatever text editor you are using (in Notepad++ this done by checking View > Word Wrap).

No empty newlines between paragraphs (ie. double spacing). If desired, you could leave a single empty newline between chapters as a chapter break, but it is instead recommend that you leave only  on its own newline between chapters (replacing the chapter title).

Do not leave any leading/trailing space, tabs or other whitespace. This includes checking for any spaces after the end of a sentence before a newline.

Use regular single and double quotation characters ( and  ) not the fancy ones (  and  ). The AI is trained using the former and thus will not use the latter properly.

Try to focus in on only one specific subject matter and ensure all included material is focused on what your module should achieve in terms of what you expect from its output. This process necessitates nuance, not stuffing as much as you can into your dataset. Keep in mind this can be tricky, as for example some Steampunk novels don't actually talk all that much about the type of content one would relate to the Steampunk genre and thus in practice it is not very effective using them to train a "Steampunk" module.

A little data goes a long way. 1MB to 5MB still provides excellent results for an authorial or thematic style assuming you provide it enough training steps. On this note, feel free to experiment with short data in general. There is nothing stopping you from turning a short prompt into a module, and this would also require much fewer training steps (perhaps within the range of 50-100 for a typical scenario prompt).

If you want to avoid the same characters appearing constantly or other forms of overfitting, try to keep the data balanced with a variety of names, phrases, and terms by including stories featuring different characters or locations.

Do not expect a module to memorize relational/factual data. For instance if you feed it a story with Pokémon descriptions, while it will reference those same Pokémon, it can and often will randomly mix up their types, moves, and other similar data.

The form or format in which the data exists plays a significant role in the form it will generate as AI output. For this reason, avoid wikis or other encyclopedic data unless you specifically want to generate encyclopedia entries (which could be useful for utility modules as content generators).

Cleaning Headings & Auxiliary Sections
Before anything else, it is important to note that discretion is key here, as you typically want to be as least destructive to your data as possible, considering how easy it is to completely ruin your data with a single misguided Find and Replace. Unless you want to see them leaked in the AI's output, make sure to remove Fore/Afterwords, Acknowledgements, Author's Notes, About the Authors, and any other sections that have nothing to do with the story or data. This also includes any author commentary or excerpts unless it is diagetic to (takes place within the narrative of) the story.

Chapter titles can be replaced with  alone on a newline to signify breaks between chapters or   for breaks between individual short stories as these are conventions of the base training data and the AI is used to working with them.

In general it is a good idea to remove, replace, or trim down anything too repetitive (such as numbered chapters, titles that use the same prefixes, or stylistic phrases like  repeated twenty times in a row) as this will increase the chance of these leaking into your output. That said, if you find yourself removing too much from a work it might just be a better idea to exclude it from your dataset altogether.

Cleaning Prose
Keep an eye out for odd symbols, characters, or other unusual formatting such as odd card suits or other symbols used as chapter breaks or Japanese quotations (  and  ) which are often found in visual novels (and can be replaced with regular quotations). Sometimes even the replacement character can be found; this is usually a sign that something's gone badly wrong with the file conversion.

On this note, often times scanning a book to a digital copy certain formatting will produce errors or won't scan correctly. You can see examples of this occasionally with underscores such as from OCR software reading  as   or for things like telepathic dialogue communication, which is typically italicized, but since raw text has no characters for italics, they will appear as __dialogue encapsulated by underscores like this__ which need to be replaced with quotations or angled brackets (  and  ) as the AI knows to associate them with non-verbal dialogue.

Another common scanning issue is having chapters start in all uppercase capital letters (ie. ) or with the first letter of the first word separated from the rest of the word (ie.  ).

Extra spacing is yet another common issue such as with possessive indicators (ie. ) or at the end of sentences before a newline. On that note extra newlines aren't good either as the AI tends to associate them with chapter breaks or a passage of time.

Also be on the lookout for vertical bars which can be replaced with colons. Lastly, if there any square brackets ( and  ) in an unedited file, you might want to remove them unless they are encapsulating something such as an indication of time, location, point-of-view or some other note you intend to use to nudge the AI in your story.

Miscellaneous Tips

 * Don't worry about and don't include the well-known  token. It is not needed for training AI modules as the training process handles that already. You don't need a   token either as contrary to popular belief, this is not an actual token used by GPT language models.
 * Don't worry about your clean-up or tokens being perfect. AI Modules are not as powerful as the actual underlying fine-tuned model and thus don't have as strong an effect as the existing model's training. If your resulting module produces at least fun or interesting results, that's still very much a success.
 * Make use of good feature-rich text editors such as Notepad++ or at least something that supports regular expressions which greatly cut down on editing time.

Recommended Dataset Tools
Gnurro's ReFormatter A powerful set of tools for data cleaning with an accessible graphical user interface. Requires Python to run. Detailed operation information can be found on its own Wiki page.

Belverk's Cleaning Python Script The official script used by the finetuning team. This "set it and forget it" Python script for automatically cleaning many common issues in data files. Note that while there is no graphical user interface or indication of progress while running, it is incredibly thorough while being minimally destructive. It is recommend users manually check through each file afterwards as it does not catch everything.

Zermelane's Dumb Reformatter A stupidly simple but quite convenient tool for reformatting text for module training in your browser. No downloads, Python, or script running necessary. This tool has some functionality overlap with the above two, but is more destructive than the official script. Use with caution, and note that your data will still require manual tweaking (such as removing Epilogues and Author's Notes) afterwards.

ScrapeFandom A Python script which scrapes English Fandom sites for updated Wiki dumps. Using Wiki data is not recommended unless heavily cleaning is done and/or the resulting module is intended for utility use such as with content generators.

Notepad++ A free and highly-extendable text editor that features custom macros and (somewhat limited) Regex Find and Replace.

Regex101 A website focused on the quick creation, testing, and learning of Regular Expressions. If you make an account you can also save and share your regex with others.

Useful Regular Expressions
A regular expression, or regex, is a sequence of characters that can be used to quickly find and replace more complex patterns of text. The following are some useful regular expressions for cleaning datasets. Keep in mind these are potentially highly destructive and one should exercise much care when using these. Batch replacement (ie. "Replace All") with these is not recommended.

/ / Selects headers, titles, and anything else that does not end in punctuation before a new line.

/ / Selects sequences of text in all uppercase capital letters. You can optionally replace the selected text with normal lower case (preserving the first uppercase letter) using this in the Replace field / / though keep in mind character names and other proper nouns may lose their first uppercase letter this way.

/ / Selects various problems involving quotations.

/ / Selects cases where the first letter of a word is separated by a space (common scanning error).

/ / Catches many of the auxiliary sections of a work.

Common Text Containers and Affixes

 * [text]: Metadata


 * : Telepathy and other nonverbal communications. (Weak in v3)


 * ─ text: LitRPG data blocks (Not used in v3). This is not an Em-Dash, but . This can include things like character or equipment stats, charged spells, item names, item rarity, etc.


 * > text: Occasionally used for computer output, etc. Currently overlaps with the text adventure mode and may be replaced by another character in the future.

Final Thoughts
Dataset cleaning may seem intimidating at first, but once you familiarize yourself with the tools and resources it's quite easy to get into the groove of things and prep a dataset in an afternoon. Take your time finding good source material, too, as no matter how well you clean something, if its inherent quality is already subpar, then it won't make much difference.

Lastly, if you have any follow up questions feel free to reach out to NAI's finetune team including Finetuneanon, Belverk, Zaltys, Gnurro, Rinter, and Lion (who wrote this guide). Special thanks to all of them!

⬆ Return to Page Top -