Datasetting for AI Modules

Dataset Cleaning Guidelines
Preparation of a clean dataset is the most important factor in the creation of a proper AI Module. Simply scraping a wiki or throwing together some PDF conversions is vastly insufficient and will at best result in sub optimal outputs riddled with leaked symbols, odd spacing, and circular repetition.

Luckily cleaning your dataset is not difficult, only time-consuming. Assuming you have the patience and follow these guidelines, you too can create high quality AI modules.

General Overview
Dataset files should be... Plain text in UTF-8 encoded  format with no tags/markdown/html, instead focusing on standard formatted English prose.

One paragraph per newline. That is, no paragraphs should be split onto multiple lines. To visualize this, it helps to turn on Word Wrapping on whatever text editor you are using (in Notepad++ this done by checking View > Word Wrap).

No empty newlines between paragraphs (ie. double spacing). If desired, you could leave a single empty newline between chapters as a chapter break, but it is instead recommend that you leave only  on its own newline between chapters (replacing the chapter title).

Do not leave any leading/trailing space, tabs or other whitespace. This includes checking for any spaces after the end of a sentence before a newline.

Use regular single and double quotation characters ( and  ) not the fancy ones (  and  ). The AI is trained using the former and thus will not use the latter properly.

Try to focus in on only one specific subject matter and ensure all included material is focused on what your module should achieve in terms of what you expect from its output. This process necessitates nuance, not stuffing as much as you can into your dataset. Keep in mind this can be tricky, as for example some Steampunk novels don't actually talk all that much about the type of content one would relate to the Steampunk genre and thus in practice it is not very effective using them to train a "Steampunk" module.

A little data goes a long way. 1MB to 5MB still provides excellent results for an authorial or thematic style assuming you provide it enough training steps. On this note, feel free to experiment with short data in general. There is nothing stopping you from turning a short prompt into a module, and this would also require much fewer training steps (perhaps within the range of 50-100 for a typical scenario prompt).

If you want to avoid the same characters appearing constantly or other forms of overfitting, try to keep the data balanced with a variety of names, phrases, and terms by including stories featuring different characters or locations.

Do not expect a module to memorize relational/factual data. For instance if you feed it a story with Pokémon descriptions, while it will reference those same Pokémon, it may still get their types, moves, or other data you included wrong.

The form or format in which the data exists plays a significant role in the form it will generate as AI output. For this reason, avoid wikis or other encyclopedic data unless you specifically want to generate encyclopedia entries (which could be useful for utility modules as content generators).

Cleaning Headings & Auxiliary Sections
Before anything else, it is important to note that discretion is key here as you typically want to be as *least* destructive to your data as possible considering how easy it is to completely ruin your data with a single misguided Find and Replace. However unless you want to see them leaked in the AI's output, make sure to remove Fore/Afterwords, Acknowledgements, Author's Notes, About the Authors, and any other sections that have nothing to do with the story or data. This also includes any author commentary or excerpts *unless* it is diagetic to (takes place within the narrative of) the story.

Chapter titles can be replaced with  alone on a newline to signify breaks between chapters or   for breaks between individual short stories as these are conventions of the base training data and the AI is used to working with them.

In general it is a good idea to remove, replace, or trim down anything too repetitive (such as numbered chapters, titles that use the same prefixes, or stylistic phrases like  repeated twenty times in a row) as this will increase the chance of these leaking into your output. That said, if you find yourself removing too much from a work it might just be a better idea to exclude it from your dataset altogether.

Cleaning Prose
Keep an eye out for odd symbols, characters, or other unusual formatting such as odd card suits or other symbols used as chapter breaks or Japanese quotations (  and  ) which are often found in visual novels (and can be replaced with regular quotations).

On this note, often times in scanning a book to a digital copy certain formatting will have errors or won't scan correctly. You can see examples of this occasionally with underscores such was from OCR software reading  as   or for things like telepathic dialogue communication, which is typically italicized but since raw text has no characters for italics they will appear as __dialogue encapsulated by underscores like this__ which need to be replaced with quotations or angled brackets (  and  ) as the AI knows to associate them with non-verbal dialogue.

Another common scanning issue is having chapters start in all uppercase capital letters (ie. ) or with the first letter of the first word separated from the rest of the word (ie.  ).

Extra spacing is yet another common issue such as with possessive indicators (ie. ) or at the end of sentences before a newline. On that note extra newlines aren't good either as the AI tends to associate them with chapter breaks or a passage of time.

Also be on the lookout for vertical bars which can be replaced with colons. Lastly, if there any square brackets ( and  ) in an unedited file, you might want to remove them unless they are encapsulating something such as an indication of time, location, point-of-view or some other note you intend to use to nudge the AI in your story.

Miscellaneous Tips

 * Don't worry about and don't include the well-known  token. It is not needed for training AI modules as the training process handles that already.
 * Don't worry about your clean-up or tokens being perfect. AI Modules are not as powerful as the actual underlying fine-tuned model and thus don't have as strong an effect as the existing model's training. If your resulting module produces at least fun or interesting results, that's still very much a success.
 * Make use of good feature-rich text editors such as Notepad++ or at least something that supports regular expressions which greatly cut down on editing time.

Recommended Dataset Tools
Gnurro's ReFormatter Belverk's Cleaning Python Script ScrapeFandom Notepad++ Regex101

Useful Regular Expressions
A regular expression or regex is a sequence of characters that can be used to quickly find and replace more complex patterns of text. The following are some useful regular expressions for cleaning datasets. Keep in mind these are potentially highly destructive and one should exercise much care when using these. Batch replacement (ie. "Replace All") with these is not recommended.

/ / Selects headers, titles, and anything else that does not end in punctuation before a new line.

/ / Selects sequences of text in all uppercase capital letters. You can optionally replace the selected text with normal lower case (preserving the first uppercase letter) using this in the Replace field / / though keep in mind character names and other proper nouns may lose their first uppercase letter this way.

/ / Selects various problems involving quotations.

/ / Selects cases where the first letter of a word is separated by a space (common scanning error).

/ / Catches many of the auxiliary sections of a work.

Final Thoughts
Dataset cleaning may seem intimidating at first, but once you familiarize yourself with the tools and resources it's quite easy to get into the groove of things and prep a dataset in an afternoon. Take your time finding good source material too as no matter how well you clean something if its inherent quality is already subpar then it won't make much difference.

Lastly, if you have any follow up questions feel free to reach out to NAI's finetune team including Finetuneanon, Belverk, Zaltys, Gnurro, Rinter, and Lion (who wrote this guide). Special thanks to all of them!

⬆ Return to Page Top -