https://cms.cogniwerk.ai/assets//eb75697d-3262-4527-b0b4-32e8bb15fb6c.jpg

Finetuning - Walkthrough

Author: 

David Mair

Date: 08.10.2024
   

Finetuning Models for Coherent Visual Style

Finetuning models is and has been the key to a coherent visual style. Until now, because of its complexity, it was a gatekeeper for technical creatives. It’s the machine power, the setup of the programs, and last but not least, the dozens of parameters that influence training tremendously. All this we took away from you so that you can now use model finetuning by just collecting a dataset of images that you want to have as a reference.

The Benefit: Generating Branded Content

Unlike Midjourney, Stable Diffusion has an open environment to change the output of a source model. How to do so? That’s what we will cover in this article.

Tutorial Overview

This tutorial will walk you through the following necessary steps:

  1. Overview Interface
  2. Preparation Dataset
  3. Captioning
  4. Training
  5. Last Words on Settings

1. Overview Interface

01 Model Finetuning Interface Train Your Own 02 Model Finetuning Interface Captions Train Your Own

Before uploading images | After uploading images We have reduced all the possibilities to a minimum. When you enter the training interface, you see the four sublines “Name,” “Trigger Words,” “Type,” and “Upload Your Images.” Here’s a little summary of what each field stands for:

  • Name:  Choose a unique name that describes the model capabilities and content well and makes it easier to distinguish once you have trained multiple models. 
  • Type: Here you choose what type of model you want to train. For now, we have four different types: Face, Style, Person, and Concept.
    • Face: Your dataset should showcase a variety of portrait pictures and close-ups of one person, but also some full-body shots so that the model can learn that it can be used for these as well. Use at least 5-15 high resolution images: SDXL does not require many images to get good results. Make sure your images are at least 1024x1024, or they will be scaled up which can introduce artifacts.
    • Style: Branded content uses certain styles like grading, lighting, and camera settings. The visual world around a brand is highly important. So, training a model that has this specific style can open up new worlds of creativity. For this, use only images in your dataset that match the style that you want to accomplish. These can be paintings, a certain hue and color balance, or anything that you can think of. Collect images and start training. For styles use at least 50 to 500 high-quality images. Again, make sure they're at least 1024x1024 to avoid image resizing.
    • Person: Unlike face training, this is focused on the person in total. So better use images that showcase the preferred person in total and not only in portraits.
    • Concept: This can stand for a lot of things like people jumping on a trampoline or a certain object like a car or a can of soda. The importance here again is that the dataset has to showcase the idea in different varieties to be used flexibly.

  • Trigger Words: Just as important as the images themselves is the tagging (or captioning). This includes the trigger words. They are especially important when training on a face or a concept like an object. If you train a model on a certain looking person with red hair, "a person with red hair" is not a good trigger because Stable Diffusion has a predefined understanding of this concept. Therefore, choose a unique word that activates this specific likelihood, such as "a ukj person" or "a r3d person". It is also possible to choose more trigger words for subconcepts: for example if you train a certain bottle shape you can use the trigger word "filled" or "empty" based on the image. This helps you to activate these concepts later when prompting with this model.
  • Upload Your Images: Just drag and drop your dataset of images here and you will automatically find yourself in the captioning section. The trigger words are added automatically. Now you can still delete the ones that don’t fit and add a more detailed description. Make it simple and separated by commas. Example: Epic mountains, sunset, illustration style. Take your time for this part. The captioning is a very important part of training as it gives the direction of what words will impact your results the most.

2. Preparation Dataset

03 Dataset 1 03 Dataset 2

Your dataset is a critical part of training. It provides the model with the reference material needed to learn and generalize based on your desired output. To ensure the best results, follow these guidelines:

  • Variety and Consistency: If you use the same image over and over, your generated outputs will lack diversity. On the other hand, if your dataset includes different lighting, poses, camera angles, and environments, the model will learn to generate more versatile results.

  • Quality of Images: Always aim for high-quality images (720p or higher). Low-resolution or blurry images can limit the model’s ability to capture fine details​. If your generated images don't look like the subject, there can be several reasons. First try to improve the quality of your training set before you try to optimize other parameters. Include only high quality images that are well lit, in focus, and show the subject clearly. More images are generally better, but only if all the images are good. Adding mediocre images will not help. Crop your images to a square format. SDXL is trained on 1024x1024 squares. Depending on where your object is located in your training images, it may end up being cropped! By cropping the images to a square format yourself, you'll ensure that the model is trained on exactly what you want.

Example:

  • For training a model on Arnold Schwarzenegger, don’t just use images of him in a bodybuilding pose. Include images where he’s in different settings, like wearing a suit, to help the model learn his full range of appearances.

A dataset is a collection of images that you want to use as a source for your training. These images will guide the training into what the output will be while using it. If they are showcasing the same thing all over again, the generations will show it as well. If there is a variety in the images such as sunlight, camera angles, clothing, poses, etc., the model gets a wider understanding of the idea.

Take 10 images of Arnold Schwarzenegger, for example. All of them on stage in his bodybuilding prime, photographed from a low angle view. While using them in training, your custom model learns “this is what Arnold Schwarzenegger looks like” when using this model. Therefore you won’t get good results if you want to generate him in a suit. Then again, if you use 10 different images, your model learns that Arnold can wear a suit as well.

This idea can be used for every dataset that you prepare. Make it specific, and it will generate specific results. Make it broad, and it will generate broad results.

3. Captioning

04 Captioning Finetuning Interface Train Your Own

Each picture needs a description. Don't worry, this is easy, although it may take a little time. Trigger words help you to describe the core concept of the images at the same time, because most likely most of the images will show something similar. And the more often a word is used in one of the captions, the machine will again learn that "a high angle view of an ukj person" is how a high angle view of a specific person looks like, or how this specific "product shot of an ukj bottle" looks like. So use this line to tag all the words that describe most of your images.

Think of the caption as a prompt: if you have a certain looking person, you wouldn't describe him in detail, just point out to Stable Diffusion that this is a "ukj person". But you would describe the environment, the setting, or the wardrobe, because those are concepts that you might want to change in the prompt. As a rule of thumb, you describe everything you want to change in the caption, but don't mention what's specific to the subject. For styles, captions are actually not necessary. However, if the captions are good, it will help the training process.

After adding trigger words you can activate them for every single image that they fit. If there is something specific in the picture, add it in the caption field. Accurate captioning is vital as it guides the training direction. Use simple, comma-separated descriptors. For example: “Epic mountains, sunset, illustration style.” Take your time with this step to ensure precision.

In the next example, we trained a car. By describing the camera angles in every caption, we were able to gain more control over the image generation. Remember: Add what you want the machine to learn.

“High angle front view of a smart fortwo next to an active volcano” “Low angle rear view of a smart fortwo next to an active volcano”

05 High Angle Front View Finetuned Model Train Your Own 05 Low Angle Rear View Finetuned Model Train Your Own.png

4. Training

Now that you have everything set, you are ready to train. Just click on the orange button in the bottom right and wait. You can close the window or grab a coffee because this will take a while. The model itself will show in the section “Train an AI Model” in the interface. Activate it and start generating with your own personalized finetuned model.

Managing Expectations:

Training is not always deterministic. Even when keeping parameters the same, slight variations in outcomes are normal. If a model doesn’t perform as expected on the first try, you may need to adjust your dataset or captions, or rerun the training process.

06 Loading Images Finetuning Interface Train Your Own 07 Select Trained Lo Ra in Interface Train Your Own

5. Last Words on Settings

We have optimized default settings to give you the best results with minimal effort. However, if you are familiar with model training and want more control, you can click the "Settings" button to adjust parameters such as:

  • Learning Rate: Controls how fast the model learns. Adjusting it can change the final quality of your results, but be cautious, as small tweaks can lead to large changes in performance​​.
  • Batch Size: Refers to how many images are processed simultaneously. This directly impacts training time and resource usage, especially if you’re using limited hardware​.

Unless you're experienced, it's best to leave these settings at default to avoid unintended results or crashes during training.

Finetuning a model is a powerful way to generate custom branded content and achieve a consistent visual style. By carefully curating your dataset, writing precise captions, and managing your expectations during training, you can create models that fit your unique creative vision. Remember, practice and experimentation are key to getting the best results.

If you want to learn more about our features read our tutorials about Image to Image and how to import an existing model. All tutorials can be found in our blog-section