Apollo's Articles

LoRA Training for Stable Diffusion 3.5

Full article can be found here : Stable Diffusion 3.5 Large Fine-tuning TutorialImages should be cropped into these aspect ratios:If you need help automatically pre-cropping your images, this is a lightweight, barebones [script](https://github.com/kasukanra/autogen_local_LLM/blob/main/detect_utils.py) I wrote to do it. It will find the best crop depending on:(1024, 1024), (1152, 896), (896, 1152), (1216, 832),(832, 1216), (1344, 768), (768, 1344), (1472, 704)1. Is there a human face in the image? If so, we’ll do the cropping oriented around that region of the image.2. If there is no human face detected, we’ll do the cropping using a saliency map, which will detect the most interesting region of the image. Then, a best crop will be extracted centered around that region.Here are some examples of what my captions look like:k4s4, a close up portrait view of a young man with green eyes and short dark hair, looking at the viewer with a slight smile, visible ears, wearing a dark jacket, hair bangs, a green and orange background k4s4, a rear view of a woman wearing a red hood and faded skirt holding a staff in each hand and steering a small boat with small white wings and large white sail towards a city with tall structures, blue sky with white clouds, cropped If you don't have your own fine-tuning dataset, feel free to use this dataset of paintings by John Singer Sargent (downloaded from WikiArt and auto-captioned) or a synthetic pixel art dataset.I’ll be showing results from several fine-tuned LoRA models of varying dataset size to show that the settings I chose generalize well enough to be a good starting point for fine-tuning LoRA.repeats duplicates your images (and optionally rotates, changes the hue/saturation, etc.) and captions as well to help generalize the style into the model and prevent overfitting. While SimpleTuner supports caption dropout (randomly dropping captions a specified percentage of the time), it doesn’t support shuffling tokens (tokens are kind of like words in the caption) as of this moment, but you can simulate the behavior of kohya’s sd-scripts where you can shuffle tokenswhile keeping an n amount of tokens in the beginning positions. Doing so helps the model not get too fixated on extraneous tokens.Steps calculationMax training steps can be calculated based on a simple mathematical equation (for a single concept):There are four variables here:Batch size: The number of samples processed in one iteration.Number of samples: Total number of samples in your dataset.Number of repeats: How many times you repeat the dataset within one epoch.Epochs: The number of times the entire dataset is processed.There are 476 images in the fantasy art dataset. Add on top of the 5 repeats from multidatabackend.json . I chose a train_batch_size of 6 for two reasons:This value would let me see the progress bar update every second or two.It’s large enough in that it can take 6 samples in one iteration, making sure that there is more generalization during the training process.If I wanted 30 or something epochs, then the final calculation would be this:represents the number of steps per epoch, which is 396.As such, I rounded these values up to 400 for CHECKPOINTING_STEPS .⚠️ Although I calculated 11,900 for MAX_NUM_STEPS, I set it to 24,000 in the end. I wanted to see more of samples of the LoRA training. Thus, anything after the original 11,900 would give me a good gauge on whether I was overtraining or not. So, I just doubled the total steps 11,900 x 2 = 23,800, then rounded up.CHECKPOINTING_STEPS represents how often you want to save a model checkpoint. Setting it to 400 is pretty close to one epoch for me, so that seemed fine.CHECKPOINTING_LIMIT is how many checkpoints you want to save before overwriting the earlier ones. In my case, I wanted to keep all of the checkpoints, so I set the limit to a high number like 60.Multiple conceptsThe above example is trained on a single concept with one unifying trigger word at the beginning: k4s4. However, if your dataset has multiple concepts/trigger words, then your step calculation could be something like this so:2 concepts [a, b]Lastly, for learning rate, I set it to 1.5e-3 as any higher would cause the gradient to explode like so:The other relevant settings are related to LoRA.{ "--lora_rank": 768, "--lora_alpha": 768, "--lora_type": "standard" } Personally, I received very satisfactory results using a higher LoRA rank and alpha. You can watch the more recent videos on my YouTube channel for a more precise heuristic breakdown of how image fidelity increases the higher you raise the LoRA rank (in my opinion).Anyway, If you don’t have the VRAM, storage capacity, or time to go so high, you can choose to go with a lower value such as 256 or 128 .As for lora_type , I’m just going with the tried and true standard . There is another option for the lycoris type of LoRA, but it’s still very experimental and not well explored. I have done the deep-dive of lycoris myself, but I haven’t found the correct settings that produces acceptable results.Custom config.json miscellaneousThere are some extra settings that you can change for quality of life.{ "--validation_prompt": "k4s4, a waist up view of a beautiful blonde woman, green eyes", "--validation_guidance": 7.5, "--validation_steps": 200, "--validation_num_inference_steps": 30, "--validation_negative_prompt": "blurry, cropped, ugly", "--validation_seed": 42, "--lr_scheduler": "cosine", "--lr_warmup_steps": 2400, } "--validation_prompt": "k4s4, a waist up view of a beautiful blonde woman, green eyes""--validation_guidance": 7.5 "--validation_steps": 200 "--validation_num_inference_steps": 30 "--validation_negative_prompt": "blurry, cropped, ugly""--lr_scheduler": "cosine""--lr_warmup_steps": 2400These are pretty self-explanatory:"--validation_prompt"The prompt that you want to use to generate validation images. This is your positive prompt."--validation_negative_prompt"Negative prompt."--validation_guidance"Classifier free guidance (CFG) scale."--validation_num_inference_steps"The number of sampling steps to use."--validation_seed"Seed value when generating validation images."--lr_warmup_steps"SimpleTuner has set the default warm up to 10% of the total training steps behind the scenes if you don’t set it, and that’s a value I use often. So, I hard-coded it in (24,000 * 0.1 = 2,400). Feel free to change this."--validation_steps"The frequency at which you want to generate validation images is set with "--validation_steps". I set mine to 200, which is a 1/2 of 400 (number of steps in an epoch for my fantasy art example dataset). This means that I generate a validation image every 1/2 of an epoch. I suggest generating validation images at least every half epoch as a sanity check. If you don’t, you might not be able to catch errors as quickly as you can.Lastly is "--lr_scheduler" and "--lr_warmup_steps".I went with a cosine scheduler. This is what it will look like:### Memory usageIf you aren’t training the text encoders (we aren’t), `SimpleTuner` saves us about `10.4 GB` of VRAM.![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/316002db-297b-45a9-b919-cec6b311c773/image.png)With the settings of `batch size` of `6` and a `lora rank/alpha` of `768`, the training consumes about `32 GB` of VRAM.![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/c2aac70a-8c65-4f6f-b602-487f24de4bd2/image.png)Understandably, this is out of the range of consumer `24 GB` VRAM GPUs. As such, I tried to decrease the memory costs by using a `batch size` of `1` and `lora rank/alpha` of `128` .Tentatively, I was able to bring the VRAM cost down to around `19.65 GB` of VRAM.However, when running inference for the validation prompts, it spikes up to around `23.37 GB` of VRAM.![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/0c5240d6-6f71-404e-bea7-b18cc35ee5ad/image.png)![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/026be306-8331-45a2-9c02-541005f2cdfd/image.png)To be safe, you might have to decrease the `lora rank/alpha` even further to `64`. If so, you’ll consume around `18.83 GB` of VRAM during training.![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/5edcaaf9-bf0d-4db0-a183-cfab44963b8e/image.png)During validation inference, it will go up to around `21.50 GB` of VRAM. This seems safe enough.![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/bd41ce4e-a0db-443b-b3d2-63eac136779d/image.png)If you do decide to go with the higher spec training of `batch size` of `6` and `lora rank/alpha` of `768` , you can use the `DeepSpeed` config I provided [above](https://www.notion.so/Stable-Diffusion-3-5-Large-Fine-tuning-Tutorial-11a61cdcd1968027a15bdbd7c40be8c6?pvs=21) if your GPU VRAM is insufficient and you have enough CPU RAM.

Apollo

Prompt Engineering for PONY Diffusion Model and the Usage of Danbooru Tags

The PONY Diffusion model is an intriguing AI tool used for generating high-quality images based on tags for prompts. It leverages Danbooru tags, a structured and detailed tagging system, to enhance the accuracy and specificity of the generated images. Understanding how to effectively craft prompts using these tags can significantly improve the results you achieve with the PONY Diffusion model.### Understanding Danbooru TagsDanbooru tags are keywords that describe various elements within an image. They are categorized into several groups, such as character traits, clothing, pose, background, and more. By using these tags, you can guide the PONY Diffusion model to create images that closely match your vision.### Basic Prompt StructureWhen creating prompts for the PONY Diffusion model, it’s essential to follow a structured approach. Here’s a basic structure you can use:1. Scene Quality: Include any Embedding tags, Score, rating and source tags2. Camera Specification: Define the distance, angle, distortion level, etc.3. Character Description: Provide an essential description of the main character.4. General Details: List clothing or objects defining the character or image5. Scene Description: Include key factors for modifying the general set.6. Background Description: Describe the ambiance, architecture, weather, etc.7. Artist Style: Modify the entire scene in accordance with cultural phenomena.### Crafting Effective Prompts1. Start with a Clear Concept: Begin with a clear idea of what you want to generate. For example, an office romance scene with two coworkers.2. Use Specific Tags: Select relevant Danbooru tags to describe the characters, setting, and actions.3. Combine Tags Thoughtfully: Ensure the tags you combine make sense together and provide a coherent description.4. Avoid Redundant Tags: Don’t repeat tags within the same prompt. It adds unnecessary complexity.### Example PromptsHere are a few examples using some of your most used tags:Example 1: Romance Kim Possible and Bonnie Rockwallerscore_9, score_8_up, score_7_up, score_6_up, realistic, wide shot, 2girl, aged up, Kim possible,red lipstick, sexy, beautiful, smooth face, seductive smile, huge breasts, slim waist, thick thighs,green lingerie, thigh high stockings, Bonnie rockwaller in white lingerieExample 2: Inspired by Jewelz Bluscore_9, score_8_up, score_7_up, score_6_up, 1girl, solo, full body, candid shot, sitting, reclining, on bed, ankles crossed, looking at camera, naughty smile, relaxed pose, (blue hair:1.4), (hime cut:1), blue eyes, thin eyebrows, blue eyebrows, slight arch, round face shape, well proportioned nose, (full lips:1), (blue lipstick), high cheekbones, flawless skin, large breasts, narrow waist, wide hips, voluptuous thighs, large hoop earrings,long fingernails, blue nails, blue platform high heels, blue subtle makeup, blue taut pencil dress, inside, stage background, studio lighting, soft shadows, soft lightingExample 3: Princess Zelda Inspired Portraitscore_9, score_8_up, score_7_up, score_6_up, 1girl, aged up, Zelda ,red lipstick, sexy, beautiful, smooth face, seductive smile, huge breasts, slim waist, thick thighs,[purple lingerie|gold lingerie|white lingerie], thigh high stockings### Practical Tips-Danbooru tags: These tags help in guiding your prompt creation. But that does not mean the tags will always work.-Existing Character: Try using a character built into the model and augment what your looking for. You can find a list of character and tags here. If you are looking for realistic try not including the source tag.-Defining age can be difficult: Try using In the prompt, "age XX" where XX is the bottom age in years for my desired range (10, 20, 30, etc.) augmented with the following terms• ⁠"infant" for <2 yrs• ⁠"child" for <10 yrs• ⁠"teen" to reinforce "age 10"• ⁠"college age" for upper "age 10" range into low "age 20" range• ⁠"young adult" reinforces "age 30" range into middle "age 40" range• ⁠"middle age" for upper "age 40" range into lower "age 60" range• ⁠"grandmother/grandfather" for "age 60" on upI'll use similar terms in the negative prompt to refine to a tighter age range.- Experiment with Variations: Don’t hesitate to tweak your prompts and experiment with different combinations of tags to achieve the best results.-Tag Weights: Some tags will automatically out weigh others, so use () to increase weight or (tag:1.x) where x is a number 1-9. To decrease weight use [] or (tag:0.x) where x is 9-1.- Negative Prompts: Use negative prompts to exclude unwanted elements. For example, `negative_prompt: lowres, source_cartoon, source_anime, sketch, drawing`.###Useful LinksTag groups- This is a list of wikis that are themselves lists of tags. Oftentimes, a tag is listed in multiple pages, or in multiple sections of the same page, for better navigation.Fashion Styles -Tags representing clothing and fashion styles.Attire -Tags representing different types of clothing.Dresses - Tags for appearance and models of dresses as well as actions.Sexual Attire - Tags representing clothing used in intimate contexts.Shoes - Different shoe styles.Body parts - Different body parts and augmentationshowto:rate - With ratings replace the : with and _ to get the desired effect.### ConclusionPrompt engineering for the PONY Diffusion model using Danbooru tags is an art in itself. By understanding and effectively utilizing these tags, you can significantly enhance the quality and relevance of your generated images. Remember to be specific, avoid redundancy, and always experiment with different tag combinations to find what works best for you.Happy creating!

Apollo

Apollo

SD 3.5 Large Sampler x Scheduler grid

LoRA Training for Stable Diffusion 3.5

Stable Diffusion 3.5 #halloween2024

Prompt Engineering for PONY Diffusion Model and the Usage of Danbooru Tags