guides

Diffusion Models: A Practical Guide

Diffusion models have the power to generate any image that you can imagine. This is the guide you need to ensure you can use them to your advantage whether you are a creative artist, software developer, or business executive.

Introduction What are Diffusion Models?Diffusion Models: Why Are They Important?Getting Started with Diffusion Models Diffusion Model Prompt Engineering Diffusion Model Limitations Diffusion Models: Additional Capabilities and Tooling Diffusion Models: Practical Applications for today and tomorrow Conclusion

Introduction

With the Release of Dall-E 2, Google's Imagen, Stable Diffusion, and Midjourney, diffusion models have taken the world by storm, inspiring creativity and pushing the boundaries of machine learning.

With written suggestions, these models may produce almost any kind of image, from the fanciful to the futuristic to the photo-realistic and of course the cute.

With these superpowers, we can create nearly any image we can think of, redefining what it means for humans to engage with silicon. Diffusion models do have some limitations, despite their sophisticated capabilities, which we will discuss later in the guide. But as these models advance further or are replaced by a new generative paradigm, people will be able to use their minds to create immersive images, films, and other media.

We examine diffusion models in this guide, including their operation, real-world uses, and potential future developments.

What are Diffusion Models?

A subclass of machine learning models known as generative models has the ability to produce new data from training data. Flow-based models, Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs) are examples of further generative models. All of them are less effective than diffusion models due to their restrictions, even if they can all produce images of a high caliber.

Diffusion models function, in essence, by first injecting noise to destroy training data and then learning to recover the data by undoing this noise. Put differently, diffusion models have the ability to produce coherent images out of noise.

In order for diffusion models to learn how to remove noise, noise is first added to images during training. In order to produce realistic images, the model then applies this denoising technique to random seeds.

Combined with text-to-image guidance, these models can be used to create a near-infinite variety of images from text alone by conditioning the image generation process. Inputs from embeddings like CLIP can guide the seeds to provide powerful text-to-image capabilities.

Diffusion models can complete various tasks, including image generation, image denoising, inpainting, outpainting, and bit diffusion.

Dall-E 2: When Dall-E 2 was unveiled in April 2022, it produced photos with even more realism and better megapixels than the first Dall-E. Dall-E 2 is now accessible to the general public on the OpenAI website as of September 28, 2022. A small selection of photographs are offered for free, and more can be purchased.
Imagen is Google's May 2022, version of a text-to-image diffusion model, which is not available to the public.
Stable Diffusion: In August 2022, Stability AI released Stable Diffusion, an open-source Diffusion model similar to Dall-E 2 and Imagen. Stability AI's released open source code and model weights, opening up the models to the entire AI community. Stable Diffusion was trained on an open dataset, using the 2 billion English label subset of the CLIP-filtered image-text pairs open dataset LAION 5b, a general crawl of the internet created by the German charity LAION.
Midjourney is another diffusion model released in July 2022 and available via API and a discord bot.

Simply put, Diffusion models are generative tools that enable users to create almost any image they can imagine.

Diffusion Models: Why Are They Important?

The pinnacle of generating capacities at this time is represented by diffusion models. But these models are not without their heroes; they have been made possible by more than ten years of progress in machine learning methods, the ubiquitous availability of enormous amounts of visual data, and better technology.

For some context, below is a brief outline of significant machine learning developments.

The groundbreaking Imagenet article and dataset, which included over 14 million manually annotated photos, were made available at CVPR in 2009. This enormous dataset from back then is still useful for scholars and companies who are developing models today.
Ian Goodfellow created GANs in 2014, bringing strong generating capabilities to machine learning models.
With the publication of the initial GPT in 2018, LLMs made their debut. Shortly after, its GPT-2 and current GPT-3 versions, which include text generation, came out.
In 2020, NeRFs allowed the world to produce 3D objects from a series of images, and known camera poses.
Diffusion models have carried on this development over the last few years, providing us with even more potent generative powers.

What about diffusion models makes them so strikingly different from their predecessors? The most apparent answer is their ability to generate highly realistic imagery and match the distribution of real images better than GANs. Also, diffusion models are more stable than GANs, which are subject to mode collapse, where, following training, they merely depict a small number of modes of the actual data distribution. Though the issue is not quite as severe in practice, this mode collapse means that in the worst situation, only one image would be returned for any query. Since the diffusion process evens out the distribution, diffusion models sidestep the issue and have a wider range of pictures than GANs.

A wide range of inputs can also be used to condition diffusion models, including text for text-to-image creation, bounding boxes for layout-to-image generation, masked images for inpainting, and lower-resolution images for super-resolution.

Diffusion models have a wide range of applications, and their actual applications are continually developing. These models will have a significant impact on marketing, social media, AR/VR, retail and eCommerce, and more.

Getting Started with Diffusion Models

Web applications such Open AI's Dall-E 2 and Stable Diffusion's DreamStudio make diffusion models readily available. With the help of these tools, novices may quickly and easily begin working with diffusion models. You can create images by following prompts and carry out inpainting and outpainting. DreamStudio has greater control over the output parameters, but Dall-E 2 has a more straightforward interface with less features. New users on each platform receive free credits; however, after those credits are used up, a usage fee will apply.

DreamStudio: Users can quickly enjoy stable diffusion with DreamStudio from Stability AI without having to worry about the specifics of the infrastructure. Tools are available for creating images as well as for inpainting and outpainting. Specially, the interface allows the user to set a random seed, meaning that one can navigate the latent space with a fixed prompt (more on this later). 200 free credits are given to new users.
Dall-E 2: Dall-E 2 is now generally available to all users, according to OpenAI, having exited its restricted beta. Dall-E 2 offers a straightforward, uncomplicated user interface for picture generation, inpainting, and outpainting.
Local Installation:
- Stability AI broke headlines when it announced that it was open-sourcing both the model weights and source code for its Diffusion model Stable Diffusion.
- You can download and install Stable Diffusion on your local computer and integrate its capabilities into applications and workflows
- Other models, such as Dall-E 2, are currently only available via API or web app as their models are not open-source like Stable Diffusion.

Additionally, aggregation sites such as Lexica.art offer a vast collection of carefully selected photos for search, making it even simpler to get started, gain inspiration from the work of the community, and learn how to create better prompts.

Diffusion Model Prompt Engineering

For Diffusion models, you can manipulate the outputs using prompts. Verbose diffusion models receive two main inputs, which are translated into a text prompt, a seed integer, and a fixed point in the model's latent space. The user enters the text prompt, and the seed integer is often created automatically. To achieve the ideal results, prompt engineering requires constant experimenting. In order to assist you shape the images you wish to generate, we examined Dall-E 2 and Stable Diffusion. Based on our findings, we have compiled our best advice on how to make the most of your prompts, including prompt length, creative style, and essential phrases.

How to prompt

In general, there are three main components to a prompt:

1. Frame - The kind of image that will be created is the frame of the image. The overall look and feel of the image is created by combining this with the Style that is provided later in the prompt. Photographs, digital illustrations, oil paintings, pencil drawings, one-line drawings, and matte paintings are a few examples of frames.

The following examples are modified versions of the base prompt "Painting of a person in a Grocery Store," in the frame of an oil painting, a digital illustration, a realistic photo, and a 3D cartoon.
Diffusion models typically default to a “picture” frame if not specified, though this is dependent on the subject matter. By specifying a frame of an image, you control the output directly.
By modifying the frame to “Polaroid” you can mimic the output of a polaroid camera, complete with large white borders.
Pencil Drawings can be produced as well.
And as already covered, different painting techniques can be applied.
Frames offer a broad indication of the kind of output that the diffusion model ought to produce. However, your prompts should also include a compelling subject and a sophisticated style in order to produce photographs that stand out. After discussing subjects, we will go over some helpful hints and techniques for coordinating frames, subjects, and styles to enhance your photos.

2. Subject - The main subject for generated images can be anything you can dream up.

Diffusion models are built largely from publicly available internet data and are able to produce highly accurate images of objects that exist in the real world.
However, Diffusion models often struggle with compositionality, so ideally, limiting your prompts to one to two subjects is best.
Sticking to one or two subjects produces generally good results, for example "Chef Chopping Carrots on a cutting board."
Even though there is some confusion here with a knife chopping another knife, there are chopped carrots in the scene, which is generally close to the original prompt.
However, expanding to more than two subjects can produce unreliable and sometimes humorous results:
Diffusion models tend to fuse two subjects into a single subject if the subjects are less common. For example, the prompt “a giraffe and an elephant” yields a giraffe-elephant hybrid rather than a single giraffe and a single elephant. Interestingly, there are often two animals in the scene, but each is typically a hybrid.
Some attempts to prevent this, including adding in a preposition like “beside,” have mixed results but are closer to the original intent of the prompt.
This issue appears subject-dependent, as a more popular pair of animals, such as “a dog and a cat,” generates distinct animals without a problem.

3. Style - The style of an image has several facets, key ones being the lighting, the theme, the art influence, or the period.

Details such as “Beautifully lit”, “Modern Cinema”, or “Surrealist”, will all influence the final output of the image.
However, Diffusion models often struggle with compositionality, so ideally, limiting your prompts to one to two subjects is best.
Referring back to the prompt of "chefs chopping carrots," we can influence this simple image by applying new styles. Here we see a “modern film look” applied to the frames of “Oil Painting” and “Picture.”
The tone of the images can be shaped by a style, here we see “spooky lighting.”
You can fine-tune the look of the resulting images by slightly modifying the style. We start with a blank slate of “a house in a suburban neighborhood.”
By adding “beautifully lit surrealist art” we get much more dynamic and intense images.
Tweaking this we can get a spooky theme to the images by replacing “beautifully lit” with the phrase “spooky scary.”
Apply this to a different frame to get the desired output, here we see the same prompt with the frame of an oil painting.
We can then alter the tone to “happy light” and see the dramatic difference in the output.
You can change the art style to further refine the images, in this case switching from “surrealist art” to “art nouveau.”
As another demonstration of how the frame influences the output, here we switch to “watercolor” with the same style.
Different seasons can be applied to images to influence the setting and tone of the image.
There is a near-infinite variety of combinations of frames and styles and we only scratch the surface here.
Artists can be used to fine-tune your prompts as well. The following are versions of the same prompt, "person shopping at a grocery store," styled to look like works of art from famous historic painters.
By applying different styles and frames along with an artist, you can create novel artwork.
Start with a base prompt of “painting of a human cyborg in a city [artist] 8K highly detailed.”
While the subject is a bit unorthodox for this group, each painting fits the expected style profile of each artist.
We can alter the style by modifying the tone, in this case, to “muted tones”:
You can further alter the output by modifying both the frame and the tone to get unique results, in this case, a frame of a “3D model painting” with neon tones.
Adding the qualifier, “the most beautiful image you've ever seen” yields eye-catching results.
And depictions such as “3D model paintings” yield unique, novel works of art.
By modifying the frame and style of the image, you can yield some amazing and novel results. Try different combinations of style modifiers, including “dramatic lighting”, or “washed colors” in addition to the examples that we provided to fine-tune your concepts further.
We hardly scratched the surface in this guide, and look forward to amazing new creations from the community.

4. Seed

A combination of the same seed, same prompt, and same version of Stable Diffusion will always result in the same image.
If you are getting different images for the same prompt, it is likely caused by using a random seed instead of a fixed seed. For example, "Bright orange tennis shoes, realistic lighting e-commerce website" can be varied by modifying the value of the random seed.
Changing any of these values will result in a different image. You can hold the prompt or seed in place and traverse the latent space by changing the other variable. This method provides a deterministic way to find similar images and vary the images slightly.
Varying the prompt to "bright blue suede dress shoes, realistic lighting e-commerce website" and holding the seed in place at 3732591490 produces results with similar compositions but matching the desired prompt. And again, holding that prompt in place and traversing the latent space by changing the seed produces different variations:
To summarize a good way to structure your prompts is to include the elements of “[frame] [main subject] [style type] [modifiers]” or “A [frame type] of a [main subject], [style example]” And an optional seed. The order of these exact phrases may alter your outcome, so if you are looking for a particular result it is best to experiment with all of these values until you are satisfied with the result.

To summarize a good way to structure your prompts is to include the elements of “[frame] [main subject] [style type] [modifiers]” or “A [frame type] of a [main subject], [style example]” And an optional seed. The order of these exact phrases may alter your outcome, so if you are looking for a particular result it is best to experiment with all of these values until you are satisfied with the result.

5. Prompt Length

Generally, prompts should be just as verbose as you need them to be to get the desired result. It is best to start with a simple prompt to experiment with the results returned and then refine your prompts, extending the length as needed.

However, many fine-tuned prompts already exist that should be reused or modified.

Modifiers such as "ultra-realistic," "octane render," and "unreal engine" tend to help refine the quality of images, as you can see in some of the examples below.

“A female daytrader with glasses in a clean home office at her computer working looking out the window, ultra realistic, concept art, intricate details, serious, highly detailed, photorealistic, octane render, 8 k, unreal engine”
“portrait photo of a man staring serious eyes with green, purple and pink facepaint, 50mm portrait photography, hard rim lighting photography-beta -ar 2:3 -beta -upbeta”
“Extremely detailed wide angle photograph, atmospheric, night, reflections, award winning contemporary modern interior design apartment living room, cozy and calm, fabrics and textiles, geometric wood carvings, colorful accents, reflective brass and copper decorations, reading nook, many light sources, lamps, oiled hardwood floors, color sorted book shelves, couch, tv, desk, plants”
“Hyperrealistic and heavy detailed fashion week runway show in the year 2050, leica sl2 50mm, vivid color, high quality, high textured, real life”
“Full-body cyberpunk style sculpture of a young handsome colombian prince half android with a chest opening exposing circuitry and electric sparks, glowing pink eyes, crown of blue flowers, flowing salmon-colored silk, fabric, raptors. baroque elements. full-length view. baroque element. intricate artwork by caravaggio. many many birds birds on background. trending on artstation, octane render, cinematic lighting from the right, hyper realism, octane render, 8k, depth of field, 3d”
“Architectural illustration of an awesome sunny day environment concept art on a cliff, architecture by kengo kuma with village, residential area, mixed development, high - rise made up staircases, balconies, full of clear glass facades, cgsociety, fantastic realism, artstation hq, cinematic, volumetric lighting, vray”

5. Additional Tips

A few additional items are worth mentioning.

Placing the primary subject of the image closer to the beginning of the prompt tends to ensure that subject is included in the image. For instance, compare the two prompts

"A city street with a black velvet couch" at times will miss the intent of the prompt entirely and the resulting image will not include a couch.
By rearranging the prompt to have the keyword "couch" closer to the beginning of the prompt, the resulting images will almost always contain a couch.

Certain subject and place combinations usually don't work well. For example, "A black velvet couch on the surface of the moon" produces erratic results, with couches completely absent and other backdrops. Nonetheless, a related prompt like "A black velvet couch in a desert" tends to better capture the essence of the challenge by precisely depicting the black hue, the velvet substance, and the scene's features. The model is presumably better at producing coherent landscapes for deserts than for the moon because there are presumably more desert photographs in the training set.

The field of prompt engineering is constantly growing, with new techniques being discovered on a daily basis. A new kind of job called "Prompt Engineer" is probably going to arise as more companies realize how effective diffusion models can be in solving their problems.

Diffusion Model Limitations

Despite their great potency, diffusion models have many drawbacks, some of which we will discuss below. Disclaimer: these limitations are effective as of October 2022, due to the quick speed of growth.

Face Distortion: When the sample size is greater than three, faces get significantly warped. The faces become significantly deformed in pictures like "a family of six in a conversation at a cafe looking at each other and holding coffee cups, a park in the background across the street leica sl2 50mm, vivid color, high quality, high textured, real life."
But as the number of participants in the prompt increases, the faces significantly deform. The new prompt, "a family of six in a conversation at a cafe looking at each other and holding coffee cups, with a park in the background across the street and a Leica SL2 50mm camera, produces strikingly realistic color images that are high quality and highly textured."
Limited Prompt Understanding: These models can be less effective as productivity tools, but they still increase productivity overall. For some photographs, you will need to manipulate the prompt a lot to achieve the desired result.

Diffusion Models: Additional Capabilities and Tooling

Diffusion models' flexibility gives them more capabilities than just pure image generation.

With the use of an image editing tool called inpainting, users can alter specific areas of an image and replace them with content that is created by the diffusion model. To make sure the generations fit inside the original image's context, the model makes reference to nearby pixels. A lot of programs allow you to alter photos (generated or real) by "erasing" or masking a certain area of the image and then instructing the model to fill it in with new information. You begin with a created or real-world image when using inpainting. Here, the picture shows a model in a contemporary metropolis leaning on a wall.
- You then apply a "mask" to the areas of the image you would like to replace, similar to erasing areas of the image. First, we will replace the jacket that the model is wearing.
- Then, we will generate new clothing for the model to wear, in this case a beautiful fancy gown shiny with pink highlights.
- Or a bright purple and pink wool jacket with orange highlights and highly detailed.
- The images above maintain the original pose, but these models can also suggest new poses, in this case, moving the arm down to the side, as seen in this example, "A glowing neon jacket purple and pink"
- Diversity of materials is demonstrated as well with "A leather jacket with gold studs"
- And also with "A shiny translucent coat neon"
- Inpainting can also be used to create new backgrounds that weren't in the original image, or it can be used to replace background pieces by just removing them and filling in the backdrop. First, start by hiding the background poles.
- To generate a new background, apply a mask to and generate new scenes, in this case "a futuristic cityscape with flying cars."
As you can see, inpainting is a very effective way to swiftly modify photographs and create dynamically generated new scenes.
Anticipate even more powerful features from these tools in the future, freeing the user from editing masks. All you need to do is specify the alterations you want, such "Replace the background with a futuristic cityscape," and the image will be modified without requiring any mouse clicks or keystrokes.
Outpainting
By adding visual elements in the same style or reimagining a story using natural language description, outpainting allows users to express their creativity beyond the confines of the original image.
You can expand an image, whether it is produced or real-world, beyond its initial bounds to create a more expansive and well-composed scene.
- Real-world image outpainting
- Start by uploading an image and selecting a region on the outside border where you would like to extend the image.
- Similar to inpainting, you then generate prompts that generate coherent scenes that extend the original image.
- Generated Image Inpainting works in much the same way. Start by generating a scene that will serve as the seed. Outside scenes lend themselves to more expansive outpainting, so we will start with “central park looking at the skyline with people on the grass in impressionist style."
- Now add in a man playing frisbee with his dog.
- It should be noted that the prompt makes no mention of maintaining the original image's skyline consistency. You can concentrate on adding the elements you wish to see, since the model takes care of extending and maintaining the background design automatically.
- Finally, we will add in a family having a picnic and we have our finished image.
Outpainting requires an extra layer of prompt refinement in order to generate coherent scenes, but enables you to quickly create large images that would take significantly longer to create with traditional methods.
Impressive levels of imagination and the capacity to construct expansive settings that maintain coherence in theme and style are made possible by outpainting. As with inpainting, this capacity can be further enhanced by making it even easier to generate the desired scenes—just give one prompt—and obtain the precise image you want.
Diffusion for Video Generation
As we've seen, creating static pictures is fun and has a lot of real-world uses. A number of recently released models enhance the diffusion models' capabilities to produce videos. Although a larger audience has not yet had access to these capabilities, they will very soon.
- Meta's Make-A-Video is a new AI system that lets people turn text prompts into brief, high-quality video clips.
- Google's Imagen Video generates approximately 5 second videos in a similar fashion.
Diffusion Model Image Curation Sites
Highly curated collections of generated images and the questions used to make them can be found on curation websites such as Lexica.art. Since millions of photos are indexed, it's likely that the image you initially believed you would need to create already exists and is only a short search away. Low latency searches yield results very instantaneously, and you don't have to wait a minute or two for a diffusion model to generate photos. This is excellent for trying new things, looking for specific kinds of pictures, or investigating ideas. Another excellent resource for learning how to prompt to obtain the desired results is Lexica (provide a few examples here).

Diffusion Models: Practical Applications for today and tomorrow

Diffusion models' apparent use is in design tools, where they enable artists to be even more productive and imaginative. The initial set of these tools has actually already been revealed, and among them is Microsoft Designer, which incorporates Dall-E 2 into its toolkit. With generative product designs, fully generated catalogs, alternate viewpoint generation, and much more, there are a lot of prospects in the retail and eCommerce market.

Product designers will have access to potent new tools that will boost their imagination and enable them to visualize items in various settings, such as homes and offices. 3D diffusion has advanced to the point that products may now be fully rendered in 3D with the touch of a button. Taking this to the extreme, these 3D renders can then be printed as a 3D model and come to life in the real world.

Advertising will become more effective because to the capacity to test various creatives and develop ad creative dynamically, which will result in significant efficiency gains.

Diffusion models will be included by the entertainment sector into special effects tooling, allowing for quicker and more affordable productions. More imaginative and outrageous entertainment ideas—which are currently limited by expensive production costs—will result from this. Similar to this, the models' near-real-time content generating capabilities will enhance the experiences of augmented and virtual reality. With only their voice, users will be able to change their surroundings whenever they choose.

A new generation of tooling is being developed around these models, which will unlock a wide range of capabilities.

Conclusion

We still don't fully understand the depth of Diffusion models' limits, despite their amazing range of possibilities.

The capabilities of foundation models will inevitably grow over time, and advancement is happening very quickly. The way humans engage with robots will shift significantly as these models advance. Text is the Universal Interface, as Roon wrote in his blog, "soon, prompting may not look like "engineering" at all but a simple dialogue with the machine."

There are many potential to advance business, art, and society; however, in order to reap these benefits, technology must be swiftly adopted. Companies that don't utilize this new feature run the danger of falling far behind. We envision a time where human creativity and production will be unrestricted, and humans will only need a prompt to create everything they can think of. Now is the ideal time to begin this adventure, and we hope that this book provides a solid starting point.

The future of your
industry starts here.

Book a Demo Build AI