How to create an AI image generator application using stable diffusion (part 2)

Written by Michael Yang | Nov 7, 2023 5:00:00 AM

Use case overview

In Part 1 of our series, we explored Stable Diffusion, an AI model that transforms text into highly realistic images. We also demonstrated how to optimize its performance using AWS tools, making text-to-image predictions faster and more efficient.

This post expands on the text-to-image feature by adding image-to-image generation and the ability to reuse seeds for consistent results. We also introduce an image search engine powered by a vector database and combine these features into a user-friendly application built with Streamlit. This application will enable users to generate images based on either text descriptions or initial images, provide a gallery of previously generated images, and even perform image searches using textual queries.

Architectural Diagram

Below are three main features of the AI image generator. Each feature is color coded as described below.

Image generation via text prompt or with initial image (Blue)
Image search / prompt recommendation (Red)
Image history search by user session (Yellow)

User provides a session id which creates a new folder in the project s3 bucket where all generated images would be stored in.
User inputs prompt to generate image(s). Users can also supply an initial image or seed value. User requests invoke Sagemaker real time endpoint that hosts Stable Diffusion 2.1 base. For each image inference, images are stored in S3 bucket. In addition, the model also performs a text2vector which stores the prompt as embeddings along with the image s3 location as metadata to the vector database (Pinecone).
Image prediction and seed value are returned to the user.
User searches for images by providing a prompt.
A similarity search is performed between the search prompt and the image prompts in the vector database.
Descending sorted results (by similarity scores) of the images along with their prompts are returned to the user.
User can also view images generated in their own session.

Image2Image text-guided generation

The StableDiffusionImg2ImgPipeline lets you pass a text prompt and an initial image to condition the generation of new images.

Initial image (Left)
“A fantasy landscape, trending on artstation” (Right, new image)

Below is a code sample to implement image2image:

import requests

from PIL import Image

from io import BytesIO

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline

device = "cuda"

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(

"runwayml/stable-diffusion-v1-5",

torch_dtype=torch.float16,

).to(device)

# let's download an initial image

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)

init_image = Image.open(BytesIO(response.content)).convert("RGB")

init_image = init_image.resize((768, 512))

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

Source (https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)

Reusing seeds and latents

To reproduce desired results or fine-tune specific outcomes that appeal to you, you have the flexibility to generate your own latents or adjust your prompt accordingly. If we wish to reuse seeds for reproducibility purposes, it becomes necessary for us to generate the latents ourselves. Otherwise, the internal pipeline will handle their generation, and we won't have a means to replicate them.

Latents are the initial data patterns used by the AI model to create images. By assigning unique seed values to these patterns, we can save and reuse them to consistently recreate specific images.

Prompt:

“Puppy in grass,flowers,poppies” (Left, original seed)
Negative prompt: “labrador” (Right, generated from seed)

Sample code to generate seed and use for inference is below.

generator = torch.Generator(device=device)

latents = None
seeds = []
for _ in range(num_images):
   # Get a new random seed, store it and use it as the generator state
   seed = generator.seed()
   seeds.append(seed)
   generator = generator.manual_seed(seed)

   image_latents = torch.randn(
       (1, pipe.unet.in_channels, height // 8, width // 8),
       generator = generator,
       device = device
   )
   latents = image_latents if latents is None else torch.cat((latents, image_latents))

# latents should have shape (4, 4, 64, 64) in this case
Latents.shape

prompt = "Labrador in the style of Vermeer"

with torch.autocast("cuda"):
   images = pipe(
       [prompt] * num_images,
       guidance_scale=7.5,
       latents = latents,
   )["sample"]

Source (https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb)

To enable our model to perform both image2image and reuse of seeds, we modified our inference code in our model artifact. We have shared the modified inference code here.

Vector database

A vector database stores and searches for information using numerical data patterns called vectors. For example, when a user enters a prompt like 'Sheep grazing,' the database finds similar images by comparing patterns in stored data.

Source (www.pinecone.io)

Let's consider an example to better understand the process. Suppose our initial prompt is "Sheep grazing." We construct a prompt vector based on this prompt and proceed to search for similar items within the vector database. As a result of this search, we discover multiple images that already align with the given description.

To facilitate efficient storage and retrieval of generated images, we store them in an S3 storage system. When inserting the images into the vector database, we include the image location (S3 URI) as metadata alongside the corresponding vector. Additionally, we incorporate the plaintext prompt as metadata. This additional metadata serves a valuable purpose, enabling a second feature known as prompt recommendations.

Prompt recommendations prove to be immensely beneficial for users who may be grappling with prompt ideas. By leveraging these recommendations, users can find inspiration and enhance their creativity, leading to more satisfying outcomes.

Prompt/image recommendations based on prompt

Stable Diffusion Playground

We built a user-friendly interface for Stable Diffusion 2.1 using Streamlit, with three key tabs:

Search: Find relevant images by entering a prompt.
Generate: Create new images using text prompts or initial images, with customizable options like seed values.
History: View and revisit all images generated during your session.

These features make it easy for users to explore and manage their creations. We have shared the code to the entire application here.

Within the "Search" tab, users gain the ability to search for images by entering a prompt. This intuitive feature allows for efficient retrieval of relevant images based on user-specified prompts.

“Search” tab Stable Diffusion Playground

Moving to the "Generate" tab, users are presented with a versatile set of options to generate images. By providing a prompt, alongside additional parameters such as a negative prompt, seed value, and desired output count, users can perform text-to-image generation. Alternatively, users have the opportunity to initiate image-to-image generation by providing an initial image in conjunction with a prompt.

“Generate” tab Stable Diffusion Playground

In the final tab, "History," users can delve into their session's historical image records. This invaluable functionality enables users to conveniently search and retrieve previously generated images, empowering them to review and analyze their creative journey.

“History” tab Stable Diffusion Playground

Conclusion

In this second part of the blog series, we have expanded the capabilities of our text2image Stable Diffusion 2.1 base model from the first part. In addition to text-to-image generation, we have introduced image2image functionality, allowing users to explore a wider range of creative possibilities.

To further enhance the user experience, we have implemented the reuse of seeds, enabling users to reproduce specific image outputs consistently. Moreover, we have integrated a vector database, which serves as the backbone for our image and prompt search as well as recommendation features. This vector database empowers users to search for images or prompts that align with their requirements, while also providing valuable recommendations to inspire their creative process.

Checkout our open source Git repository for more open source materials. Also, contact us to learn how to economically productionize your generative AI model at scale.

View full post