Text-to-image generation is a type of generative AI to create an image based on a provided text description. The purpose is to produce an image that accurately portrays the intricacies and nuances described in the text itself. However, this undertaking comes with substantial challenges, as it requires the model to comprehend both the meaning and structure of the textual input while generating visually realistic images. The applications for text-to-image generation are vast, spanning domains such as AI photography, concept art, architectural design, fashion, video games, graphic design, and numerous other creative fields.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. It is an open-source model that can be hosted through AWS. For smooth real-time interactions with the model, it's important to use accelerated hardware like GPUs or AWS Inferentia2 (Amazon's machine learning inference accelerator).
Input prompt:
In this two-part blog series, we will discuss how to develop an AI image generator application using Stable Diffusion in AWS.
For this first part, we demonstrate how to leverage AWS Deep Learning Containers (DLC) and Inferentia to optimize serving Stable Diffusion 2.1 base that can perform text2image predictions. We also demonstrate a benchmark test to compare the latency and cost for text2image of the three deployment configurations:
The pipelines library provides a simple way to run state-of-the-art diffusion models in inference. More information on diffusers can be found here.
SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. More information on libraries supported by Sagemaker’s DLCs can be found here.
Below are steps to deploying Stable Diffusion 2.1 base using diffusers and by extending Sagemaker’s DLC:
Prompt:
An astronaut llama in space, space, interplanetary, so real, unreal, amazing lighting, cinematic, intense, detailed (left image)
For instructions on creating a custom ECR image, you can find detailed guidance in the "extending-image-notebook" located here.
Notebook to deploy the Stable Diffusion model can be found here.
Sagemaker’s DLCs support libraries to enable model parallelism and inference optimizations such as DJL-Serving and DeepSpeed Inference.
DJL-Serving is an open-source, high-performance model server powered by DJL. It takes multiple deep learning models or workflows, and makes them available through an HTTP endpoint. Versions 0.19 and above are supported by SageMaker and work with Amazon EC2 instances with multiple GPUs to facilitate large model inference (LMI) with model parallelism.
DeepSpeed Inference is an open-source inference optimization library. It includes model partitioning schemes for model parallelism with supported models, including many transformer models. It also has optimized kernels for popular models such as OPT, GPT, and BLOOM that can significantly improve inference latency. The version of DeepSpeed in the LMI DLCs is optimized and tested to work on SageMaker. It includes several enhancements, including support for BF16 precision models.
Below are steps to deploying Stable Diffusion 2.1 base using DJL and Deepspeed:
| model = DJLModel( |
| predictor = model.deploy( |
Code to deploy this solution can be found here.
AWS Inferentia 2 represents the latest advancement in the Inferentia series, succeeding its predecessor, Inferentia 1, which was introduced in 2019. Leveraging the power of Inferentia 1, Amazon EC2 Inf1 instances delivered superior performance by achieving a 25% increase in throughput and a 70% reduction in cost compared to equivalent G5 instances utilizing the NVIDIA A10G GPU.
The newly developed Inferentia 2 chip delivers exceptional enhancements in performance. It offers a remarkable 4x increase in throughput and a notable 10x reduction in latency compared to its predecessor, Inferentia 1. Correspondingly, the newly launched Amazon EC2 Inf2 instances demonstrate remarkable improvements, providing up to 2.6x better throughput, an 8.1x decrease in latency, and a 50% increase in performance per watt when compared to similar G5 instances. Inferentia 2 offers a balance between cost-effective inference optimization, thanks to its high throughput, and swift response times for your applications, courtesy of its low inference latency.
To cater to different requirements, Inf2 instances are available in various sizes, each equipped with a varying number of Inferentia 2 chips, ranging from 1 to 12. When multiple chips are present, they benefit from a lightning-fast direct Inferentia 2 to Inferentia 2 interconnectivity, enabling distributed inference on large-scale models. For instance, the largest Inf2 instance size, inf2.48xlarge, incorporates 12 chips and offers ample memory capacity to accommodate a 175-billion parameter model such as GPT-3 or BLOOM. In this blog, we apply the inf2.xlarge with Stable Diffusion 2.1 base.
Deployment steps:
To deploy Stable Diffusion 2.1 base using Inferentia 2, we need to perform two key steps. First, we need to compile the model to run on Inf2 using the AWS Trainium (trn1) instance. Then, we need to use a custom inference script specifically designed for Inferentia 2 to enable running the model.
Below are the detailed steps to accomplish this:
Code to deploy this solution can be found here.
Lastly, we conducted a benchmark test by processing 100 prompts to compare the latency and cost for text2image of the three configurations below.
| Configuration |
Average latency (sec/img) |
Instance type |
Instance cost per hour |
| StableDiffusion2.1 base (Default) |
3.91 |
ml.g5.xlarge |
$1.4084 |
| StableDiffusion2.1 base (DJL Serving to host model using Deepspeed) |
2.55 |
ml.g5.xlarge |
$1.4084 |
| StableDiffusion2.1 base (Inferentia 2) |
2.36 |
ml.inf2.xlarge |
$0.99 |
The Inferentia 2 configuration achieved an average latency of 2.36 seconds per image. This makes it 40% faster than the default configuration, which has a latency of 3.91 seconds per image, and 8% faster than the DJL+Deepspeed configuration, which has a latency of 2.55 seconds per image.
Furthermore, in terms of cost-efficiency, the Inferentia 2 configuration cost $649.00 per 1 million images processed. It outperforms the default configuration by 57.54%, which costs $1,529.68 per 1 million images, and the DJL+Deepspeed configuration by 34.91%, which costs $997.62 per 1 million images.
| Configurations |
|
Default |
DJL+Deepspeed |
Inferentia 2 |
|
|
Latency |
3.91s |
2.55s |
2.36s |
| Default |
3.91s |
0.00% |
-34.78% |
-39.64% |
| DJL+Deepspeed |
2.55s |
53.33% |
0.00% |
-7.45% |
| Inferentia 2 |
2.36s |
65.68% |
8.05% |
0.00% |
| Configurations |
|
Default |
DJL+Deepspeed |
Inferentia 2 |
|
|
Cost |
$1,529.68 |
$997.62 |
$649.00 |
| Default |
$1,529.68 |
0.00% |
-34.79% |
-57.54% |
| DJL+Deepspeed |
$997.62 |
53.27% |
0.00% |
-34.91% |
| Inferentia 2 |
$649.00 |
135.54% |
53.71% |
0.00% |
In this post, we discussed how you can deploy Stable Diffusion 2.1 base using Sagemaker DLC, DJL with Deepspeed and Inferentia 2. We also proved through benchmarking that Inferentia 2 configuration not only delivers significantly lower latency, making it faster than other configurations, but it also provides remarkable cost savings, making it a highly economical choice for serving Stable Diffusion.
In the next blog, How to create an AI image generator application using Stable Diffusion - Part 2/2, you will learn how to expand our Stable Diffusion model to perform image2image. In addition, we will learn how to use vector databases to enable image/prompt recommendations. Finally, we will create and deploy our AI image generator application via Streamlit.
Check out our open source Git repository for more open source materials. Also, contact us to learn how to deploy and optimize your favorite generation AI model.