Stable Diffusion: Difference between revisions

Browse history interactively ← Previous editContent deleted Content addedVisual WikitextInline

Revision as of 14:04, 22 December 2022 editElspea756 (talk \| contribs)Extended confirmed users1,405 edits Undid revision 1128887563 by Benlisquare (talk) Current consensus is that sexist image spam is unnecessary for this encyclopedia articleTags: Undo Reverted← Previous edit		Latest revision as of 16:57, 22 January 2025 edit undoAndreamatar (talk \| contribs)4 edits →See also: linksTag: Visual edit
(325 intermediate revisions by more than 100 users not shown)
Line 1:		Line 1:
	{{Short description\|Image-generating machine learning model}}		{{Short description\|Image-generating machine learning model}}
			{{Use mdy dates\|date=October 2023}}
	{{Infobox software		{{Infobox software
	\| name = Stable Diffusion		\| name = Stable Diffusion
	\| logo =		\| logo =
			\| logo caption =
	\| screenshot = File:A photograph of an astronaut riding a horse 2022-08-28.png
			\| screenshot = Astronaut Riding a Horse (SD3.5).webp
	\| screenshot size = 250px		\| screenshot size = 250px
	\| caption = An image generated by Stable Diffusion based on the text prompt "a photograph of an astronaut riding a horse"		\| caption = An image generated with Stable Diffusion 3.5 based on the text prompt <code>a photograph of an astronaut riding a horse</code>
	\| ~~developer~~ = CompVis ~~group LMU Munich; Runway;~~ Stability AI~~<ref name="stable-diffusion-github" />~~		\| author = Runway, CompVis, and Stability AI
			\| developer = ]
	\| released = August 22, 2022		\| released = August 22, 2022
	\| latest release version = 2.1 (model)<ref name="~~release2.1~~">{{cite web\|url=https://stability.ai/~~blog~~/~~stablediffusion2~~-1-~~release7~~-~~dec~~-~~2022~~\|title=Stable Diffusion v2.1 ~~and DreamStudio Updates 7~~-~~Dec~~ ~~22\|website=stability.ai~~\|archive-date=~~December~~ 10, ~~2022~~\|archive-url=~~http~~://~~web.~~archive.~~org/web~~/~~20221210062732~~/https://stability.ai/~~blog~~/~~stablediffusion2~~-1-~~release7~~-~~dec~~-~~2022~~\|url-status=live}}</ref>		\| latest release version = SD 3.5 (model)<ref name="release-version">{{cite web\|url=https://stability.ai/news/introducing-stable-diffusion-3-5\|title=Stable Diffusion 3.5\|website=]\|access-date=October 23, 2024\|archive-date=October 23, 2024\|archive-url=https://archive.today/20241023040750/https://stability.ai/news/introducing-stable-diffusion-3-5\|url-status=live}}</ref>
	\| latest release date = ~~December~~ 7, ~~2022~~		\| latest release date = October 22, 2024
			\| repo =
	\| repo = {{url\|https://github.com/Stability-AI/stablediffusion}}
			\| programming language = ]<ref>{{cite web \| author1 = Ryan O'Connor \| title = How to Run Stable Diffusion Locally to Generate Images \| url = https://www.assemblyai.com/blog/how-to-run-stable-diffusion-locally-to-generate-images/ \| access-date = May 4, 2023 \| date = August 23, 2022 \| archive-date = October 13, 2023 \| archive-url = https://web.archive.org/web/20231013123717/https://www.assemblyai.com/blog/how-to-run-stable-diffusion-locally-to-generate-images/ \| url-status = live }}</ref>
	\| programming language = ]
	\| operating system = ~~Any that support ] ]s~~		\| operating system =
	\| genre = ]		\| genre = ]
	\| license = ~~Creative~~ ML ~~OpenRAIL-M~~		\| license = Stability AI Community License
	\| website = {{url\|https://stability.ai}}		\| website = {{url\|https://stability.ai/stable-image}}
	}}		}}
	'''Stable Diffusion''' is a ], ] released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as ], outpainting, and generating image-to-image translations guided by a ].<ref name=":0">{{Cite web\|title=Diffuse The Rest - a Hugging Face Space by huggingface\|url=https://huggingface.co/spaces/huggingface/diffuse-the-rest\|access-date=2022-09-05\|website=huggingface.co\|archive-date=2022-09-05\|archive-url=https://web.archive.org/web/20220905141431/https://huggingface.co/spaces/huggingface/diffuse-the-rest\|url-status=live }}</ref>

			'''Stable Diffusion''' is a ], ] released in 2022 based on ] techniques. The ] technology is the premier product of ] and is considered to be a part of the ongoing ].
	Stable Diffusion is a ] ], a variety of deep generative ] developed by the CompVis group at ].<ref name="paper" /> The model has been released by a collaboration of ], CompVis LMU, and Runway with support from EleutherAI and ].{{r\|stable-diffusion-launch\|stable-diffusion-github}}<ref>{{cite web\|title=Revolutionizing image generation by AI: Turning text into images\|url=https://www.lmu.de/en/newsroom/news-overview/news/revolutionizing-image-generation-by-ai-turning-text-into-images.html\|website=LMU Munich\|access-date=17 September 2022}}</ref> In October 2022, Stability AI raised US$101 million in a round led by ] and ].<ref>{{Cite web\|last=Wiggers\|first=Kyle\|title=Stability AI, the startup behind Stable Diffusion, raises $101M\|url=https://techcrunch.com/2022/10/17/stability-ai-the-startup-behind-stable-diffusion-raises-101m/\|access-date=2022-10-17\|website=Techcrunch\|date=17 October 2022\|language=en}}</ref>

			It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as ], outpainting, and generating image-to-image translations guided by a ].<ref name=":0">{{Cite web\|title=Diffuse The Rest - a Hugging Face Space by huggingface\|url=https://huggingface.co/spaces/huggingface/diffuse-the-rest\|access-date=2022-09-05\|website=huggingface.co\|archive-date=2022-09-05\|archive-url=https://web.archive.org/web/20220905141431/https://huggingface.co/spaces/huggingface/diffuse-the-rest\|url-status=live }}</ref> Its development involved researchers from the CompVis Group at ] and ] with a computational donation from Stability and training data from non-profit organizations.<ref name="sifted_financialtimes">{{cite web\|title=Leaked deck raises questions over Stability AI's Series A pitch to investors\|url=https://sifted.eu/articles/stability-ai-fundraise-leak\|access-date=2023-06-20\|website=sifted.eu\|archive-date=June 29, 2023\|archive-url=https://web.archive.org/web/20230629201917/https://sifted.eu/articles/stability-ai-fundraise-leak\|url-status=live}}</ref><ref name="lmu_lauch">{{cite web\|title=Revolutionizing image generation by AI: Turning text into images\|url=https://www.lmu.de/en/newsroom/news-overview/news/revolutionizing-image-generation-by-ai-turning-text-into-images.html\|access-date=2023-06-21\|website=www.lmu.de\|archive-date=September 17, 2022\|archive-url=https://web.archive.org/web/20220917200820/https://www.lmu.de/en/newsroom/news-overview/news/revolutionizing-image-generation-by-ai-turning-text-into-images.html\|url-status=live}}</ref><ref>{{Cite web\|last=Mostaque\|first=Emad\|date=November 2, 2022\|title=Stable Diffusion came from the Machine Vision & Learning research group (CompVis) @LMU_Muenchen\|url=https://twitter.com/EMostaque/status/1587844074064822274?lang=en\|access-date=2023-06-22\|website=Twitter\|language=en\|archive-date=July 20, 2023\|archive-url=https://web.archive.org/web/20230720002303/https://twitter.com/EMostaque/status/1587844074064822274?lang=en\|url-status=live}}</ref><ref name="stable-diffusion-launch" />
	Stable Diffusion's code and model weights have been released publicly,<ref>{{Citation\|title=Stable Diffusion\|date=2022-11-04\|url=https://github.com/CompVis/stable-diffusion\|publisher=CompVis - Machine Vision and Learning LMU Munich\|access-date=2022-11-04}}</ref> and it can run on most consumer hardware equipped with a modest ] with at least 8 GB ]. This marked a departure from previous proprietary text-to-image models such as ] and ] which were accessible only via ]s.<ref name="pcworld">{{cite web\|title=The new killer app: Creating AI art will absolutely crush your PC\|url=https://www.pcworld.com/article/916785/creating-ai-art-local-pc-stable-diffusion.html\|access-date=2022-08-31\|website=PCWorld\|archive-date=2022-08-31\|archive-url=https://web.archive.org/web/20220831065139/https://www.pcworld.com/article/916785/creating-ai-art-local-pc-stable-diffusion.html\|url-status=live }}</ref><ref name="verge"/>

			Stable Diffusion is a ], a kind of deep generative artificial ]. Its code and model weights have been released ],<ref name="stable-diffusion-github"/> and it can run on most consumer hardware equipped with a modest ] with at least 4 GB ]. This marked a departure from previous proprietary text-to-image models such as ] and ] which were accessible only via ]s.<ref name="pcworld">{{cite web\|title=The new killer app: Creating AI art will absolutely crush your PC\|url=https://www.pcworld.com/article/916785/creating-ai-art-local-pc-stable-diffusion.html\|access-date=2022-08-31\|website=PCWorld\|archive-date=2022-08-31\|archive-url=https://web.archive.org/web/20220831065139/https://www.pcworld.com/article/916785/creating-ai-art-local-pc-stable-diffusion.html\|url-status=live }}</ref><ref name="verge"/>

			==Development==
			Stable Diffusion originated from a project called '''Latent Diffusion''',<ref name=":9">{{cite web \| url=https://github.com/CompVis/latent-diffusion \| title=CompVis/Latent-diffusion \| website=] }}</ref> developed in Germany by researchers at ] in ] and ]. Four of the original 5 authors (Robin Rombach, Andreas Blattmann, Patrick Esser and Dominik Lorenz) later joined Stability AI and released subsequent versions of Stable Diffusion.<ref>{{cite web \| url=https://stability.ai/news/stable-diffusion-3-research-paper \| title=Stable Diffusion 3: Research Paper }}</ref>

			The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich.<ref name=verge/> Development was led by Patrick Esser of ] and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion.<ref name="stable-diffusion-launch"/> Stability AI also credited ] and ] (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project.<ref name="stable-diffusion-launch"/>

	==Technology==		==Technology==
	]		]
	]		] process used by Stable Diffusion. The model generates images by iteratively denoising ] until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on ] along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.]]

	===Architecture===		===Architecture===
			{{Main\|Latent diffusion model}}
	Stable Diffusion uses a variant of ] (DM), called latent diffusion model (LDM).<ref name="stable-diffusion-github">{{cite web\|title=Stable Diffusion Repository on GitHub\|url=https://github.com/CompVis/stable-diffusion\|publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich\|access-date=17 September 2022\|date=17 September 2022}}</ref> Introduced in 2015, diffusion models are trained with the objective of removing successive applications of ] on training images which can be thought of as a sequence of ]s. Stable Diffusion consists of 3 parts: the ] (VAE), ], and an optional text encoder.<ref name=":02">{{Cite web\|last=Alammar\|first=Jay\|title=The Illustrated Stable Diffusion\|url=https://jalammar.github.io/illustrated-stable-diffusion/\|access-date=2022-10-31\|website=jalammar.github.io}}</ref> The VAE encoder compresses the image from pixel space to a smaller dimensional ], capturing a more fundamental semantic meaning of the image.<ref>{{Cite web\|title=High-Resolution Image Synthesis with Latent Diffusion Models\|url=https://ommer-lab.com/research/latent-diffusion-models/\|access-date=2022-11-04\|website=Machine Vision & Learning Group\|language=en-US}}</ref> Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.<ref name=":02" /> The U-Net block, composed of a ] backbone, ] the output from forward diffusion backwards to obtain latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.<ref name=":02" /> The denoising step can be flexibly conditioned on a string of text, an image, and other modalities. The encoded conditioning data is exposed to denoising U-Nets via a ].<ref name=":02" /> For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space.<ref name="stable-diffusion-github" /> Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.<ref>{{Cite web\|title=Stable Diffusion launch announcement\|url=https://stability.ai/blog/stable-diffusion-announcement\|access-date=2022-11-02\|website=Stability.Ai\|language=en-GB}}</ref><ref>Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). ''High-Resolution Image Synthesis with Latent Diffusion Models'' (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. ]:2112.10752.</ref>
			Models in Stable Diffusion series before SD 3 all used a kind of ] (DM), called a ], developed by the CompVis (Computer Vision & Learning)<ref>{{Cite web \|title=Home \|url=https://ommer-lab.com/ \|access-date=2024-09-05 \|website=Computer Vision & Learning Group \|language=en-US}}</ref> group at ].<ref name="paper" /><ref name="stable-diffusion-github">{{cite web\|title=Stable Diffusion Repository on GitHub\|url=https://github.com/CompVis/stable-diffusion\|publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich\|access-date=17 September 2022\|date=17 September 2022\|archive-date=January 18, 2023\|archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion\|url-status=live}}</ref> Introduced in 2015, diffusion models are trained with the objective of removing successive applications of ] on training images, which can be thought of as a sequence of ]s. Stable Diffusion consists of 3 parts: the ] (VAE), ], and an optional text encoder.<ref name=":02">{{Cite web\|last=Alammar\|first=Jay\|title=The Illustrated Stable Diffusion\|url=https://jalammar.github.io/illustrated-stable-diffusion/\|access-date=2022-10-31\|website=jalammar.github.io\|archive-date=November 1, 2022\|archive-url=https://web.archive.org/web/20221101104342/https://jalammar.github.io/illustrated-stable-diffusion/\|url-status=live}}</ref> The VAE encoder compresses the image from pixel space to a smaller dimensional ], capturing a more fundamental semantic meaning of the image.<ref name=paper/> Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.<ref name=":02" /> The U-Net block, composed of a ] backbone, ] the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.<ref name=":02" />

			The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a ].<ref name=":02" /> For conditioning on text, the fixed, pretrained ] ViT-L/14 text encoder is used to transform text prompts to an embedding space.<ref name="stable-diffusion-github" /> Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.<ref name="stable-diffusion-launch"/><ref name=paper/>

			The name ''diffusion'' takes inspiration from the ] ] and an important link was made between this purely physical field and deep learning in 2015.<ref>{{cite book \|last1=David \|first1=Foster \|title=Generative Deep Learning \|publisher=O'Reilly \|edition=2 \|chapter=8. Diffusion Models}}</ref><ref>{{cite arXiv \|author1=Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli \|title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics \|date=12 March 2015 \|class=cs.LG \|eprint=1503.03585 }}</ref>

			With 860{{spaces}}million parameters in the U-Net and 123{{spaces}}million in the text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards, and unlike other diffusion models, it can run on ] GPUs,<ref>{{Cite web\|url=https://huggingface.co/docs/diffusers/v0.5.1/en/api/pipelines/stable_diffusion\|title=Stable diffusion pipelines\|website=huggingface.co\|access-date=June 22, 2023\|archive-date=June 25, 2023\|archive-url=https://web.archive.org/web/20230625030241/https://huggingface.co/docs/diffusers/v0.5.1/en/api/pipelines/stable_diffusion\|url-status=live}}</ref> and even ]-only if using the ] version of Stable Diffusion.<ref>{{cite web \|url=https://docs.openvino.ai/2023.3/notebooks/225-stable-diffusion-text-to-image-with-output.html \|title=Text-to-Image Generation with Stable Diffusion and OpenVINO™ \|author=<!--Not stated--> \|website=openvino.ai \|publisher=] \|access-date=February 10, 2024}}</ref>

			==== SD XL ====
			The XL version uses the same LDM architecture as previous versions,<ref name=":4" /> except larger: larger UNet backbone, larger cross-attention context, two text encoders instead of one, and trained on multiple aspect ratios (not just the square aspect ratio like previous versions).

			The SD XL Refiner, released at the same time, has the same architecture as SD XL, but it was trained for adding fine details to preexisting images via text-conditional img2img.

			==== SD 3.0 ====
			{{Main\|Diffusion model#Rectified flow}}
			The 3.0 version<ref name=":6">{{Citation \|last1=Esser \|first1=Patrick \|title=Scaling Rectified Flow Transformers for High-Resolution Image Synthesis \|date=2024-03-05 \|arxiv=2403.03206 \|last2=Kulal \|first2=Sumith \|last3=Blattmann \|first3=Andreas \|last4=Entezari \|first4=Rahim \|last5=Müller \|first5=Jonas \|last6=Saini \|first6=Harry \|last7=Levi \|first7=Yam \|last8=Lorenz \|first8=Dominik \|last9=Sauer \|first9=Axel}}</ref> completely changes the backbone. Not a UNet, but a ''Rectified Flow Transformer'', which implements the rectified flow method<ref name=":7">{{Citation \|last1=Liu \|first1=Xingchao \|title=Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow \|date=2022-09-07 \|arxiv=2209.03003 \|last2=Gong \|first2=Chengyue \|last3=Liu \|first3=Qiang}}</ref><ref name=":8">{{Cite web \|title=Rectified Flow — Rectified Flow \|url=https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html \|access-date=2024-03-06 \|website=www.cs.utexas.edu}}</ref> with a ].

			The Transformer architecture used for SD 3.0 has three "tracks", for original text encoding, transformed text encoding, and image encoding (in latent space). The transformed text encoding and image encoding are mixed during each transformer block.

			The architecture is named "multimodal diffusion transformer (MMDiT), where the "multimodal" means that it mixes text and image encodings inside its operations. This differs from previous versions of DiT, where the text encoding affects the image encoding, but not vice versa.

	=== Training data ===		=== Training data ===
	Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from ] data scraped from the web, where 5 billion image-text pairs were classified based on language, filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).<ref name="Waxy">{{Cite web\|last=Baio\|first=Andy\|date=2022-08-30\|title=Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator\|url=https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/\|access-date=2022-11-02\|website=Waxy.org\|language=en-US}}</ref> The dataset was created by ], a German non-profit which receives funding from Stability AI.<ref name="Waxy" /><ref>{{Cite web\|title=This artist is dominating AI-generated art. And he's not happy about it.\|url=https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/\|access-date=2022-11-02\|website=MIT Technology Review\|language=en}}</ref> The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+.<ref name="Waxy" /> A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with ] taking up 8.5% of the subset, followed by websites such as ], ], ], ] and ].<ref>{{Cite web\|~~last~~=~~Ivanovs~~\|~~first~~=~~Alex~~\|date=~~2022~~-09-08\|title=~~Stable~~ ~~Diffusion:~~ ~~Tutorials,~~ ~~Resources,~~ ~~and~~ ~~Tools~~\|url=https://~~stackdiary~~.~~com~~/~~stable~~-~~diffusion~~-~~resources~~/\|access-date=~~2022-11-02~~\|~~website~~=~~Stack~~ ~~Diary~~\|~~language=en~~-US}}</ref~~><ref name="Waxy" /~~>		Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from ] data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).<ref name="Waxy">{{Cite web\|last=Baio\|first=Andy\|date=2022-08-30\|title=Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator\|url=https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/\|access-date=2022-11-02\|website=Waxy.org\|language=en-US\|archive-date=January 20, 2023\|archive-url=https://web.archive.org/web/20230120124332/https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/\|url-status=live}}</ref> The dataset was created by ], a German non-profit which receives funding from Stability AI.<ref name="Waxy" /><ref>{{Cite web\|title=This artist is dominating AI-generated art. And he's not happy about it.\|url=https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/\|access-date=2022-11-02\|website=MIT Technology Review\|language=en\|archive-date=January 14, 2023\|archive-url=https://web.archive.org/web/20230114125952/https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/\|url-status=live}}</ref> The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+.<ref name="Waxy" /> A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with ] taking up 8.5% of the subset, followed by websites such as ], ], ], ] and ].{{citation needed\|date=October 2023}} An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data.<ref name=":2">{{Cite web \|last1=Brunner \|first1=Katharina \|last2=Harlan \|first2=Elisa \|date=2023-07-07 \|title=We Are All Raw Material for AI \|url=https://interaktiv.br.de/ki-trainingsdaten/en/index.html \|archive-url=https://web.archive.org/web/20230912092308/https://interaktiv.br.de/ki-trainingsdaten/en/index.html \|publisher=Bayerischer Rundfunk (BR) \|access-date=September 12, 2023 \|archive-date=September 12, 2023 \|url-status=live }}</ref>

	=== Training procedures ===		=== Training procedures ===
	The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them.<ref>{{Citation\|last=Schuhmann\|first=Christoph\|title=CLIP+MLP Aesthetic Score Predictor\|date=2022-11-02\|url=https://github.com/christophschuhmann/improved-aesthetic-predictor\|access-date=2022-11-02}}</ref><ref name="Waxy" /><ref name="LAION-Aesthetics">{{Cite web\|title=LAION-Aesthetics {{!}} LAION\|url=https://laion.ai/blog/laion-aesthetics\|access-date=2022-09-02\|website=laion.ai\|language=en\|archive-date=2022-08-26\|archive-url=https://web.archive.org/web/20220826121216/https://laion.ai/blog/laion-aesthetics/\|url-status=live }}</ref> The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a ] with greater than 80% probability.<ref name="Waxy" /> Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.<ref name=":5">{{cite arXiv\|last1=Ho\|first1=Jonathan\|last2=Salimans\|first2=Tim\|date=2022-07-25\|title=Classifier-Free Diffusion Guidance\|class=cs.LG\|eprint=2207.12598 }}</ref>		The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them.<ref>{{Citation\|last=Schuhmann\|first=Christoph\|title=CLIP+MLP Aesthetic Score Predictor\|date=2022-11-02\|url=https://github.com/christophschuhmann/improved-aesthetic-predictor\|access-date=2022-11-02\|archive-date=June 8, 2023\|archive-url=https://web.archive.org/web/20230608005334/http://github.com/christophschuhmann/improved-aesthetic-predictor/\|url-status=live}}</ref><ref name="Waxy" /><ref name="LAION-Aesthetics">{{Cite web\|title=LAION-Aesthetics {{!}} LAION\|url=https://laion.ai/blog/laion-aesthetics\|access-date=2022-09-02\|website=laion.ai\|language=en\|archive-date=2022-08-26\|archive-url=https://web.archive.org/web/20220826121216/https://laion.ai/blog/laion-aesthetics/\|url-status=live }}</ref> The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a ] with greater than 80% probability.<ref name="Waxy" /> Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.<ref name=":5">{{cite arXiv\|last1=Ho\|first1=Jonathan\|last2=Salimans\|first2=Tim\|date=2022-07-25\|title=Classifier-Free Diffusion Guidance\|class=cs.LG\|eprint=2207.12598 }}</ref>

			The model was trained using 256 ] GPUs on ] for a total of 150,000 GPU-hours, at a cost of $600,000.<ref>{{Cite web\|last=Mostaque\|first=Emad\|date=August 28, 2022\|title=Cost of construction\|url=https://twitter.com/emostaque/status/1563870674111832066\|access-date=2022-09-06\|website=Twitter\|language=en\|archive-date=2022-09-06\|archive-url=https://web.archive.org/web/20220906155426/https://twitter.com/EMostaque/status/1563870674111832066\|url-status=live }}</ref><ref name="stable-diffusion-model-card-1-4"/><ref>{{Cite web\|last=Wiggers\|first=Kyle\|date=2022-08-12\|title=A startup wants to democratize the tech behind DALL-E 2, consequences be damned\|url=https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/\|access-date=2022-11-02\|website=TechCrunch\|language=en-US\|archive-date=January 19, 2023\|archive-url=https://web.archive.org/web/20230119005503/https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/\|url-status=live}}</ref>

			SD3 was trained at a cost of around $10 million.<ref>{{Cite web \|last=emad_9608 \|date=2024-04-19 \|title=10m is about right \|url=http://www.reddit.com/r/StableDiffusion/comments/1c870a5/any_estimate_on_how_much_money_they_spent_to/l0dc2ni/ \|access-date=2024-04-25 \|website=r/StableDiffusion}}</ref>
	The model was trained using 256 ] GPUs on ] for a total of 150,000 GPU-hours, at a cost of $600,000.<ref>{{Cite web\|last=Mostaque\|first=Emad\|date=August 28, 2022\|title=Cost of construction\|url=https://twitter.com/emostaque/status/1563870674111832066\|access-date=2022-09-06\|website=Twitter\|language=en\|archive-date=2022-09-06\|archive-url=https://web.archive.org/web/20220906155426/https://twitter.com/EMostaque/status/1563870674111832066\|url-status=live }}</ref><ref name="stable-diffusion-model-card-1-4"/><ref>{{Cite web\|last=Wiggers\|first=Kyle\|date=2022-08-12\|title=A startup wants to democratize the tech behind DALL-E 2, consequences be damned\|url=https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/\|access-date=2022-11-02\|website=TechCrunch\|language=en-US}}</ref>

	=== Limitations ===		=== Limitations ===
	Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution;<ref name="diffusers">{{Cite web\|title=Stable Diffusion with 🧨 Diffusers\|url=https://huggingface.co/blog/stable_diffusion\|access-date=2022-10-31\|website=huggingface.co}}</ref> the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution.<ref name="release2.0">{{cite web\|url=https://stability.ai/blog/stable-diffusion-v2-release\|title=Stable Diffusion 2.0 Release\|website=stability.ai\|archive-date=December 10, 2022\|archive-url=~~http~~://web.archive.org/web/20221210062729/https://stability.ai/blog/stable-diffusion-v2-release\|url-status=live}}</ref> Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database.<ref>{{Cite web\|title=LAION\|url=https://laion.ai/\|access-date=2022-10-31\|website=laion.ai\|language=en}}</ref> The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model.<ref>{{Cite web\|date=2022-08-24\|title=Generating images with Stable Diffusion\|url=https://blog.paperspace.com/generating-images-with-stable-diffusion/\|access-date=2022-10-31\|website=Paperspace Blog\|language=en}}</ref> In ~~addition~~ to ~~human~~ ~~limbs~~, ~~generating~~ ~~animal limbs have been observed to~~ be ~~challenging as well~~, ~~with~~ ~~the~~ ~~observed~~ ~~failure~~ ~~rate~~ of ~~25%~~ ~~when~~ ~~trying~~ ~~to generate an image of~~ ~~a horse~~.<ref>{{Cite web~~\|author=François~~ ~~Chollet~~\|title=~~(If~~ ~~you~~ ~~were~~ ~~wondering~~ ~~how~~ ~~often~~ ~~Stable~~ ~~Diffusion~~ ~~will~~ ~~give~~ ~~you~~ a ~~horse~~ ~~with~~ ~~more~~ ~~than~~ 4 ~~legs~~ ~~(or~~ ~~sometimes~~ ~~less)~~ ~~when~~ ~~you~~ ~~ask~~ it ~~for~~ a ~~photo~~ of ~~a horse~~: in my ~~experience~~ ~~it's~~ ~~about 20~~-~~25%~~ of ~~the~~ ~~time.)~~\|url=https://~~twitter~~.com/~~fchollet~~/~~status~~/~~1573879858203340800\|access~~-~~date=2022~~-10-31\|~~website~~=~~Twitter\|language=en~~}}</ref>		Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution;<ref name="diffusers">{{Cite web\|title=Stable Diffusion with 🧨 Diffusers\|url=https://huggingface.co/blog/stable_diffusion\|access-date=2022-10-31\|website=huggingface.co\|archive-date=January 17, 2023\|archive-url=https://web.archive.org/web/20230117222142/https://huggingface.co/blog/stable_diffusion\|url-status=live}}</ref> the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution.<ref name="release2.0">{{cite web\|url=https://stability.ai/blog/stable-diffusion-v2-release\|title=Stable Diffusion 2.0 Release\|website=stability.ai\|archive-date=December 10, 2022\|archive-url=https://web.archive.org/web/20221210062729/https://stability.ai/blog/stable-diffusion-v2-release\|url-status=live}}</ref> Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database.<ref>{{Cite web\|title=LAION\|url=https://laion.ai/\|access-date=2022-10-31\|website=laion.ai\|language=en\|archive-date=October 16, 2023\|archive-url=https://web.archive.org/web/20231016082902/https://laion.ai/\|url-status=live}}</ref> The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model.<ref>{{Cite web\|date=2022-08-24\|title=Generating images with Stable Diffusion\|url=https://blog.paperspace.com/generating-images-with-stable-diffusion/\|access-date=2022-10-31\|website=Paperspace Blog\|language=en\|archive-date=October 31, 2022\|archive-url=https://web.archive.org/web/20221031231727/https://blog.paperspace.com/generating-images-with-stable-diffusion/\|url-status=live}}</ref> Stable Diffusion XL (SDXL) version 1.0, released in July 2023, introduced native 1024x1024 resolution and improved generation for limbs and text.<ref>{{Cite web \|title=Announcing SDXL 1.0 \|url=https://stability.ai/blog/stable-diffusion-sdxl-1-announcement \|access-date=2023-08-21 \|website=Stability AI \|language=en-GB \|archive-date=July 26, 2023 \|archive-url=https://web.archive.org/web/20230726215239/https://stability.ai/blog/stable-diffusion-sdxl-1-announcement \|url-status=live }}</ref><ref>{{Cite web \|last=Edwards \|first=Benj \|date=2023-07-27 \|title=Stability AI releases Stable Diffusion XL, its next-gen image synthesis model \|url=https://arstechnica.com/information-technology/2023/07/stable-diffusion-xl-puts-ai-generated-visual-worlds-at-your-gpus-command/ \|access-date=2023-08-21 \|website=Ars Technica \|language=en-us \|archive-date=August 21, 2023 \|archive-url=https://web.archive.org/web/20230821011216/https://arstechnica.com/information-technology/2023/07/stable-diffusion-xl-puts-ai-generated-visual-worlds-at-your-gpus-command/ \|url-status=live }}</ref>

	Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset such as generating ] characters ("waifu diffusion"),<ref>{{Cite web\|title=hakurei/waifu-diffusion · Hugging Face\|url=https://huggingface.co/hakurei/waifu-diffusion\|access-date=2022-10-31\|website=huggingface.co}}</ref> new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging<ref>{{cite arXiv\|first1=Pierre\|last1=Chambon\|first2=Christian\|last2=Bluethgen\|first3=Curtis P.\|last3=Langlotz\|first4=Akshay\|last4=Chaudhari\|date=2022-10-09\|title=Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains\|class=cs.CV\|eprint=2210.04133}}</ref> to ].<ref>{{cite web\|author=Seth Forsgren\|author2=Hayk Martiros\|url=https://www.riffusion.com/about\|title=Riffusion - Stable diffusion for real-time music generation\|website=Riffusion\|archive-url=~~http~~://web.archive.org/web/20221216092717/https://www.riffusion.com/about\|archive-date=December 16, 2022\|url-status=live}}</ref> However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of ],<ref>{{Citation\|last=Mercurio\|first=Anthony\|title=Waifu Diffusion\|date=2022-10-31\|url=https://github.com/harubaru/waifu-diffusion/blob/6bf942eb6368ebf6bcbbb24b6ba8197bda6582a0/docs/en/training/README.md\|access-date=2022-10-31}}</ref> which exceeds the usual resource provided in consumer GPUs~~, such~~ as ]’s ] ~~having~~ ~~around~~ 12 GB.<ref>{{Cite web\|last=Smith\|first=Ryan\|title=NVIDIA Quietly Launches GeForce RTX 3080 12GB: More VRAM, More Power, More Money\|url=https://www.anandtech.com/show/17204/nvidia-quietly-launches-geforce-rtx-3080-12gb-more-vram-more-power-more-money\|access-date=2022-10-31\|website=www.anandtech.com}}</ref>		Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset, such as generating ] characters ("waifu diffusion"),<ref>{{Cite web\|title=hakurei/waifu-diffusion · Hugging Face\|url=https://huggingface.co/hakurei/waifu-diffusion\|access-date=2022-10-31\|website=huggingface.co\|archive-date=October 8, 2023\|archive-url=https://web.archive.org/web/20231008120655/https://huggingface.co/hakurei/waifu-diffusion\|url-status=live}}</ref> new data and further training are required. ] adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging<ref>{{cite arXiv\|first1=Pierre\|last1=Chambon\|first2=Christian\|last2=Bluethgen\|first3=Curtis P.\|last3=Langlotz\|first4=Akshay\|last4=Chaudhari\|date=2022-10-09\|title=Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains\|class=cs.CV\|eprint=2210.04133}}</ref> to ].<ref>{{cite web\|author=Seth Forsgren\|author2=Hayk Martiros\|url=https://www.riffusion.com/about\|title=Riffusion - Stable diffusion for real-time music generation\|website=Riffusion\|archive-url=https://web.archive.org/web/20221216092717/https://www.riffusion.com/about\|archive-date=December 16, 2022\|url-status=live}}</ref> However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of ],<ref>{{Citation\|last=Mercurio\|first=Anthony\|title=Waifu Diffusion\|date=2022-10-31\|url=https://github.com/harubaru/waifu-diffusion/blob/6bf942eb6368ebf6bcbbb24b6ba8197bda6582a0/docs/en/training/README.md\|access-date=2022-10-31\|archive-date=October 31, 2022\|archive-url=https://web.archive.org/web/20221031234225/https://github.com/harubaru/waifu-diffusion/blob/6bf942eb6368ebf6bcbbb24b6ba8197bda6582a0/docs/en/training/README.md\|url-status=live}}</ref> which exceeds the usual resource provided in such consumer GPUs as ]'s ], which has only about 12 GB.<ref>{{Cite web\|last=Smith\|first=Ryan\|title=NVIDIA Quietly Launches GeForce RTX 3080 12GB: More VRAM, More Power, More Money\|url=https://www.anandtech.com/show/17204/nvidia-quietly-launches-geforce-rtx-3080-12gb-more-vram-more-power-more-money\|access-date=2022-10-31\|website=www.anandtech.com\|archive-date=August 27, 2023\|archive-url=https://web.archive.org/web/20230827092451/https://www.anandtech.com/show/17204/nvidia-quietly-launches-geforce-rtx-3080-12gb-more-vram-more-power-more-money\|url-status=live}}</ref>

	The creators of Stable Diffusion acknowledge the potential for ], as the model was primarily trained on images with English descriptions.<ref name="stable-diffusion-model-card-1-4">{{Cite web\|title=CompVis/stable-diffusion-v1-4 · Hugging Face\|url=https://huggingface.co/CompVis/stable-diffusion-v1-4\|access-date=2022-11-02\|website=huggingface.co}}</ref> As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.<ref name="stable-diffusion-model-card-1-4" />		The creators of Stable Diffusion acknowledge the potential for ], as the model was primarily trained on images with English descriptions.<ref name="stable-diffusion-model-card-1-4">{{Cite web\|title=CompVis/stable-diffusion-v1-4 · Hugging Face\|url=https://huggingface.co/CompVis/stable-diffusion-v1-4\|access-date=2022-11-02\|website=huggingface.co\|archive-date=January 11, 2023\|archive-url=https://web.archive.org/web/20230111161920/https://huggingface.co/CompVis/stable-diffusion-v1-4\|url-status=live}}</ref> As a result, generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages, with western or white cultures often being the default representation.<ref name="stable-diffusion-model-card-1-4" />

	=== End-user fine tuning ===		=== End-user fine-tuning ===
	To address the limitations of the model's initial training, end-users may opt to implement additional training ~~for~~ ~~the purpose~~ of fine-~~tuning~~ generation outputs to match more specific use-cases. There are three methods in which user-accessible fine tuning can be applied to a Stable Diffusion model checkpoint:		To address the limitations of the model's initial training, end-users may opt to implement additional training to ] generation outputs to match more specific use-cases, a process also referred to as ]. There are three methods in which user-accessible fine-tuning can be applied to a Stable Diffusion model checkpoint:
	*An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt.<ref>{{cite web\|author=Dave James\|date=October 28, 2022\|url=https://www.pcgamer.com/nvidia-rtx-4090-stable-diffusion-training-aharon-kahana/\|title=I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann\|website=]\|archive-url=~~http~~://web.archive.org/web/20221109154310/https://www.pcgamer.com/nvidia-rtx-4090-stable-diffusion-training-aharon-kahana/\|archive-date=November 9, 2022\|url-status=live}}</ref> Embeddings are based on the "textual inversion" concept developed by researchers from ] in 2022 with support from ], where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles.<ref>{{cite arXiv\|first1=Rinon\|last1=Gal\|first2=Yuval\|last2=Alaluf\|first3=Yuval\|last3=Atzmon\|first4=Or\|last4=Patashnik\|first5=Amit H.\|last5=Bermano\|first6=Gal\|last6=Chechik\|first7=Daniel\|last7=Cohen-Or\|date=2022-08-02\|title=An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion\|class=cs.CV\|eprint=2208.01618}}</ref>		*An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt.<ref>{{cite web\|author=Dave James\|date=October 28, 2022\|url=https://www.pcgamer.com/nvidia-rtx-4090-stable-diffusion-training-aharon-kahana/\|title=I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann\|website=]\|archive-url=https://web.archive.org/web/20221109154310/https://www.pcgamer.com/nvidia-rtx-4090-stable-diffusion-training-aharon-kahana/\|archive-date=November 9, 2022\|url-status=live}}</ref> Embeddings are based on the "textual inversion" concept developed by researchers from ] in 2022 with support from ], where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles.<ref>{{cite arXiv\|first1=Rinon\|last1=Gal\|first2=Yuval\|last2=Alaluf\|first3=Yuval\|last3=Atzmon\|first4=Or\|last4=Patashnik\|first5=Amit H.\|last5=Bermano\|first6=Gal\|last6=Chechik\|first7=Daniel\|last7=Cohen-Or\|date=2022-08-02\|title=An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion\|class=cs.CV\|eprint=2208.01618}}</ref>
	*A "~~Hypernetwork~~" is a small ~~pre-trained~~ neural network that is applied to various points within a larger neural network, and refers to the technique created by ] developer Kurumuz in 2021, originally intended for text-generation ]. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space.<ref>{{cite web\|date=October 11, 2022\|url=https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac\|title=NovelAI Improvements on Stable Diffusion\|website=NovelAI\|archive-url=https://archive.vn/~~x9zHS~~\|archive-date=October 27, 2022\|url-status=live}}</ref>		*A "hypernetwork" is a small pretrained neural network that is applied to various points within a larger neural network, and refers to the technique created by ] developer Kurumuz in 2021, originally intended for text-generation ]. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space.<ref>{{cite web\|date=October 11, 2022\|url=https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac\|title=NovelAI Improvements on Stable Diffusion\|website=NovelAI\|archive-url=https://archive.today/20221027041603/https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac\|archive-date=October 27, 2022\|url-status=live}}</ref>
	*] is a deep learning generation model developed by researchers from ] and ] in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.<ref>{{cite web\|author=Yuki Yamashita\|date=September 1, 2022\|url=https://www.itmedia.co.jp/news/articles/2209/01/news041.html\|title=愛犬の合成画像を生成できるAI 文章で指示するだけでコスプレ米Googleが開発\|website=ITmedia Inc.\|language=ja\|archive-url=https://web.archive.org/web/20220831232021/https://www.itmedia.co.jp/news/articles/2209/01/news041.html\|archive-date=August 31, 2022\|url-status=live}}</ref>		*] is a deep learning generation model developed by researchers from ] and ] in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.<ref>{{cite web\|author=Yuki Yamashita\|date=September 1, 2022\|url=https://www.itmedia.co.jp/news/articles/2209/01/news041.html\|title=愛犬の合成画像を生成できるAI 文章で指示するだけでコスプレ米Googleが開発\|website=ITmedia Inc.\|language=ja\|archive-url=https://web.archive.org/web/20220831232021/https://www.itmedia.co.jp/news/articles/2209/01/news041.html\|archive-date=August 31, 2022\|url-status=live}}</ref>

	== Capabilities ==		== Capabilities ==
	The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output.<ref name="stable-diffusion-github"/> Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis"<ref>{{cite arXiv\|date=August 2, 2021\|first1=Chenlin\|last1=Meng\|first2=Yutong\|last2=He\|first3=Yang\|last3=Song\|first4=Jiaming\|last4=Song\|first5=Jiajun\|last5=Wu\|first6=Jun-Yan\|last6=Zhu\|first7=Stefano\|last7=Ermon\|title=SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations\|class=cs.CV\|eprint=2108.01073}}</ref>) through its diffusion-denoising mechanism.<ref name="stable-diffusion-github"/> In addition, the model also allows the use of prompts to partially alter existing images via ] and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.<ref name="webui_showcase">{{cite web\|url=https://github.com/AUTOMATIC1111/stable-diffusion-webui-feature-showcase\|title=Stable Diffusion web UI\|website=GitHub\|date=10 November 2022 }}</ref>		The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output.<ref name="stable-diffusion-github"/> Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis"<ref>{{cite arXiv\|date=August 2, 2021\|first1=Chenlin\|last1=Meng\|first2=Yutong\|last2=He\|first3=Yang\|last3=Song\|first4=Jiaming\|last4=Song\|first5=Jiajun\|last5=Wu\|first6=Jun-Yan\|last6=Zhu\|first7=Stefano\|last7=Ermon\|title=SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations\|class=cs.CV\|eprint=2108.01073}}</ref>) through its diffusion-denoising mechanism.<ref name="stable-diffusion-github"/> In addition, the model also allows the use of prompts to partially alter existing images via ] and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.<ref name="webui_showcase">{{cite web\|url=https://github.com/AUTOMATIC1111/stable-diffusion-webui-feature-showcase\|title=Stable Diffusion web UI\|website=GitHub\|date=10 November 2022\|access-date=September 27, 2022\|archive-date=January 20, 2023\|archive-url=https://web.archive.org/web/20230120032734/https://github.com/AUTOMATIC1111/stable-diffusion-webui-feature-showcase\|url-status=live}}</ref>

	Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in ] precision instead of the default ] to tradeoff model performance with lower VRAM usage.<ref name="diffusers" />		Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in ] precision instead of the default ] to tradeoff model performance with lower VRAM usage.<ref name="diffusers" />
Line 57:		Line 89:
	=== Text to image generation ===		=== Text to image generation ===
	{{multiple image		{{multiple image
	\|direction = vertical		\| direction = vertical
	\|align = right		\| align = right
	\|total_width = 200		\| total_width = 200
	\|image1 = Algorithmically-generated landscape artwork of forest with Shinto shrine.png		\| image1 = Algorithmically-generated landscape artwork of forest with Shinto shrine.png
	\|image2 = Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for green trees.png		\| image2 = Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for green trees.png
	\|image3 = Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for round stones.png		\| image3 = Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for round stones.png
	\|footer = Demonstration of the effect of negative prompts on image generation.		\| footer = Demonstration of the effect of negative prompts on image generation
	*'''Top''': No negative prompt		*'''Top''': no negative prompt
	*'''Centre''': "green trees"		*'''Centre''': "green trees"
	*'''Bottom''': "round stones, round rocks"		*'''Bottom''': "round stones, round rocks"
	}}		}}
	The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt.<ref name="stable-diffusion-github" /> Generated images are tagged with an invisible ] to allow users to identify an image as generated by Stable Diffusion,<ref name="stable-diffusion-github" /> although this watermark loses its efficacy if the image is resized or rotated.<ref>{{Citation\|title=invisible-watermark\|date=2022-11-02\|url=https://github.com/ShieldMnt/invisible-watermark/blob/9802ce3e0c3a5ec43b41d503f156717f0c739584/README.md\|publisher=Shield Mountain\|access-date=2022-11-02}}</ref>		The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt.<ref name="stable-diffusion-github" /> Generated images are tagged with an invisible ] to allow users to identify an image as generated by Stable Diffusion,<ref name="stable-diffusion-github" /> although this watermark loses its efficacy if the image is resized or rotated.<ref>{{Citation\|title=invisible-watermark\|date=2022-11-02\|url=https://github.com/ShieldMnt/invisible-watermark/blob/9802ce3e0c3a5ec43b41d503f156717f0c739584/README.md\|publisher=Shield Mountain\|access-date=2022-11-02\|archive-date=October 18, 2022\|archive-url=https://web.archive.org/web/20221018062806/https://github.com/ShieldMnt/invisible-watermark/blob/9802ce3e0c3a5ec43b41d503f156717f0c739584/README.md\|url-status=live}}</ref>

	Each txt2img generation will involve a specific ] which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.<ref name="diffusers" /> Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.<ref name="diffusers" /> Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt.<ref name=":5" /> More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value.<ref name="diffusers" />		Each txt2img generation will involve a specific ] which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.<ref name="diffusers" /> Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.<ref name="diffusers" /> Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt.<ref name=":5" /> More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value.<ref name="diffusers" />

	Additional text2img features are provided by ] implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets.<ref>{{Cite web\|title=stable-diffusion-tools/emphasis at master · JohannesGaessler/stable-diffusion-tools\|url=https://github.com/JohannesGaessler/stable-diffusion-tools\|access-date=2022-11-02\|website=GitHub\|language=en}}</ref> An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.<ref name="webui_showcase" /><ref name="release2.1"/>		Additional text2img features are provided by ] implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets.<ref>{{Cite web\|title=stable-diffusion-tools/emphasis at master · JohannesGaessler/stable-diffusion-tools\|url=https://github.com/JohannesGaessler/stable-diffusion-tools\|access-date=2022-11-02\|website=GitHub\|language=en\|archive-date=October 2, 2022\|archive-url=https://web.archive.org/web/20221002081041/https://github.com/JohannesGaessler/stable-diffusion-tools\|url-status=live}}</ref> An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.<ref name="webui_showcase" /><ref name="release2.1">{{cite web\|url=https://stability.ai/blog/stablediffusion2-1-release7-dec-2022\|title=Stable Diffusion v2.1 and DreamStudio Updates 7-Dec 22\|website=stability.ai\|archive-date=December 10, 2022\|archive-url=https://web.archive.org/web/20221210062732/https://stability.ai/blog/stablediffusion2-1-release7-dec-2022\|url-status=live}}</ref>

	=== Image modification ===		=== Image modification ===
			{{Multiple image
			\| direction = horizontal
			\| align = right
			\| total_width = 400
			\| image1 = NightCitySphere (SD1.5).jpg
			\| image2 = NightCitySphere (SDXL).jpg
			\| footer = Demonstration of img2img modification
			*'''Left''': Original image created with Stable Diffusion 1.5
			*'''Right''': Modified image created with Stable Diffusion XL 1.0
			}}
	Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided.<ref name="stable-diffusion-github" />		Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided.<ref name="stable-diffusion-github" />

			There are different methods for performing img2img. The main method is SDEdit,<ref name=":10" /> which first adds noise to an image, then denoises it as usual in text2img.
	The ability of img2img to add noise to the original image makes it potentially useful for ] and ], in which the visual features of image data are changed and anonymized.<ref name=":1">{{cite arXiv\|last1=Luzi\|first1=Lorenzo\|last2=Siahkoohi\|first2=Ali\|last3=Mayer\|first3=Paul M.\|last4=Casco-Rodriguez\|first4=Josue\|last5=Baraniuk\|first5=Richard\|date=2022-10-21\|title=Boomerang: Local sampling on image manifolds using diffusion models\|class=cs.CV\|eprint=2210.12100 }}</ref> The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image.<ref name=":1" /> Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to ] and ], the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces.<ref>{{Cite web\|last=Bühlmann\|first=Matthias\|date=2022-09-28\|title=Stable Diffusion Based Image Compression\|url=https://pub.towardsai.net/stable-diffusion-based-image-compresssion-6f1f0a399202\|access-date=2022-11-02\|website=Medium\|language=en}}</ref>

			The ability of img2img to add noise to the original image makes it potentially useful for ] and ], in which the visual features of image data are changed and anonymized.<ref name=":1">{{cite arXiv\|last1=Luzi\|first1=Lorenzo\|last2=Siahkoohi\|first2=Ali\|last3=Mayer\|first3=Paul M.\|last4=Casco-Rodriguez\|first4=Josue\|last5=Baraniuk\|first5=Richard\|date=2022-10-21\|title=Boomerang: Local sampling on image manifolds using diffusion models\|class=cs.CV\|eprint=2210.12100 }}</ref> The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image.<ref name=":1" /> Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to ] and ], the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces.<ref>{{Cite web\|last=Bühlmann\|first=Matthias\|date=2022-09-28\|title=Stable Diffusion Based Image Compression\|url=https://pub.towardsai.net/stable-diffusion-based-image-compresssion-6f1f0a399202\|access-date=2022-11-02\|website=Medium\|language=en\|archive-date=November 2, 2022\|archive-url=https://web.archive.org/web/20221102231642/https://pub.towardsai.net/stable-diffusion-based-image-compresssion-6f1f0a399202\|url-status=live}}</ref>

	Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided ], which fills the masked space with newly generated content based on the provided prompt.<ref name="webui_showcase" /> A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0.<ref name="release2.0"/> Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.<ref name="webui_showcase" />		Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided ], which fills the masked space with newly generated content based on the provided prompt.<ref name="webui_showcase" /> A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0.<ref name="release2.0"/> Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.<ref name="webui_showcase" />
Line 83:		Line 127:
	A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the ] of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output.<ref name="release2.0"/>		A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the ] of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output.<ref name="release2.0"/>

	==~~Usage~~==		=== ControlNet ===
			ControlNet<ref name="controlnet-paper">{{Cite arXiv\|title=Adding Conditional Control to Text-to-Image Diffusion Models\|last=Zhang\|first=Lvmin\|date=10 February 2023\|class=cs.CV \|eprint=2302.05543 }}</ref> is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models. The "zero convolution" is a 1×1 convolution with both weight and bias initialized to zero. Before training, all zero convolutions produce zero output, preventing any distortion caused by ControlNet. No layer is trained from scratch; the process is still fine-tuning, keeping the original model secure. This method enables training on small-scale or even personal devices.
	Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. The freedom provided to users over image usage has caused controversy over the ethics of ownership, as Stable Diffusion and other generative models are trained from copyrighted images without the owner’s consent.<ref name=":13">{{Cite web\|last=Cai\|first=Kenrick\|title=Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion\|url=https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/\|access-date=2022-10-31\|website=Forbes\|language=en}}</ref>

			===User Interfaces===
	As ]s and ]s are not subject to copyright, it is often interpreted that users of Stable Diffusion who generate images of artworks should not be considered to be infringing upon the copyright of visually similar works.<ref name="automaton" /> However, individuals depicted in generated images may be protected by ] if their likeness is used,<ref name="automaton">{{cite web\|date=August 24, 2022\|url=https://automaton-media.com/articles/newsjp/20220824-216074/\|title=高性能画像生成AI「Stable Diffusion」無料リリース。「kawaii」までも理解し創造する画像生成AI\|website=Automaton Media\|language=ja}}</ref> and ] such as recognizable brand logos still remain protected by copyright. Nonetheless, visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors.<ref name="MIT-LAION" />

			Stability provides an online image generation service called ''DreamStudio''.<ref>{{cite web \|last1=Edwards \|first1=Benj \|title=Stable Diffusion in your pocket? "Draw Things" brings AI images to iPhone \|url=https://arstechnica.com/information-technology/2022/11/stable-diffusion-in-your-pocket-draw-things-brings-ai-images-to-iphone/ \|website=Ars Technica \|access-date=10 July 2024 \|language=en-us \|date=10 November 2022}}</ref><ref>{{cite web \|last1=Wendling \|first1=Mike \|title=AI can be easily used to make fake election photos - report \|url=https://www.bbc.com/news/world-us-canada-68471253 \|website=bbc.com \|access-date=10 July 2024 \|date=6 March 2024 \|quote=The CCDH, a campaign group, tested four of the largest public-facing AI platforms: Midjourney, OpenAI's ChatGPT Plus, Stability.ai's DreamStudio and Microsoft's Image Creator.}}</ref> The company also released an open source version of ''DreamStudio'' called ''StableStudio''.<ref>{{cite web \|last1=Wiggers \|first1=Kyle \|title=Stability AI open sources its AI-powered design studio \|url=https://techcrunch.com/2023/05/18/stability-ai-open-sources-its-ai-powered-design-studio/ \|website=TechCrunch \|access-date=10 July 2024 \|date=18 May 2023}}</ref><ref>{{cite web \|last1=Weatherbed \|first1=Jess \|title=Stability AI is open-sourcing its DreamStudio web app \|url=https://www.theverge.com/2023/5/17/23726751/stability-ai-stablestudio-dreamstudio-stable-diffusion-artificial-intelligence \|website=The Verge \|language=en \|date=17 May 2023}}</ref> In addition to Stability's interfaces, many third party open source interfaces exist, such as ], which is the most popular and offers extra features,<ref>{{cite web \|last1=Mann \|first1=Tobias \|title=A friendly guide to local AI image gen with Stable Diffusion and Automatic1111 \|url=https://www.theregister.com/2024/06/29/image_gen_guide/ \|website=] \|language=en \|date=29 Jun 2024}}</ref> '']'', which aims to decrease the amount of prompting needed by the user,<ref>{{cite web \|last1=Hachman \|first1=Mak \|title=Fooocus is the easiest way to create AI art on your PC \|url=https://www.pcworld.com/article/2253285/fooocus-is-the-easiest-way-to-run-ai-art-on-your-pc.html \|website=PCWorld \|language=en}}</ref> and '']'', which has a ] user interface, essentially a ] akin to many ] applications.<ref>{{cite web \|url=https://learn.thinkdiffusion.com/comfyui-workflows-and-what-you-need-to-know/ \|title=ComfyUI Workflows and what you need to know \|author=<!--Not stated--> \|website=thinkdiffusion.com \|date=December 2023 \|access-date=2024-07-10}}</ref><ref>{{cite web \|url=https://github.com/comfyanonymous/ComfyUI \|title=ComfyUI \|author=<!--Not stated--> \|website=github.com \|access-date=2024-07-10}}</ref><ref>{{cite thesis \|last=Huang \|first=Yenkai \|date=2024-05-10 \|title=Latent Auto-recursive Composition Engine \|url=https://digitalcommons.dartmouth.edu/cgi/viewcontent.cgi?article=1188&context=masters_theses \|degree=M.S. Computer Science \|publisher=] \|access-date=2024-07-10}}</ref>
	Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI.<ref name="bijapan">{{cite web\|author=Ryo Shimizu\|date=August 26, 2022\|url=https://www.businessinsider.jp/post-258369\|title=Midjourneyを超えた？無料の作画AI｢ #StableDiffusion ｣が｢AIを民主化した｣と断言できる理由\|website=Business Insider Japan\|language=ja}}</ref> Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, explains that " peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology",<ref name="verge" /> and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences.<ref name="verge" /> In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis.<ref name="verge" /><ref name="bijapan" /> This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the open-source nature of the license which Stable Diffusion was released under.<ref name=":13"/>

			== Releases ==
			{\| class="wikitable"
			\|+
			!Version number
			!Release date
			!Parameters
			!Notes
			\|-
			\|1.1, 1.2, 1.3, 1.4<ref>{{Cite web \|title=CompVis/stable-diffusion-v1-4 · Hugging Face \|url=https://huggingface.co/CompVis/stable-diffusion-v1-4 \|url-status=live \|archive-url=https://web.archive.org/web/20230111161920/https://huggingface.co/CompVis/stable-diffusion-v1-4 \|archive-date=January 11, 2023 \|access-date=2023-08-17 \|website=huggingface.co}}</ref>
			\|August 2022
			\|
			\|All released by CompVis. There is no "version 1.0". 1.1 gave rise to 1.2, and 1.2 gave rise to both 1.3 and 1.4.<ref>{{Cite web \|date=2023-08-23 \|title=CompVis (CompVis) \|url=https://huggingface.co/CompVis \|access-date=2024-03-06 \|website=huggingface.co}}</ref>
			\|-
			\|1.5<ref>{{Cite web \|title=runwayml/stable-diffusion-v1-5 · Hugging Face \|url=https://huggingface.co/runwayml/stable-diffusion-v1-5 \|url-status=live \|archive-url=https://web.archive.org/web/20230921025150/https://huggingface.co/runwayml/stable-diffusion-v1-5 \|archive-date=September 21, 2023 \|access-date=2023-08-17 \|website=huggingface.co}}</ref>
			\|October 2022
			\|983M
			\|Initialized with the weights of 1.2, not 1.4. Released by RunwayML.
			\|-
			\|2.0<ref name=":3">{{Cite web \|title=stabilityai/stable-diffusion-2 · Hugging Face \|url=https://huggingface.co/stabilityai/stable-diffusion-2 \|url-status=live \|archive-url=https://web.archive.org/web/20230921135247/https://huggingface.co/stabilityai/stable-diffusion-2 \|archive-date=September 21, 2023 \|access-date=2023-08-17 \|website=huggingface.co}}</ref>
			\|November 2022
			\|
			\|Retrained from scratch on a filtered dataset.<ref>{{Cite web \|title=stabilityai/stable-diffusion-2-base · Hugging Face \|url=https://huggingface.co/stabilityai/stable-diffusion-2-base \|access-date=2024-01-01 \|website=huggingface.co}}</ref>
			\|-
			\|2.1<ref>{{Cite web \|title=stabilityai/stable-diffusion-2-1 · Hugging Face \|url=https://huggingface.co/stabilityai/stable-diffusion-2-1 \|url-status=live \|archive-url=https://web.archive.org/web/20230921025146/https://huggingface.co/stabilityai/stable-diffusion-2-1 \|archive-date=September 21, 2023 \|access-date=2023-08-17 \|website=huggingface.co}}</ref>
			\|December 2022
			\|
			\|Initialized with the weights of 2.0.
			\|-
			\|XL 1.0<ref>{{Cite web \|title=stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face \|url=https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 \|url-status=live \|archive-url=https://web.archive.org/web/20231008071719/https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 \|archive-date=October 8, 2023 \|access-date=2023-08-17 \|website=huggingface.co}}</ref><ref name=":4">{{cite arXiv\|last1=Podell \|first1=Dustin \|title=SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis \|date=2023-07-04 \|eprint=2307.01952 \|last2=English \|first2=Zion \|last3=Lacey \|first3=Kyle \|last4=Blattmann \|first4=Andreas \|last5=Dockhorn \|first5=Tim \|last6=Müller \|first6=Jonas \|last7=Penna \|first7=Joe \|last8=Rombach \|first8=Robin\|class=cs.CV }}</ref>
			\|July 2023
			\|3.5B
			\|The XL 1.0 base model has 3.5 billion parameters, making it around 3.5x larger than previous versions.<ref>{{Cite web \|title=Announcing SDXL 1.0 \|url=https://stability.ai/news/stable-diffusion-sdxl-1-announcement \|access-date=2024-01-01 \|website=Stability AI \|language=en-GB}}</ref>
			\|-
			\|XL Turbo<ref>{{Cite web \|title=stabilityai/sdxl-turbo · Hugging Face \|url=https://huggingface.co/stabilityai/sdxl-turbo \|access-date=2024-01-01 \|website=huggingface.co}}</ref>
			\|November 2023
			\|
			\|Distilled from XL 1.0 to run in fewer diffusion steps.<ref>{{Cite web \|title=Adversarial Diffusion Distillation \|url=https://stability.ai/research/adversarial-diffusion-distillation \|access-date=2024-01-01 \|website=Stability AI \|language=en-GB}}</ref>
			\|-
			\|3.0<ref>{{Cite web \|title=Stable Diffusion 3 \|url=https://stability.ai/news/stable-diffusion-3 \|access-date=2024-03-05 \|website=Stability AI \|language=en-GB}}</ref><ref name=":6" />
			\|February 2024 (early preview)
			\|800M to 8B
			\|A family of models.
			\|-
			\|3.5<ref name="release-sd3.5">{{cite web\|url=https://stability.ai/news/introducing-stable-diffusion-3-5\|title=Stable Diffusion 3.5\|website=]\|access-date=October 23, 2024\|archive-date=October 23, 2024\|archive-url=https://archive.today/20241023040750/https://stability.ai/news/introducing-stable-diffusion-3-5\|url-status=live}}</ref>
			\|October 2024
			\|2.5B to 8B
			\|A family of models with Large (8 billion parameters), Large Turbo (distilled from SD 3.5 Large), and Medium (2.5 billion parameters).
			\|}
			Key papers

			* ''Learning Transferable Visual Models From Natural Language Supervision'' (2021).<ref>{{cite arXiv\|last1=Radford \|first1=Alec \|title=Learning Transferable Visual Models From Natural Language Supervision \|date=2021-02-26 \|eprint=2103.00020 \|last2=Kim \|first2=Jong Wook \|last3=Hallacy \|first3=Chris \|last4=Ramesh \|first4=Aditya \|last5=Goh \|first5=Gabriel \|last6=Agarwal \|first6=Sandhini \|last7=Sastry \|first7=Girish \|last8=Askell \|first8=Amanda \|last9=Mishkin \|first9=Pamela\|class=cs.CV }}</ref> This paper describes the CLIP method for training text encoders, which convert text into floating point vectors. Such text encodings are used by the diffusion model to create images.
			* ''SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations'' (2021).<ref name=":10">{{cite arXiv \|last1=Meng \|first1=Chenlin \|title=SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations \|date=2022-01-04 \|eprint=2108.01073 \|last2=He \|first2=Yutong \|last3=Song \|first3=Yang \|last4=Song \|first4=Jiaming \|last5=Wu \|first5=Jiajun \|last6=Zhu \|first6=Jun-Yan \|last7=Ermon \|first7=Stefano\|class=cs.CV }}</ref> This paper describes SDEdit, aka "img2img".
			* ''High-Resolution Image Synthesis with Latent Diffusion Models'' (2021, updated in 2022).<ref>{{Cite book \|last1=Rombach \|first1=Robin \|last2=Blattmann \|first2=Andreas \|last3=Lorenz \|first3=Dominik \|last4=Esser \|first4=Patrick \|last5=Ommer \|first5=Björn \|date=2022 \|chapter=High-Resolution Image Synthesis With Latent Diffusion Models \|chapter-url=https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html \|title= Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) \|language=en \|pages=10684–10695\|arxiv=2112.10752 }}</ref> This paper describes the latent diffusion model (LDM). This is the backbone of the Stable Diffusion architecture.
			* ''Classifier-Free Diffusion Guidance'' (2022).<ref name=":5" /> This paper describes CFG, which allows the text encoding vector to steer the diffusion model towards creating the image described by the text.
			* ''SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis'' (2023).<ref name=":4" /> Describes SDXL.
			* ''Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow'' (2022).<ref name=":7" /><ref name=":8" /> Describes rectified flow, which is used for the backbone architecture of SD 3.0.
			* ''Scaling Rectified Flow Transformers for High-resolution Image Synthesis'' (2024).<ref name=":6" /> Describes SD 3.0.

			Training cost

			* SD 2.0: 0.2 million hours on A100 (40GB).<ref name=":3" />
			Stable Diffusion 3.5 Large was made available for enterprise usage on Amazon Bedrock of ].<ref>{{Cite web \|last=Kerner \|first=Sean Michael \|date=2024-12-19 \|title=Stable Diffusion 3.5 hits Amazon Bedrock: What it means for enterprise AI workflows \|url=https://venturebeat.com/ai/stable-diffusion-3-5-hits-amazon-bedrock-what-it-means-for-enterprise-ai-workflows/ \|access-date=2024-12-25 \|website=VentureBeat \|language=en-US}}</ref>

			==Usage and controversy==
			Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals.<ref>{{Cite web \|date=2023-07-26 \|title=LICENSE.md · stabilityai/stable-diffusion-xl-base-1.0 at main \|url=https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md \|access-date=2024-01-01 \|website=huggingface.co}}</ref>

			The images Stable Diffusion was trained on have been filtered without human input, leading to some harmful images and large amounts of private and sensitive information appearing in the training data.<ref name=":2" />

			More traditional visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors.<ref name="MIT-LAION" />

			Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI.<ref name="bijapan">{{cite web\|author=Ryo Shimizu\|date=August 26, 2022\|url=https://www.businessinsider.jp/post-258369\|title=Midjourneyを超えた？無料の作画AI｢ #StableDiffusion ｣が｢AIを民主化した｣と断言できる理由\|website=Business Insider Japan\|language=ja\|access-date=October 4, 2022\|archive-date=December 10, 2022\|archive-url=https://web.archive.org/web/20221210192453/https://www.businessinsider.jp/post-258369\|url-status=live}}</ref> Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, ], argues that " peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology",<ref name="verge" /> and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences.<ref name="verge" /> In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis.<ref name="verge" /><ref name="bijapan" /> This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code.<ref name=":13">{{Cite web\|last=Cai\|first=Kenrick\|title=Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion\|url=https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/\|access-date=2022-10-31\|website=Forbes\|language=en\|archive-date=September 30, 2023\|archive-url=https://web.archive.org/web/20230930125226/https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/\|url-status=live}}</ref>

			Controversy around photorealistic ] have been brought up, due to such images generated by Stable Diffusion being shared on websites such as ].<ref>{{Cite web \|date=2023-06-27 \|title=Illegal trade in AI child sex abuse images exposed \|url=https://www.bbc.com/news/uk-65932372 \|access-date=2023-09-26 \|website=BBC News \|language=en-GB \|archive-date=September 21, 2023 \|archive-url=https://web.archive.org/web/20230921100213/https://www.bbc.com/news/uk-65932372 \|url-status=live }}</ref>

			In June of 2024, a ], a user interface for Stable Diffusion, took place, with the hackers claiming they targeted users who committed "one of our sins", which included AI-art generation, art theft, promoting cryptocurrency.<ref>{{Cite web \|last=Maiberg \|first=Emanuel \|date=2024-06-11 \|title=Hackers Target AI Users With Malicious Stable Diffusion Tool on GitHub to Protest 'Art Theft' \|url=https://www.404media.co/hackers-target-ai-users-with-malicious-stable-diffusion-tool-on-github/ \|url-access=subscription \|access-date=2024-06-14 \|website=404 Media \|language=en}}</ref>

			==Litigation==
			===Andersen, McKernan, and Ortiz v. Stability AI, Midjourney, and DeviantArt===
			In January 2023, three artists, ], ], and Karla Ortiz, filed a ] lawsuit against Stability AI, ], and ], claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.<ref>{{Cite web\|url=https://www.theverge.com/2023/1/16/23557098/generative-ai-art-copyright-legal-lawsuit-stable-diffusion-midjourney-deviantart\|title=AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit\|first=James\|last=Vincent\|date=January 16, 2023\|website=The Verge\|access-date=January 16, 2023\|archive-date=March 9, 2023\|archive-url=https://web.archive.org/web/20230309010528/https://www.theverge.com/2023/1/16/23557098/generative-ai-art-copyright-legal-lawsuit-stable-diffusion-midjourney-deviantart\|url-status=live}}</ref>

			In July 2023, U.S. District Judge ] inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint, providing them an opportunity to reframe their arguments.<ref name=Reuters-SDLawsuit>{{Cite news \|last=Brittain \|first=Blake \|date=2023-07-19 \|title=US judge finds flaws in artists' lawsuit against AI companies \|language=en \|work=Reuters \|url=https://www.reuters.com/legal/litigation/us-judge-finds-flaws-artists-lawsuit-against-ai-companies-2023-07-19/ \|access-date=2023-08-06 \|archive-date=September 6, 2023 \|archive-url=https://web.archive.org/web/20230906193839/https://www.reuters.com/legal/litigation/us-judge-finds-flaws-artists-lawsuit-against-ai-companies-2023-07-19/ \|url-status=live }}</ref>

			===Getty Images v. Stability AI===

			In January 2023, ] initiated legal proceedings against Stability AI in the English High Court, alleging significant infringement of its intellectual property rights. Getty Images claims that Stability AI "scraped" millions of images from Getty’s websites without consent and used these images to train and develop its deep-learning Stable Diffusion model.<ref>{{Cite news \|last=Goosens\|first=Sophia\|date=2024-02-28 \| title=Getty Images v Stability AI: the implications for UK copyright law and licensing\| url=https://www.pinsentmasons.com/out-law/analysis/getty-images-v-stability-ai-implications-copyright-law-licensing}}</ref><ref>{{Cite news \| last= Gill\| first=Dennis \| date=2023-12-11 \| title=Getty Images v Stability AI: copyright claims can proceed to trial\| url=https://www.pinsentmasons.com/out-law/news/getty-images-v-stability-ai}}</ref>

			Key points of the lawsuit include:

			* Getty Images asserting that the training and development of Stable Diffusion involved the unauthorized use of its images, which were downloaded on servers and computers that were potentially in the UK. However, Stability AI argues that all training and development took place outside the UK, specifically in U.S. data centers operated by Amazon Web Services.<ref>{{Cite news \|last=Goosens\|first=Sophia\|date=2024-02-28 \| title=Getty v. Stability AI case goes to trial in the UK – what we learned\| url=https://www.reedsmith.com/en/perspectives/2024/02/getty-v-stability-ai-case-goes-to-trial-in-the-uk-what-we-learned}}</ref>
			* Stability AI applied for reverse summary judgment and/or strike out of two claims: the training and development claim, and the secondary infringement of copyright claim. The High Court, however, refused to strike out these claims, allowing them to proceed to trial. The court is to determine whether the training and development of Stable Diffusion occurred in the UK, which is crucial for establishing jurisdiction under the UK's Copyright, Designs and Patents Act 1988 (CDPA).<ref name="pinsentmasons2024GettyvsStabilityAI">{{Cite news \|last=Hill \|first=Charlotte \| date=2024-02-16 \| title=Generative AI in the courts: Getty Images v Stability AI
			\| url=https://www.penningtonslaw.com/news-publications/latest-news/2024/generative-ai-in-the-courts-getty-images-v-stability-ai}}</ref>
			* The secondary infringement claim revolves around whether the pre-trained Stable Diffusion software, made available in the UK through platforms like GitHub, HuggingFace, and DreamStudio, constitutes an "article" under sections 22 and 23 of the CDPA. The court will decide whether the term "article" can encompass intangible items such as software.<ref name="pinsentmasons2024GettyvsStabilityAI" />
			The trial is expected to take place in summer 2025 and has significant implications for UK copyright law and the licensing of AI-generated content.

	== License ==		== License ==
	Unlike models like ], Stable Diffusion makes its ],<ref name="stability">{{cite web\|title=Stable Diffusion Public Release\|url=https://stability.ai/blog/stable-diffusion-public-release\|url-status=live\|archive-url=https://web.archive.org/web/20220830210535/https://stability.ai/blog/stable-diffusion-public-release\|archive-date=2022-08-30\|access-date=2022-08-31\|website=Stability.Ai}}</ref><ref name="stable-diffusion-github" /> along with pretrained weights. ~~Its~~ license prohibits certain use cases, including crime, ], ], ], "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... ]".<ref name="washingtonpost">{{cite news\|date=2022-08-30\|title=Ready or not, mass video deepfakes are coming\|newspaper=The Washington Post\|url=https://www.washingtonpost.com/technology/2022/08/30/deep-fake-video-on-agt/\|url-status=live\|access-date=2022-08-31\|archive-url=https://web.archive.org/web/20220831115010/https://www.washingtonpost.com/technology/2022/08/30/deep-fake-video-on-agt/\|archive-date=2022-08-31}}</ref><ref>{{Cite web\|title=License - a Hugging Face Space by CompVis\|url=https://huggingface.co/spaces/CompVis/stable-diffusion-license\|url-status=live\|archive-url=https://web.archive.org/web/20220904215616/https://huggingface.co/spaces/CompVis/stable-diffusion-license\|archive-date=2022-09-04\|access-date=2022-09-05\|website=huggingface.co}}</ref> The user owns the rights to their generated output images, and is free to use them commercially.<ref>{{cite web\|author=Katsuo Ishida\|date=August 26, 2022\|title=言葉で指示した画像を凄いAIが描き出す「Stable Diffusion」～画像は商用利用も可能\|url=https://forest.watch.impress.co.jp/docs/review/1434893.html\|website=Impress Corporation\|language=ja}}</ref>		Unlike models like ], Stable Diffusion makes its ],<ref name="stability">{{cite web\|title=Stable Diffusion Public Release\|url=https://stability.ai/blog/stable-diffusion-public-release\|url-status=live\|archive-url=https://web.archive.org/web/20220830210535/https://stability.ai/blog/stable-diffusion-public-release\|archive-date=2022-08-30\|access-date=2022-08-31\|website=Stability.Ai}}</ref><ref name="stable-diffusion-github" /> along with the model (pretrained weights). Prior to Stable Diffusion 3, it applied the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M).<ref>{{Cite web \|title=From RAIL to Open RAIL: Topologies of RAIL Licenses \|url=https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses \|access-date=2023-02-20 \|website=Responsible AI Licenses (RAIL) \|date=18 August 2022 \|language=en-US \|archive-date=July 27, 2023 \|archive-url=https://web.archive.org/web/20230727145215/https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses \|url-status=live }}</ref> The license prohibits certain use cases, including crime, ], ], ], "]", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... ]".<ref name="washingtonpost">{{cite news\|date=2022-08-30\|title=Ready or not, mass video deepfakes are coming\|newspaper=The Washington Post\|url=https://www.washingtonpost.com/technology/2022/08/30/deep-fake-video-on-agt/\|url-status=live\|access-date=2022-08-31\|archive-url=https://web.archive.org/web/20220831115010/https://www.washingtonpost.com/technology/2022/08/30/deep-fake-video-on-agt/\|archive-date=2022-08-31}}</ref><ref>{{Cite web\|title=License - a Hugging Face Space by CompVis\|url=https://huggingface.co/spaces/CompVis/stable-diffusion-license\|url-status=live\|archive-url=https://web.archive.org/web/20220904215616/https://huggingface.co/spaces/CompVis/stable-diffusion-license\|archive-date=2022-09-04\|access-date=2022-09-05\|website=huggingface.co}}</ref> The user owns the rights to their generated output images, and is free to use them commercially.<ref>{{cite web\|author=Katsuo Ishida\|date=August 26, 2022\|title=言葉で指示した画像を凄いAIが描き出す「Stable Diffusion」～画像は商用利用も可能\|url=https://forest.watch.impress.co.jp/docs/review/1434893.html\|website=Impress Corporation\|language=ja\|access-date=October 4, 2022\|archive-date=November 14, 2022\|archive-url=https://web.archive.org/web/20221114020520/https://forest.watch.impress.co.jp/docs/review/1434893.html\|url-status=live}}</ref>

			Stable Diffusion 3.5 applies the permissive Stability AI Community License while commercial enterprises with revenue exceed $1 million need the Stability AI Enterprise License.<ref>{{Cite web \|date=2024-07-05 \|title=Community License \|url=https://stability.ai/news/license-update \|access-date=2024-10-23 \|website=] \|language=en-GB}}</ref> As with the OpenRAIL-M license, the user retains the rights to their generated output images and is free to use them commercially.<ref name="release-sd3.5" />

	==See also==		==See also==
	* ]
	* ]		* ]
			* ]
			* ]
	* ]		* ]
			* ]
	* ]		* ]

Line 103:		Line 245:
	<ref name="MIT-LAION">{{cite web		<ref name="MIT-LAION">{{cite web
	\|work=MIT Technology Review		\|work=MIT Technology Review
	\|last=Heikkilä\|first=Melissa		\|last=Heikkilä
			\|first=Melissa
	\|date=16 September 2022		\|date=16 September 2022
	\|title=This artist is dominating AI-generated art. And he's not happy about it.		\|title=This artist is dominating AI-generated art. And he's not happy about it.
	\|url=https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/		\|url=https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/
			\|access-date=September 26, 2022
			\|archive-date=January 14, 2023
			\|archive-url=https://web.archive.org/web/20230114125952/https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/
			\|url-status=live
	}}</ref>		}}</ref>
	<ref name="paper">{{cite conference\|last1=Rombach\|last2=Blattmann\|last3=Lorenz\|last4=Esser\|last5=Ommer\|title=High-Resolution Image Synthesis with Latent Diffusion Models\|conference=International Conference on Computer Vision and Pattern Recognition (CVPR)\|pages=10684–10695\|date=June 2022\|location=New Orleans, LA\|url=https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf\|arxiv=2112.10752}}</ref>		<ref name="paper">{{cite conference\|last1=Rombach\|last2=Blattmann\|last3=Lorenz\|last4=Esser\|last5=Ommer\|title=High-Resolution Image Synthesis with Latent Diffusion Models\|conference=International Conference on Computer Vision and Pattern Recognition (CVPR)\|pages=10684–10695\|date=June 2022\|location=New Orleans, LA\|url=https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf\|arxiv=2112.10752\|access-date=September 17, 2022\|archive-date=January 20, 2023\|archive-url=https://web.archive.org/web/20230120163151/https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf\|url-status=live}}</ref>
	<ref name="stable-diffusion-launch">{{cite web\|url=https://stability.ai/blog/stable-diffusion-announcement\|title=Stable Diffusion Launch Announcement\|website=Stability.Ai\|access-date=2022-09-06\|archive-date=2022-09-05\|archive-url=https://web.archive.org/web/20220905105009/https://stability.ai/blog/stable-diffusion-announcement\|url-status=live}}</ref>		<ref name="stable-diffusion-launch">{{cite web\|url=https://stability.ai/blog/stable-diffusion-announcement\|title=Stable Diffusion Launch Announcement\|website=Stability.Ai\|access-date=2022-09-06\|archive-date=2022-09-05\|archive-url=https://web.archive.org/web/20220905105009/https://stability.ai/blog/stable-diffusion-announcement\|url-status=live}}</ref>
	<ref name="verge">{{cite web		<ref name="verge">{{cite web
	\|work=The Verge		\|work=The Verge
	\|last=Vincent\|first=James		\|last=Vincent
			\|first=James
	\|date=15 September 2022		\|date=15 September 2022
	\|title=Anyone can use this AI art generator — that's the risk		\|title=Anyone can use this AI art generator — that's the risk
	\|url=https://www.theverge.com/2022/9/15/23340673/ai-image-generation-stable-diffusion-explained-ethics-copyright-data		\|url=https://www.theverge.com/2022/9/15/23340673/ai-image-generation-stable-diffusion-explained-ethics-copyright-data
			\|access-date=September 30, 2022
			\|archive-date=January 21, 2023
			\|archive-url=https://web.archive.org/web/20230121153021/https://www.theverge.com/2022/9/15/23340673/ai-image-generation-stable-diffusion-explained-ethics-copyright-data
			\|url-status=live
	}}</ref>		}}</ref>
	}}		}}
Line 122:		Line 274:
	{{Commons category}}		{{Commons category}}
	*		*
			*{{Cite web \|title=Step by Step visual introduction to Diffusion Models. - Blog by Kemal Erdem \|url=https://erdem.pl/2023/11/step-by-step-visual-introduction-to-diffusion-models/ \|access-date=2024-08-31\|language=en}}
	{{Differentiable computing}}
			*{{Cite web \|title=U-Net for Stable Diffusion \|url=https://nn.labml.ai/diffusion/stable_diffusion/model/unet.html \|access-date=2024-08-31 \|website=U-Net for Stable Diffusion \|language=en}}
			*
			*: Investigation on sensitive and private data in Stable Diffusions training data
			*""
			*""

			{{Generative AI}}
			{{Artificial intelligence navbox}}

			]
	]		]
	]		]
	]		]
			]
			]
			]

Latest revision as of 16:57, 22 January 2025

Image-generating machine learning model

Stable Diffusion
An image generated with Stable Diffusion 3.5 based on the text prompt `a photograph of an astronaut riding a horse`
Original author(s)	Runway, CompVis, and Stability AI
Developer(s)	Stability AI
Initial release	August 22, 2022

Stable release	SD 3.5 (model) / October 22, 2024

Repository	github.com/Stability-AI/generative-models
Written in	Python
Type	Text-to-image model
License	Stability AI Community License
Website	stability.ai/stable-image

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. Its development involved researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a computational donation from Stability and training data from non-profit organizations.

Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural network. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.

Development

Stable Diffusion originated from a project called Latent Diffusion, developed in Germany by researchers at Ludwig Maximilian University in Munich and Heidelberg University. Four of the original 5 authors (Robin Rombach, Andreas Blattmann, Patrick Esser and Dominik Lorenz) later joined Stability AI and released subsequent versions of Stable Diffusion.

The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich. Development was led by Patrick Esser of Runway and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion. Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project.

Technology

Architecture

Main article: Latent diffusion model

Models in Stable Diffusion series before SD 3 all used a kind of diffusion model (DM), called a latent diffusion model (LDM), developed by the CompVis (Computer Vision & Learning) group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.

The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.

The name diffusion takes inspiration from the thermodynamic diffusion and an important link was made between this purely physical field and deep learning in 2015.

With 860 million parameters in the U-Net and 123 million in the text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards, and unlike other diffusion models, it can run on consumer GPUs, and even CPU-only if using the OpenVINO version of Stable Diffusion.

SD XL

The XL version uses the same LDM architecture as previous versions, except larger: larger UNet backbone, larger cross-attention context, two text encoders instead of one, and trained on multiple aspect ratios (not just the square aspect ratio like previous versions).

The SD XL Refiner, released at the same time, has the same architecture as SD XL, but it was trained for adding fine details to preexisting images via text-conditional img2img.

SD 3.0

Main article: Diffusion model § Rectified flow

The 3.0 version completely changes the backbone. Not a UNet, but a Rectified Flow Transformer, which implements the rectified flow method with a Transformer.

The Transformer architecture used for SD 3.0 has three "tracks", for original text encoding, transformed text encoding, and image encoding (in latent space). The transformed text encoding and image encoding are mixed during each transformer block.

The architecture is named "multimodal diffusion transformer (MMDiT), where the "multimodal" means that it mixes text and image encodings inside its operations. This differs from previous versions of DiT, where the text encoding affects the image encoding, but not vice versa.

Training data

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality). The dataset was created by LAION, a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons. An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data.

Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.

SD3 was trained at a cost of around $10 million.

Limitations

Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution; the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution. Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database. The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model. Stable Diffusion XL (SDXL) version 1.0, released in July 2023, introduced native 1024x1024 resolution and improved generation for limbs and text.

Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset, such as generating anime characters ("waifu diffusion"), new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging to algorithmically generated music. However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of VRAM, which exceeds the usual resource provided in such consumer GPUs as Nvidia's GeForce 30 series, which has only about 12 GB.

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages, with western or white cultures often being the default representation.

End-user fine-tuning

To address the limitations of the model's initial training, end-users may opt to implement additional training to fine-tune generation outputs to match more specific use-cases, a process also referred to as personalization. There are three methods in which user-accessible fine-tuning can be applied to a Stable Diffusion model checkpoint:

An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt. Embeddings are based on the "textual inversion" concept developed by researchers from Tel Aviv University in 2022 with support from Nvidia, where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles.
A "hypernetwork" is a small pretrained neural network that is applied to various points within a larger neural network, and refers to the technique created by NovelAI developer Kurumuz in 2021, originally intended for text-generation transformer models. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space.
DreamBooth is a deep learning generation model developed by researchers from Google Research and Boston University in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.

Capabilities

The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis") through its diffusion-denoising mechanism. In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.

Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage.

Text to image generation

Demonstration of the effect of negative prompts on image generation

Top: no negative prompt
Centre: "green trees"
Bottom: "round stones, round rocks"

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt. Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion, although this watermark loses its efficacy if the image is resized or rotated.

Each txt2img generation will involve a specific seed value which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image. Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects. Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt. More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value.

Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets. An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.

Image modification

Demonstration of img2img modification

Left: Original image created with Stable Diffusion 1.5
Right: Modified image created with Stable Diffusion XL 1.0

Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided.

There are different methods for performing img2img. The main method is SDEdit, which first adds noise to an image, then denoises it as usual in text2img.

The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and data augmentation, in which the visual features of image data are changed and anonymized. The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image. Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to JPEG and WebP, the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces.

Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask, which fills the masked space with newly generated content based on the provided prompt. A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0. Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.

A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output.

ControlNet

ControlNet is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models. The "zero convolution" is a 1×1 convolution with both weight and bias initialized to zero. Before training, all zero convolutions produce zero output, preventing any distortion caused by ControlNet. No layer is trained from scratch; the process is still fine-tuning, keeping the original model secure. This method enables training on small-scale or even personal devices.

User Interfaces

Stability provides an online image generation service called DreamStudio. The company also released an open source version of DreamStudio called StableStudio. In addition to Stability's interfaces, many third party open source interfaces exist, such as AUTOMATIC1111 Stable Diffusion Web UI, which is the most popular and offers extra features, Fooocus, which aims to decrease the amount of prompting needed by the user, and ComfyUI, which has a node-based user interface, essentially a visual programming language akin to many 3D modeling applications.

Releases


Version number	Release date	Parameters	Notes
1.1, 1.2, 1.3, 1.4	August 2022		All released by CompVis. There is no "version 1.0". 1.1 gave rise to 1.2, and 1.2 gave rise to both 1.3 and 1.4.
1.5	October 2022	983M	Initialized with the weights of 1.2, not 1.4. Released by RunwayML.
2.0	November 2022		Retrained from scratch on a filtered dataset.
2.1	December 2022		Initialized with the weights of 2.0.
XL 1.0	July 2023	3.5B	The XL 1.0 base model has 3.5 billion parameters, making it around 3.5x larger than previous versions.
XL Turbo	November 2023		Distilled from XL 1.0 to run in fewer diffusion steps.
3.0	February 2024 (early preview)	800M to 8B	A family of models.
3.5	October 2024	2.5B to 8B	A family of models with Large (8 billion parameters), Large Turbo (distilled from SD 3.5 Large), and Medium (2.5 billion parameters).

Key papers

Learning Transferable Visual Models From Natural Language Supervision (2021). This paper describes the CLIP method for training text encoders, which convert text into floating point vectors. Such text encodings are used by the diffusion model to create images.
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations (2021). This paper describes SDEdit, aka "img2img".
High-Resolution Image Synthesis with Latent Diffusion Models (2021, updated in 2022). This paper describes the latent diffusion model (LDM). This is the backbone of the Stable Diffusion architecture.
Classifier-Free Diffusion Guidance (2022). This paper describes CFG, which allows the text encoding vector to steer the diffusion model towards creating the image described by the text.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (2023). Describes SDXL.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (2022). Describes rectified flow, which is used for the backbone architecture of SD 3.0.
Scaling Rectified Flow Transformers for High-resolution Image Synthesis (2024). Describes SD 3.0.

Training cost

SD 2.0: 0.2 million hours on A100 (40GB).

Stable Diffusion 3.5 Large was made available for enterprise usage on Amazon Bedrock of Amazon Web Services.

Usage and controversy

Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals.

The images Stable Diffusion was trained on have been filtered without human input, leading to some harmful images and large amounts of private and sensitive information appearing in the training data.

More traditional visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors.

Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI. Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, argues that " peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology", and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences. In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis. This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code.

Controversy around photorealistic sexualized depictions of underage characters have been brought up, due to such images generated by Stable Diffusion being shared on websites such as Pixiv.

In June of 2024, a hack on an extension of ComfyUI, a user interface for Stable Diffusion, took place, with the hackers claiming they targeted users who committed "one of our sins", which included AI-art generation, art theft, promoting cryptocurrency.

Litigation

Andersen, McKernan, and Ortiz v. Stability AI, Midjourney, and DeviantArt

In January 2023, three artists, Sarah Andersen, Kelly McKernan, and Karla Ortiz, filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.

In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint, providing them an opportunity to reframe their arguments.

Getty Images v. Stability AI

In January 2023, Getty Images initiated legal proceedings against Stability AI in the English High Court, alleging significant infringement of its intellectual property rights. Getty Images claims that Stability AI "scraped" millions of images from Getty’s websites without consent and used these images to train and develop its deep-learning Stable Diffusion model.

Key points of the lawsuit include:

Getty Images asserting that the training and development of Stable Diffusion involved the unauthorized use of its images, which were downloaded on servers and computers that were potentially in the UK. However, Stability AI argues that all training and development took place outside the UK, specifically in U.S. data centers operated by Amazon Web Services.
Stability AI applied for reverse summary judgment and/or strike out of two claims: the training and development claim, and the secondary infringement of copyright claim. The High Court, however, refused to strike out these claims, allowing them to proceed to trial. The court is to determine whether the training and development of Stable Diffusion occurred in the UK, which is crucial for establishing jurisdiction under the UK's Copyright, Designs and Patents Act 1988 (CDPA).
The secondary infringement claim revolves around whether the pre-trained Stable Diffusion software, made available in the UK through platforms like GitHub, HuggingFace, and DreamStudio, constitutes an "article" under sections 22 and 23 of the CDPA. The court will decide whether the term "article" can encompass intangible items such as software.

The trial is expected to take place in summer 2025 and has significant implications for UK copyright law and the licensing of AI-generated content.

License

Unlike models like DALL-E, Stable Diffusion makes its source code available, along with the model (pretrained weights). Prior to Stable Diffusion 3, it applied the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M). The license prohibits certain use cases, including crime, libel, harassment, doxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... legally protected characteristics or categories". The user owns the rights to their generated output images, and is free to use them commercially.

Stable Diffusion 3.5 applies the permissive Stability AI Community License while commercial enterprises with revenue exceed $1 million need the Stability AI Enterprise License. As with the OpenRAIL-M license, the user retains the rights to their generated output images and is free to use them commercially.

References

"Stable Diffusion 3.5". Stability AI. Archived from the original on October 23, 2024. Retrieved October 23, 2024.
Ryan O'Connor (August 23, 2022). "How to Run Stable Diffusion Locally to Generate Images". Archived from the original on October 13, 2023. Retrieved May 4, 2023.
"Diffuse The Rest - a Hugging Face Space by huggingface". huggingface.co. Archived from the original on September 5, 2022. Retrieved September 5, 2022.
"Leaked deck raises questions over Stability AI's Series A pitch to investors". sifted.eu. Archived from the original on June 29, 2023. Retrieved June 20, 2023.
"Revolutionizing image generation by AI: Turning text into images". www.lmu.de. Archived from the original on September 17, 2022. Retrieved June 21, 2023.
Mostaque, Emad (November 2, 2022). "Stable Diffusion came from the Machine Vision & Learning research group (CompVis) @LMU_Muenchen". Twitter. Archived from the original on July 20, 2023. Retrieved June 22, 2023.
^ "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on September 5, 2022. Retrieved September 6, 2022.
^ "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. September 17, 2022. Archived from the original on January 18, 2023. Retrieved September 17, 2022.
"The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on August 31, 2022. Retrieved August 31, 2022.
^ Vincent, James (September 15, 2022). "Anyone can use this AI art generator — that's the risk". The Verge. Archived from the original on January 21, 2023. Retrieved September 30, 2022.
"CompVis/Latent-diffusion". GitHub.
"Stable Diffusion 3: Research Paper".
"Home". Computer Vision & Learning Group. Retrieved September 5, 2024.
^ Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752. Archived (PDF) from the original on January 20, 2023. Retrieved September 17, 2022.
^ Alammar, Jay. "The Illustrated Stable Diffusion". jalammar.github.io. Archived from the original on November 1, 2022. Retrieved October 31, 2022.
David, Foster. "8. Diffusion Models". Generative Deep Learning (2 ed.). O'Reilly.
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli (March 12, 2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics". arXiv:1503.03585 .{{cite arXiv}}: CS1 maint: multiple names: authors list (link)
"Stable diffusion pipelines". huggingface.co. Archived from the original on June 25, 2023. Retrieved June 22, 2023.
"Text-to-Image Generation with Stable Diffusion and OpenVINO™". openvino.ai. Intel. Retrieved February 10, 2024.
^ Podell, Dustin; English, Zion; Lacey, Kyle; Blattmann, Andreas; Dockhorn, Tim; Müller, Jonas; Penna, Joe; Rombach, Robin (July 4, 2023). "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis". arXiv:2307.01952 .
^ Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (March 5, 2024), Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, arXiv:2403.03206
^ Liu, Xingchao; Gong, Chengyue; Liu, Qiang (September 7, 2022), Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, arXiv:2209.03003
^ "Rectified Flow — Rectified Flow". www.cs.utexas.edu. Retrieved March 6, 2024.
^ Baio, Andy (August 30, 2022). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Waxy.org. Archived from the original on January 20, 2023. Retrieved November 2, 2022.
"This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review. Archived from the original on January 14, 2023. Retrieved November 2, 2022.
^ Brunner, Katharina; Harlan, Elisa (July 7, 2023). "We Are All Raw Material for AI". Bayerischer Rundfunk (BR). Archived from the original on September 12, 2023. Retrieved September 12, 2023.
Schuhmann, Christoph (November 2, 2022), CLIP+MLP Aesthetic Score Predictor, archived from the original on June 8, 2023, retrieved November 2, 2022
"LAION-Aesthetics | LAION". laion.ai. Archived from the original on August 26, 2022. Retrieved September 2, 2022.
^ Ho, Jonathan; Salimans, Tim (July 25, 2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 .
Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on September 6, 2022. Retrieved September 6, 2022.
^ "CompVis/stable-diffusion-v1-4 · Hugging Face". huggingface.co. Archived from the original on January 11, 2023. Retrieved November 2, 2022.
Wiggers, Kyle (August 12, 2022). "A startup wants to democratize the tech behind DALL-E 2, consequences be damned". TechCrunch. Archived from the original on January 19, 2023. Retrieved November 2, 2022.
emad_9608 (April 19, 2024). "10m is about right". r/StableDiffusion. Retrieved April 25, 2024.{{cite web}}: CS1 maint: numeric names: authors list (link)
^ "Stable Diffusion with 🧨 Diffusers". huggingface.co. Archived from the original on January 17, 2023. Retrieved October 31, 2022.
^ "Stable Diffusion 2.0 Release". stability.ai. Archived from the original on December 10, 2022.
"LAION". laion.ai. Archived from the original on October 16, 2023. Retrieved October 31, 2022.
"Generating images with Stable Diffusion". Paperspace Blog. August 24, 2022. Archived from the original on October 31, 2022. Retrieved October 31, 2022.
"Announcing SDXL 1.0". Stability AI. Archived from the original on July 26, 2023. Retrieved August 21, 2023.
Edwards, Benj (July 27, 2023). "Stability AI releases Stable Diffusion XL, its next-gen image synthesis model". Ars Technica. Archived from the original on August 21, 2023. Retrieved August 21, 2023.
"hakurei/waifu-diffusion · Hugging Face". huggingface.co. Archived from the original on October 8, 2023. Retrieved October 31, 2022.
Chambon, Pierre; Bluethgen, Christian; Langlotz, Curtis P.; Chaudhari, Akshay (October 9, 2022). "Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains". arXiv:2210.04133 .
Seth Forsgren; Hayk Martiros. "Riffusion - Stable diffusion for real-time music generation". Riffusion. Archived from the original on December 16, 2022.
Mercurio, Anthony (October 31, 2022), Waifu Diffusion, archived from the original on October 31, 2022, retrieved October 31, 2022
Smith, Ryan. "NVIDIA Quietly Launches GeForce RTX 3080 12GB: More VRAM, More Power, More Money". www.anandtech.com. Archived from the original on August 27, 2023. Retrieved October 31, 2022.
Dave James (October 28, 2022). "I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann". PC Gamer. Archived from the original on November 9, 2022.
Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (August 2, 2022). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". arXiv:2208.01618 .
"NovelAI Improvements on Stable Diffusion". NovelAI. October 11, 2022. Archived from the original on October 27, 2022.
Yuki Yamashita (September 1, 2022). "愛犬の合成画像を生成できるAI 文章で指示するだけでコスプレ米Googleが開発". ITmedia Inc. (in Japanese). Archived from the original on August 31, 2022.
Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv:2108.01073 .
^ "Stable Diffusion web UI". GitHub. November 10, 2022. Archived from the original on January 20, 2023. Retrieved September 27, 2022.
invisible-watermark, Shield Mountain, November 2, 2022, archived from the original on October 18, 2022, retrieved November 2, 2022
"stable-diffusion-tools/emphasis at master · JohannesGaessler/stable-diffusion-tools". GitHub. Archived from the original on October 2, 2022. Retrieved November 2, 2022.
"Stable Diffusion v2.1 and DreamStudio Updates 7-Dec 22". stability.ai. Archived from the original on December 10, 2022.
^ Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (January 4, 2022). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv:2108.01073 .
^ Luzi, Lorenzo; Siahkoohi, Ali; Mayer, Paul M.; Casco-Rodriguez, Josue; Baraniuk, Richard (October 21, 2022). "Boomerang: Local sampling on image manifolds using diffusion models". arXiv:2210.12100 .
Bühlmann, Matthias (September 28, 2022). "Stable Diffusion Based Image Compression". Medium. Archived from the original on November 2, 2022. Retrieved November 2, 2022.
Zhang, Lvmin (February 10, 2023). "Adding Conditional Control to Text-to-Image Diffusion Models". arXiv:2302.05543 .
Edwards, Benj (November 10, 2022). "Stable Diffusion in your pocket? "Draw Things" brings AI images to iPhone". Ars Technica. Retrieved July 10, 2024.
Wendling, Mike (March 6, 2024). "AI can be easily used to make fake election photos - report". bbc.com. Retrieved July 10, 2024. The CCDH, a campaign group, tested four of the largest public-facing AI platforms: Midjourney, OpenAI's ChatGPT Plus, Stability.ai's DreamStudio and Microsoft's Image Creator.
Wiggers, Kyle (May 18, 2023). "Stability AI open sources its AI-powered design studio". TechCrunch. Retrieved July 10, 2024.
Weatherbed, Jess (May 17, 2023). "Stability AI is open-sourcing its DreamStudio web app". The Verge.
Mann, Tobias (June 29, 2024). "A friendly guide to local AI image gen with Stable Diffusion and Automatic1111". The Register.
Hachman, Mak. "Fooocus is the easiest way to create AI art on your PC". PCWorld.
"ComfyUI Workflows and what you need to know". thinkdiffusion.com. December 2023. Retrieved July 10, 2024.
"ComfyUI". github.com. Retrieved July 10, 2024.
Huang, Yenkai (May 10, 2024). Latent Auto-recursive Composition Engine (M.S. Computer Science thesis). Dartmouth College. Retrieved July 10, 2024.
"CompVis/stable-diffusion-v1-4 · Hugging Face". huggingface.co. Archived from the original on January 11, 2023. Retrieved August 17, 2023.
"CompVis (CompVis)". huggingface.co. August 23, 2023. Retrieved March 6, 2024.
"runwayml/stable-diffusion-v1-5 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
^ "stabilityai/stable-diffusion-2 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
"stabilityai/stable-diffusion-2-base · Hugging Face". huggingface.co. Retrieved January 1, 2024.
"stabilityai/stable-diffusion-2-1 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
"stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face". huggingface.co. Archived from the original on October 8, 2023. Retrieved August 17, 2023.
"Announcing SDXL 1.0". Stability AI. Retrieved January 1, 2024.
"stabilityai/sdxl-turbo · Hugging Face". huggingface.co. Retrieved January 1, 2024.
"Adversarial Diffusion Distillation". Stability AI. Retrieved January 1, 2024.
"Stable Diffusion 3". Stability AI. Retrieved March 5, 2024.
^ "Stable Diffusion 3.5". Stability AI. Archived from the original on October 23, 2024. Retrieved October 23, 2024.
Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela (February 26, 2021). "Learning Transferable Visual Models From Natural Language Supervision". arXiv:2103.00020 .
Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2022). "High-Resolution Image Synthesis With Latent Diffusion Models". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695. arXiv:2112.10752.
Kerner, Sean Michael (December 19, 2024). "Stable Diffusion 3.5 hits Amazon Bedrock: What it means for enterprise AI workflows". VentureBeat. Retrieved December 25, 2024.
"LICENSE.md · stabilityai/stable-diffusion-xl-base-1.0 at main". huggingface.co. July 26, 2023. Retrieved January 1, 2024.
Heikkilä, Melissa (September 16, 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review. Archived from the original on January 14, 2023. Retrieved September 26, 2022.
^ Ryo Shimizu (August 26, 2022). "Midjourneyを超えた？無料の作画AI｢ #StableDiffusion ｣が｢AIを民主化した｣と断言できる理由". Business Insider Japan (in Japanese). Archived from the original on December 10, 2022. Retrieved October 4, 2022.
Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Archived from the original on September 30, 2023. Retrieved October 31, 2022.
"Illegal trade in AI child sex abuse images exposed". BBC News. June 27, 2023. Archived from the original on September 21, 2023. Retrieved September 26, 2023.
Maiberg, Emanuel (June 11, 2024). "Hackers Target AI Users With Malicious Stable Diffusion Tool on GitHub to Protest 'Art Theft'". 404 Media. Retrieved June 14, 2024.
Vincent, James (January 16, 2023). "AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit". The Verge. Archived from the original on March 9, 2023. Retrieved January 16, 2023.
Brittain, Blake (July 19, 2023). "US judge finds flaws in artists' lawsuit against AI companies". Reuters. Archived from the original on September 6, 2023. Retrieved August 6, 2023.
Goosens, Sophia (February 28, 2024). "Getty Images v Stability AI: the implications for UK copyright law and licensing".
Gill, Dennis (December 11, 2023). "Getty Images v Stability AI: copyright claims can proceed to trial".
Goosens, Sophia (February 28, 2024). "Getty v. Stability AI case goes to trial in the UK – what we learned".
^ Hill, Charlotte (February 16, 2024). "Generative AI in the courts: Getty Images v Stability AI".
"Stable Diffusion Public Release". Stability.Ai. Archived from the original on August 30, 2022. Retrieved August 31, 2022.
"From RAIL to Open RAIL: Topologies of RAIL Licenses". Responsible AI Licenses (RAIL). August 18, 2022. Archived from the original on July 27, 2023. Retrieved February 20, 2023.
"Ready or not, mass video deepfakes are coming". The Washington Post. August 30, 2022. Archived from the original on August 31, 2022. Retrieved August 31, 2022.
"License - a Hugging Face Space by CompVis". huggingface.co. Archived from the original on September 4, 2022. Retrieved September 5, 2022.
Katsuo Ishida (August 26, 2022). "言葉で指示した画像を凄いAIが描き出す「Stable Diffusion」～画像は商用利用も可能". Impress Corporation (in Japanese). Archived from the original on November 14, 2022. Retrieved October 4, 2022.
"Community License". Stability AI. July 5, 2024. Retrieved October 23, 2024.

External links

Stable Diffusion Demo
"Step by Step visual introduction to Diffusion Models. - Blog by Kemal Erdem". Retrieved August 31, 2024.
"U-Net for Stable Diffusion". U-Net for Stable Diffusion. Retrieved August 31, 2024.
Interactive Explanation of Stable Diffusion
"We Are All Raw Material for AI": Investigation on sensitive and private data in Stable Diffusions training data
"Negative Prompts in Stable Diffusion"
"Negative Prompts in Stable Diffusion"

Generative AI

Concepts

Models

Text	Claude DBRX Gemini GPT 1 2 3 J ChatGPT 4 4o o1 o3 Grok Granite Llama Mistral Large PanGu-Σ Qwen
Image	Aurora DALL-E Firefly Flux Ideogram Midjourney Stable Diffusion
Video	Dream Machine Gen-3 Alpha Hailuo AI Kling Sora Veo VideoPoet
Music	Udio Suno AI

Companies

Category
[REDACTED] Commons

Artificial intelligence

History (timeline)