- Diffusion Digest
- Posts
- Elevenlabs in Taiwanese Parliament, FLUX Updates, Ted Chiang on AI Art | This Week in AI Art 💬
Elevenlabs in Taiwanese Parliament, FLUX Updates, Ted Chiang on AI Art | This Week in AI Art 💬
Cut through the noise, stay informed — new stories every Sunday.
pov: you spent the entire day collating all the community resources into a single document
In this issue:
FLUX UPDATES
General Updates
FluxMusic, a new text-to-music generation model converts audio into a compressed representation using VAE and mel-spectrograms. It employs multiple pre-trained text encoders to understand prompts, then uses a two-stage process: first applying attention mechanisms to both text and music data, then refining the audio with music-only layers. This approach aims to create music that aligns with text descriptions while maintaining quality. The model has about 4 billion parameters and can run locally. Users found the demo results passable but not as impressive as proprietary models, others saw it as significant progress for open-source music generation. The GitHub repository is still in development, with plans for easier setup and potential integration with tools like ComfyUI.
u/zer0int1 announced a new fine-tuned CLIP-L text encoder for use with Flux.1, aimed at improving text and detail adherence in image generation. The model, available on Hugging Face, is designed to enhance the performance of the Flux.1 AI image generation model. The fine-tuned encoder demonstrates better text rendering and detail accuracy compared to the original OpenAI CLIP-L. It maintains compatibility with various AI frameworks, including Stable Diffusion 1.5, SDXL, SD3, and Flux. The author also shared insights into the training process, which utilized the T2I-COCO-SPRIGHT dataset and the CLIP fine-tuning code (which can be found here). The original thread can be found here.
u/terminusresearchorg announced the release of simpletuner v1.0, a major update to their AI model training tool. This release includes significant refactoring of the trainer, improved attention masking for text inputs, and the introduction of a Trainer class for developers. Key features include better multi-GPU step tracking, faster processing due to optimized mask placement, and enhanced fine detail and text generation in trained models. The update also deprecates config.env files in favor of config.json or config.toml, and integrates previously hidden default settings from train.sh into train.py. A Jupyter Notebook is also now available for interactive exploration of the Trainer class, and can be found here. Read the full github release here, and the quickstart guide can be found here. The original thread can be found here.
LoRA Training Techniques
A post by u/tom83_be provides a detailed tutorial on training Flux.1 Dev LoRAs using "ComfyUI Flux Trainer" with 12 VRAM requirements. The guide outlines steps for setting up ComfyUI, installing necessary components like ComfyUI Manager and the Flux Trainer, and configuring the training process. The author outlines using split_mode for low VRAM usage (12GB or less), setting network_dim and network_alpha to 64, and using Adafactor as the optimizer. The tutorial also covers model file placement, workflow configuration, and potential issues with validation sampling. OP reports achieving training speeds of about 9.5 seconds per iteration on a 3060 GPU with 12GB VRAM, with the ability to train at 512x512 resolution. Link to full reddit thread can be found here.
u/Nuckyduck shared a workflow for training local Flux LoRA models on 16GB GPUs using ComfyUI. The post describes a method for creating personalized AI models with limited hardware resources, achieving results in about 2 hours and 45 minutes on a 4070 Ti Super GPU. The workflow, available on GitHub, includes tools for preprocessing data and incorporates WD14 tagging. Key technical details include using a network dimension and alpha of 64, and employing a split mode for improved results. The Reddit thread can be found here.
u/Yacben presents a technique for training smaller and more efficient LoRA (Low-Rank Adaptation) models for the FLUX image generation model. The post suggests that good FLUX LoRAs can be as small as 4.5MB (128 dimensions) by training only one or two specific layers, such as single_transformer_blocks.7.proj_out and single_transformer_blocks.20.proj_out. This approach reportedly achieves 99% face likeness with great flexibility in as little as 580KB (single layer, 16 dimensions). The method offers significant benefits, including 30% faster training times, 40% VRAM savings, and reduced file sizes. The full reddit guide can be found here.
u/cocktail_peanut introduced Fluxgym, an open-source web UI for training Flux LoRAs with low VRAM requirements. Built on Kohya-ss/sd-scripts for training and utilizing AI-toolkit's gradio UI with Florence-2 for auto-captioning, Fluxgym aims to simplify the LoRA training process. The tool supports training on GPUs with 12G, 16G, 20G+ VRAM, with reported training times of 58 minutes on an A4500 and 20 minutes on a 4090 for a single LoRA. Fluxgym includes features like automatic AI captioning, customizable training parameters, and support for various artifact types including code, documents, and SVG images. The project is available on GitHub and can be installed locally. The original thread can be found here.
Realism Update
u/KudzuEye provides an update on improved training approaches and inference techniques for creating realistic "boring" images using Flux. The post details a new LoRA (Low-Rank Adaptation) model trained on a small, balanced dataset of 30 images with simple captions. The author uses an older commit of AI-Toolkit with a latent shift bug fix, overtraining at 5000 steps with a 0.0005 learning rate, and experimenting with LoRA strength during inference. It is also recommended to use Dynamic Thresholding with high negative guidance, the Heun/beta sampler combination, and prompts like "Flickr 2000s photo" to enhance realism. Links to the new LoRA version are provided on CivtiAI and Hugging Face. The full reddit comment outlining the process can be found here.
IS AI ART... ART?
In his New Yorker essay "Why A.I. Isn't Going to Make Art," science fiction author and former computer scientist Ted Chiang critically examines the relationship between AI and artistic creation. As AI technologies advance, Chiang challenges the notion that machines can truly create art, exploring themes of creativity, intention, and intelligence. His analysis raises important questions about the future of human expression in an AI-driven world.
Let's start with the core of the essay. What's the author's main argument about AI and art creation?
The author's main argument is that AI fundamentally cannot create true art. He contends that art emerges from countless intentional decisions made at every level of the creative process, from broad concepts to minute details. While AI can generate content by aggregating or mimicking existing works, it lacks the deep intentionality and lived experience that infuse human-created art with meaning and originality. This argument challenges the notion that AI can truly replicate the artistic process, suggesting instead that it's merely producing sophisticated imitations.
That's interesting. How does he define art, and why does he think this definition matters?
The essay defines art as the result of making numerous choices. This definition is crucial because it emphasizes the process of creation rather than just the end product. It suggests that artistry lies in the interplay between high-level conceptual decisions and granular execution choices. By framing art this way, the author highlights why AI-generated content falls short - it can't make the kind of meaningful, intentional choices that human artists do. This definition asserts that the value of art is intrinsically tied to the human experience and decision-making process behind its creation.
So how does the author view the role of human decision-making and intention in creating art? And why does he believe AI systems fundamentally lack these qualities?
Chiang views human intention and decision-making as fundamental to art creation. He argues it's not just about making choices, but about making meaningful choices rooted in lived experience, emotion, and a desire to communicate. In contrast, he contends that AI lacks true intentionality or understanding. While AI can generate outputs based on statistical patterns in its training data, it doesn't have genuine thoughts or feelings to express. This absence of true intention means that AI-generated content, no matter how superficially impressive, lacks the deeper resonance and meaning of human-created art. The essay suggests that this fundamental difference makes AI incapable of creating true art in the deepest sense.
Let's zoom out a bit. What distinctions does the author make between human intelligence and AI capabilities?
The author makes a critical distinction between skill and intelligence, drawing on François Chollet's definition. He argues that while AI systems can achieve high levels of skill in specific domains, they lack true intelligence - the ability to efficiently learn new skills and adapt to novel situations. To illustrate this, he contrasts the enormous amount of training data required for AI systems to master tasks with the relatively quick learning capabilities of humans and animals. For instance, he mentions how rats can learn to drive tiny cars in just 24 training sessions, while AI systems need millions of iterations to master new skills. This distinction challenges the notion of AI as truly "intelligent" and highlights his perspective on the unique adaptability of human cognition.
Finally, given all of this, what concerns does the author raise about the widespread use of generative AI for writing tasks?
Regarding the widespread use of generative AI for writing tasks, the author raises several profound concerns. He worries that by removing genuine intention and lived experience from writing, AI risks stripping communication of its deeper meaning and emotional resonance. There's a fear that using AI for writing tasks could undermine the critical thinking and cognitive development that writing practices are meant to foster, especially in educational settings. The ease of generating AI content might lead to an overwhelming abundance of low-quality, meaningless text, potentially diminishing our standards for both what we read and what we write. The author is concerned about the potential degradation of discourse and the loss of authenticity in communication. Even if AI-generated text is coherent or unremarkable, he argues it lacks the genuine human touch that makes communication meaningful. These concerns go beyond just the quality of AI-generated content, touching on broader implications for human creativity, communication, and the value we place on authentic expression. The essay challenges us to consider what might be lost if we increasingly rely on AI to mediate our thoughts and creative expressions.
Where do you stand on AI-generated art? |
AI AUDIO ASSISTS LEGISLATOR'S PARLIAMENTARY QUESTIONING
Dr. Chen Ching-Hui, a Taiwanese legislator, lost her voice before a crucial questioning session with the Premier. Her team used ElevenLabs' voice cloning technology to create an AI version of her voice.
A Taiwan Parliament first!
AI Audio assisting a legislator during her questioning with the premier.
More on the story here: elevenlabs.io/blog/taiwan-pa…
— ElevenLabs (@elevenlabsio)
3:42 PM • Sep 3, 2024
They obtained approval from parliament officials to use the AI voice, as parliament rules require statements to be spoken aloud for official records. This marked Taiwan's first use of AI voice cloning in parliamentary proceedings.
The event has sparked discussions about further AI applications in parliament, such as automating the reading of long bills. Dr. Ju Chun Ko, who assisted in this process, views it as an example of human-AI collaboration enhancing democratic processes.
Dr. Ko plans to teach AI tools to his Youth League, believing it's important for young leaders to understand and use these technologies in politics.
Reply