• Diffusion Digest
  • Posts
  • FLUX Optimizations, Darth Vader's AI Voice, 3D AI Tools | This Week in AI Art 🧊

FLUX Optimizations, Darth Vader's AI Voice, 3D AI Tools | This Week in AI Art 🧊

Cut through the noise, stay informed — new stories every Sunday.

If you’re trying to get to inbox zero but you still want to read this later:

FLUX UPDATES

Performance Improvements
  • u/Serasul reported a 53.88% speedup for the Flux.1-Dev model using torch.compile() [LINK_TO_GITHUB_REPO]. This optimization technique, which leverages FP8 precision, is currently limited to Linux systems and requires NVIDIA GPUs with Ada architecture (4000 series or newer). The speedup was demonstrated on high-end GPUs like the A100 and H100. While promising, some users expressed concerns about potential trade-offs between speed and output quality, as well as compatibility issues with certain workflows, particularly those involving LoRA models. [ORIGINAL_REDDIT_THREAD]

  • u/Christianman88 demonstrates running the Flux AI image generation model on a low-end NVIDIA GTX 1060 6GB GPU using the Forge web UI. They achieved 8-13 minute generation times for 896x1152 pixel images using the flux1-dev-bnb-nf4-v2.safetensors checkpoint. The post highlights that older NVIDIA GPUs can still be viable for AI workflows, albeit with longer processing times. Key optimizations discussed include using quantized models like NF4 to reduce VRAM usage, and potentially incorporating "Hyper" LoRA models to reduce the number of required inference steps. The Forge UI is noted as being more user-friendly and faster for loading the Flux model compared to alternatives like ComfyUI, especially for users with limited VRAM. [ORIGINAL_REDDIT_THREAD]

  • u/Iory1998 provides a comprehensive comparison of different quantization levels for the Flux.1 image generation model. The post discusses the trade-offs between model size, VRAM usage, and output quality for various quantization levels including FP16, Q8_0, Q6_KM, Q5_1, Q5_0, Q4_0, and NF4. The author recommends Q8 for 24GB VRAM systems, Q6_KM for 16GB, Q5_1 for 12GB, and Q4_0 or Q4_1 for systems with less than 10GB VRAM. They emphasize the importance of loading text encoders into RAM to optimize VRAM usage and generation speed. [ORIGINAL_REDDIT_THREAD]

  • A technique for fine-tuning specific layers in Flux shared by u/mrfofr, to achieve faster training and inference while maintaining quality. The method involves targeting layers 7, 12, 16, and 20 using the regex "transformer.single_transformer_blocks.(7|12|16|20).proj_out" in the Replicate Flux trainer. This approach can result in 15-20% faster inference speeds. The post also provides a full list of layers that can be targeted and suggests using custom captions for experiments. Additionally, OP recommends using the Replicate CLI for queueing multiple experiments with similar parameters. The technique aims to balance training speed, likeness quality, and inference performance, though the optimal configuration may vary depending on the specific subject or style being trained. [ORIGINAL_REDDIT_THREAD]

  • Reddit user u/protector111 shared a comparison of FLUX's --fast mode testing on an RTX 4090 GPU, focusing on speed, quality, and LoRA likeness degradation. The tests were conducted using ComfyUI on Windows 10 with PyTorch 2.4.1+cu121 and xformers 0.0.28.dev895. The post compares different data types (fp16, fp8_E4m3fn, and fp8_e5m2) in terms of rendering time and output quality for various image generation tasks. The --fast mode, which uses fp8 operations, showed significant speed improvements, particularly with the fp8_E4m3fn data type, while maintaining comparable quality to the default fp16 when used with higher step counts and upscaling. The findings suggest that fp8_E4m3fn with --fast mode can offer a balance between speed and quality for Stable Diffusion workflows. [ORIGINAL_REDDIT_THREAD]

Practical Application
  • u/dal_mac shares their experience transitioning their remote photography service to FLUX. They detail their workflow for creating highly accurate AI-generated portraits using LoRA training on 12-24 client photos. Key technical aspects include using the "Acorn Is Spinning" base model for improved skin textures, applying 75-150 total training steps per image, and utilizing multi-resolution training in some cases. The poster emphasizes the importance of careful dataset curation, removing repetitive images, and balancing representation of facial details. They also mention using ComfyUI for generation, Ultimate SD Upscale for post-processing, and applying film grain for a more realistic look. [ORIGINAL_REDDIT_THREAD]

Technical Insight
  • u/tabula_rasa22 provides an overview of how the Flux image generation model processes text prompts. The post explains that Flux uses both CLIP and T5 models to interpret prompts, with T5 acting as an intermediary to guide CLIP throughout the image generation process. CLIP tokenizes words and finds reference images, while T5 applies natural language processing to provide more nuanced interpretation. This dual-model approach allows Flux to handle more complex prompts and generate images with better adherence to user intent. OP notes that Flux's text processing is more sophisticated than earlier diffusion models, potentially improving results for casual users but also introducing new complexities for fine-tuning and customization.[ORIGINAL_REDDIT_THREAD]

JAMES EARL JONES PAVES WAY FOR AI VOICES

James Earl Jones, the iconic voice actor of Darth Vader, died on September 11, 2024, at the age of 93. In 2022, Jones signed over the rights to his voice to Lucasfilm, allowing them to recreate Darth Vader's voice using artificial intelligence. This decision was made as Jones was considering retiring from the role.

The AI technology used to recreate Jones' voice was developed by Respeecher, a Ukrainian tech start-up. Matthew Wood, the supervising sound editor at Lucasfilm, presented Jones with Respeecher's work, which was subsequently used in the Disney+ series "Obi-Wan Kenobi" in 2022.

Jones had an illustrious career spanning over 70 years, with notable roles in films like "The Great White Hope," "Field of Dreams," and as the voice of Mufasa in "The Lion King." However, he is perhaps best known for his portrayal of Darth Vader in the Star Wars franchise, a role he began in 1977 with "Star Wars: Episode IV – A New Hope."

May he rest in peace.

PS5 PRO AI UPSCALING

Sony has officially announced the PS5 Pro, a more powerful version of their PlayStation 5 console, scheduled for release on November 7th, 2024, with a price tag of $699.99. The PS5 Pro introduces several key improvements, including a larger GPU with 67% more compute units, advanced ray tracing capabilities, and a new AI-driven upscaling technology called PlayStation Spectral Super Resolution (PSSR).

This PSSR feature you highlighted is a significant addition to the PS5 Pro. Developed by Sony, this AI-powered upscaling technique is conceptually similar to Nvidia's DLSS and AMD's FSR. PSSR aims to enhance both frame rates and image quality in PlayStation games, and it's designed to replace existing temporal anti-aliasing or upsampling methods currently used in games.

Lead architect for the console Mark Cerny states that PSSR "analyzes the game images pixel by pixel." The machine-based learning technology then goes in automatically and adds "an extraordinary amount" of detail to the on-screen image. It's possible that while playing a game at 1080p with 60fps support, PSSR can upscale the images closer to 1440p without noticeable degradation to the framerate.

By integrating this technology with the hardware improvements, Sony intends to allow players to enjoy games with high-fidelity graphics while maintaining smooth performance. The ultimate goal is to eliminate the need for players to choose between performance and visual quality modes in games.

PUT THIS ON YOUR RADAR

Subscribe to keep reading

This content is free, but you must be subscribed to Diffusion Digest to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.