Diffusion Digest
Posts
Flux Updates, Nvidia 'Cosmos' Project, and AI Video Game Strike | This Week in AI Art ✨

Flux Updates, Nvidia 'Cosmos' Project, and AI Video Game Strike | This Week in AI Art ✨

Cut through the noise, stay informed — new stories every Sunday.

August 11, 2024

These days, my Reddit feed is composed of flux updates and strawberries. Mostly strawberries.

In this issue:

🛠️ Flux Advancements: Explore the latest updates in Flux, including improved training methods and new capabilities.
🎮 AI Impact on Entertainment: Learn about the SAG-AFTRA strike and its implications for AI use in the video game industry.
🎥 AI in Media: Discover how Nvidia's massive video data mining project is shaping the future of AI training.
📡 On Our Radar:

If you’re trying to get to inbox zero but you still want to read this later:

FLUX UPDATES

Holy Flux updates.

SimpleTuner v0.9.8: Efficient Flux Training for Limited VRAM

SimpleTuner v0.9.8 is a versatile tool designed to train the Stable Diffusion Flux model across a range of GPUs with varying VRAM capacities. It employs quantized flux training, enabling use on GPUs with as little as 13.9GB VRAM and scaling efficiently up to 40GB. While the tool supports both LoRA and full tuning, the latter is generally not recommended. Multiple GPUs can also be utilized, but the approach favors quantizing and LoRAs over model splitting. It's worth noting that quantized training on multi-GPU setups is currently under development. For those pursuing full tuning, a substantial 80GB VRAM per GPU is required.

Improve Flux's Prompt Adherence and Introduce Negative Prompt Functionality

This method utilizes a method called "sd-dynamic-thresholding." It involves setting the Classifier-Free Guidance (CFG) value higher than 1, diverging from Flux's typical usage. The technique allows Flux to process both positive and negative prompts while managing increased latent space values. However, the primary trade-off is a significant decrease in processing speed—approximately 50% slower when using CFG > 1. u/Total-Resort-3120 recommends CFG = 3 for optimal results, balancing improved output quality with computational cost.

ControlNet (Canny) model for the FLUX.1-dev text-to-image

XLabs AI released a ControlNet (Canny) model for the FLUX.1-dev text-to-image AI system. This allows for more controlled image generation based on edge detection. Users are strongly advised to download only .safetensor files to avoid potential security risks from malicious code that could be embedded in other file formats. Given the sudden appearance of these advanced tools from an unknown source, it is recommended to practice caution when downloading these tools.

Canny model - Flux.1 dev (via u/BoostPixels)

X-Labs releases 6 new Flux LoRAs for style adaptation

X-Labs AI has released a collection of six new LoRAs (Low-Rank Adaptation models) for the Flux image generation system. These LoRAs include styles such as Art, Anime, Disney, Landscape, Retro, and Realistic. The LoRAs are designed to fine-tune the Flux model for specific artistic styles or content types. ComfyUI wrapper is also available.

Generate Realistic Human Images with Flux.1 Pro and Schnell

Prompt: “Phone photo: A woman stands in front of a mirror, capturing a selfie. The image quality is grainy, with a slight blur softening the details. The lighting is dim, casting shadows that obscure her features. The room is cluttered, with clothes strewn across the bed and an unmade blanket. Her expression is casual, full of concentration, while the old iPhone struggles to focus, giving the photo an authentic, unpolished feel. The mirror shows smudges and fingerprints, adding to the raw, everyday atmosphere of the scene” (via u/Sea_Law_7725)

Collection of Jupyter notebooks for Running the Flux Image Generation Models

AI Sparks SAG-AFTRA Video Game Industry Strike

On July 26, 2024, SAG-AFTRA (Screen Actors Guild-American Federation of Television and Radio Artists) initiated a strike against the video game industry. This action followed failed negotiations for a new Interactive Media Agreement (IMA), primarily due to disagreements over AI-related worker protections. The strike impacts over 160,000 SAG-AFTRA members, affecting both new and ongoing video game projects across major publishers like Activision, Take-Two, Insomniac Games, and WB Games. This strike echoes a previous 11-month SAG-AFTRA video game strike in 2016, which resulted in improved conditions for performers.

So what's the deal with these AI issues that derailed the SAG-AFTRA talks?

AI-related issues were the primary cause of the breakdown in negotiations. The main disagreement was over worker protections regarding AI technology. SAG-AFTRA wanted protections for both voice and movement performers concerning digital replicas and the use of generative AI to create performances without initial input. The video game companies initially offered protections only to voice performers, later extending them to motion performers but with conditions that SAG-AFTRA found unacceptable.

Hold up a sec. How do SAG-AFTRA's AI protection demands stack up against the companies' offers?

SAG-AFTRA wanted comprehensive protections for both voice and movement performers against the use of AI to create digital replicas or generate new performances. The video game companies initially offered protections only for voice performers. They later extended the offer to include motion performers, but only if "the performer is identifiable in the output of the AI digital replica." SAG-AFTRA rejected this proposal, arguing it would exclude most movement performances and leave stunt performers unprotected.

What's this "side letter six" thing, and how does it affect the strike's impact?

The "side letter six" clause is a specific provision in the Interactive Media Agreement that exempts certain games from being considered struck work during a strike. Specifically, it applies to games that were in production before August 2023. This clause, along with other exemptions, could potentially limit the strike's impact. For example, although Take-Two is a struck company, Grand Theft Auto VI is not considered struck work due to the side letter six clause. Additionally, work done under SAG-AFTRA's Tiered-Budget Independent Interactive Media Agreement or an Interim Interactive Media Agreement is exempt from the strike. The clause states that it "permits but does not require performers to render services during a strike."

‘The Verge’ Source

Leaked: Nvidia's 'Cosmos' AI Project Mines Vast Video Data

404 Media recently published an investigative report exposing Nvidia's secret "Cosmos" project. Based on leaked internal documents, the report reveals Nvidia is processing vast amounts of video data to train a state-of-the-art video foundation model. This AI will reportedly power various Nvidia products, including their Omniverse 3D world generator, self-driving car systems, and "digital human" offerings. The revelation highlights the ongoing ethical and legal debates surrounding data collection practices in AI development.

So, what's the deal with Nvidia's "Cosmos" project, and how are they getting all this data?

Nvidia is working on a yet-to-be-released video foundational model called "Cosmos" for various applications including their Omniverse 3D world generator, self-driving car systems, and "digital human" products. Internal documents show Nvidia has been scraping large amounts of video content from sources like YouTube, Netflix, and other platforms to train this AI model, reportedly aiming to process "a human lifetime visual experience worth of training data per day."

Spill the beans - what kind of content and methods is Nvidia using for Cosmos?

Employees used tools like yt-dlp and virtual machines with rotating IP addresses to download videos at scale, attempting to avoid being blocked by platforms. The project aims to compile diverse video content, including cinematic footage, drone footage, egocentric views, travel, nature, and gaming content from sources like GeForce Now.

This sounds dicey. What's the word on ethics and legality, and how's Nvidia spinning it?

There are internal discussions about the legal and ethical implications of using copyrighted content and academic datasets for commercial AI training purposes. When employees raised questions about potential legal issues, managers often dismissed these concerns, citing "executive decisions" and "umbrella approval" for data use. Nvidia officially defended its practices, stating they are "in full compliance with the letter and spirit of copyright law," citing fair use for transformative purposes like model training.

Nvidia's Cosmos project spotlights a trend in the AI industry: the widespread scraping of copyrighted content for training without explicit permission. This practice has thrust tech giants into a legal gray area, igniting a debate over whether such use of copyrighted material constitutes fair use. As the field rapidly evolves, resolving these legal and ethical challenges will be critical in shaping the future of AI technology and its societal impact.

‘404 Media’ Source

Put This On Your Radar

Deep-Live-Cam: Real-Time Webcam Face Swapping Tool

A new open-source project on GitHub allows for real-time face swapping in webcam feeds using a single image.

diffusion_digest
View more on Instagram
diffusion_digest
Add a comment...

Supports various GPU acceleration options (CUDA, CoreML, DirectML, OpenVINO)
Built-in checks to prevent misuse with inappropriate content
Simple GUI for easy operation
CLI mode available for advanced users

Github Link

Comfy Org August Updates

Comfy Org Blog has announced several updates in their August 2024 blog post:

Support for Flux and Hunyuan DiT AI models
Weekly release versions introduced (currently at v0.0.4)
New TypeScript frontend coming August 15th
Major core engine update (PR 2666) planned for next week

New TypeScript frontend coming August 15th

‘Comfy Org Blog’ Link

LLM Saga: AI D&D Game Engine

u/Valuevow has created an online game engine called LLM Saga that runs D&D 5e-style campaigns using AI.

AI Lore Master manages the game world and NPCs
Character creation with unique races and backgrounds
Turn-based combat following 5e rules
Environmental interactions and ability checks
Visual assets generated by AI

The creator plans to hold a first play test on November 4, 2024. Interested players can find more information at llmsaga.com.

Apple's ml_mdm: Open-Source Image Synthesis

Apple has released ml_mdm, an open-source framework for training high-quality text-to-image diffusion models efficiently.

Trains single pixel-space models up to 1024x1024 resolution
Demonstrates strong zero-shot generalization on small datasets
Pretrained models available (64x64, 256x256, 1024x1024)
Compatible with video generation
Includes web demo for image generation

For developers interested in efficient, high-resolution image synthesis, this tool offers a promising new approach.

Github Link

CogVideoX-2B: Text-to-Video Model

THUDM has released CogVideoX-2B, an open-source text-to-video generation model.

Prompt: “A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.“

Generates 6-second videos at 8 fps, 720x480 resolution
Supports English prompts up to 226 tokens
Inference possible on a single GPU (18-24GB VRAM)
Available via Hugging Face Diffusers library
Includes tools for prompt optimization and fine-tuning

Try it out on Hugging Face Spaces or check the GitHub repo for detailed usage instructions and model information.

Hugging Face Link

Github Link

ComfyUI Wrapper Github Link

ReSyncer: AI Lip-Sync System

Researchers have developed ReSyncer, an AI system that synchronizes facial movements with audio for creating virtual presenters or performers.

diffusion_digest
View more on Instagram

Generates high-quality lip-synced videos from audio input
Supports fast personalization, video-driven lip-syncing, and speaking style transfer
Potential applications in virtual presenters, language dubbing, and more

Project Paper Link

Reply

or to participate.