Jeffrey Urban

I attended NeurIPS 2024 in Vancouver, one of the premier machine learning conferences, and came away with several insights that suggest where the field is heading. Here are the trends and ideas that stood out most.

The Big Picture

Several overarching themes emerged across the conference:

Inference-time compute is taking center stage. Rather than just making models bigger during training, there's growing focus on combining models and using multiple passes at inference time to improve results. This shift means we're thinking less about raw model scale and more about how models work together and iterate on problems.

Text and video are becoming universal representations. Models trained on text and video are increasingly able to generalize to other types of data. This suggests these modalities capture something fundamental about how information can be represented and transformed.

Physical world understanding has massive room for growth. While we've made tremendous progress on text and images, there's enormous untapped potential in models that understand the physical world, from robotics to time series data.

Hardware efficiency breakthroughs are on their way. Startups like DiffLogic demonstrate clear pathways from general models to FPGA and ASIC implementations with 2-20 million times more efficiency with concrete implementations of smaller models demonstrating these gains.

Creative professionals are pushing back. There's active negotiation and contract discussions around the use of generative AI for writing, art, and other creative work. The relationship between AI and human creativity is being actively contested and redefined.

Flow Matching for Generative Modeling

One of the tutorials covered flow matching, a powerful framework for generative modeling. A few observations from this session:

Context matters for transcription. An amusing aside: conference subtitles would benefit from using slide materials as context. "Romanian" is unlikely when discussing Riemannian manifolds, and "Campbell Italian" doesn't make sense when the slide shows "Cambell et al."

Fine-tuning is universal in production. All major production models for image and video generation are fine-tuned. This can be done with surprisingly few examples and can incorporate reward models that favor visual quality or human interest. LoRA (Low-Rank Adaptation) has become a standard approach, freezing the original model while adding tunable parameters.

This whole area seems highly relevant for building applications on top of public shared models. Understanding how to effectively fine-tune and compose models is becoming a core skill.

Beyond Decoding: Meta-Generation for LLMs

The tutorial on meta-generation algorithms highlighted something I hadn't fully appreciated: most language model APIs now include JSON schema options that force the model to follow a specified structure. This is a small but powerful feature that makes LLMs much more practical for structured data tasks.

xLSTM: A Challenger to Transformers

Sepp Hochreiter's talk on industrial AI introduced xLSTM, which addresses LSTM limitations and competes with transformer-based LLMs while better maintaining inference speed as it scales.

His provocative claim: "The transformer is too slow." Given that current methods increasingly rely on inference-time compute, xLSTM's speed advantages could become crucial. Interestingly, xLSTM and Mamba2 are converging from different directions toward similar architectures. Mamba2 is essentially xLSTM without an input gate.

This raises interesting questions about hybrid approaches: will we see single models combining multiple techniques, or applications that orchestrate different types of models?

Unconditional Generation Makes a Comeback

One of the most exciting talks was by Tianhong Li on Representation-Conditioned Generation (RCG). The problem: unconditional generation (modeling data distribution without human-annotated labels) has historically produced much worse results than conditional generation.

RCG's solution is elegant: generate semantic representations in the space produced by a self-supervised encoder, then use those representations to condition the image generator. The results are remarkable: RCG achieved FID of 2.15 on ImageNet 256×256, a 64% improvement over the previous best of 5.91. These unconditional results rival leading class-conditional methods.

What makes this particularly exciting:

It enables training on all unlabeled images, not just curated labeled datasets
Linear interpolation in the representation space allows steering generation and exploring novel combinations
It might be possible to backfit this into existing supervised models by creating a normalized flow from a general distribution to distributions given useful prompts

Code is available at github.com/LTH14/rcg.

Practical Tools and Techniques

Stylus for adapter selection. With over 100K fine-tuned adapters available in open-source communities (mostly LoRAs for diffusion models), finding and composing the right adapters for a prompt is challenging. Stylus automatically selects and composes task-specific adapters based on keywords, achieving better quality and efficiency than base models alone.

Diffusion guidance techniques. There's interesting work on guiding diffusion models with "bad versions" of themselves to tighten generated examples closer to the original dataset distribution.

Hardware Efficiency: A Paradigm Shift

The work on difflogic (Convolutional Differentiable Logic Gate Networks) demonstrates staggering efficiency gains. For image classification and similar tasks, FPGA implementations offer huge speed improvements, with ASIC implementations achieving 2-22 million times better efficiency versus GPUs.

This represents a complete rethinking of model deployment. The flexibility and portability to specialized hardware opens entirely new applications.

Time Series and Foundation Models

The time series meetup introduced me to Temporal Point Processes for discrete events and the idea of "level 0" foundation models that simply dump logs into an LLM.

I met Shawn Jain, who left OpenAI to co-found Synthefy to build time series foundation models. After reading the TimeWeaver paper, I'm convinced these capabilities would be transformational for systems engineering and operations.

Much of complex systems engineering involves exploring enormous state spaces generated from combinations of configurations, contextual conditions, and input sequences. Traditional engineering approaches fall short of capturing nuanced relationships in complex systems. Context and cohorts are crucial but underexplored.

The TimeWeaver approach appears powerful for unlocking insights across engineering integration and operations. Similar opportunities likely exist in financial markets and other domains involving complex temporal patterns.

AI and Creativity: A Contentious Dialogue

The Workshop on Creativity & Generative AI highlighted tensions between ML researchers and professionals, and creative professionals.

Collective bargaining and prompting. Negotiation around AI use in creative work focuses heavily on prompting: who writes the prompts, who owns the outputs. This may be specific to text generation and might not apply well to generation and curation workflows that don't center on prompting. Prompting itself might be temporary.

Data intermediaries for creative work. The presentation on choral AI datasets suggested creative professionals need representatives analogous to artist's agents. One example: Serpentine LLC serving as a trusted data intermediary for choral singing datasets.

The Ted Chiang Keynote

Ted Chiang's keynote was divisive, and I found some of his arguments problematic. A line of reasoning dismissed attention mechanisms as mere "kernel smoothing" and discounting apparent cognitive capabilities as anthropomorphism. There appeared to be an inconsistent elitism around photography: acknowledging that photography and Photoshop involve "so many decisions," and were therefore clearly art, but refusing to consider the possibility that creation with generative models, dataset curation, LoRA tuning, and tools like ComfyUI could involve equally many decisions; Challenges on this point from the generative AI art contingent in the room went unanswered.

The assumption that profit motive is incompatible with creative or artistic merit seemed particularly limiting.

Reframing the Discussion

More productive framings emerged in the subsequent discussions:

Curation as art. Orchestra conductors, DJs, museum curators all create art through selection and arrangement. Mixup, remix, and sampling (for example DJ Shadow) are established art forms. Generative AI tools enable new forms of curation.

The challenge of abundance. Generative AI creates a firehose of output. The artistic challenge becomes curating and steering toward transcendent artifacts that capture something distilled but unexpected. Perhaps we can create quantitative heuristics that help human interaction steer toward more differentiated outputs.

Art is in the perception. Art isn't just what the artist creates—it's what the perceiver experiences. We find transcendence in random artifacts from nature, broken pottery, found objects. The model itself can be an artistic artifact (think products displayed in MoMA), distilling human creativity and delivering it back in purer form, interpolating among combinations, identifying relationships, and applying concepts in novel ways.

Definitions of Creativity

Computational artist Ben Bogart offered fascinating perspectives:

Perception itself is a creative act and need not have intention
The process of art includes defining what art is—it's recursive
Agental realism: creativity involves defining boundaries and resolving ambiguities

Main definitions that emerged:

Novel combinations: Relationships between novelty, surprise, and value
Exploratory creativity: Finding new possibilities within existing patterns
Resonance: Matching with familiar patterns
Transformational creativity: Fundamentally changing the space of possibilities

Other elements: distillation, perspective, agency, intentionality, connection with inner experience, recognition and work.

My synthesis: Creativity is exposure plus resonance in perception. Process and product are the same, viewed from first-person versus shared perspectives. Creativity need not produce new artifacts. It can be recognition, as with found objects.

Looking Forward

NeurIPS 2024 reinforced that we're in a transition period. The frontier isn't just about making models bigger. It's about making them work together more effectively, deploying them more efficiently, and understanding their relationship to human creativity and expertise.

The technical advances are enabling entirely new applications, from unconditional generation that opens up unlabeled data, to hardware implementations that make edge deployment practical, to time series models that could transform how we understand complex systems.

At the same time, the social and creative dimensions are actively being negotiated. How we integrate these tools into creative practice, how we credit and compensate creators, and how we think about the nature of creativity itself are all in flux.

The conference left me optimistic about the technical trajectory and thoughtful about the broader implications. The most interesting work ahead isn't just in making better models... it's in understanding how to use them well.

Insights from NeurIPS 2024