Show HN: Lemon Slice Live – Have a video call with a transformer model

192 points by lcolucci 2 days ago

Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew: https://www.youtube.com/watch?v=CeYp5xQMFZY. Try it for yourself at: https://lemonslice.com/live.

(Btw, we used to be called Infinity AI and did a Show HN under that name last year: https://news.ycombinator.com/item?id=41467704.)

Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.

To achieve this demo, we had to do the following (among other things! but these were the hardest):

1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.

2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.

3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.

More technical details here: https://lemonslice.com/live/technical-report.

Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.

We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!

We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.

djaychela a day ago

Just talked with Max Headroom and Michael Scott - my wife is an office fan so knows the references, and I know enough Max to ask the right things.

Overall, a fun experience. I think that MH was better than Scott. Max was missing the glitches and moving background but I'd imagine both of those are technically challenging to achieve.

Michael Scott's mouth seemed a bit wrong - I was thinking Michael J Fox but my wife then corrected that with Jason Bateman - which is much more like it. He knew Office references alright, but wasn't quite Steve Carell enough.

The default while it was listening could do with some work, I think - that was the least convincing bit; for Max he would have just glitched or even been completely still I would think. Michael Scott seemed too synthetic at this point.

Don't get me wrong, this was pretty clever and I enjoyed it, just trying to say what I found lacking without trying to sound like I could do better (which I couldn't!).

andrew-w 18 hours ago

Thanks for the feedback. This is definitely a demo where every piece matters for maximizing the enjoyment factor. We spent the most effort on optimizing video quality and latency, but not a lot on tweaking the character prompts that go into the LLM. Turns out that matters a lot too.

anishsikka an hour ago

this was overall fun. better than expected. i'm an office fan so tried dwight and michael scott. i hope you folks get better at this. excited to see where you get in the next 12 months or so. Godspeed!

zebomon a day ago

This is impressive. The video chat works well. It is just a hair away from a very comfortable conversation. I'm excited to see where you have it a year from now, if it turns out to be financially viable. Good luck!

lcolucci a day ago

Thank you! Very much agree that we need to improve speed to make the conversation more comfortable. Our target is <2sec latency (as measured by time to first byte). The other building blocks of the stack (like interruption handling, etc) will get better in the coming months as well. In 1 year things should feel like the equivalent of a zoom conversation with another human.

dang 2 days ago

https://lemonslice.com/api/videos/video-XzDwIcW6QCvSIj1vX1Hu...

lcolucci 2 days ago

haha this is amazing! Just made him a featured character. Folks can chat with him by searching for "Devil"

mentalgear 19 hours ago

So basically the old open-source live-portrait hooked up with audio output. Was very glitchy and low res on my side. btw: Wondering if it's legal to use characters you don't have rights to. (how do you justify possible IP infringement)

andrew-w 19 hours ago

One way this differs is in the model architecture. Our approach relies on a single pass of a diffusion transformer (DiT), whereas Live Portrait relies on intermediate representations and multiple distinct modules. Getting a DiT to be real-time was a big part of our work. Quoting the Live Portrait paper: "Diffusion-based portrait animation methods [...] are usually [too] computationally expensive." As you hinted at, we had to compromise on resolution to get there (this demo is 256x256), but we think that will improve over time.
- andrew-w 19 hours ago
  
  Not relying on facial keypoints means we can animate a wide range of non-humanoid characters. My favorite is talking to the Doge meme.
pjc50 19 hours ago

> Wondering if it's legal to use characters you don't have rights to. (how do you justify possible IP infringement)
IP law tends to be "richer party wins". There's going to be a bunch of huge fights over this, as both individual artists and content megacorps are furious about this copyright infringment, but OpenAI and friends will get the "we're a hundred-billion-dollar company, we can buy our own legislation" treatment.
e.g. https://www.theguardian.com/technology/2024/dec/17/uk-propos... a straightforward nationalisation of all UK IP so that it can be instantly given away for free to US megacorps.

srameshc 2 days ago

I am very much fascinated by this virtual avatar talking thing. I tried video-retalking https://github.com/OpenTalker/video-retalking just to see how far I can make it work to make a talking avatar but it is tremendously difficult. But this holds tremendous possibilities and I hope it can be eventually cheaper to run such models. I know this is far superior and probably a lot different but I hope to find open source solutions like Lemon Slice someday that I can experiment with.

sid-the-kid 2 days ago

Nice! Thanks for sharing. I hadn't seen that paper before. Looks like they take in a real-world video and then re-generate the mouth to get to lip synch. In our solution, we take in an image and then generate the entire video.
I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.

lostmsu 2 days ago

This is very impressive. Any details about model architecture and size? Input and output representation?

How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?

sid-the-kid 2 days ago

For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly.
The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.
- tough 2 days ago
  
  I'm not at that level but reminded me of https://news.ycombinator.com/item?id=43736193
  - sid-the-kid 2 days ago
    
    Nice find! I hand't seen this before (and will take a deeper look later). It looks like this is an approach to better utilize the GPU memory. And, we would probably benefit from this to get more of a speed-up, which would also help us get better video quality.
    I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.
    
    tough 2 days ago
    
    Yep, this is way slower but considered SOTA on video-gen open source.
    I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work
    glad if its useful for your work/research to check out the paper
    edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back
- tony_cannistra a day ago
  
  Nice. This is also how recent advances in ML weather forecasting work. Weather forecasting really is just "video generation" but in higher dimensions.
- dheera a day ago
  
  Nice! What infra do you use for inference? I'm wondering what the cost-effective platforms are for projects like this. GPUs on AWS and Azure are incredibly expensive for personal use.
  - sid-the-kid a day ago
    
    We use modal (https://modal.com/). They give us GPUs on-demand, which is critical for us so we are only paying for what we are using. Pricing is about $2/hr per GPU (as a baseline of the costs). Long story short, things get VERY expensive quickly.
lcolucci 2 days ago

thank you! We have an architecture diagram and some more details in the tech report here: https://lemonslice.com/live/technical-report
And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.

bsenftner a day ago

This is fantastic. I was the founder of the 3D Avatar Store, a company that was doing similar things 15 years ago with 3D reconstructions of people. Your platform is what I was trying to build back then, but at the time nobody thought such tech was possible, or they seriously wanted to make porn, and we refused. I'll try reaching out through channels to connect with your team. I come from a feature film VFX, Academy Award quality work, so it would be interesting to discuss. Plus, I've not been idle since the 3D Avatar Store, not at all...

andrew-w 18 hours ago

We've been very inspired by interactive character experiences powered by traditional VFX + puppetry (turtle talk with crush is a favorite). I think that sort of interactive entertainment will become more commonplace as tech like ours continues to improve. Looking forward to connecting!

gitroom 2 days ago

honestly this feels kinda huge - stuff like this is moving so fast, it's insane seeing it go real-time

sid-the-kid 2 days ago

IMO, most videos models will be fully real time within 2 years. You will be able to pick a model, imagine any world and then be fully immersed in it. Walk around any city interacting with people, first person shooter games on any map with crazy monsters, or just let the model auto-pilot an adventure for you.
- genewitch a day ago
  
  Probably not, but even if so, how much will that cost? There's AI that will take a pronpt like "what's the best weed whacker in 2025" and build a whole web page to publish the review. It's great, awesome. $10 in tokens to do that.
  And that's probably still a subsidized cost!
  Bfw "what is the best weed whacker" is John C. Dvorak's "AI test"
lcolucci 2 days ago

thanks so much for the kind words! we agree that the leap to real-time feels huge. so excited to share this with you all

NoScopeNinja a day ago

Hey, this looks really cool! I'm wondering - what happens if you feed it something totally different like a Van Gogh painting or anime character? Have you tested any non-photo inputs?

andrew-w 17 hours ago

It works with any style of character! Check out the embedded videos in our tech report. Peachy and the toilet are my favorite. https://lemonslice.com/live/technical-report

elternal_love 2 days ago

Hmm, plug this together with a app which collects photos and chats with a deceased love one and you have a working Malachim. Might be worth a shot.

Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,

andrew-w 2 days ago

Thanks for the feedback. Optimizing for speed meant we had fewer LLMs to choose from. OpenAI had surprisingly high variance in latency, which made it unusable for this demo. I think we could probably do a better job with prompting for some of the characters.
- genewitch a day ago
  
  You know the trick of having Gemini or mistral re-jigger the prompt?
  Also, you do realize that this will be used to defraud people of money and/or property, right?
  All about coulda, not shoulda.
  - netdevphoenix a day ago
    
    > Also, you do realize that this will be used to defraud people of money and/or property, right?
    Sadly, no one cares. LLM driven fraud is already happening and it is about to become more profitable.

ashishact a day ago

This is just brilliant. Hope you succeed, so that eventually I get an API to play with.

andrew-w 17 hours ago

Thanks! What kind of use case are you thinking about?

wouterjanl a day ago

Really cool stuff. It felt strangely real. Impressive!

andrew-w 17 hours ago

Thanks! We think we can cut down the latency to <2s which should make it feel even more natural.

o_____________o 19 hours ago

Are you going to offer a web embeddable version of the Live offering?

andrew-w 18 hours ago

It's something we are considering. What use cases do you have in mind?
- o_____________o 16 hours ago
  
  Mostly personal/art projects for now, but willing to pay.
  - andrew-w 15 hours ago
    
    Just added a signup at the bottom of the technical report: https://lemonslice.com/live/technical-report

inhumantsar 21 hours ago

love the demo video with Andrew. showing the potential as well as the delays and awkwardness of AI is refreshing compared to the heavily edited hype reels that are so common

andrew-w 19 hours ago

I spent about 2 hours recording videos with different characters. Of course, the one I made as a joke for myself and never intended to share was the most enjoyable to watch :)

movedx01 16 hours ago

watching baron harkonnen verbally create me code for todo list in React was rather amusing, thanks

andrew-w 16 hours ago

glad to bring a little joy into the world :)

benob 2 days ago

Very nice. Are you planning a paper?

lcolucci 2 days ago

thank you! No concrete paper plan yet as we're focused on shipping product features. anything specific you'd want to read about?

sid-the-kid 2 days ago

The system just crashed. Sorry! Working on getting things live again as fast as we can!

sid-the-kid 2 days ago

We are live again folks! Sorry about that. We ran out of storage space.
PUSH_AX 2 days ago

Ah the ole HN soak test.
- sid-the-kid 2 days ago
  
  Ya. You always think you cross your Ts. But, the law always holds.
- lcolucci 2 days ago
  
  haha one of the reasons launching on HN is great!

aorloff 2 days ago

Max Headroom lives !

andrew-w a day ago

Just added as a public character :)
- consumer451 a day ago
  
  Really wish his trademark glitching head nod was there, but I can imagine how that might not be possible.
  Super cool product in any case.
- aorloff 9 hours ago
  
  Really well done
sid-the-kid a day ago

Does he? I can't find him.
- aorloff 9 hours ago
  
  Toggle the tab to public
- sid-the-kid a day ago
  
  Looked it up. Cool reference.

bigyabai 2 days ago

> reducing delays and improving resolution (purpose-built ASICs will help)

How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.

lcolucci 2 days ago

We wouldn't build it ourselves, but there are several companies like Etched, Groq, and Cerebras working on purpose-built hardware for transformer models. Here's more: https://www.etched.com/announcing-etched

andrewstuart a day ago

A really compelling experience.

It seems clumsy to use copyrighted characters in your demos.

Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.

Obviously games could use this.

tetris11 2 days ago

If you could lower the email signup for a few hours, that'd be nice. I'm not going to sign up for yet another service I'm unsure about.

sid-the-kid 2 days ago

We just removed email signup. You can try it out now without logging in. It was easier than expected to do technically, so we just shipped a quick update.
- tetris11 2 days ago
  
  Thanks! This is amazing
  - sid-the-kid 2 days ago
    
    Glad you like it! IMO, biggest things to improve on are 1) time to video response and 2) being able to generate more complicated videos (2 people talking to each other, a person walking + talking, scene cuts while talking).

doublerabbit 2 days ago

"Try it now live" and then request me to enter my email.

I'll pass thanks.

sid-the-kid 2 days ago

That's fair. We just removed the sign-in for HN. Should be live shortly.
Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.
- sgrove 2 days ago
  
  I think it's not going well? I keep getting to the start a new call page, it fails, and takes me back to the live page. I assume your servers are on fire, but implementing some messaging would help ("come back later") or even better, a queueing system ("you're N in line") would help a lot.
  Really looking forward to trying this out!
  - andrew-w 2 days ago
    
    We're back online! One of our cache systems ran out of memory. Oops. Agree on improved messaging.
- yahoozoo 2 days ago
  
  Do you use a cloud-based GPU provider?
  - sid-the-kid 2 days ago
    
    Yes. We use Modal (https://modal.com/), and are big fans of them. They are very ergonomic for development, and allow us to request GPU instances on demand. Currently, we are running our real-time model on A100s.
    
    lostmsu 2 days ago
    
    I see you are paying $2/h. Shoot me an email at victor ta borg.games if your model would fit on RTX 3090 24G to get it down to $0.2/h (fellow startup).
    
    tough 2 days ago
    
    maybe demos could be a downsampled bitrate/size running on commercial GPU's
- ivape 2 days ago
  
  How much would this demo cost you from the HN traffic if you don't mind me asking?
  - sid-the-kid 2 days ago
    
    Good question. I guess depends on how many users we get. Each users gets their own dedicated GPU. Most video generations systems (and AI systems in general) can share GPUs during generation. Since we are real time, we don't do that. So, each user minute is a GPU minute. This is the biggest driver of the cost.
    
    tough 2 days ago
    
    feels like the next logical step for you to bring enconomies of scale is to allow users generating the video to automatically stream it to n platforms, so each gpu can be generating 1 png for many humans to watch simultaneously, with maybe 1 human driving the seat on what to generate, or more ai, idk
    
    sid-the-kid a day ago
    
    that's a good idea! Would be especially cool if the human is charismatic and does a good job driving the convo. Maybe we can try it out with a streamer.
    
    tough a day ago
    
    Vtuber comes to mind
    
    pjc50 19 hours ago
    
    Neuro/vedal already there, although not with the model as well.