Show HN: Lemon Slice Live – Have a video call with a transformer model

93 points | by lcolucci 5 hours ago

48 comments

srameshc 4 hours ago
I am very much fascinated by this virtual avatar talking thing. I tried video-retalking https://github.com/OpenTalker/video-retalking just to see how far I can make it work to make a talking avatar but it is tremendously difficult. But this holds tremendous possibilities and I hope it can be eventually cheaper to run such models. I know this is far superior and probably a lot different but I hope to find open source solutions like Lemon Slice someday that I can experiment with.
[-]
- sid-the-kid 4 hours ago
  Nice! Thanks for sharing. I hadn't seen that paper before. Looks like they take in a real-world video and then re-generate the mouth to get to lip synch. In our solution, we take in an image and then generate the entire video.
  I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.
lostmsu 4 hours ago
This is very impressive. Any details about model architecture and size? Input and output representation?
How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?
[-]
- sid-the-kid 4 hours ago
  For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly.
  The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.
  [-]
  - tony_cannistra 17 minutes ago
    Nice. This is also how recent advances in ML weather forecasting work. Weather forecasting really is just "video generation" but in higher dimensions.
  - tough 3 hours ago
    I'm not at that level but reminded me of https://news.ycombinator.com/item?id=43736193
    [-]
    - sid-the-kid 3 hours ago
      Nice find! I hand't seen this before (and will take a deeper look later). It looks like this is an approach to better utilize the GPU memory. And, we would probably benefit from this to get more of a speed-up, which would also help us get better video quality.
      I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.
      [-]
      - tough 3 hours ago
        Yep, this is way slower but considered SOTA on video-gen open source.
        I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work
        glad if its useful for your work/research to check out the paper
        edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back
  - dheera 42 minutes ago
    Nice! What infra do you use for inference? I'm wondering what the cost-effective platforms are for projects like this. GPUs on AWS and Azure are incredibly expensive for personal use.
    [-]
    - sid-the-kid 29 minutes ago
      We use modal (https://modal.com/). They give us GPUs on-demand, which is critical for us so we are only paying for what we are using. Pricing is about $2/hr per GPU (as a baseline of the costs). Long story short, things get VERY expensive quickly.
- lcolucci 4 hours ago
  thank you! We have an architecture diagram and some more details in the tech report here: https://lemonslice.com/live/technical-report
  And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.
dang 5 hours ago
https://lemonslice.com/api/videos/video-XzDwIcW6QCvSIj1vX1Hu...
[-]
- lcolucci 4 hours ago
  haha this is amazing! Just made him a featured character. Folks can chat with him by searching for "Devil"
gitroom 3 hours ago
honestly this feels kinda huge - stuff like this is moving so fast, it's insane seeing it go real-time
[-]
- sid-the-kid 3 hours ago
  IMO, most videos models will be fully real time within 2 years. You will be able to pick a model, imagine any world and then be fully immersed in it. Walk around any city interacting with people, first person shooter games on any map with crazy monsters, or just let the model auto-pilot an adventure for you.
- lcolucci 3 hours ago
  thanks so much for the kind words! we agree that the leap to real-time feels huge. so excited to share this with you all
elternal_love an hour ago
Hmm, plug this together with a app which collects photos and chats with a deceased love one and you have a working Malachim. Might be worth a shot.
Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,
[-]
- andrew-w an hour ago
  Thanks for the feedback. Optimizing for speed meant we had fewer LLMs to choose from. OpenAI had surprisingly high variance in latency, which made it unusable for this demo. I think we could probably do a better job with prompting for some of the characters.
benob 3 hours ago
Very nice. Are you planning a paper?
[-]
- lcolucci 3 hours ago
  thank you! No concrete paper plan yet as we're focused on shipping product features. anything specific you'd want to read about?
aorloff an hour ago
Max Headroom lives !
[-]
- andrew-w 43 minutes ago
  Just added as a public character :)
- sid-the-kid an hour ago
  Does he? I can't find him.
  [-]
  - sid-the-kid an hour ago
    Looked it up. Cool reference.
sid-the-kid 4 hours ago
The system just crashed. Sorry! Working on getting things live again as fast as we can!
[-]
- sid-the-kid 4 hours ago
  We are live again folks! Sorry about that. We ran out of storage space.
- PUSH_AX 4 hours ago
  Ah the ole HN soak test.
  [-]
  - sid-the-kid 4 hours ago
    Ya. You always think you cross your Ts. But, the law always holds.
  - lcolucci 3 hours ago
    haha one of the reasons launching on HN is great!
bigyabai 5 hours ago
> reducing delays and improving resolution (purpose-built ASICs will help)
How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.
[-]
- lcolucci 5 hours ago
  We wouldn't build it ourselves, but there are several companies like Etched, Groq, and Cerebras working on purpose-built hardware for transformer models. Here's more: https://www.etched.com/announcing-etched
tetris11 5 hours ago
If you could lower the email signup for a few hours, that'd be nice. I'm not going to sign up for yet another service I'm unsure about.
[-]
- sid-the-kid 5 hours ago
  We just removed email signup. You can try it out now without logging in. It was easier than expected to do technically, so we just shipped a quick update.
  [-]
  - tetris11 4 hours ago
    Thanks! This is amazing
    [-]
    - sid-the-kid 4 hours ago
      Glad you like it! IMO, biggest things to improve on are 1) time to video response and 2) being able to generate more complicated videos (2 people talking to each other, a person walking + talking, scene cuts while talking).
doublerabbit 5 hours ago
"Try it now live" and then request me to enter my email.
I'll pass thanks.
[-]
- sid-the-kid 5 hours ago
  That's fair. We just removed the sign-in for HN. Should be live shortly.
  Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.
  [-]
  - sgrove 4 hours ago
    I think it's not going well? I keep getting to the start a new call page, it fails, and takes me back to the live page. I assume your servers are on fire, but implementing some messaging would help ("come back later") or even better, a queueing system ("you're N in line") would help a lot.
    Really looking forward to trying this out!
    [-]
    - andrew-w 4 hours ago
      We're back online! One of our cache systems ran out of memory. Oops. Agree on improved messaging.
  - ivape 4 hours ago
    How much would this demo cost you from the HN traffic if you don't mind me asking?
    [-]
    - sid-the-kid 4 hours ago
      Good question. I guess depends on how many users we get. Each users gets their own dedicated GPU. Most video generations systems (and AI systems in general) can share GPUs during generation. Since we are real time, we don't do that. So, each user minute is a GPU minute. This is the biggest driver of the cost.
      [-]
      - tough 3 hours ago
        feels like the next logical step for you to bring enconomies of scale is to allow users generating the video to automatically stream it to n platforms, so each gpu can be generating 1 png for many humans to watch simultaneously, with maybe 1 human driving the seat on what to generate, or more ai, idk
        [-]
        sid-the-kid an hour ago
        that's a good idea! Would be especially cool if the human is charismatic and does a good job driving the convo. Maybe we can try it out with a streamer.
        [-]
        tough 24 minutes ago
        Vtuber comes to mind
  - yahoozoo 5 hours ago
    Do you use a cloud-based GPU provider?
    [-]
    - sid-the-kid 5 hours ago
      Yes. We use Modal (https://modal.com/), and are big fans of them. They are very ergonomic for development, and allow us to request GPU instances on demand. Currently, we are running our real-time model on A100s.
      [-]
      - lostmsu 4 hours ago
        I see you are paying $2/h. Shoot me an email at victor ta borg.games if your model would fit on RTX 3090 24G to get it down to $0.2/h (fellow startup).
        [-]
        tough 3 hours ago
        maybe demos could be a downsampled bitrate/size running on commercial GPU's