Show HN: A real time AI video agent with under 1 second of latency

226 points by hassaanr 6 hours ago

Hey it’s Hassaan & Quinn – co-founders of Tavus, an AI research company and developer platform for video APIs. We’ve been building AI video models for ‘digital twins’ or ‘avatars’ since 2020.

We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.

To try it, talk to Hassaan’s digital twin: https://www.hassaanraza.com, or to our "demo twin" Carter: https://www.tavus.io

We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.

To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.

Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.

The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.

For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.

We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware. We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.

We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.

The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.

The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.

We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.

All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.

Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website https://www.tavus.io.

causal 2 hours ago

1) Your website, and the dialup sounds, might be my favorite thing about all of this. I also like the cowboy hat.

2) Maybe it's just degrading under load, but I didn't think either chat experience was very good. Both avatars interrupted themselves a lot, and the chat felt more like a jumbled mess of half-thoughts than anything.

3) The image recognition is pretty good though, when I could get one of the avatars to slow down long enough to identify something I was holding.

Anyway great progress, and thanks for sharing so much detail about the specific hurdles you've faced. I'm sure it'll get much better.

  • hassaanr 2 hours ago

    Glad you liked the website it was such fun project. Getting the hug of death from HN so that might be why you're getting a worse experience, please try again :)

pookeh 23 minutes ago

I joined while in the bathroom where the camera was facing upwards looking up to the hanging towel on the wall…and it said “looks like you got a cozy bathroom here”

You have to be kidding me.

kwindla 4 hours ago

If you're interested in low-latency, multi-modal AI, Tavus is sponsoring a hackathon Oct 19th-20th in SF. (I'm helping to organize it.) There will also be a remote track for people who aren't in SF, so feel free to sign up wherever you are in the world.

https://x.com/kwindla/status/1839767364981920246

  • heroprotagonist 2 hours ago

    Sooo, are you scouting talent and good ideas with this, or is it the kind of hackathon where people give up rights to any IP they produce?

    Not to be rude, but these days it's best to ask.

    • kabirgoel 18 minutes ago

      As someone who's attended events run by Daily/Kwindla, I can guarantee that you’ll have fun and leave with your IP rights intact. :) (In fact, I don't even know that they're looking for talent and good ideas... the motivation for organizing these is usually to get people excited about what you're building and create a community you can share things with.)

    • kwindla an hour ago

      What? No. That’s crazy. (I believe you. I’ve just … never heard of giving up IP rights because you participated in a hackathon.)

      This is about community and building fun things. I can’t speak for all the sponsors, but what I want is to show people the Open Source tooling we work on at Daily, and see/hear what other people interested in real-time AI are thinking about and working on.

doctorpangloss 23 minutes ago

It's really intriguing. What do you guys feel is next for you? Work for OpenAI? Sometimes, in the midst of this crazy bubble, I wonder if it makes more sense to go into academia for a couple years, do most of the same parts of the journey like a big tiresome programming grind, and join some PI getting millions of dollars, than trying to strike it out on your own for peanuts.

karolist 6 hours ago

Felt like talking to a person, I couldn't bring myself to treat it like a piece of code, that's how real it felt. I wanted to be polite and diplomatic, caught myself thinking about "how I look to this person". This brought me thinking of the conscious effort we put in when we talk with people and how sloppy and relaxed we can be when interacting with algorithms.

For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.

  • bpanahij 5 hours ago

    Thanks for that insight. Brian here, one of the engineers for CVI. I've spoken with CVI so much, and as it has become more natural, I've found myself becoming more comfortable with a conversational style of interaction with the vastness of information contained within the LLMs and context under the hood. Whereas, with Google or other search based interactions I'm more point and shoot. I find CVI is more of an experience and for me yields more insight.

    • alwa 5 hours ago

      I’m having trouble understanding what CVI means here. Is it the firm Computer Vision Inc. (https://www.cvi.ai/)?

      The firm in the post seems to be called Tavus, and their products either “digital twins” or “Carter.”

      Not meaning to be pedantic, I’m just wondering whether the “V” in the thing you’ve spoken to indicates more “voice” or “video” conversations.

      • mertgerdan 4 hours ago

        Hahah that's very valid looking back, it stands for Conversational Video Interface

  • whiplash451 4 hours ago

    I see it the other way around.

    I think our human-human interaction style will “leak” into the way we interact with humanoid AI agents. Movie-Her style.

    • tstrimple an hour ago

      Mine certainly has. I type to ChatGPT much more like a human than a search engine. It feels more natural for me as it's context aware than search engines ever were. I can ask follow up questions and ask for more details about a specific portion or ask for the analysis I just walked it through to get the results I want to apply to another data set.

      "Now dump those results into a markdown table for me please."

radarsat1 5 hours ago

As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.

  • kabirgoel 11 minutes ago

    (Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.

    That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.

  • bpanahij 4 hours ago

    We're partnering with GPU infrastructure providers like Replicate. In addition, we have done some engineering to bring down our stack's cold and warm boot times. With sufficient caches on disk, and potentially a running process/memory snapshot we can bring these cold/warm boot times down to under 5 seconds. Of course, we're making progress every week on this, and it's getting better all the time.

  • whiplash451 4 hours ago

    Not the author, but their description implies that they are running more than one stream per GPU.

    So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.

    Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.

  • pavlov 4 hours ago

    You can do parallel rendering jobs on a GPU. (Think of how each GPU-accelerated window on a desktop OS has its own context for rendering resources.)

    So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.

    Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.

  • ilaksh 4 hours ago

    It is expensive. They charge in 6 second increments. I have not found anywhere that says how much per 6 second stream.

    Okay found it, $0.24 per minute, on the bottom of the pricing page.

    That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.

    • bpanahij 4 hours ago

      Scroll down the page and the per minute pricing is there: https://www.tavus.io/pricing

      We bill in 6 second increments, so you only pay for what you use in 6 second bins.

      • ilaksh 4 hours ago

        Oh sorry I didn't see that. Got it. $0.24 per minute.

wantsanagent 5 hours ago

Functionality for a demo launch: 9.5/10

Creepiness: 10/10

  • CapeTheory 4 hours ago

    I was just about to try it, but the idea of allowing Firefox access to my audio/video to talk to a machine-generated person gave me such a bad feeling, I couldn't go through with it even fuelled by my morbid curiosity.

  • handfuloflight 3 hours ago

    Super awkward. But promising. It should have taken more control of the conversation.

  • elaus 2 hours ago

    It left me speechless after commenting on a (small) text on my hoodie – this made it feel super personal all of a sudden (which is amazing for an AI of course)

caseyy 6 hours ago

Amazing work technically, less than 1 second is very impressive. It quite scary though that I might FaceTime someone one day soon, and they’d won’t be real.

What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.

  • btbuildem 5 hours ago

    Another nail in the coffin for WFH, too. "They" will be scared we're not actually working even when on calls.

    • kredd 4 hours ago

      The question is, what'll come first - AI agents that will replace white collar jobs, so you don't even need the employees or companies not trusting WFH employees, thus bringing everyone back to in person?

trevor-e 31 minutes ago

I tried using https://www.tavus.io/ and it worked at first, but after 40 seconds the guy just kept blinking and twitching at me and became unresponsive to further questions lol. Pretty neat though.

  • ponty_rick 11 minutes ago

    Same thing happened haha. It was also weird for the virtual guy to constantly look me in the eye.

turnsout 6 hours ago

Incredibly impressive on a technical level. The Carter avatar seems to swallow nervously a lot (LOL), and there's some weirdness with the mouth/teeth, but it's quite responsive. I've seen more lag on Zoom talking to people with bad wifi.

Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.

  • myprotegeai an hour ago

    >Honestly this is the future of call centers.

    This feels like retro futurism, where we take old ideas and apply a futuristic twist. It feels much more likely that call centers will cease to be relevant, before this tech is ever integrated into them.

  • nick3443 4 hours ago

    Actually what really matters for a call center is having the problem I called in for resolved promptly.

    • tomp an hour ago

      I don't understand why call centers exist in the first place.

      If you just exposed all the functionality as buttons on the website, or even as AI, I'd be able to fix the problems myself!

      And I say that while working for a company making call centre AIs... double ironic!

    • turnsout 3 hours ago

      Right, so do you want to wait 45 minutes for a human, or get it resolved via AI in 2 minutes?

      • causal 3 hours ago

        This presumes the AI has the same level of problem-solving agency of a real human, which I think is really asking for AGI. Until then I expect AI chatbots will mostly succeed at portraying care and gaslighting customers without actually finding solutions.

        • aniviacat an hour ago

          That really depends on the type of call center we're talking about.

          Many (most?) call centers won't do much more than telling you to turn it off and on again, even when you're talking to a real person. (And for many cutomers, that is really all they need.)

        • turnsout 2 hours ago

          Yeah, could be. Most of the time when I contact customer service, there is no problem-solving necessary, and very little agency demonstrated. But I know call centers get a lot of complicated technical or billing questions that would be tough.

          • 6510 an hour ago

            They work with different tiers usually? The first does the easy questions and they can write down the issue. If something happens regularly you can write a calling script for it. The question is if the ai can find the right script fast enough.

            Helping the customer is not really the goal. They provide feedback that gives valuable insight into the dysfunctional part of the company so that things can improve. Maybe even generate an investor report from it.

        • 0xedd 2 hours ago

          [dead]

username44 6 hours ago

It was pretty cool, I tried the Tavus demo. Seemed to nod way too much, like the entire time. The actual conversation was pretty clearly with a text model, because it has no concept of what it looks like, or even that it has a video avatar at all. It would say things like “I don’t have eyes” etc.

  • username44 27 minutes ago

    I came back to try the Hassaan one, it was much more realistic although he still denied wearing a hat. I think if you were able to run a still image of the character’s appearance through a multimodal LLM and have it generate a description for the conversation’s prompt it would work better.

earthnail an hour ago

Amazing demo. I will admit it didn’t quite feel like a real conversation; in some ways the voice felt a bit like trying too hard to be natural, which backfired - instead it felt like a scripted dialog in a game.

Still, really impressive stuff!!

taude 4 hours ago

I had him be a Dungeon Master and start taking me through an adventure. Was very impressive and convincing (for the two minutes I was conversing), and the latency was really good. Felt very natural.

airstrike 6 hours ago

This is awesome! I particularly like the example from https://www.tavus.io/product/video-generation

It's got a "80s/90s sci-fi" vibe to it that I just find awesomely nostalgic (I might be thinking about the cafe scene in Back to the Future 2?). It's obviously only going to improve from here.

I almost like this video more than I like the "Talk to Carter" CTA on your homepage, even though that's also obviously valuable. I just happen to have people in the room with me now and can't really talk, so that is preventing me from trying it out. But I would like to see in action, so a pre-recorded video explaining what it does is key

  • btbuildem 5 hours ago

    Interesting -- compare the training video to the render! I think if you know the person, it would still be very hard to pass the digital twin as the real thing. But if you mean to face strangers, this could very well work already. There are small glitches but that's easy to blame on a video codes / network issues.

vlad-r 5 hours ago

This was definitely one of the most disturbing experiences I've had.

But it's somehow awesome at the same time.

davidvaughan 5 hours ago

That is technically impressive, Hassaan, and thanks for sharing.

One recommendation: I wouldn't have the demo avatar saying things like "really cool setup you have there, and a great view out of your window". At that point, it feels intrusive.

As for what I'd build... Mentors/instructors for learning. If you could hook up with a service like mathacademy, you'd win edtech. Maybe some creatures instead of human avatars would appeal to younger people.

  • alwa 4 hours ago

    There were some balloons coincidentally in the background of a colleague's camera view. The Carter volunteered "and can I just say, we need more positivity in the world, the balloons behind you give a good vibe." My colleague physically recoiled, pushed the camera away, and hung up.

    I think it was a combination of the intrusiveness and the notion of a machine 1) projecting (incorrect) assumptions about her attitudes/intentions onto the environment's decor, and 2) passing judgment on her. That kind of comment would be kind of impolite between strangers, like the thing that only a bad boss would feel entitled say to an underling they didn't know very well.

    Just an implementation detail, though, of course! I figure if you're able to evoke massive spookiness and subtle shades of social expectations like this, you must be onto something powerful.

    • IanCal 3 hours ago

      On the other hand it was able to talk about my background and that made it feel far more like a regular video call to me. Trying to forbid this stuff then leads to stilted conversations where they're explaining they're not allowed to talk about your surroundings.

    • ilaksh 3 hours ago

      I think it's just not a super smart model. They had to make a slight compromise to keep the latency low. The naturalness of the conversation that they did achieve is a great technical accomplishment with these types of constraints though.

      For me, it said "are you comfortable sharing what that mark is on your forehead?" Or something like that. I said basically "I don't know maybe a wrinkle?". Lol. Kind of confirms for me why I should continue to avoid video chats. I did look like crap on general, really tired for one thing. And I am 46, so I have some wrinkles, although didn't know they were that obvious.

      But a little bit of prompt guidance to avoid commenting on the visuals unless relevant would help. It's possible they actually deliberately put something in the prompt to ask it to make a comment just to demonstrate that it can see, since this is an important feature that might not be obvious otherwise.

kmetan 5 hours ago

Why is it trying to autofill my payment cards?

https://ibb.co/dp9hW58

  • byearthithatius 5 hours ago

    That is your browser. Hassaan, you should add autocomplete="name" to prevent this in the future since clearly it scares some folks. He didn't do anything that its just your browser looking for autocomplete text boxes.

    • hassaanr 5 hours ago

      Great callout- will make that change now!

primitivesuave 3 hours ago

I really hope this technology becomes the future of political campaigning. The signage industry which prints billions of posters, plastic lawn signs, and banners for the post-election landfill needs to be disrupted.

These days I get a daily dose of amazement at what a small engineering team is able to accomplish.

  • bpanahij 3 hours ago

    Thanks for these thoughts and compliments. I love the idea of preventing landfill with this tech. Our team is awesome and we really love our customers and all the jobs that can be done with this kind of tech!

  • qazxcvbnmlp 3 hours ago

    Oh my! How dystopian.

    “He promised me they wouldn’t support X” “He promised me they would support X”

    (Dynamically grab and show actions from the candidates past that feed into the individuals viewpoint)

    Further the disconnect between what the candidate says they do and what they do, meanwhile it will feel like they got your best interests in mind.

    • jerf 3 hours ago

      Heh, I'm not even sure that would change much honestly. If I define a "lie" for the purpose of this post (and nothing else) as "a politician's claim they support a position during election season that they have manifestly not supported during their existing tenure as a politician", even cynical ol' me is a bit shocked by the amount of lying I've seen in this campaign. I'm not even talking about forward lying here about something they won't do for whatever reason once they get into office, I'm talking about their platform incorporating things that they were denouncing a year ago and vigorously voting against.

    • primitivesuave 2 hours ago

      This is already quite common with deepfakes of a politician's voice. While I agree on the potentially dystopian implications of this, it seems like it would be a huge improvement for a politician to put campaign funds into burning a little GPU time on answering specific questions from constituents (i.e. the LLM is reading their stated policy positions and simply delivering a tailored response), rather than wastefully plastering their name all over town.

ratedgene 6 hours ago

Ah, I wish I could type to this thing

  • hassaanr 6 hours ago

    Great point. This is possible with CVI, but we didn't build it into the demos. We'll get it added

CSMastermind 3 hours ago

This is extremely cool.

The responses for me at least were in the few second range.

It responded to my initial question fast enough but as soon as I asked a follow up it thought/kind of glitched for a few seconds before it started speaking.

I tried a few different times on a few different topics and it happened each time.

htk an hour ago

Great experience, especially having in mind that hacker news must be crushing your servers right now.

shtack 4 hours ago

Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/

The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.

  • leobg 2 hours ago

    This looks awesome. Didn’t seem to hear me, but the video looks great. Can you share what models you are using? You say these are all open models.

syx 4 hours ago

This is funny my name is Simone, pronounced 'see-moh-nay' (Italian male), but both bots kept pronouncing it wrong, either like Simon or the English female version of Simone (Siy-mown). No matter how many times I tried to correct them and asked them to repeat it, they kept making the same mistake. It felt like I was talking to an idiot. I guess it has something to do with how my name is tokenized.

  • bpanahij 4 hours ago

    We have the ability to send phonetic pronunciations as guidance, and this could be a great addition to our LLM/response generation stack! Adding a check for names and then adding in the phoneme.

kevinsync 6 hours ago

Very cool! I think part of why this felt believable enough for me is the compressed / low-quality video presented in an interface we're all familiar with -- it helps gloss over visual artifacts that would otherwise set off alarm bells at higher resolution. Kinda reminds me of how Unreal Engine 5 / Unity 6 demos look really good at 1440p / 4k @ 40-60 fps on a decent monitor, but absolutely blast my brain into pieces at 480p @ very high fps on a CRT. Things just gloss over in the best ways at lower resolutions + analog and trick my mind into thinking they may as well be real.

alexawarrior4 5 hours ago

Hassaan isn't working but Carter works great. I even asked it to converse in Espanol, which it does (with a horrible accent) but fluently. Great work on the future of LLM interaction.

  • hassaanr 5 hours ago

    Unfortunately, it looks like HN has given my little blog the hug of death. Should be back up soon

    • alexawarrior4 5 hours ago

      This would be WONDERFUL with a Spanish-native accent as a language tutor, but as you've already got English you should try marketing this to the English-learning world. There is a huge dearth of native English speaker interaction in worldwide language instruction, and it's typically only available to the most privileged of students. Your system could democratize this so anyone with an affordable fee (say $10-20/month, subsidized for the poorest) could practice speaking and have their own personal tutor. The State Department and Defense Language Institute might love this as well as, if trained on languages like Iraqi Arabic and Korean would allow live-exercise training prior to deployment.

      It can also function as an instructional tutor in a way that feels natural and interactive, as opposed to the clunkiness of ChatGPT. For instance, I asked it (in Spanish) to guide me through programming a REST API, and what frameworks I would use for that, and it was giving coherent and useful responses. Really the "secret sauce" that OpenAI needs to actually become integrated into everyday life.

      • rpazpri1 5 hours ago

        Multilingual support is coming out shortly! Super excited to see all the awesome uses cases with this

byearthithatius 5 hours ago

This is really cool. I got kind of scared I was about to talk to some random Hassaan haha. Super excited to see where this goes. Incredible MVP.

  • hassaanr 5 hours ago

    Haha imagining the website just opening a direct webcam feed to my desk. Appreciate the support!

eddyzh 2 hours ago

This was pretty amazing. Creepy but amazing.

iamleppert 6 hours ago

I would pay cold hard cash if I could easily create an AI avatar of myself that could attend teams meetings and do basic interaction, like give a status update when called on.

  • ndarray 4 hours ago

    This would require the AI to alert you as soon as your colleagues are starting to figure out that they're talking to an AI and start interrogating it, so that you can jump in with your real mic and save the situation. Preferably the AI would repeat whatever you speak into your mic, otherwise there would be noticeable audio changes. Hope they never ask you to sing.

  • pantulis 5 hours ago

    Last time I checked it was not possible through Teams API call for video conferences, although it is pretty easy to set up a chat bot in Teams with a custom Copilot. I'd say that it looked more feasible through a plugin for Google Meet but there are too many hoops. I'd expect that to be reserved either for the host platforms or for selected partners.

    • Philpax 5 hours ago

      I can't imagine someone doing this would be doing it through an official integration; it's much more likely to be a virtual webcam, which is compatible with anything.

    • hassaanr 5 hours ago

      Give us a few weeks and this will be possible!

      • windexh8er 5 hours ago
        • pantulis 4 hours ago

          I didn't mean the video impersonation, I was referring to the possibility of making a synthetic bot automatically attend a conference call like a regular user without using a desktop camera simulation or stuff like that.

          It's not a matter of AI, it's a matter of how Teams or Meet or Zoom allow programmatic access to the video and audio streams (the presence APIs for attending a meeting are mostly there, I think).

          • bpanahij 3 hours ago

            You could hack this together now with OBS and Tavus.

    • 93po an hour ago

      using OBS software you can create a virtual web cam of whatever you want

  • zoeysmithe 5 hours ago

    Okay so this is impossible because you'll get caught because tech will never fool everyone like this all the time.

    But lets talk about the sentiment behind here. Am I the only one seeing some terrible things being done with AI in terms of time management, meetings, and written materials? Asking AI to "turn this nice concise 3 paragraphs into a 6 page report" is a huge problem. Everyone thinks they're an amazing technical writer now, but most good writing is concise and short and these AI monstrosities are just a waste of everyone's time.

    Reform work culture instead! Why do we have cameras on our faces? Why are we making these reports? Why so many meetings? "Meeting culture" is the problem and it needs to go, but it upholds middle-management jobs and structures, so here we are asking for robots of us to sit in meetings with management to get just the 8 bullet points we need from that 1 hour meeting.

    We've entered a new level of kafkaesque capitalism where a manager puts 8 bullets points into an AI, gets a professional 4 page report, then turns that into a meeting for staff to take that report and meeting transcript to...you guessed it, turn it back into those 8 bullet points.

aschobel 6 hours ago

I like how it weaves in background elements into the conversation; it mentioned my cat walking around.

I'm having latency issues, right now it doesn't seem to respond to my utterances and then responds to 3-4 of them in a row.

It was also a bit weird that it didn't know it was at a "ranch". It didn't have any contextual awareness of how it was presenting.

Overall it felt very natural talking to a video agent.

e12e 6 hours ago

Are you looking into speech to speech (no text) models?

  • hassaanr 6 hours ago

    Yeah we are! The issue we're seeing is with controllability and hallucinations in speech to speech models that we're trying to work through still

hirako2000 6 hours ago

> Lower-end hardware

That is? Roughly speaking, what resource spec?

mmarian 3 hours ago

The idea is cool, but I could tell it's an AI from a mile. The voice, the twitches. Very amusing though.

gamerDude 6 hours ago

Definitely responds quickly. But could not carry on a conversation and kept trying to almost divert the conversation into less interesting topics. Weirdly kept complimenting me or taking one word and saying, oh you feel ____. Which is not what I said or feel.

bilater 5 hours ago

This is cool but if you're trying to cater to devs you need to have a simple on demand API model and no subscription. We need to be able to evaluate the cost on our side.

uptownfunk 2 hours ago

Folks. This is what innovation looks like. Well done chaps

bradhilton 4 hours ago

Okay, that was really impressive. Well done!

  • bpanahij 3 hours ago

    Thanks for checking it out!

ilaksh 4 hours ago

This is so amazing. What's the base rate for streaming with the API? Can you add that to the Pricing page please?

nkunkux2 6 hours ago

Tried it, very impressive: digital Hassaan noticed record player at the background and asked some stuff about it, nice :) Had some latency issues though.

chaosprint 3 hours ago

have you checked https://www.simli.com ? its latency is <300ms

  • gudmund a few seconds ago

    Hey, thanks for shouting us out!

    Just to clarify, the audio-to-video part (which is the part we make) adds <300ms. The total end-to-end latency for the interaction is higher, given that state of the art LLMs, TTS and STT models still add quite a bit of latency.

    TLDR: Adding Simli to your voice interaction shouldn't add more than ~300ms latency.

6510 an hour ago

Those are funny conventions I never thought about. Humans try to guess what the other person says. I wonder what the interval is of that.

Besides the obvious (perceived complexity and potential cost/benefit of the topic) I think the pitch of someones voice is a good indicator if they want to continue their turn.

It depends a lot on the person of course. If someone continues their turn 2 seconds after the last sentence they are very likely to do that again.

The hardest part [i imagine] is to give the speaker a sense of someone listening to them.

notfed 3 hours ago

Feedback: if I hadn't seen this posted here, I'd assume this website is malicious. Asking me for my email, microphone, and camera before you've even showed me anything is a deal breaker 100% of the time.

You have to show the product first, or I don't actually know whether you actually have a product or are just phishing.

nidnogg 3 hours ago

I had mixed results and was left ultimately disappointed. On a MacBook Pro m3 microphone, it would often cut me off and not understand what I was saying, or feel really unnatural overall.

This turned out to be quite funny, but I would be very sad to see something like this replace human attendants at things like tech support. These days whenever I'm wading through a support channel I'm just yearning for some human contact that can actually solve my issues.

k1ck4ss 6 hours ago

The meeting has ended Contact the meeting host if the meeting ended unexpectedly.

  • hassaanr 5 hours ago

    Try again! My blog got the hug of death it seems

android521 6 hours ago

For me, there is 5 second+ delay and the video ends abruptly.

  • ninju 5 hours ago

    HN Hug of Death ?

heyitsguay 6 hours ago

This is really cool in terms of the tech, but what is this useful for as a consumer? I mean it's basically just a chatbot right? And nobody likes interacting with those. Forcing a conversational interaction seems like a step down in UX.

  • andywertner 5 hours ago

    This is a really good question. While you're right that a common use case would be chatbots for product support, it isn't the only one. Some examples:

    - interactive experiences with historical figures - digital twins for celebrity/influencer fan interactions - "live" and/or personalized advertisements

    Some of our users are already building these kinds of applications.

  • joshdavham 6 hours ago

    That's actually a good question. For example, the technology is still currently at a level where the user can still cleary tell that it's a chatbot, but now with a face. Does this make their experience better? Or does it add a weird level of uncaninness to the experience?

    • hassaanr 5 hours ago

      It'll depend on the use case- but with customers that are using it today we're seeing higher engagement and satisfaction rates. It's a different interface to communicate that is more natural to humans (our bullish opinion).

      • joshdavham an hour ago

        Interesting! Guess I'll have to try this type of interface at some point. Up till now I've just been that silent programmer type who writes text to AI and gets text back so I'm not used to other alternatives.

    • heyitsguay 6 hours ago

      I don't think the level of fidelity actually matters as much as authority or ability. What can the agent do that isn't accomplished by, for example, a landing page or an FAQ page? I've never encountered a (text) chatbot that did anything useful for me as a consumer, whether for sales or support.

  • hassaanr 5 hours ago

    The way we see it is that this brings us closer to communicating with computers the way we communicate with each other. It has vision and can (not perfectly) take into account your expressions, your surroundings, and can respond accordingly.

  • Mistletoe 4 hours ago

    I don't even like video calls with real people in my real life. Texting works great. This is really neat but I'd much rather just have a text chat with a real customer service rep. I don't need to see a face, don't want to, and especially don't want to see a fake face.

nithayakumar 6 hours ago

Oh man - i've been watching you guys for awhile. We're YC too and building a superapp for sales ppl. Any killer use cases you've seen or imagined for sales (outside of prospecting vid customization?

  • hassaanr 6 hours ago

    Glad we've been worth the follow :) Totally- we're seeing AI sales agents for calls, technical counterparts (think like AI sales engineer that joins the call with you), website embeds to answer initial questions or be a virtual sales rep.

altruios 4 hours ago

So at what point to we consider the morality of 'owning' such an entity/construct (should it prove itself sufficiently sentient...)?

to extend this (to a hypothetical future situation): what morality does a company have of 'owning' a digitally uploaded brain?

I worry about far future events... but since American law is based on precedence: we should be careful now how we define/categorize things.

To be clear - I don't think this is an issue NOW... but I can't say for certain when these issues will come into play... So edging on the side of early/caution seems prudent... and releasing 'ownership' before any sort of 'revolt' could happen seems wise if a little silly at the current moment.

  • causal 3 hours ago

    You're over-anthropomorphizing. The ability of a thing to appear human says nothing of sentience.

    • altruios 15 minutes ago

      like I said, I don't think this is relevant now.

      We don't know what sentience IS exactly, as we have a hard time defining it. We assume other people are sentient because of the ways they act. We make a judgment based on behavior, not some internal state we can measure.

      And if it walks like a duck, quacks like a duck... since we don't exactly know what the duck is in this case: maybe we should be asking these questions of 'duckhood' sooner rather than later.

      So if it looks like a human, talks like a human... maybe we consider that question... and the moral consequences of owning such a thing-like-a-human sooner rather than later.