Nice! I really like how many variations on this idea are coming out. MacWhisper used to be great, but is kinda of a buggy mess now.
I'm making my own, for personal use. I did a survey of many and they all (that I could find) skip the fundamentals.
The major issues that I've run into:
- Crash recovery. Most of these apps are incredibly buggy and crash all the time, taking the recorded audio with them. Macwhisper is incredibly bad at this.
- Disk space. Many of these apps save wav files to disk. After a few hours of meetings, you may end up with gigabytes eaten.
- Microphone bleed. People don't always use headphones, the system mic will pick up the speaker sounds, causing duplicate (approximately) transcriptions.
I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
I had the same experience so started building my own. All problems are solvable, just working on the polish.
- crash recovery: part one is use ADTS aac (even if process crashes, audio is saved up until it does). Part two is isolating the transcription/summaries in separate XPC services.
- disk space: AAC 64kbps mono soles it. Could use Opus for further reduction but both are small.
- speaker bleed: macOS voice isolation processing solves this. It’s a nightmare to get setup, but works great once done.
- library: using argmax SDK - by a bunch of ex-Apple on device AI folks.
It it wasn’t for CoreAudio, I’d say it was easy to make. Argmax, Whisper, and llama.cpp - wrapped in the right architecture, mostly just work.
I’m having fun nerding out on the details like custom vocabulary (get the names of the people in here meeting right), inferring speaker names from transcript, calendar integration, nice UI, etc.
> I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
Wait really? I honestly would have thought this was a solved problem by now, especially high quality transcriptions bit, just out of curiosity, is the problem that the quality isn't high enough?
There are still a few unsolved problems that require tuning for specific applications. Applications that own the video call have a much easier time, they have access to each individual audio stream. Applications like this, however, have to deal with overlapping voices from a single stream. If it's trying to attribute each utterance to an individual, separating the voices is tough, or can lead to confusing transcripts. There are many little problems like this which make it a tough problem in real world usage. Domain specific terms, or proper nouns is another source of inaccuracy.
This is an excellent product and exactly what I've been looking for. But most of my meetings are done on my company Mac, and they definitely won't let me install this kind of software, even though I'd be willing to pay for it myself.
I don’t have this particular use case right now but if anything it feels like LLMs and their distilled on prem models are starting to kill SaaS simply because it becomes more and more tenable to build a “complete software” in a short time frame. That’s freaking awesome. Good idea and love the return of the good old you buy, you own it mentality
This is definitely on the to-do list if there’s enough demand for it. The payment/distribution/updates infra required is not insignificant, especially if nobody was that bothered, but by the sounds of it they are so I’ll bump this up the priority list.
It's kinda funny how frontier LLMs change the game when it comes to software. If it becomes so good to make whatever little utility you want, why would I pay 10 dollars when an AI subscription is 20 bucks and I can build way more in a month for that $20? Especially since it's very likely people on show HN have simply used AI anyway, so why would I pay for your prompts?
I don't really recommend it. If the software is a one-time purchase, there's no need to rewrite it with an LLM. Rewriting the tokens could cost more than just $10.
I'd much rather spend $10 than have to sit at a prompt every day babysitting the thing, after working all day sitting at a prompt babysitting other things
I will be happy to spend £10 on this. One feature question though -- does it continue transcribing the meeting even if I've turned my volume down / muted it?
It does indeed. Trace will record your system audio regardless of your speaker volume. You do have the option to mute your own mic temporarily though, via a button on the “pill” or a global keyboard shortcut.
Which Speech-to-Text is used? Is it possible to configure it? This might be crucial for supporting languages other than English - the model that comes built-in with macOS fails completely for German.
This looks like a good approach, though I would expect this to be a native macOs feature within 12 months -- this seems totally like it fits into their product roadmap.
Agreed with JohnBiz, the moment flagging is interesting and unusual, and a nice contrast to passive transcription. I only recently learned about MacWhisper (I'm Windows primarily) and was floored to learn how expensive the Pro option is. Nowadays it's not so hard to have some-level of DIY transcription, so crazy that it's priced with a premium.
What's your diarization pipeline? Pyannote?
I'd taken a different approach that used a LLM clean-up pass to summarize and progressively compress the transcript for ultra-long content, but I like the idea of targeted "pay attention here" flags.
A lot of the available models are Whisper or Faster-Whisper derived and shared across multiple apps. The tier names are often funny... "Tiny" "base" "small" "medium" "large" "large-v2" "large-v3" "large-v3-turbo" -en only variants, etc.
In my experience, medium is often the sweet spot for English accuracy vs speed, especially if following-up with a post-processing pass. The large options are all fine, but can severely slow it down. There are some speed checks on my website if you're curious (link not posted because I don't want to hijack another post's app).
Nice! I really like how many variations on this idea are coming out. MacWhisper used to be great, but is kinda of a buggy mess now.
I'm making my own, for personal use. I did a survey of many and they all (that I could find) skip the fundamentals.
The major issues that I've run into:
- Crash recovery. Most of these apps are incredibly buggy and crash all the time, taking the recorded audio with them. Macwhisper is incredibly bad at this.
- Disk space. Many of these apps save wav files to disk. After a few hours of meetings, you may end up with gigabytes eaten.
- Microphone bleed. People don't always use headphones, the system mic will pick up the speaker sounds, causing duplicate (approximately) transcriptions.
I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
Anyway, most of these apps are built around https://github.com/FluidInference/FluidAudio, if anyone is curious. Their readme has a big list of similar apps as well.
I had the same experience so started building my own. All problems are solvable, just working on the polish.
- crash recovery: part one is use ADTS aac (even if process crashes, audio is saved up until it does). Part two is isolating the transcription/summaries in separate XPC services.
- disk space: AAC 64kbps mono soles it. Could use Opus for further reduction but both are small.
- speaker bleed: macOS voice isolation processing solves this. It’s a nightmare to get setup, but works great once done.
- library: using argmax SDK - by a bunch of ex-Apple on device AI folks.
It it wasn’t for CoreAudio, I’d say it was easy to make. Argmax, Whisper, and llama.cpp - wrapped in the right architecture, mostly just work.
I’m having fun nerding out on the details like custom vocabulary (get the names of the people in here meeting right), inferring speaker names from transcript, calendar integration, nice UI, etc.
Nice tip on FluidAudio that's the kind of thing I've been looking for. Thanks!
I’m using MacParakeet these days. If your language is supported, definitely give it a try. It’s much faster and lower footprint
> I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
Wait really? I honestly would have thought this was a solved problem by now, especially high quality transcriptions bit, just out of curiosity, is the problem that the quality isn't high enough?
There are still a few unsolved problems that require tuning for specific applications. Applications that own the video call have a much easier time, they have access to each individual audio stream. Applications like this, however, have to deal with overlapping voices from a single stream. If it's trying to attribute each utterance to an individual, separating the voices is tough, or can lead to confusing transcripts. There are many little problems like this which make it a tough problem in real world usage. Domain specific terms, or proper nouns is another source of inaccuracy.
This is an excellent product and exactly what I've been looking for. But most of my meetings are done on my company Mac, and they definitely won't let me install this kind of software, even though I'd be willing to pay for it myself.
And if it runs on the browser without install it would not probably be able to record your other browser (or app) audio
The key moments feat is neat. Been working on a free opensource offline transcriber that runs fast on CPU and does diarization too
https://github.com/kouhxp/yapsnap
I don’t have this particular use case right now but if anything it feels like LLMs and their distilled on prem models are starting to kill SaaS simply because it becomes more and more tenable to build a “complete software” in a short time frame. That’s freaking awesome. Good idea and love the return of the good old you buy, you own it mentality
Those transcription times are fast fast. What model/library do you use?
I'd love to have a purchase option not tied to the App Store if possible. I don't use an Apple account with my Mac, but I would love to try Trace.
This is definitely on the to-do list if there’s enough demand for it. The payment/distribution/updates infra required is not insignificant, especially if nobody was that bothered, but by the sounds of it they are so I’ll bump this up the priority list.
Also agreed, my work prohibits App Store apps so i have to skip things like this.
Agreed, no need to tie it into Apple either.
This looks sick. I was going to download it but for $10 I am more willing to attempt asking Claude to implement something like it, than to purchase.
I would be more willing to purchase if it was open source and I could build from source to try it first.
It's kinda funny how frontier LLMs change the game when it comes to software. If it becomes so good to make whatever little utility you want, why would I pay 10 dollars when an AI subscription is 20 bucks and I can build way more in a month for that $20? Especially since it's very likely people on show HN have simply used AI anyway, so why would I pay for your prompts?
I don't really recommend it. If the software is a one-time purchase, there's no need to rewrite it with an LLM. Rewriting the tokens could cost more than just $10.
* full price tokens, yes
Not the subsidized subs
I'd much rather spend $10 than have to sit at a prompt every day babysitting the thing, after working all day sitting at a prompt babysitting other things
I will be happy to spend £10 on this. One feature question though -- does it continue transcribing the meeting even if I've turned my volume down / muted it?
It does indeed. Trace will record your system audio regardless of your speaker volume. You do have the option to mute your own mic temporarily though, via a button on the “pill” or a global keyboard shortcut.
Which Speech-to-Text is used? Is it possible to configure it? This might be crucial for supporting languages other than English - the model that comes built-in with macOS fails completely for German.
This looks like a good approach, though I would expect this to be a native macOs feature within 12 months -- this seems totally like it fits into their product roadmap.
Agreed with JohnBiz, the moment flagging is interesting and unusual, and a nice contrast to passive transcription. I only recently learned about MacWhisper (I'm Windows primarily) and was floored to learn how expensive the Pro option is. Nowadays it's not so hard to have some-level of DIY transcription, so crazy that it's priced with a premium.
What's your diarization pipeline? Pyannote?
I'd taken a different approach that used a LLM clean-up pass to summarize and progressively compress the transcript for ultra-long content, but I like the idea of targeted "pay attention here" flags.
Super interesting! How accurate is the local model to transcribe audio compared to other cloud services? E.g. Google Meet, Otter, Granola, etc.
A lot of the available models are Whisper or Faster-Whisper derived and shared across multiple apps. The tier names are often funny... "Tiny" "base" "small" "medium" "large" "large-v2" "large-v3" "large-v3-turbo" -en only variants, etc.
In my experience, medium is often the sweet spot for English accuracy vs speed, especially if following-up with a post-processing pass. The large options are all fine, but can severely slow it down. There are some speed checks on my website if you're curious (link not posted because I don't want to hijack another post's app).
I've been looking for this exact thing!
Does it support multiple languages?
I don't see how this is different to literally the dozens of other offline transcription apps, many open source even unlike this one.
can you share them? I'm looking for a decent open source one
Literally so many when searched: https://hn.algolia.com/?q=macOS+transcription
Add "open source" if you wish as well.
Any that you have used and recommend comparable to the one from the post? Thank you!
I don't mind https://matthartman.github.io/ghost-pepper/ however I do really want speaker recognition which it does have but I haven't been able to get it working.
I'm seeing a lot right here: https://github.com/FluidInference/FluidAudio
I don’t see any there that are as focused as this one, perhaps except Talat which is considerably more expensive.
Ah. My bad. I didn't review them I was just paying more attention to the op asking for a list of open source ones.
I went through the list but most feel subpar to me, and some aren't even open source (just claim they use FluidAudio I guess?)
Handy is the most common recommendation
https://handy.computer/
That doesn't seem to do transcription of meetings?
Classic HN. Thanks for keeping it real.
There are so many I've seen on show HN, that's why.
https://hn.algolia.com/?q=macOS+transcription