FFmpeg Assembly Language Lessons

421 points | by flykespice 3 days ago

162 comments

cr125rider 3 days ago
I can’t imagine the scale that FFMPEG operates at. A small improvement has to be thousands and thousands of hours of compute saved. Insanely useful project.
[-]
- prisenco 3 days ago
  Their commitment to performance is a beautiful thing.
  Imagine all projects were similarly committed.
  [-]
  - godelski 3 days ago
    There's tons of backlash here as if people think better performance requires writing in assembly.
    But to anyone complaining, I want to know, when was the last you pulled out a profiler? When was the last time you saw anyone use a profiler?
    People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.
    I literally timed it on my M2 Air. 8s to open and another 1s to get a blank document. Meanwhile it took (neo)vim 0.1s and it's so fast I can't click my stopwatch fast enough to properly time it. And I'm not going to bother checking because the race isn't even close.
    I'm (we're) not pissed that the code isn't optional, I'm pissed because it's slower than dialup. So take that Knuth quote you love about optimization and do what he actually suggested. Grab a fucking profiler, it is more important than your Big O
    [-]
    - nwallin 3 days ago
      Another datapoint that supports your argument is the Grand Theft Auto Online (GTAO) thing a few months ago.[0] GTAO took 5-15 minutes to start up. Like you click the icon and 5-15 minutes later you're in the main menu. Everyone was complaining about it for years. Years. Eventually some enterprising hacker disassembled the binary and profiled it. 95% of the runtime was in `strlen()` calls. Not only was that where all the time was spent, but it was all spent `strlen()`ing the exact same ~10MB resource string. They knew exactly how large the string was because they allocated memory for it, and then read the file off the disk into that memory. Then they were tokenizing it in a loop. But their tokenization routine didn't track how big the string was, or where the end of it was, so for each token it popped off the beginning, it had to `strlen()` the entire resource file.
      The enterprising hacker then wrote a simple binary patch that reduced the startup time from 5-10 minutes to like 15 seconds or something.
      To me that's profound. It implies that not only was management not concerned about the start up time, but none of the developers of the project ever used a profiler. You could just glance at a flamegraph of it, see that it was a single enormous plateau of a function that should honestly be pretty fast, and anyone with an ounce of curiousity would be like, ".........wait a minute, that's weird." And then the bug would be fixed in less time than it would take to convince management that it was worth prioritizing.
      It disturbs me to think that this is the kind of world we live in. Where people lack such basic curiosity. The problem wasn't that optimization was hard, (optimization can be extremely hard) it was just because nobody gave a shit and nobody was even remotely curious about bad performance. They just accepted bad performance as if that's just the way the world is.
      [0] Oh god it was 4 years ago: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
      [-]
      - godelski 3 days ago
        I just started getting back into gaming and I'm seeing shit like this all the time. It's amazing that stuff like this is so common while the Quake fast inverse square root algo is so well known.
        How is it that these companies spend millions of dollars to develop games and yet modders are making patches in a few hours fixing bugs that never get merged. Not some indie game, but AAA rated games!
        I think you're right, it's on both management and the programmers. Management only knows how to rush but not what to rush. The programmers fall for the trap (afraid to push back) and never pull up a profiler. Maybe over worked and over stressed but those problems never get solved if no one speaks up and everyone is quiet and buys into the rush for rushing's sake mentality.
        It's amazing how many problems could be avoided by pulling up a profiler or analysis tool (like Valgrind).
        It's amazing how many millions of dollars are lost because no one ever used a profiler or analysis tool.
        I'll never understand how their love for money makes them waste so much of it.
        [-]
        bigstrat2003 3 days ago
        AAA games are, largely, quite bad in quality these days. Unfortunately, the desire to make a quality product (from the people who actually make the games) is overruled by the desire to maximize profit (from the people who pay their salaries). Indie games are still great, but I barely even bother to glance at AAA stuff any more.
        [-]
        godelski 3 days ago
        > by the desire to
        An appropriate choice of words.
        I'm just wondering if/when anyone will realize that often desire gets in the way of achieving. ̶T̶h̶e̶y̶ ̶m̶a̶y̶ ̶b̶e̶ ̶p̶e̶n̶n̶y̶ ̶w̶i̶s̶e̶ ̶b̶u̶t̶ ̶t̶h̶e̶y̶'̶r̶e̶ ̶p̶o̶u̶n̶d̶ ̶f̶o̶o̶l̶i̶s̶h̶.̶ Chasing pennies with dollars
        pjmlp 3 days ago
        That has been like that since there have been publishers in the games industry.
        Back then, the indies stuff was only if you happened to live nearby someone you knew doing bedroom coding, distributing tapes on school, or they got lucky land their game on one of those shareware tapes collection.
        Trying to actually get a publisher deal was really painful, and if you did, they really wanted their money back in sales.
        [-]
        versteegen 2 days ago
        Shareware tapes collection? Was there really such a thing? If so I would imagine it would be one or two demos per tape?
        [-]
        pjmlp 2 days ago
        Yes there was such a thing, for those of us that leaved throught the 1980's.
        There are tons of games that you can fit into 60m, 90m, or 180m tapes, when 48 KB/128 KB is all you got.
        More like 20 or something.
        Magazines like Your Sinclair and Crash would have such cassete tapes,
        https://archive.org/details/YourSinclair37Jan89/YourSinclair...
        https://www.crashonline.org.uk/
        They would be glued into the magazine with adhesive tape, and later on to avoid them being stolen, the whole magazine plus tape would be in a plastic.
      - bakugo 3 days ago
        > To me that's profound. It implies that not only was management not concerned about the start up time, but none of the developers of the project ever used a profiler.
        Odds are that someone did notice it during profiling and filed a ticket with the relevant team to have it fixed, which was then set to low priority because implementing the latest batch of microtransactions was more important.
        I feel like this is just a natural consequence of the metrics-driven development that is so prevalent in large businesses nowadays. Management has the numbers showing them how much money they make every time they add a new microtransaction, but they don't have numbers showing them how much money they're losing due to people getting tired of waiting 15 minutes for the game to load, so the latter is simply not acknowledged as a problem.
      - skeaker 3 days ago
        iirc this bug existed from release but didn't impact the game until years later after a sizable number of DLCs were added to the online mode, since the function only got slower with each one added. Not that it's fine that the bug stayed in that long, but you can see how it would be missed given that when they had actual programmers running profilers at development time it wouldn't have raised any red flags after completing in ten seconds or whatever.
        [-]
        tinyhitman 2 days ago
        I don't know. As a developer there would be even more reason to be curious as to why the release binary is an order of magnitude slower then what is seen in development.
        [-]
        p_l 2 days ago
        At release it was "working fine, same as in dev".
        It slowed down gradually as the JSON manifest of optional content grew.
      - mschuster91 2 days ago
        > It disturbs me to think that this is the kind of world we live in. Where people lack such basic curiosity. The problem wasn't that optimization was hard, (optimization can be extremely hard) it was just because nobody gave a shit and nobody was even remotely curious about bad performance. They just accepted bad performance as if that's just the way the world is.
        The problem is, you don't get rewarded for curiosity, for digging down into problem heaps, or for straying out of line. To the contrary, you'll often enough get punished for not fulfilling your quota.
      - LarMachinarum 3 days ago
        > and anyone with an ounce of curiousity would be like, ".........wait a minute"
        I see what you did there ;)
      - Ygg2 2 days ago
        > Another datapoint that supports your argument is the Grand Theft Auto Online (GTAO) thing a few months ago.[0] GTAO took 5-15 minutes to start up. Like you click the icon and 5-15 minutes later you're in the main menu. Everyone was complaining about it for years.
        I see this is a datapoint, but not for your argument. This thing sat in the code base didn't cause problems and didn't affect sales of the game pre or post GTAO launch.
        This sounds a lot like selection bias. You want to enhance airplanes that flew and returned. Rather than those that didn't come back.
        Let's say they did the opposite and focused on improving this over a feature or a level from GTA. What level or what feature that you liked could you remove to make way for investigating and fixing this issue? Because at the end of the day - time is zero-sum. Everything you do comes at the expense of everything you didn't.
        [-]
        pjc50 2 days ago
        This is the sort of thing that, if fixed early enough in the development cycle, actually net brings forwards development. Because every time someone needs to test the game they hit the delay.
        (which makes it all the more strange that it wasn't fixed)
        [-]
        Ygg2 2 days ago
        > This is the sort of thing that, if fixed early enough in the development cycle
        Is it? It didn't become noticable until GTA got a bunch of DLCs.
        Sure someone might have spotted it. But it would take more time to spot it early, and that time is time not spent fixing bugs.
        godelski 2 days ago
        I think you have the logic backwards. You are saying it didn't cause problems, right? Well that's the selection bias. You're basing your assumption on what is more easily measurable. It's "not a problem" because it got sales, right? Those are the planes that returned.
        But what's much harder to measure is the number of sales you missed. Or where the downed planes were hit. You don't have the downed planes, you can't see where they were hit! You just can't have that measurement, you can only infer the data through the survivors.
        > Because at the end of the day - time is zero-sum
        Time is a weird thing. It definitely isn't zero sum. There's an old saying from tradesmen "why is there always time to do things twice but never time to do things right?" Time is made. Sometimes spending less time gives you more time. And all sorts of other weird things. But don't make the classic mistake of rushing needlessly.
        Time is only one part of the equation and just like the body the mind has stamina. Any physical trainer would tell you you're going to get hurt if you just keep working one group of muscles and keep lifting just below your limit. It's silly that the idea is that we'd go from sprint to sprint. The game industry is well known to be abusive of its developers, and that's already considering the baseline developer isn't a great place to start from, even if normalized.
        [-]
        Ygg2 2 days ago
        > But what's much harder to measure is the number of sales you missed. Or where the downed planes were hit. You don't have the downed planes, you can't see where they were hit! You just can't have that measurement, you can only infer the data through the survivors.
        Not really. There are about 300 million gamers [1] if you exclude Androids and iPhones. How many sales units did GTA V make? 215 million[2]. It's a meteoric hit. They missed a sliver (35%) of their target audience.
        You could argue that they missed the mobile market. But the biggest market - Android is a pain to develop for; the minimum spec for GTA V to have parity on phones would exclude a large part of the market (most likely), and the game itself isn't really mobile-friendly.
        Ok, but we have a counter example (pun intended). Counter-Strike. Similarly, multiplayer, targets PCs mostly, developed by Valve, similarly excellent and popular to boot. However, it's way faster and way better optimized. So how much it "sold" according to [3]? 70 million. 86 if you consider Half-Life 1 and 2 as its single player campaign.
        I'm not sure what the deciding factor for people is, but I can say it's not performance.
        > Time is a weird thing. It definitely isn't zero sum.
        If you are doing thing X, you can't do another thing Y, unless you are multitasking (if you are a time traveler, beware of paradoxes). But then you are doing two things poorly, and even then, if you do X and Y, adding other tasks becomes next to impossible.
        It definitely is. Tim Cain had a video[4] about how they spent man months trying to find a cause for a weird foot sliding bug, that's barely noticeable, which they managed so solve. And at that time Diablo came out and it was a massive success with foot sliding up the wazoo. So, just because it bugs you doesn't mean others will notice.
        > "why is there always time to do things twice but never time to do things right?"
        Because you're always operating with some false assumption. You can't do it right, because the right isn't fixed and isn't always known, nor is it specified right for whom?
        [1]https://www.pocketgamer.biz/92-of-gamers-are-exclusively-usi...
        [2]https://web.archive.org/web/20250516021052/https://venturebe...
        [3]https://vgsales.fandom.com/wiki/Counter-Strike
        [4]https://youtu.be/gKEIE47vN9Y?t=651
        [-]
        godelski 2 days ago
        > They missed a sliver (35%) of their target audience.
        Next time you're at a party go take a third of the cake and then tell everyone you just took "a sliver". See what happens...
        Honestly, there's no point in trying to argue with you. Either you're trolling, you're greatly disconnected from reality, or you think I'm brain dead. No good can come from a conversation with someone that is so incorrigible.
        [-]
        Ygg2 a day ago
        > Next time you're at a party go take a third of the cake and then tell everyone you just took "a sliver". See what happens...
        Fine, I'll concede it's the wrong word used. But:
        > Honestly, there's no point in trying to argue with you. Either you're trolling, you're greatly disconnected from reality
        Wait. I'm disconnected? Selling millions of unit (Half life) is amazing success and tens of millions is stellar success by any measure (Baldur's Gate, Call of Duty, Skyrim). But selling hundreds of millions (Minecraft, GTAV)? That's top 10 most popular game of all time.
        So according to you, one of the top 5 best-selling game in history is somehow missing a huge part of the market? You can argue a plethora of things, but you can't speculate that GTA V could have done much better by saying "you're trolling"/"no point arguing".
        And saying that optimizing the DLC JSON loader could have given them a bigger slice of the pie is incredulous at best.
        You're extrapolating your preferences to 6 billion people. It's like watching a designer assume everyone will notice they used soft kerning, with dark grey font color on a fishbone paper background for their website. And that they cleverly aligned the watermark with the menu elements.
      - Sophira 3 days ago
        Honestly the GTA5 downloader/updater itself has pretty bad configuration. I wrote a post about it on Reddit years ago along with how to fix it.
        I don't know if it's still applicable or not because I haven't played it for ages, but just in case it is, here's the post: https://www.reddit.com/r/GTAV/comments/3ysv1d/pc_slow_rsc_au...
    - jmmv 2 days ago
      > Grab a fucking profiler, it is more important than your Big O
      This is exactly why I wrote https://jmmv.dev/2023/09/performance-is-not-big-o.html a few years back. The focus on Big O during interviews is, I think, harmful.
      [-]
      - godelski 2 days ago
        I think you're right. Early on I did HPC and scientific computing. No one talked about Big O. Maybe that's because a lot of people were scientists, but still, there was a lot of focus on performance. Really how people were optimized is with a profiler. You talk about the type of data being processed and how and looked for the right way to do things based on this and people didn't do the reductions and simplifications in Big O.
        Those simplifications are harmful when you start thinking about parallel processing. There's things you might want to do that would look silly in serial process. O(2n) can be better than O(n) because you care about the actual functions. Let's say you have a loop and you do y[i] = f(x[i]) + g(x[i]). If f and g are heavy then you may want to split this out into two loops y[i] += f(x[i]) and y[i] += g(x[i]) since these are associative (so non-blocking).
        Most of the work was really about I/O. Those were almost always the bottlenecks. Your Big O won't help there. You gotta write things with awareness about where in memory it is and what kind of memory is being used. All about how you break things apart, operate in parallel, and what you can run asynchronously.
        Honestly, I think a big problem these days is that we still operate as if a computer has 1 CPU and only one location for memory.
      - mjevans 2 days ago
        What more harmful is probably not having a set of guilds / unions (that both work together, share the same collective bargains, but also compete for members) to cut a lot of the annoying for ALL sides involved interview process out.
        Why do they ask about Big O? Because it works as a filter. That's how bad some of the candidates are.
        What would I rather they do? Have a non-trivial but obviously business unrelated puzzle that happens to include design flaws and the interviewee is given enough time and latitude to 'fulfill the interface, but make this the best it can be'.
    - AdieuToLogic 3 days ago
      > People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.
      It could be worse I suppose...
      Some versions of Microsoft Excel had a flight simulator embedded in them[0]!
      :-D
      0 - https://web.archive.org/web/20210326220319/https://eeggs.com...
    - genewitch 2 days ago
      is there a windows profiler that i can use on microsoft binaries to see what the hold-up is? I think 15 years ago i used valgrind and i cannot remember for what. Either way, there's a ton of stuff i want to report, either to microsoft (they won't care), but the internet might.
      i've managed to track down powershell in windows terminal taking forever to fully launch down to "100% gpu usage in the powershell process", but i'd really like to know what it thinks it's doing.
      also: 4 seconds to blank document in Word. the splash screen is most of that. notepad++ ~2 seconds. notepad.exe slightly over 1 second before the text content loads. Edge: 2 seconds to page render from startup. Oh and powershell loads instantly if i switch to the "old" terminal, but the old terminal doesn't have tabs, so that's a non-starter. "forever" above means 45-60 seconds.
      [-]
      - p_l 2 days ago
        The splash screens usually "hide" various loading steps. In Excel it's often loading and initialization of various extensions, for example.
    - saagarjha 2 days ago
      Your computer is broken. My M1 Pro launches it to user interactive in less than two seconds. And, to be clear, I launched it in a profiler. I suggest you do the same on your machine and find out why it's taking that long.
      [-]
      - tom_ 2 days ago
        Maybe it's phoning home to verify the app, or whatever it is it does? Launch times for MS Word on my 11 year old Macbook Pro, approx time to the opening dialog:
        First run since last reboot: 19 seconds
        Second run: 2.5 seconds
        Third run after sudo purge: 7 seconds
        Maybe it's an artefact of where I live, but the verify step always takes ages. First run of anything after a reboot takes an outlandish amount of time. GUI stuff is bad; Unix-type subprocess-heavy stuff is even worse. Judging by watching the Xcode debugger try to attach to my own programs, when they are in this state, this is not obviously something the program itself is doing.
        [-]
        godelski 2 days ago
        I think you're right. I rarely use word and so it was definitely running "cold"
        I went ahead and did another run and it was much faster. About 2 seconds. So things are definitely being cached. I did a trace on it (Instruments) and there's a lot of network activity. Double the time after sudo purge. There's 2 second of network time where the previous run only spent 1 second. Ran a tad faster when I turned the network off, though ended up using more CPU power.
        FWIW, looks to be only using 4 of my 8 cores, all of which are performance cores. Also looks like it is fairly serialized as there's not high activation on any 2 cores at the same time. Like I'll see one core spike, drop, and then another core spike. If I'm reading the profiler right then those are belonging to the same subprocesses and just handing over to a different thread.
        For comparison, I also ran on ghostty and then opened vim. Ghostty uses the performance cores but very low demand. vim calls the efficiency cores and I don't see anything hit above 25% and anytime there's a "spike" there's 2, appearing across 2 cores. Not to mention that ghostty is 53MB and nvim is more than a magnitude less. Compared to Word's 2.5GB...
        I stand by my original statement. It's a fucking text editor and it should be nearly instantaneous. As in <1s cold start.
        [-]
        saagarjha a day ago
        I think even 1s is generous, of course. I'm just saying it doesn't actually take 10.
      - tengwar2 a day ago
        Are we talking about Word, or a text editor? They seem to be saying the latter, particularly given the comparison with vi. I consistently get about half a second to open TextEdit on an M1, and that seems to be due to the opening animation.
    - tekknik 2 days ago
      you’re asking for people to care about something they do a few times a day and further asking people to devote time to this. it’s ok if you feel this is important but as a developer for over 15 years i don’t care if my text editor takes 10 seconds to start as i have other things starting at the same time that takes longer.
      or put another way, if you care about text editor performance, or are hyper focused on performance in all cases, you miss the point of software development
    - 1vuio0pswjnm7 3 days ago
      "I literally timed it on my M2 Air."
      I bet it opens faster on a Surface Pro
      [-]
      - justsid 3 days ago
        It does not. In fact, it crashes roughly every 4th or so startup.
        [-]
        1vuio0pswjnm7 3 days ago
        Yikes. I'm glad I do not use Windows anymore
      - 3 days ago
        [deleted]
      - godelski 3 days ago
        I mean we're talking about a fucking text editor here. A second to load is a long time even if it was on an intel i3 from 10 years ago. Because... it is a text editor... Plugins and all the fancy stuff is nice, but those can be loaded asynchronously and do not need to prevent you from jumping into a blank document.
        But the god damn program is over 2GB in size... like what the fuck... There's no reason for an app I open a few times a year and have zero plugins and ONLY does text editing should even be a gig.
        Seriously, get some context before you act high and mighty.
        I don't know how anyone can look at Word and think it is anything but the accumulation of bloat and tech debt piling up. With decades of "it's good enough" compounding and shifting the bar lower as time goes on.
        [-]
        versteegen 2 days ago
        As a long time emacs user, all of that criticism hits uncomfortably close to home, much as I would like to diss Word...
        [-]
        sintax 2 days ago
        Try doom emacs, that loads super fast.
        fkyoureadthedoc 2 days ago
        Nobody gives a shit because apparently MS is the only company that can make a "fucking text editor" that people actually want to use.
        [-]
        godelski 2 days ago
        > people actually want to use
        I think this is an incorrect assumption.
        Just because people use it doesn't mean they want to use it. We're in a bubble here and most people are pretty tech illiterate. Most people don't even know there are other options.
        Besides, it also misses a lot. Like how there's a lot of people that use Google Docs. Probably the only alternative an average person is aware of. But in the scientific/academic community nearly everyone uses LaTeX. They might bemoan and complain but there's a reason they're using it over Word and TeX isn't 2.5GB...
        1vuio0pswjnm7 2 days ago
        In earlier times, before Google or OS X even existed, long before "automatic updates", it used to be own experience that Microsoft's pre-installed Windows programs would run generally faster and with fewer issues (read: none) than third party software that a Windows user would install. This was also the case with software downloaded from Microsoft that a Windows user might install. Hence I thought perhaps MS Word today might run smoother on a Microsoft laptop than an Apple laptop. I'm not a Windows user anymore so I have no idea.
        For reading, editing, creating Word documents, the TextMaker Android app seems to work. Size of recent version I tried was 111MB, startup is quick. Paid features are non-essential IMHO.
        https://www.softmaker.net/down/tm2024manual_en.pdf
        A personal favourite program for me is spitbol, a SNOBOL interpreter written in a portable assembly language^3 called MINIMAL. I'm using a 779k static binary. SNOBOL is referenced here:
        1. https://borretti.me/article/you-can-choose-tools-that-make-y...
        The Almquist shell is another favourite of mine. It's both the "UI" and the language I use everyday to get stuff done on the computer. Like Microsoft Word makes some HN commenters unhappy, it seems that the UNIX shell makes some "developers" unhappy.^2
        But the shell is everywhere and it is not going away.
        IME, old software from an era before so-called "tech" companies funded by VC and "ad services", still works and I like it. For example, sed is still the favourite editor for me despite its limitations, and it is close to the shell in terms of ubiquity. Following current trends requires devoting more and more storage, memory and CPU in order to wait for today's programs to compile, start, or "update". As a hobbyist who generally ignores the trends I am experiencing no such resource drains and delays.
        For every rare software user complaining about bloat, there is at the same time a "developer", e.g., maybe one writing a Javascript engine for a massive web browser controlled by an advertising company, who is complaining about the UNIX shell.
        Developers like to frame software as a popularity contest. The most popular software is declared to have "won". (Not sure what that means about all the other software. Maybe not all software authors are interested in this contest.) To me, ubiquity is more important than "popularity":
        2. https://borretti.me//article/shells-are-two-things
        "There are 5,635 shell scripts on my humble Ubuntu box."
        This makes me happy.
        On Linux, I use vim 4.6 from 1997, a 541k static-pie binary. I use ired, bvi and toybox hexedit as hex editors, 62k, 324k and 779k static-pie binaries, respectively. If I dislike something I can change it in the source code. If I find other software I like better I can switch. No closed source and proprietary file formats like MS Word. The most entertaining aspect of the cult of Microsoft is that the company is so protective of software that it tells the public is "obsolete" or otherwise not worth using anymore, old versions of software or minimal versions that have few "features".
        https://ftp.nluug.nl/pub/vim/unix/vim-4.6.tar.gz
        https://codeload.github.com/radare/ired/zip/refs/heads/maste...
        https://codeload.github.com/johnsonjh/bvi-lf/zip/refs/heads/...
        https://www.landley.net/toybox/downloads/toybox-0.8.9.tar.gz
        https://www.landley.net/toybox/downloads/binaries/latest/toy...
        3. The topic of this thread is assembly language. Makes me happy
        The appeal of smaller, faster software to me is not that this stuff is so _good_. It is that the alternatives, software like MS Word, is so _bad_.
        [-]
        1vuio0pswjnm7 2 days ago
        The character editor TECO-C is another program I like that was originally written in assembly. This is a 153k static-pie binary ("non-video")
        https://codeload.github.com/blakemcbride/tecoc/zip/refs/head...
        Also forgot to mention that TextMaker on Android contains networking code and will try to connect to the internet. This is can be blocked with Netguard app, GrapheneOS, netfilter/pf/etc. on a gateway controlled by the user, or whatever other solution for blocking connections that the user prefers.
  - sfn42 3 days ago
    That would be an enormous waste of time. 99.9% of software doesn't have to be anywhere near optimal. It just has to not be wasteful.
    Sadly lots of software is blatantly wasteful. But it doesn't take fancy assembly micro optimization to fix it, the problem is typically much higher level than that. It's more like serialized network requests, unnecessarily high time complexities, just lots of unnecessary work and unnecessary waiting.
    Once you have that stuff solved you can start looking at lower level optimization, but by that point most apps are already nice and snappy so there's no reason to optimize further.
    [-]
    - harikb 3 days ago
      Sorry, I would word it differently. 99.9% software should be decently performant. Yes, don't need 'fancy assembly micro optimization'. That said, today some large portion of software is written by folks who absolutely doesn't care about performance - just duct-taping some sh*t to somehow make it work and call it a day.
      [-]
      - sfn42 3 days ago
        Seems to me like we're in agreement.
    - pjmlp 3 days ago
      People not paying attention on data structures and algorithms classes, or never bothering to learn them in first place.
  - therealmarv 3 days ago
    like Slack or Jira... lol.
  - Almondsetat 3 days ago
    Yeah no, I'd like non-performance critical programs to focus on other things than performance thank you
    [-]
    - Sesse__ 3 days ago
      Hard disagree. I'd like word processors to not need ten seconds just to start up. I'd like chat clients not to use _seconds_ to echo my message back to me. I'd like news pages that don't empty my mobile data cap just by existing. All of these are “non-performance critical”, but I'd _love_ for them to focus on performance.
      [-]
      - brookst 3 days ago
        So you’re a PM for a word processor. You have a giant backlog.
        Users want to load and edit PDFs. Finnish has been rendering right to left for months, but the easy fix will break Hebrew. The engineers say a new rendering engine is critical or these things will just get worse. Sales team says they’re blocked on a significant contract because the version tracking system allows unaudited “clear history” operations. Reddit is going berserk because the icon you used (and paid for!) for the new “illuminated text mode” turns out to be stolen from a Lithuanian sports team.
        Knowing that most of your users only start the app when their OS forces a reboot… just how much priority does startup time get?
        [-]
        jntun 3 days ago
        This is an incredibly convoluted hypothetical trying to negate the idea that users notice and/or appreciate how quickly their applications start. Usually as a PM you are managing multiple engineers, one of which I would assume is capable of debugging and eventually implementing a fix for faster start times. Even if they can't fix it immediately due to whatever contrived reason you've supposed, at least they will know where and how to fix it when the time does come. In fact, I would argue pretending there is no issue because of your mountain of other problems is the worst possible scenario to be in.
        hugo1789 3 days ago
        I don't think that fits MS Office. The situation is more that you have a working, usable word processor which has all the festures your user needs. Since many years ago. But your UI designer thinks it can be a little more beautiful but much slower. Of course you give that way too much priority.
        On my Laptop where I am forced by my company to run windows, I run word 2010 and it runs far better(speed and stability) that the newest word I have to use ob my office pc.
        windward 2 days ago
        Many of the important decisions are made at design and review time. When that team adds PDF support, they should act unlike the Explorer team and avoid unnecessary O(n^2) algorithms.
        Part of getting this to happen is setting the right culture and incentives. PM is such a nebulous term that I can't say this definitively, but I don't think the responsibility for this lies with them. Some poor performance is simply tech debt and should be tackled in the same way.
        $WORD_PROCESSOR employees should be capable of this: we've all seen how they interview.
      - ronsor 3 days ago
        > I'd like news pages that don't empty my mobile data cap just by existing.
        To be fair, this is because they mostly care about serving ads. Without the ads, the pages are often fine.
        [-]
        godelski 3 days ago
        Many things are slow because few programmers (or managers) care. Because they'll argue about "value" but all those notions of value are made up anyways.
        People argue "sure, it's not optimal, but it's good enough". But that compounds. A little slower each time. A little slower each application. You test on your VM only running your program.
        But all of this forgets what makes software so powerful AND profitable: scale. Since we always need to talk monetary value, let's do that. Shaving off a second isn't much if it's one person or one time but even with a thousand users that's over 15 minutes, per usage. I mean we're talking about a world where American Airlines talks about saving $40k/yr by removing an olive and we don't want to provide that same, or more(!), value to our customers? Let's say your employee costs $100k/yr and they use that program once a day. That's 260 seconds or just under 5 minutes. Nothing, right? A measly $4. But say you have a million users. Now that's $4 million!
        Now, play a fun game with me. Just go about your day as normal but pay attention to all those little speedbumps. Count them as $1m/s and let me know what you got. We're being pretty conservative here as your employee costs a lot more than their salary (2-3x) and we're ignoring slowdown being disruptive and breaking flow. But I'm willing to bet in a typical day you'll get on the order of hundreds of millions ($100m is <2 minutes).
        We solve big problems by breaking them into a bunch of smaller problems, so don't forget that those small problems add up. It's true even if you don't know what big problem you're solving.
        Sesse__ 3 days ago
        I have uBO, they're still obscenely large.
        phalanx104 3 days ago
        untrue. what bloats the modern web is the widespread AND suboptimal use of web frameworks. otherwise, making adblockers would dramatically speed up the loading of every website that uses ads, while it is true to some extent, is not the entire picture. anyways, i'm not saying that these libraries are always slow, but the users aren't aware of the performance characteristics and perf habits they should use while making use of such libraries. do you have any idea how many tens of layers of abstractions a "website" takes to reach your screen?
        [-]
        fkyoureadthedoc 3 days ago
        untrue. what makes bloats the modern web is competing incentives and businesses choosing what they think is going to make them the most money.
      - godelski 3 days ago
        When I was in school I had a laundry app (forced to use) that took 8 seconds to load, mostly while it scanned the network for the machines. It also had the rooms out of order in the room listing and no caching so every time you wanted to check the status (assuming it even worked) it took no less than a minute. It usually took less time to physically check, which also had a 100% accuracy.
        Fuck this "we don't need to optimize" bullshit. Fuck this "minimum viable product" bullshit. It's just a race to the bottom. No one paper cut is the cause of death, but all of them are when you have a thousand.
        [-]
        andrekandre 2 days ago
        > No one paper cut is the cause of death, but all of them are when you have a thousand.
        this is a common failure mode of projects unfortunately, and its precisely because paper-cuts have low signal-to-noise ratio that they are hardly worked on...
        ironically, sometimes its the reactions to big issues that cause paper-cuts to flourish (aka red-tape, incident mitigation, rushing to deadlines, "temporary" bug fixes piling up etc etc)
        [-]
        godelski 2 days ago
        I don't think it was papercuts. The company also had their API leaked and it made the news. 6 months later and it wasn't fixed and people were doing their laundry for free...
      - Capricorn2481 3 days ago
        > None of these are “non-performance critical”, but I'd _love_ for them to focus on performance
        Then you agree with the poster. Performance critical software should focus on performance.
    - lo_zamoyski 3 days ago
      Indeed. All else remaining the same, a faster program is generally more desirable than a slower program, but we don't live in generalities where all else remains the same and we simply need to choose fast over slow. Fast often costs more to produce.
      Programming is a small piece of a larger context. What makes a program "good" is not a property of the program itself, but measured by external ends and constraints. This is true of all technology. Some of these constraints are resources, and one of these resources is time. In fact, the very same limitation on time that motivates the prioritization of development effort toward some features other than performance is the very same limitation that motivates the desire for performance in the first place.
      Performance must be understood globally. Let's say we need a result in three days, and it takes two days to write a program that takes one day to get the result, but a week to write a program that takes a second to produce a result, then obviously, it is better to write the program the first way. In a week's time, your fast program will no longer be needed! The value of the result will have expired.
      This is effectively a matter of opportunity cost.
      [-]
      - godelski 3 days ago
        There's nothing more permanent than a temporary fix that works.
    - not_your_vase 3 days ago
      This mentality brings you a loading screen when you start the calculator on windows.
      [-]
      - adolph 3 days ago
        echo ${calculation} into bc works as fast as your fingers
      - fkyoureadthedoc 3 days ago
        What? Calculator starts up faster than I can figure out on where and on which screen it decided to open
        [-]
        dwringer 3 days ago
        On this machine it took me about 8 seconds to get the start menu open, about 5 seconds to get it to recognize that I'd typed "calc", another 5 seconds for it to let me actually select it to launch, and then about 20 seconds from the calculator window appearing - in its empty loading state - for it to actually come up. I admit this computer is several years old - but ... it's... a calculator.
        kiwijamo 3 days ago
        On Windows 11 I can see a startup screen briefly before it loads the calculator buttons -- takes maybe 2 seconds all up -- seems to be 1 seconds to start up screen then another second to populate the buttons. But can understand why people feel it's a regression though as I reall the win95/98/me calc.exe would pretty much appear near instantly even on the CPU/RAM/etc of the day.
        [-]
        fkyoureadthedoc a day ago
        It's probably very hardware dependent now, just like it was back then. My calculator opens to interactive in much less than 1 second, but I've got a 9800x3d and fast memory and nvme drive. The other guy saying his start menu takes 8 seconds to open probably has a pretty shit computer.
        I definitely got the stupid hourglass in win 95 when trying to open anything, but my understanding of computers at the time was that black ones were faster than beige ones, so my computer was probably shit.
        I tried to look up calculator win 95 vids on YouTube, there are a couple. One gets an hourglass - but less than a second, one is instant, one shows the calculator crashing lol.
        During this I also found out that Microsoft Calculator is open source: https://github.com/microsoft/calculator
        robinsonb5 2 days ago
        I'm currently on a Windows 10 machine with Core i5 that's more than a decade old. The calculator takes a couple of seconds to start up - provided it's a "good" day (i.e. one when Windows isn't downloading updates or doing search indexing or malware scanning in the background.)
        But I also have a Core 2 Duo-based WinXP machine in easy reach (just to keep a legacy software environment alive) and its keyboard has a dedicated calculator button. The calculator is just there the moment I press that button - it's appeared long before I can even release the button.
    - EliRivers 3 days ago
      Surely all programs are performance critical. Any program we think isn't is just a program where the performance met the criteria already.
      [-]
      - 6SixTy 3 days ago
        Safety critical systems say hello.
        [-]
        pjc50 2 days ago
        Safety critical is of course also performance critical to an even greater extent than games. You can usually get away with a dropped frame but you can't miss, say, valve timings.
        oguz-ismail 3 days ago
        > Safety critical systems
        Any concrete examples where we can see the code?
        [-]
        6SixTy 3 days ago
        sqlite is probably our best example. The project touts use within Airbus A350 and DO-178B certification.
    - 3 days ago
      [deleted]
    - const_cast 2 days ago
      The fallacy here is that if we focus on performance we could, instead, be using that time to make the application better.
      The reality is that non-performant apps aren't non-performant because they're doing so many cool things. No, that compute is wasted. The digital equivalent of pushing a box up a hill then back down 1000 times.
      I mean, the types of performance issues I've seen is like: grab 100,000 records from the database, throw away 99,900, return 100 to the front end.
      Optimizing that saves orders of magnitude of time but the thing does the same thing. Like we're just being wasteful and stupid. Those records we throw away aren't being used for some super cool AI powered feature. Its just waste.
  - byteknight 3 days ago
    Seems so easy! You only need the entire world even tangentially related to video to rely solely on your project for a task and you too can have all the developers you need to work on performance!
    [-]
    - astrange 3 days ago
      ffmpeg has competition. For the longest time it wasn't the best audio encoder for any codec[0], and it wasn't the fastest H.264 decoder when everyone wanted that because a closed-source codec named CoreAVC was better[1].
      ffmpeg was however, always the best open-source project, basically because it had all the smart developers who were capable of collaborating on anything. Its competition either wasn't smart enough and got lost in useless architecture-astronauting[2], or were too contrarian and refused to believe their encoder quality could get better because they designed it based on artificial PSNR benchmarks instead of actually watching the output.
      [0] For complicated reasons I don't fully understand myself, audio encoders don't get quality improvements by sharing code or developers the way decoders do. Basically because they use something called "psychoacoustic models" which are always designed for the specific codec instead of generalized. It might just be that noone's invented a way to do it yet.
      [1] I eventually fixed this by writing a new multithreading system, but it took me ~2 years of working off summer of code grants, because this was before there was much commercial interest in it.
      [2] This seems to happen whenever I see anyone try to write anything in C++. They just spend all day figuring out how to connect things to other things and never write the part that does anything?
      [-]
      - godelski 3 days ago
        > They just spend all day figuring out how to connect things to other things and never write the part that does anything?
        I see a lot of people write software like this regardless of language. Like their job is to glue pieces of code together from stack overflow. Spending more time looking for the right code that kinda sorta works than it would take to write the code which will just work.
        [-]
        astrange 3 days ago
        At least they get there.
        I was thinking about two types of people; one gets distracted and starts writing their own UI framework and standard library and never gets back to the program. The other starts writing a super-flexible plugin system for everything because they're overly concerned with developing a community to the point they don't want to actually implement anything themselves.
        (In this space the first was a few different mplayer forks and the second was gstreamer.)
        [-]
        godelski 3 days ago
        Sometimes they get there but a lot of times not too.
        I'm pretty sure there are a lot more types and the two you wrote aren't the copy-pasters either. Me, I try to follow the Unix philosophy[0] though I think there's plenty of exceptions to be made. Basically just write a bunch of functions and make your functions simple. Function overhead calls are usually cheap so this allows things to be very flexible. Because the biggest lesson I've learned is that the software is going to change so it is best to write with this in mind. The best laid plans of mice and men and all I guess. So write for today but don't forget about tomorrow.
        Then of course there are those that love abstractions, those that optimize needlessly, and many others. But I do feel the copy-pasters are the most common type these days.
        [0] https://en.wikipedia.org/wiki/Unix_philosophy
      - skeaker 3 days ago
        That's a fun term for [2]. Our team always called it bikeshedding.
    - ackfoobar 3 days ago
      I seem to recall that they lamented on twitter the low amount of (monetary or code) contribution they got, despite how heavily they are used.
      [-]
      - godelski 3 days ago
        They have some fire tweets, especially when people say they write things from scratch or boast about how much money they make with ffmpeg wrappers
        https://x.com/FFmpeg/status/1775178803129602500
        https://x.com/FFmpeg/status/1856078171017281691
        https://x.com/FFmpeg/status/1950227075576823817
        Oh, and here's one making fun of HN comments. Hi ffmpeg :) https://x.com/FFmpeg/status/1947076489880486131
      - hdgvhicv 3 days ago
        Wasn’t that a trillion dollar company demanding support for their little problem?
      - lo_zamoyski 3 days ago
        No one is forcing them to produce code for free. There is something toxic about giving things away for free with the ulterior motive of getting money for it.
        [-]
        imchillyb 3 days ago
        It’s market manipulation, with the understanding that free beats every other metric.
        Once the competition fails, the value extraction process can begin. This is where the toxicity of our city begins to manifest. Once there is no competition remaining we can begin eating seeds as a pastime activity.
        The toxicity of our city; our city. How do you own the world? Disorder.
        Disorder…
    - hluska 3 days ago
      You know friend, if open source actually worked like that I wouldn’t be so allergic to releasing projects. But it doesn’t - a large swath of the economy depends on unpaid labour being treated poorly by people who won’t or can’t contribute.
  - motorest 3 days ago
    > Imagine all projects were similarly committed.
    How many projects would have anything to benefit from this focus on optimization, though?
    There is a reason why the first rule of optimization is "don't do it", and the second (experts only) is "don't do it yet".
  - lmm 2 days ago
    As an industry we are too bad at correctness to even begin to worry about performance. Looking at FFmpeg (who are a pretty good project that I don't want to pick on too much) I see their most recent patch release fixes 3 CVEs from 2023 plus one from this year, and that's just security vulnerabilities, never mind regular bugs. Imagine if half the effort that people put into making it fast went into making it right.
- dlcarrier 2 days ago
  Web browsers and Electron far more than make up for it, in wasted resources.
  [-]
  - harry8 2 days ago
    They run on machines that are designed to have unused computer capacity. Desktops, laptops, tablets, phones. Responsiveness is the key.
    Ffmpeg runs on a load of server farms for all your favourite streaming services and the bajillion more you’ve never heard of. Saving compute there saves machines and power.
    Your point is well taken but there is a distinction that matters some.
    [-]
    - dlcarrier 2 days ago
      Does it?
      I've had some instability using PyTorch on my desktop computer, that only appeared if I was using the computer. I just a few minutes ago discovered what the problem was, because while running a calculation I opened FreeTube, a YouTube interface that runs in an Electron instance, and my computer immediately restarted. Apparently 8 cores, 96 MB of cache, 32 GB of system RAM, and 16 GB of VRAM doesn't leave enough unused capacity for one more web browser instance.
      I've helped multiple people figure out that their games were getting latency spikes because of the resources used by chat applications running on Electron or similar, which no amount of RAM or processing power seems to fix.
      Besides the opertunity cost of the resources that extremely inneficient applications use making the computer much less useful for other tasks, power consumption itself can matter a lot. Sure, using more power on a desktop computer may only cost a fraction of a cent per day, but on laptops, tablets, and phones that don't have user replacable batteries, which is pretty much every one in production, every percent power using increase not only wares the battery the same percentage faster, which proportionally reduces the useful life of the device, but it reduces how long the device is usable between charges, requring topping off and reducing usability and reliability.
      Also, running ICQ on a 90's computer with a mechanical hard drive was far more responsive than running Slack or Discord on the fastest computer you can buy today today. I can guaruntee you that switching from an Electron/HTML/CSS/JavaScript/WhateverJavaScriptFramework stack to a C/GTK or similar stack will not only reduce resource consumption by an order of magnitude or two, increase security, and make the codebase simpler and easier to mantain, it will also be much, much more responsive.
    - kasabali 2 days ago
      Saving compute in consumer facing devices saves machines and power, too.
      [-]
      - harry8 2 days ago
        Yeah power. But less so. My phone or laptop battery lasts shorter between charges. I’m not buying more phones or laptops to take up the slack. Powering the screen dominates battery usage between charges. It isn’t nothing, sure.
        I’m in no way defending electron. It’s just not taking back the power and machines saved by ffmpeg. Which is a happy accident that’s nice. Restart your electron hate all you want.
- zahlman 3 days ago
  It'd be nice, though, to have a proper API (in the traditional sense, not SaaS) instead of having to figure out these command lines in what's practically its own programming language....
  [-]
  - codys 3 days ago
    FFMpeg does have an API. It ships a few libraries (libavcodec, libavformat, and others) which expose a C api that is used in the ffmpeg command line tool.
    They publish doxygen generated documentation for the APIs, available here: https://ffmpeg.org/doxygen/trunk/
    [-]
    - zahlman 3 days ago
      Don't know how I overlooked that, thanks. Maybe because the one Python wrapper I know about is generating command lines and making subprocess calls.
      [-]
      - Wowfunhappy 3 days ago
        They're relatively low level APIs. Great if you're a C developer, but for most things you'd do in python just calling the command line probably does make more sense.
        [-]
        kaladin-jasnah 3 days ago
        As someone that used these APIs in C, they were not very well-documented nor intuitive, and oftentimes segfaulted when you messed up, instead of returning errors—I suppose the validation checks sacrifice performance for correctness, which is undesirable. Either way, dealing with this is not fun. Such is the life of a C developer, I suppose....
        f1shy 3 days ago
        It could even make sense in C. In some circumstances, I wouldn’t feel bad for cutting that corner.
        [-]
        1718627440 3 days ago
        Yes, that's what I did some time ago. I already want concurrency and isolation, so why not let the OS do that. Also I don't need to manage resources, when ffmpeg already does that.
      - ansk 3 days ago
        For future reference, if you want proper python bindings for ffmpeg* you should use pyav.
        * To be more precise, these are bindings for the libav* libraries that underlie ffmpeg
      - javier2 3 days ago
        If you are processing user data, the subprocess approach makes it easier to handle bogus or corrupt data. If something is off, you can just kill the subprocess. If something is wrong with the linked C api, it can be harder to handle predictably.
        [-]
        astrange 3 days ago
        Also because you can apply stricter sandboxing/jail/containerization to the process.
  - xxpor 3 days ago
    I get why the CLI is so complicated, but I will say AI has been great at figuring out what I need to run given an English language input. It's been one of the highest value uses of AI for me.
    [-]
    - alfg 2 days ago
      Pretty much the reason I created https://github.com/alfg/ffmpeg-commander, prior to AI uses.
    - gooob 3 days ago
      hell yeah, same here. i made a little python GUI app to edit videos
Validark 2 days ago
I would be interested in more examples where "assembly is faster than intrinsics". I.e., when the compiler screws up. I generally write Zig code with the expectation of a specific sequence of instructions being emitted, and I usually get it via the high level wrappers in std.simd + a few llvm intrinsics. If those fail I'll use inline assembly to force a particular instruction. On extremely rare occasions I'll rely on auto-vectorization, if it's good and I want it to fall back on scalar on less sophisticated CPU targets (although sometimes it's the compiler that lacks sophistication). Aside from the glaring holes in the VPTERNLOG finder, I feel that instruction selection is generally good enough that I can get whatever I want.
The bigger issue is instruction ordering and register allocation. On code where the compiler effectively has to lower serially-dependent small snippets independently, I think the compiler does a great job. However, when it comes to massive amounts of open code I'm shocked at how silly the decisions are that the compiler makes. I see super trivial optimizations available at a glance. Things like spilling x and y to memory, just so it can read them both in to do an AND, and spill it again. Constant re-use is unfortunately super easy to break: Often just changing the type in the IR makes it look different to the compiler. It also seems unable to merge partially poisoned (undefined) constants with other constants that are the same in all the defined portions. Even when you write the code in such a way where you use the same constant twice to get around the issue, it will give you two separate constants instead.
I hope we can fix these sorts of things in compilers. This is just my experience. Let me know if I left anything out.
KwanEsq 3 days ago
Prior discussion 2025-02-22, 222 comments: https://news.ycombinator.com/item?id=43140614
ykonstant 2 days ago
>There are two flavours of x86 assembly syntax that you’ll see online: AT&T and Intel. AT&T Syntax is older and harder to read compared to Intel syntax. So we will use Intel syntax.
God bless you, ffmpeg.
WhitneyLand 3 days ago
I was expecting to read pearls of wisdom gleaned from all the hard work done on the project, but I’m not really getting how this relates to ffmpeg.
The few chapters I saw seemed to be pretty generic intro to assembly language type stuff.
[-]
- saagarjha 2 days ago
  It's intended to get people up to speed on assembly so they can contribute to FFmpeg.
commandlinefan 3 days ago
Shame this doesn't start with a quick introduction to running the examples with an actual assembler like NASM.
NullCascade 3 days ago
What is the actual process of identifying hotspots caused suboptimal compiler generated assembly?
Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
[-]
- astrange 3 days ago
  So the main issues here are not what people think they are. They generally aren't "suboptimal assembly", at least not what you can reasonably expect out of a C compiler.
  The factors are something like:
  - specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.
  - predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.
  - optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.
  - taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.
  [-]
  - derf_ 3 days ago
    One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.
    [0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.
    [-]
    - saagarjha 2 days ago
      Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially.
    - magicalhippo 3 days ago
      It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss.
      In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3.
      No wonder optimizing memory access can work wonders.
  - ack_complete 3 days ago
    > And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything.
    Wouldn't Intel be the one defining the intrinsics? They're referenced from the ISA manuals, and the Intel Intrinsics Guide regularly references intrinsics like _allow_cpu_features() that are only supported by the Intel compiler and aren't implemented in MSVC.
    [-]
    - astrange 3 days ago
      The _emm _epi8 stuff is Hungarian notation, which is from Microsoft.
      [-]
      - ack_complete 3 days ago
        Uh, no, that's standard practice for disambiguating the intrinsic operations for different data types without overloading support. ARM does the same thing with their vector intrinsics, such as vaddq_u8(), vaddq_s16(), etc.
- duped 3 days ago
  Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM.
  > Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
  IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare).
  For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions.
  But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful.
  [-]
  - Sesse__ 3 days ago
    > Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM.
    perf is included with the Linux kernel, and works with a fair amount of architectures (including Arm).
    [-]
    - godelski 3 days ago
      You may still need to install linux-tools to get the perf command.
      [-]
      - Sesse__ 3 days ago
        It's included with the kernel as distributed by upstream. Your distribution may choose to split out parts of it into other binary packages.
        [-]
        godelski 3 days ago
        I'm not disagreeing, I just wanted to add so others might know why they can't just run the command.
    - duped 3 days ago
      perf doesn't give you instruction level profiling, does it? I thought the traces were mostly at the symbol level
      [-]
      - Sesse__ 3 days ago
        Hit enter on the symbol, and you get instruction-level profiles. Or use perf annotate explicitly. (The profiles are inherently instruction-level, but the default perf report view aggregates them into function-level for ease of viewing.)
- jcranmer 3 days ago
  > Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
  Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice:
  If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code.
  If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways.
  [-]
  - astrange 3 days ago
    Loop vectorization doesn't work for ffmpeg's needs because the kernels are too small and specialized. It works better for scientific/numeric computing.
    You could invent a DSL for writing the kernels in… but they did, it's x86inc.asm. I agree ispc is close to something that could work.
- 3 days ago
  [deleted]
SilentM68 3 days ago
Why not include the required or targeted math lessons needed for the FFmpeg Assembly Lessons in the GitHub repository? It'd be easier for people to get started if everything was in one place :)
[-]
- snickerbockers 3 days ago
  NTA but if the assumption is that the reader has only a basic understanding of C programming and wants to contribute to a video codec there is a lot of ground that needs to be covered just to get to how the cooley/tukey algorithm works and even that's just the basic fundamentals.
  [-]
  - byryan 3 days ago
    I read the repo more as "go through this if you want to have a greater understanding of how things work on a lower level inside your computer". In other words, presumably it's not only intended for people who want to contribute to a video codec/other parts of ffmpeg. But I'm also NTA, so could be wrong.
sylware 3 days ago
There is serious abuse of nasm macro-preprocessor. Going to be tough to move away to another assembler.
[-]
- loeg 3 days ago
  Why move away?
  [-]
  - saagarjha 2 days ago
    Not everyone wants to use nasm. Sometimes all you have is the clang integrated assembler :(
    [-]
    - sylware 2 days ago
      You could use GAS, FASM2, or write an ffmpeg specific one: writting a real-life assembler is orders of magnitude simpler than writting a real-life compiler... Usually that implementation complexity is very dependent on the macro pre-precessor one.
      This is a matter of exit cost: for instance look at linux bitkeeper->git exit.
    - loeg 2 days ago
      > Not everyone wants to use nasm.
      Yeah, but I'd really only be concerned with what the ffmpeg developers want here.
    - sylware 2 days ago
      Yep, look at nasm SDK, horrible, with tons of perl generators, etc, etc. From the worst.
- oguz-ismail 3 days ago
  Where? There's very little code in those lessons
  [-]
  - pveierland 3 days ago
    The lessons reference `cglobal` in `x86inc.asm`:
    https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/x...
ngcc_hk 3 days ago
More interesting than I thought it could be. A domain specific tutorial is so much better.
Alifatisk 3 days ago
How do they make these assembly instructions portable across different cpus?
[-]
- CannotCarrot 3 days ago
  I think there's a generic C fallback, which can also serve as a baseline. But for the big (targeted) architectures, there one handwritten assembly version per arch.
  [-]
  - faluzure 3 days ago
    Yup.
    On startup, it runs cpuid and assigns each operation the most optimal function pointer for that architecture.
    In addition to things like ‘supports avx’ or ‘supports sse4’ some operations even have more explicit checks like ‘is a fifth generation celeron’. The level of optimization in that case was optimizing around the cache architecture on the cpu iirc.
    Source: I did some dirty things with chromes native client and ffmpeg 10 years ago.
- 3 days ago
  [deleted]
- KeplerBoy 3 days ago
  They don't. It's just x86-64.
  [-]
  - ahartmetz 3 days ago
    The lessons yes, but the repo contains assembly for the 5-6 architectures in wide use in consumer hardware today. Separate files of course. https://github.com/FFmpeg/FFmpeg/tree/master/libavcodec
    [-]
    - KeplerBoy 3 days ago
      Yeah, sure. I was specifically referring to the tutorials. Ffmpeg needs to run everywhere, although I believe they are more concerned about data center hardware than consumer hardware. So probably also stuff like power pc.
      [-]
      - duskwuff 3 days ago
        To a first approximation, the only architectures where people really care about ffmpeg performance (anymore) are x86_64 and arm64. Everything else is of minimal importance - the few assembly routines for other architectures were probably written more for fun than for practical reasons.
abhisek 3 days ago
Love it. Thanks for taking the time to write this. Hope it will encourage more folks to contribute.
nisten 3 days ago
[flagged]
rob44 2 days ago
[flagged]