I love this kind of project, this is wonderful work. I guess the challenge is to now make it work for general purpose Python. In any case it looks very much like a marketable product already. I would seek financing to see how far this can go.
Back when C# came out, I thought for sure someone would make a processor that would natively execute .Net bytecode. Glad to see it finally happened for some language.
I want to say there was a product that did this circa 2006-2008 but all I’m finding is the .NET Micro Framework and its modern successor the .NET nano Framework.
I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.
In university, for my undergrad thesis, I wanted to do this for a Befunge variant (choosing the character set to simplify instruction decoding). My supervisor insisted on something more practical, though. :(
Are there any limitations on what code can run? (discounting e.g. memory limitations and OS interaction)
I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.
I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).
There are definitely some limitations beyond just memory or OS interaction.
Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware.
I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.
Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.
As for the design process — I’d love to share more!
I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.
> I'd prefer to move forward based on clear use cases
Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.
Would you have to rewrite those stdlib modules in pure python?
In terms of a feature-set to target, would it make sense to be going after RPython instead of "real" Python? Doing that would let you leverage all the work that PyPy has done on separating what are the essential primitives required to make a Python vs what are the sugar and abstractions that make it familiar:
JVM I think I can understand, but do you happen to know more about LISP machines and whether they use an ISA specifically optimized for the language, or if the compilers for x86 end up just doing the same thing?
In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.
I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.
The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...
So first of all, this is awesome and props to you for some great work.
I have what may be a dumb question, but I've heard that Lua can be used in embedded contexts, and that it can be used without dynamic memory allocation and other such things you don't want in real time systems. How does this project compare to that? And like I said it's likely a dumb question because I haven't actually used Lua in an embedded context but I imagine if there's something there you've probably looked at it?
with embedded scripting languages (including lua and micropython) the CPU is running a compiled interpreter (usually written in C, compiled to the CPU's native architecture) and the interpreter is running the script. on PyXL, the CPU's native architecture is python bytecode, so there's no compiled interpreter.
This is cool for sure. I think you’ll ultimately find that this can’t really be faster than modern OoO cores because python instructions are so complex. To execute them OoO or even at a reasonable frequency (e.g. to reduce combinatorial latency), you’ll need to emit type-specialized microcode on the fly, but you can’t do that until the types are known — which is only the case once all the inputs are known for python.
You're right that dynamic typing makes high-frequency execution tricky, and modern OoO cores are incredibly good at hiding latencies.
But PyXL isn't trying to replace general-purpose CPUs — it's designed for efficient, predictable execution in embedded and real-time systems, where simplicity and determinism matter more than absolute throughput.
Most embedded cores (like ARM Cortex-M and simple RISC-V) are in-order too — and deliver huge value by focusing on predictability and power efficiency.
That said, there’s room for smart optimizations even in a simple core — like limited lookahead on types, hazard detection, and other techniques to smooth execution paths.
I think embedded and real-time represent the purest core of the architecture — and once that's solid, there's a lot of room to iterate upward for higher-end acceleration later.
Sure, but for embedded use cases (which this is targeting), the goal isn't raw speed so much as being fast enough for specific use cases while minimizing power usage / die area / cost.
Great work! :D
I had a question about that though.
Instead of compiling to PySM, why not compile directly to a real assembly like ARM?
Is the PySM assembly very special to accomodate python features in a way that can't be done efficiently in existing architectures like ARM?
Good question.
In theory, you can compile anything Turing-complete to anything else — ARM and Python are both Turing-complete.
But practically, Python's model (dynamic typing, deep use of the stack) doesn't map cleanly onto ARM's register-based, statically-typed instruction set.
PySM is designed to match Python’s structure much more naturally — it keeps the system efficient, simpler to pipeline, and avoids needing lots of extra translation layers.
I built a hardware processor that runs Python programs directly, without a traditional VM or interpreter.
Early benchmark: GPIO round-trip in 480ns — 30x faster than MicroPython on a Pyboard (at a lower clock).
Demo: https://runpyxl.com/gpio
* Could you share the assembly language of the processor?
* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?
Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining.
Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.
Why not ARM/X86/etc...
Existing CPUs are optimized for static, register-based compiled languages like C/C++.
Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).
This sounds like your ‚arch‘ (sorry don‘t 100% know the correct term here) could potentially also run ruby/js if the toolchain can interpret it into your assembly language?
Wow, this is fascinating stuff. Just a side question (and please understand I am not a low-level hardware expert, so pardon me if this is a stupid question): does this arch support any sort of speculative execution, and if so do you have any sort of concerns and/or protections in place against the sort of vulnerabilities that seem to come inherent with that?
Right now, PyXL runs fully in-order with no speculative execution.
This is intentional for a couple of reasons:
First, determinism is really important for real-time and embedded systems — avoiding speculative behavior makes timing predictable and eliminates a whole class of side-channel vulnerabilities.
Second, PyXL is still at an early stage — the focus right now is on building a clean, efficient architecture that makes sense structurally, without adding complex optimizations like speculation just for the sake of performance.
In the future, if there's a clear real-world need, limited forms of prediction could be considered — but always very carefully to avoid breaking predictability or simplicity.
Why is it not routine to "compile" Python? I understand that the interpreter is great for rapid iteration, cross compatibility, etc. But why is it accepted practice in the Python world to eschew all of the benefits of compilation by just dumping the "source" file in production?
The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library). Without type annotations, there is very little for a non-JIT compiler to optimize, since:
- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.
- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).
- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).
A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.
You make a great point — type information is definitely a huge part of the challenge.
I'd add that even beyond types, late binding is fundamental to Python’s dynamism:
Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.
So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.
That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.
Is "single user graphics workstation" even still a goal? Great target in the Early to Mid Ethernetian when Xerox Dorados and Dandelions, Symbolics, and Liliths roamed the Earth. Doesn't feel like a modern goal or standard of comparison.
I used those workstations back in the day—then rinsed and repeated with JITs and GCs for Self, Java, and on to finally Python in PyPy. They're fantastic! Love having them on-board. Many blessings to Deutsch, Ungar, et al. But for 40 years JIT's value has always been to optimize away the worst gaps, getting "close enough" to native to preserve "it's OK to use the highest level abstractions" for an interesting set of workloads. A solid success, but side by side with AOT compilation of closer-to-the-machine code? AOT regularly wins, then and now.
"Solved" should imply performance isn't a reason to utterly switch languages and abstractions. Yet witness the enthusiasm around Julia and Rust e.g. specifically to get more native-like performance. YMMV, but from this vantage, seeing so much intentional down-shift in abstraction level and ecosystem maturity "for performance" feels like JIT reduced but hardly eliminated the gap.
It is solved to the point the users on those communities are not writing extensions in C all the time, to compensate for the interpreter implementation.
AOT winning over JITs on micro benchmarks hardly wins in meaningful way for most business applications, especially when JIT caches and with PGO data sharing across runs is part of the picture.
Sure there are always going to be use cases that require AOT, and in most of them is due to deployment constraints, than anything else.
Most mainstream devs don't even know how to use PGO tooling correctly from their AOT toolchains.
Heck, how many Electron apps do you have running right now?
> We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.
Some years ago there was an attempt to create a linux distribution including a Python userspace, called Snakeware. But the project went inactive since then. See https://github.com/joshiemoore/snakeware
Python doesn’t eschew all benefits of compilation. It is compiled, but to an intermediate byte code, not to native code, (somewhat) similar to the way java and C# compile to byte code.
Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.
Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.
If you define "compiling Python" as basically "taking what the interpreter would do but hard-coding the resulting CPU instructions executed instead of interpreting them", the answer is, you don't get very much performance improvement. Python's slowness is not in the interpreter loop. It's in all the things it is doing per Python opcode, most of which are already compiled C code.
If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.
A giant part of the cost of dynamic languages is memory access. It's not possible, in general, to know the type, size, layout, and semantics of values ahead of time. You also can't put "Python objects" or their components in registers like you can with C, C++, Rust, or Julia "objects." Gradual typing helps, and systems like Cython, RPython, PyPy etc. are able to narrow down and specialize segments of code for low-level optimization. But the highly flexible and dynamic nature of Python means that a lot of the work has to be done at runtime, reading from `dict` and similar dynamic in-memory structures. So you have large segments of code that are accessing RAM (often not even from caches, but genuine main memory, and often many times per operation). The associated IO-to-memory delays are HUGE compared to register access and computation more common to lower-level languages. That's irreducible if you want Python semantics (i.e. its flexibility and generality).
Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.
For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.
There's no benefit that I know of, besides maybe a tiny cold start boost (since the interpreter doesn't need to generate the bytecode first).
I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.
Where’s the AOT compiler that handles the whole Python language?
It’s not routine because its not even an option, and people who are concerned either use the tools that let them compile a subset of Python within a larger, otherwise-interpreted program, or use a different language.
There have been efforts (like Cython, Nuitka, PyPy’s JIT) to accelerate Python by compiling subsets or tracing execution — but none fully replace the standard dynamic model at least as far as I know.
Part of the issue is the number of instructions Python has to go through to do useful work. Most of that is unwrapping values and making sure they're the right type to do the thing you want.
For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.
If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.
That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."
it would be nice to have some peripheral drivers implemented (UART, eMMC etc).
having this, the next tempting step is to make `print` function work, then the filesystem wrapper etc.
btw - what i'm missing is a clear information of limitations. it's definitely not true that i can take any Python snippet and run it using PyXL (for example threads i suppose?)
Peripheral drivers (like UART, SPI, etc.) are definitely on the roadmap - They'd obviously be implemented in HW.
You're absolutely right — once you have basic IO, you can make things like print() and filesystem access feel natural.
Regarding limitations: you're right again.
PyXL currently focuses on running a subset of real Python — just enough to show it's real python and to prove the core concept, while keeping the system small and efficient for hardware execution.
I'm intentionally holding off on implementing higher-level features until there's a real use case, because embedded needs can vary a lot, and I want to keep the system tight and purpose-driven.
Also, some features (like threads, heavy runtime reflection, etc.) will likely never be supported — at least not in the traditional way — because PyXL is fundamentally aimed at embedded and real-time applications, where simplicity and determinism matter most.
Do I get this right? this is an ASIC running a python-specific microcontroller which has python-tailored microcode? and together with that a python bytecode -> microcode compiler plus support infrastructure to get the compiled bytcode to the asic?
You're close:
It's currently running on an FPGA (Zynq-7000) — not ASIC yet — but yeah, could be transferable to ASIC (not cheap though :))
It's a custom stack-based hardware processor tailored for executing Python programs directly. Instead of traditional microcode, it uses a Python-specific instruction set (PySM) that hardware executes.
As someone who did a CPython Bytecode → Java bytecode translator (https://timefold.ai/blog/java-vs-python-speed), I strongly recommend against the CPython Bytecode → PySM Assembly step:
- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.
- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.
- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.
I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.
You're absolutely right that CPython bytecode changes over time and isn’t perfectly documented — I’ve also had to read the CPython source directly at times because of unclear docs.
That said, I intentionally chose to target bytecode instead of AST at this stage.
Adhering to the AST would actually make me more vulnerable to changes in the Python language itself (new syntax, new constructs), whereas bytecode changes are usually contained to VM-level behavior.
It also made it much easier early on, because the PyXL compiler behaves more like a simple transpiler — taking known bytecode and mapping it directly to PySM instructions — which made validation and iteration faster.
Either way, some adaptation will always be needed when Python evolves — but my goal is to eventually get to a point where only the compiler (the software part of PyXL) needs updates, while keeping the hardware stable.
CPython bytecode changes behaviour for no reason and very suddenly, so you will be vulnerable to changes in Python language versions. A few from the top of my head:
- In Python 3.10, jumps changed from absolute indices to relative indices
- In Python 3.11, cell variables index is calculated differently for cell variables corresponding to parameters and cell variables corresponding to local variables
- In Python 3.11, MAKE_FUNCTION has the code object at the TOS instead of the qualified name of the function
Have you considered joining the next tiny tapeout run? This is exactly the type of project I'm sure they would sponsor or try to get to asic.
In case you weren't aware, they give you 200 x 150 um tile on a shared chip. There is then some helper logic to mux between the various projects on the chip.
Makes me think of LabVIEW FPGA, where you could run LabVIEW code directly on FPGA, more like generate vhdl or verilog from LabVIEW, and do very high loop rate deterministic control systems. Very cool. Except with that you were locked down to the national instruments ecosystem and no one really used it.
Thanks — it’s definitely been incredibly satisfying to see it run on real hardware!
Right now, PyXL is tied fairly closely to a specific CPython version's bytecode format (I'm targeting CPython 3.11 at the moment).
That said, the toolchain handles translation from Python source → CPython bytecode → PyXL Assembly → hardware binary, so in principle adapting to a new Python version would mainly involve adjusting the frontend — not reworking the hardware itself.
Longer term, the goal is to stabilize a consistent subset of Python behavior, so version drift becomes less painful.
>PyXL is a custom hardware processor that executes Python directly — no interpreter, no JIT, and no tricks. It takes regular Python code and runs it in silicon.
So, no using C libraries. That takes out a huge chunck of pip packages...
You're absolutely right — today, PyXL only supports pure Python execution, so C extensions aren’t directly usable.
That said, in future designs, PyXL could work in tandem with a traditional CPU core (like ARM or RISC-V), where C libraries execute on the CPU side and interact with PyXL for control flow and Python-level logic.
There’s also a longer-term possibility of compiling C directly to PyXL’s instruction set by building an LLVM backend — allowing even tighter integration without a second CPU.
Right now the focus is on making native Python execution viable and efficient for real-time and embedded systems, but I definitely see broader hybrid models ahead.
Would this be able to handle an exec()- or eval()-call? Is there a Python byte code compiler available as python byte code to include in this processor?
When I first started PyXL, this kind of vision was exactly on my mind.
Maybe not AWS Lambda specifically, but definitely server-side acceleration — especially for machine learning feature generation, backend control logic, and anywhere pure Python becomes a bottleneck.
It could definitely get there — but it would require building a full-scale deployment model and much broader library and dynamic feature support.
That said, the underlying potential is absolutely there.
PyXL today is aimed more at embedded and real-time systems.
For server-class use, I'd need to mature heap management, add basic concurrency, a simple network stack, and gather real-world benchmarks (like requests/sec).
That said, I wouldn’t try to fully replicate CPython for servers — that's a very competitive space with a huge surface area.
I'd rather focus on specific use cases where deterministic, low-latency Python execution could offer a real advantage — like real-time data preprocessing or lightweight event-driven backends.
When I originally started this project, I was actually thinking about machine learning feature generation workloads — pure Python code (branches, loops, dynamic types) without heavy SIMD needs. PyXL is very well suited for that kind of structured, control-flow-heavy workload.
If I wanted to pitch PyXL to VCs, I wouldn’t aim for general-purpose servers right away.
I'd first find a specific, focused use case where PyXL's strengths matter, and iterate on that to prove value before expanding more broadly.
This is still an early-stage project — it's not completed yet, and fabricating a custom chip would involve huge costs.
I'm a solo developer worked on this in my spare time, so FPGA was the most practical way to prove the core concepts and validate the architecture.
Longer term, I definitely see ASIC fabrication as the way to unlock PyXL’s full potential — but only once the use case is clear and the design is a little more mature.
Oh, my comment wasn't meant as a criticism just curiosity because I would have been extremely surprised to see such a project being fabricated.
I find the idea of a processor designed for a specific very high level language quite interesting. What made you choose python and do you think it's the "correct" language for such a project? It sure seems convenient as a language but I wouldn't have thought it is best suited for that task due to the very dynamic nature of it. Perhaps something like Nim which is similar but a little less dynamic would be a better choice?
I see what you did there! There's a LISP Machine with its guts on display at the MIT Museum. I recall we had one in the graduate student comp sci lab at University of Delaware (I was a tolerated undergrad). By then LISP was faster on a Sun workstation, but someone had taught it to play Tetris.
GC is still a WIP, but the key idea is the system won't stall — garbage collection happens asynchronously, in the background, without interrupting PyXL execution.
This is so cool, I have dreamt about doing this but wouldn't know where to start. Do you have a plan for releasing it? What is your background? Was there anything that was way more difficult than you thought it would be? Or anything that was easier than you expected?
Right now, the plan is to present it at PyCon first (next month) and then publish more about the internals afterward.
Long-term, I'm keeping an open mind, not sure yet.
My background is in high-frequency trading (HFT), high-performance computing (HPC), systems programming, and networking.
I didn't come from HW background — or at least, I wasn't when I started — but coming from the software side gave me a different perspective on how dynamic languages could be made much more efficient at the hardware level.
Difficult - adapting the Python execution model to my needs in a way that keeps it self-coherent if it makes sense. This is still fluid and not finalized...
Easy - Not sure if categorize as easy, but more surprising: The current implementation is rather simple and elegant (at least I think so :-) ), so still no special advanced CPU design stuff (branch prediction, super-scalar, etc).
So even now, I'm getting a huge improvement over CPython or MicroPython VMs in the known python bottlenecks (branchings, function calls, etc)
Difficult - adapting the Python execution model to my needs in a way that keeps it self-coherent if it makes sense. This is still fluid and not finalized...
Alright well those dots are begging me to ask what they mean, or at least one specific story for the nerds :-)
Long-term, I'm keeping an open mind, not sure yet.
Well please consider open source, even if you charge for access to your open source code. And even if you don't go open source, atleast make it cheap enough that a solo developer could afford to build on it without thinking.
Would be curious to see how this benchmarks on a faster FGPA since I imagine clock frequency is the latency dictator - while memory and tile can determine how many instances can run in parallel.
Not yet — I'm currently testing on a Zynq-7000 platform (embedded-class FPGA), mainly because it has an ARM CPU tightly integrated (and it's rather cheap).
I use the ARM side to handle IO and orchestration, which let me focus the FPGA fabric purely on the Python execution core, without having to build all the peripherals from scratch at this stage.
To run PyXL on a server-class FPGA (like Azure instances), some adaptations would be needed — the system would need to repurpose the host CPU to act as the orchestrator, handling memory, IO, etc.
The question is: what's the actual use case of running on a server? Besides testing max frequency -- for which I could just run Vivado on a different target (would need license for it though)
For now, I'm focusing on validating the core architecture, not just chasing raw clock speeds.
How big a deal would it be to include the bytecode->PySM translation into the ISA? It seems like it would be even cooler if the CPU actually ran python bytecode itself.
That's a great question!
I actually thought a lot about that early on.
In theory, you could build a CPU that directly interprets Python bytecode — but Python bytecode is quite high-level and irregular compared to typical CPU instructions.
It would add a lot of complexity and make pipelining much harder, which would hurt performance, especially for real-time or embedded use.
By compiling the Python bytecode ahead of time into a simpler, stack-based ISA (what I call PySM), the CPU can stay clean, highly pipelined, and efficient.
It also opens the door in the future to potentially supporting other languages that could target the same ISA!
Oh boy, I definitely considered that — turning PyXL into a RISC-V extension was an early idea I thought of.
It could probably be adapted into one.
But I ultimately decided to build it as its own clean design because I wanted the flexibility to rethink the entire execution model for Python — not just adapt an existing register-based architecture.
FPGA is for prototyping. although this could probably be used as a soft core.
But looking forward, ASIC is definitely the way to go.
Thanks you very much.
I learned of Jazelle after started working on it and this is a good thing, because Jazelle didn't become too popular AFAIK, so it would just make me quit. Glad I didn't though :)
The significant difference between Jazelle and your project is how Jazelle sits on top of a CPU that can already run a java interpreter without the instruction set extensions, said instruction set didn't implement all of java (it still required a runtime to implement the missing opcodes, in ARM), and java runtimes quickly got better optimized than doing the same thing with the instruction set.
I think building a CPU that can only do this is a really novel idea and am really interested in seeing when you eventually disclose more implementation details. My only complaint is that it isn't Lua :P
Python’s execution model is already very stack-oriented — CPython bytecode operates by pushing and popping values almost constantly.
Building PyXL as a stack machine made it much more natural to map Python semantics directly onto hardware, without forcing an unnatural register-based structure on it.
It also avoids a lot of register allocation overhead (renaming and such).
Nice, next step could be rolling out that bytecode compiler in Python, so it’s self-contained. And a port to some LLM-on-silicon, so we could have it executing Python as the inference goes :-P
PyPy is a JIT compiler — it runs on a standard CPU and accelerates "hot" parts of a program after runtime analysis.
This is a great approach for many applications, but it doesn’t fit all use cases.
PyXL is a hardware solution — a custom processor designed specifically to run Python programs directly.
It's currently focused on embedded and real-time environments where JIT compilation isn't a viable option due to memory constraints, strict timing requirements, and the need for deterministic behavior.
That a interesting project! I have some follow up:
> No VM, No C, No JIT. Just PyXL.
Is the main goal to achive C-like performance with the ease of writing python? Do you have a perfomance comparision against C? Is the main challenge the memory management?
> PyXL runs on a Zynq-7000 FPGA (Arty-Z7-20 dev board). The PyXL core runs at 100MHz. The ARM CPU on the board handles setup and memory, but the Python code itself is executed entirely in hardware. The toolchain is written in Python and runs on a standard development machine using unmodified CPython.
> PyXL skips all of that. The Python bytecode is executed directly in hardware, and GPIO access is physically wired to the processor — no interpreter, no function call, just native hardware execution.
Did you write some sort of emulation to enable testing it without the physical Arty board?
Goal:
Yes — the main goal is to bring C-like or close-to-C performance to Python code, without sacrificing the ease of writing Python.
However, due to the nature of Python itself, I'm not sure how close I can get to native C performance, especially competing with systems (both SW and HW) that were revised and refined for decades.
Performance comparison against C:
I don't have a formal benchmark directly against C yet.
The early GPIO benchmark (480ns toggle) is competitive with hand-written C on ARM microcontrollers — even when running at a lower clock speed.
But a full systematic comparison (across different workloads) would definitely be interesting for the future.
Main challenge:
Yes — memory management is one of the biggest challenges.
Dynamic memory allocation and garbage collection are tricky to manage efficiently without breaking real-time guarantees.
I have a roadmap for it, but would like to stick to a real use case before moving forward.
Software emulation:
I am using Icarus (could use Verilator) for RTL simulation if that's what you meant.
But hardware behavior (like GPIO timing) still needs to be tested on the real FPGA to capture true performance characteristics.
There are a lot of dimensions to what you could call performance. The FPGA here is only clocked at 100 MHz and there's no way you're going to get the same throughput with it as you would on a conventional processor, especially if you add a JIT to optimize things. What you do get here is very low latency.
This type of project is why I love HN. This work is brilliant!
Almost every question I had, you already answered in the comments. The only one remaining at the moment: How long exactly have you been working on PyXL?
I'm a software engineer by background, mostly in high-frequency trading (HFT), HPC, systems programming, and networking — so a lot of focus on efficiency and low-level behavior.
I had played a bit with FPGAs before, but nothing close to this scale — most of the hardware and Python internals work I had to figure out along the way.
Amazing, I'm sure many programmers would join to contribute to your great project, which could become as big as a Python-based operating system, which due to the simplicity of the code would advance very quickly.
Thank you!
Right now I'm focusing on keeping the core simple, efficient, and purpose-driven — mainly to run Python well on hardware for embedded and real-time use cases.
As for the future, I’m keeping an open mind.
It would be exciting if it grew into something bigger, but my main focus for now is making sure the foundation is as solid and clean as possible.
Well, one could argue that modern CPUs are designed as C Machine, even more so that now everyone is adding hardware memory tagging as means to fix C memory corruption issues.
Only if you don't understand the history of C. B was a LCD grouping of assembler macros for a typical register machine, C just added a type system and a couple extra bits of syntax. C isn't novel in the slightest, you're structuring and thinking about your code pretty similar to a certain style of assembly programming on a register machine. And yes, that type of register machine is still the most popular way to design an architecture because it has qualities that end up being fertile middle ground between electrical engineers and programmers.
Also there are no languages that reflect what modern CPUs are like, because modern CPUs obfuscate and hide much of how the way they work. Not even assembly is that close to the metal anymore, and it even has undefined behavior these days. There was an attempt to make a more explicit version of the hardware with Itanium, and it was explicitly a failure for much of the same reason than iAPX432 was a failure. So we kept the simpler scalar register machine around, because both compilers and programmers are mostly too stupid to work with that much complexity. C didn't do shit, human mental capacity just failed to evolve fast enough to keep up with our technology. Things like Rust are more the descendant of C than the modern design of a CPU.
What do you think a language based on a modern CPU architecture would look like? The big deal is representing the OoO and speculative execution, right?
Text files seem a bit too sequential in structure, maybe we can figure out a way to represent the dependency graphs directly.
I envision an inflected grammar. That sounds crazy I know, but x64 is an inflected language already. The pointer arithmetic you can attach to a register isn't an expression or a distinct group of words, it's a suffix. Part of the word, indistinguishable from it. Someone once did a great job of explaining to me how that mapped to microcode in a shockingly static way and it blew my mind. I see affixes for controlling the branch predictor. Operations should also be inflected in a contextual way, making their relationship to other operations explicit, giving you control over how things are pipelined. Maybe take some inspiration from afro-asiatic languages, use kind of consonantal root system.
The end result would look nothing like any other programming language and would die in obscurity, to be honest. But holy shit it would be really fucking cool.
Historically their performance is underwhelming. Sometimes competitive on the first iteration, sometimes just mid. But generally they can't iterate quickly (insufficient resources, insufficient product demand) so they are quickly eclipsed by pure software implementations atop COTS hardware.
This particular Valley of Disappointment is so routine as to make "let's implement this in hardware!" an evergreen tarpit idea. There are a few stunning exceptions like GPU offload—but they are unicorns.
They were a tar pit in the 1980s and 1990s when Moores law meant a 16x increase in processor speed every 6 years.
Right now the only reason why we don't have new generations of these eating the lunch of general purpose CPUs is that you'd need to organize a few billion transistors into something useful. That's something a bit beyond what just about everyone (including Intel now apparently) can manage.
Sure. The need to organize millions (now 10s to 100s of billions) of transistors to do something useful, the economics and will to bring those to market, the need to coordinate functions baked into hardware with the faster moving and vastly more-plastic software world—oh, and Amdahl's Law.
They are the tar pit. Transistor counts skyrocket, but the principles and obstacles have not changed one iota in over 50 years.
I love this kind of project, this is wonderful work. I guess the challenge is to now make it work for general purpose Python. In any case it looks very much like a marketable product already. I would seek financing to see how far this can go.
This looks incredible.
Do you have any open source code available for this yet?
Are you planning to release this as open source? If not, do you have a rough idea for how you plan to commercial license this tech?
Back when C# came out, I thought for sure someone would make a processor that would natively execute .Net bytecode. Glad to see it finally happened for some language.
I want to say there was a product that did this circa 2006-2008 but all I’m finding is the .NET Micro Framework and its modern successor the .NET nano Framework.
I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.
Maybe you’re thinking of Singularity OS?
For Java, this was around for a bit https://en.wikipedia.org/wiki/Jazelle.
Even better was a complete system rather than a mode for arm codes that ran the common jvm opcodes.
https://en.wikipedia.org/wiki/PicoJava
Didn't some phones have hardware Java execution or does my memory fail me?
In university, for my undergrad thesis, I wanted to do this for a Befunge variant (choosing the character set to simplify instruction decoding). My supervisor insisted on something more practical, though. :(
Does anyone remember the JavaOne ring giveaway?
https://news.ycombinator.com/item?id=8598037
Java got that with smart cards for example. Cute oddities of the past
JavaCard was just implemented as just a regular interpreter last time I checked.
I'd be surprised if azure app services didn't do this already.
I’d be willing to bet my net worth that they don’t
Azure runs on Linux if I'm not mistaken.
Wouldn't that be a real scoop?
Are there any limitations on what code can run? (discounting e.g. memory limitations and OS interaction)
I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.
I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).
Thanks — really appreciate the interest!
There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.
Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.
As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.
> I'd prefer to move forward based on clear use cases
Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.
Would you have to rewrite those stdlib modules in pure python?
In terms of a feature-set to target, would it make sense to be going after RPython instead of "real" Python? Doing that would let you leverage all the work that PyPy has done on separating what are the essential primitives required to make a Python vs what are the sugar and abstractions that make it familiar:
https://doc.pypy.org/en/latest/faq.html#what-is-pypy
JVM I think I can understand, but do you happen to know more about LISP machines and whether they use an ISA specifically optimized for the language, or if the compilers for x86 end up just doing the same thing?
In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.
This is very, very cool. Impressive work.
I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.
The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...
So first of all, this is awesome and props to you for some great work.
I have what may be a dumb question, but I've heard that Lua can be used in embedded contexts, and that it can be used without dynamic memory allocation and other such things you don't want in real time systems. How does this project compare to that? And like I said it's likely a dumb question because I haven't actually used Lua in an embedded context but I imagine if there's something there you've probably looked at it?
with embedded scripting languages (including lua and micropython) the CPU is running a compiled interpreter (usually written in C, compiled to the CPU's native architecture) and the interpreter is running the script. on PyXL, the CPU's native architecture is python bytecode, so there's no compiled interpreter.
This is cool for sure. I think you’ll ultimately find that this can’t really be faster than modern OoO cores because python instructions are so complex. To execute them OoO or even at a reasonable frequency (e.g. to reduce combinatorial latency), you’ll need to emit type-specialized microcode on the fly, but you can’t do that until the types are known — which is only the case once all the inputs are known for python.
Thanks — appreciate it!
You're right that dynamic typing makes high-frequency execution tricky, and modern OoO cores are incredibly good at hiding latencies. But PyXL isn't trying to replace general-purpose CPUs — it's designed for efficient, predictable execution in embedded and real-time systems, where simplicity and determinism matter more than absolute throughput. Most embedded cores (like ARM Cortex-M and simple RISC-V) are in-order too — and deliver huge value by focusing on predictability and power efficiency. That said, there’s room for smart optimizations even in a simple core — like limited lookahead on types, hazard detection, and other techniques to smooth execution paths. I think embedded and real-time represent the purest core of the architecture — and once that's solid, there's a lot of room to iterate upward for higher-end acceleration later.
Very cool! Nobody who really wants simplicity and determinism is going to be using Python on a microcontroller though.
Hm, why not though. People managed to do it with tiny JVMs before, so why not a Python variant.
Sure, but for embedded use cases (which this is targeting), the goal isn't raw speed so much as being fast enough for specific use cases while minimizing power usage / die area / cost.
Great work! :D I had a question about that though. Instead of compiling to PySM, why not compile directly to a real assembly like ARM? Is the PySM assembly very special to accomodate python features in a way that can't be done efficiently in existing architectures like ARM?
Thanks — appreciate it!
Good question. In theory, you can compile anything Turing-complete to anything else — ARM and Python are both Turing-complete. But practically, Python's model (dynamic typing, deep use of the stack) doesn't map cleanly onto ARM's register-based, statically-typed instruction set. PySM is designed to match Python’s structure much more naturally — it keeps the system efficient, simpler to pipeline, and avoids needing lots of extra translation layers.
I built a hardware processor that runs Python programs directly, without a traditional VM or interpreter. Early benchmark: GPIO round-trip in 480ns — 30x faster than MicroPython on a Pyboard (at a lower clock). Demo: https://runpyxl.com/gpio
* What HDL did you use to design the processor?
* Could you share the assembly language of the processor?
* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?
Thanks for the question.
HDL: Verilog
Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.
Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).
This sounds like your ‚arch‘ (sorry don‘t 100% know the correct term here) could potentially also run ruby/js if the toolchain can interpret it into your assembly language?
Wow, this is fascinating stuff. Just a side question (and please understand I am not a low-level hardware expert, so pardon me if this is a stupid question): does this arch support any sort of speculative execution, and if so do you have any sort of concerns and/or protections in place against the sort of vulnerabilities that seem to come inherent with that?
Thanks — and no worries, that’s a great question!
Right now, PyXL runs fully in-order with no speculative execution. This is intentional for a couple of reasons: First, determinism is really important for real-time and embedded systems — avoiding speculative behavior makes timing predictable and eliminates a whole class of side-channel vulnerabilities. Second, PyXL is still at an early stage — the focus right now is on building a clean, efficient architecture that makes sense structurally, without adding complex optimizations like speculation just for the sake of performance.
In the future, if there's a clear real-world need, limited forms of prediction could be considered — but always very carefully to avoid breaking predictability or simplicity.
Why is it not routine to "compile" Python? I understand that the interpreter is great for rapid iteration, cross compatibility, etc. But why is it accepted practice in the Python world to eschew all of the benefits of compilation by just dumping the "source" file in production?
The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library). Without type annotations, there is very little for a non-JIT compiler to optimize, since:
- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.
- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).
- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).
A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.
You make a great point — type information is definitely a huge part of the challenge.
I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.
So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.
That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.
Solved by Smalltalk, Self, and Lisp JITs, that are in the genesis of JIT technology, some of it landed on Hotspot and V8.
Python starting with 3.13 also has a JIT available.
Kind of, you have to compile it yourself, and is rather basic, still early days.
PyPy and GraalPy is where the fun is, however they are largely ignored outside their language research communities.
"Addressed" or "mitigated" perhaps. Not "solved." Just "made less painful" or "enough less painful that we don't need to run screaming from the room."
Versus what most folks do with CPython, it is indeed solved.
We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.
Yes, there are a couple of ongoing attempts, while most in the community rather write C extensions.
Is "single user graphics workstation" even still a goal? Great target in the Early to Mid Ethernetian when Xerox Dorados and Dandelions, Symbolics, and Liliths roamed the Earth. Doesn't feel like a modern goal or standard of comparison.
I used those workstations back in the day—then rinsed and repeated with JITs and GCs for Self, Java, and on to finally Python in PyPy. They're fantastic! Love having them on-board. Many blessings to Deutsch, Ungar, et al. But for 40 years JIT's value has always been to optimize away the worst gaps, getting "close enough" to native to preserve "it's OK to use the highest level abstractions" for an interesting set of workloads. A solid success, but side by side with AOT compilation of closer-to-the-machine code? AOT regularly wins, then and now.
"Solved" should imply performance isn't a reason to utterly switch languages and abstractions. Yet witness the enthusiasm around Julia and Rust e.g. specifically to get more native-like performance. YMMV, but from this vantage, seeing so much intentional down-shift in abstraction level and ecosystem maturity "for performance" feels like JIT reduced but hardly eliminated the gap.
It is solved to the point the users on those communities are not writing extensions in C all the time, to compensate for the interpreter implementation.
AOT winning over JITs on micro benchmarks hardly wins in meaningful way for most business applications, especially when JIT caches and with PGO data sharing across runs is part of the picture.
Sure there are always going to be use cases that require AOT, and in most of them is due to deployment constraints, than anything else.
Most mainstream devs don't even know how to use PGO tooling correctly from their AOT toolchains.
Heck, how many Electron apps do you have running right now?
> We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.
Some years ago there was an attempt to create a linux distribution including a Python userspace, called Snakeware. But the project went inactive since then. See https://github.com/joshiemoore/snakeware
I fail to find anything related to have a good enough performance for a desktop system written in Python.
> The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library).
When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...
Python doesn’t eschew all benefits of compilation. It is compiled, but to an intermediate byte code, not to native code, (somewhat) similar to the way java and C# compile to byte code.
Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.
Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.
If you define "compiling Python" as basically "taking what the interpreter would do but hard-coding the resulting CPU instructions executed instead of interpreting them", the answer is, you don't get very much performance improvement. Python's slowness is not in the interpreter loop. It's in all the things it is doing per Python opcode, most of which are already compiled C code.
If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.
A giant part of the cost of dynamic languages is memory access. It's not possible, in general, to know the type, size, layout, and semantics of values ahead of time. You also can't put "Python objects" or their components in registers like you can with C, C++, Rust, or Julia "objects." Gradual typing helps, and systems like Cython, RPython, PyPy etc. are able to narrow down and specialize segments of code for low-level optimization. But the highly flexible and dynamic nature of Python means that a lot of the work has to be done at runtime, reading from `dict` and similar dynamic in-memory structures. So you have large segments of code that are accessing RAM (often not even from caches, but genuine main memory, and often many times per operation). The associated IO-to-memory delays are HUGE compared to register access and computation more common to lower-level languages. That's irreducible if you want Python semantics (i.e. its flexibility and generality).
Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.
For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.
There's no benefit that I know of, besides maybe a tiny cold start boost (since the interpreter doesn't need to generate the bytecode first).
I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.
> Why is it not routine to "compile" Python?
Where’s the AOT compiler that handles the whole Python language?
It’s not routine because its not even an option, and people who are concerned either use the tools that let them compile a subset of Python within a larger, otherwise-interpreted program, or use a different language.
Check Nuitka: https://nuitka.net/
There have been efforts (like Cython, Nuitka, PyPy’s JIT) to accelerate Python by compiling subsets or tracing execution — but none fully replace the standard dynamic model at least as far as I know.
It's called Nim.
Part of the issue is the number of instructions Python has to go through to do useful work. Most of that is unwrapping values and making sure they're the right type to do the thing you want.
For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.
If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.
That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."
Amazing work! This is a great project!
Every time I see a project that has a great implementation on an FPGA, I lament the fact that Tabula didn’t make it, a truly innovative and fast FPGA.
<https://en.m.wikipedia.org/wiki/Tabula,_Inc.>
it would be nice to have some peripheral drivers implemented (UART, eMMC etc).
having this, the next tempting step is to make `print` function work, then the filesystem wrapper etc.
btw - what i'm missing is a clear information of limitations. it's definitely not true that i can take any Python snippet and run it using PyXL (for example threads i suppose?)
Great points!
Peripheral drivers (like UART, SPI, etc.) are definitely on the roadmap - They'd obviously be implemented in HW. You're absolutely right — once you have basic IO, you can make things like print() and filesystem access feel natural.
Regarding limitations: you're right again. PyXL currently focuses on running a subset of real Python — just enough to show it's real python and to prove the core concept, while keeping the system small and efficient for hardware execution. I'm intentionally holding off on implementing higher-level features until there's a real use case, because embedded needs can vary a lot, and I want to keep the system tight and purpose-driven.
Also, some features (like threads, heavy runtime reflection, etc.) will likely never be supported — at least not in the traditional way — because PyXL is fundamentally aimed at embedded and real-time applications, where simplicity and determinism matter most.
Do I get this right? this is an ASIC running a python-specific microcontroller which has python-tailored microcode? and together with that a python bytecode -> microcode compiler plus support infrastructure to get the compiled bytcode to the asic?
fun :-)
but did I get it right?
You're close: It's currently running on an FPGA (Zynq-7000) — not ASIC yet — but yeah, could be transferable to ASIC (not cheap though :))
It's a custom stack-based hardware processor tailored for executing Python programs directly. Instead of traditional microcode, it uses a Python-specific instruction set (PySM) that hardware executes.
The toolchain compiles Python → CPython Bytecode → PySM Assembly → hardware binary.
As someone who did a CPython Bytecode → Java bytecode translator (https://timefold.ai/blog/java-vs-python-speed), I strongly recommend against the CPython Bytecode → PySM Assembly step:
- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.
- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.
- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.
I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.
Thanks — really appreciate your insights.
You're absolutely right that CPython bytecode changes over time and isn’t perfectly documented — I’ve also had to read the CPython source directly at times because of unclear docs.
That said, I intentionally chose to target bytecode instead of AST at this stage. Adhering to the AST would actually make me more vulnerable to changes in the Python language itself (new syntax, new constructs), whereas bytecode changes are usually contained to VM-level behavior. It also made it much easier early on, because the PyXL compiler behaves more like a simple transpiler — taking known bytecode and mapping it directly to PySM instructions — which made validation and iteration faster.
Either way, some adaptation will always be needed when Python evolves — but my goal is to eventually get to a point where only the compiler (the software part of PyXL) needs updates, while keeping the hardware stable.
CPython bytecode changes behaviour for no reason and very suddenly, so you will be vulnerable to changes in Python language versions. A few from the top of my head:
- In Python 3.10, jumps changed from absolute indices to relative indices
- In Python 3.11, cell variables index is calculated differently for cell variables corresponding to parameters and cell variables corresponding to local variables
- In Python 3.11, MAKE_FUNCTION has the code object at the TOS instead of the qualified name of the function
For what it's worth, I created a detailed behaviour of each opcode (along with example Python sources) here: https://github.com/TimefoldAI/timefold-solver/blob/main/pyth... (for up to Python 3.11).
This was my first thought as well. They will be stuck at a certain python version
Have you considered joining the next tiny tapeout run? This is exactly the type of project I'm sure they would sponsor or try to get to asic.
In case you weren't aware, they give you 200 x 150 um tile on a shared chip. There is then some helper logic to mux between the various projects on the chip.
https://tinytapeout.com/
Not an ASIC, it’s running on an FPGA. There is an ARM CPU that bootstraps the FPGA. The rest of what you said is about right.
Makes me think of LabVIEW FPGA, where you could run LabVIEW code directly on FPGA, more like generate vhdl or verilog from LabVIEW, and do very high loop rate deterministic control systems. Very cool. Except with that you were locked down to the national instruments ecosystem and no one really used it.
I
Wow, these FPGAs are not cheap. Don't they also have a couple of ARM cores attached on the SOC?
Fantastic work! :D Must be super-satisfying to get it up and running! :D
Is it tied to a particular version of python?
Thanks — it’s definitely been incredibly satisfying to see it run on real hardware!
Right now, PyXL is tied fairly closely to a specific CPython version's bytecode format (I'm targeting CPython 3.11 at the moment).
That said, the toolchain handles translation from Python source → CPython bytecode → PyXL Assembly → hardware binary, so in principle adapting to a new Python version would mainly involve adjusting the frontend — not reworking the hardware itself.
Longer term, the goal is to stabilize a consistent subset of Python behavior, so version drift becomes less painful.
>PyXL is a custom hardware processor that executes Python directly — no interpreter, no JIT, and no tricks. It takes regular Python code and runs it in silicon.
So, no using C libraries. That takes out a huge chunck of pip packages...
You're absolutely right — today, PyXL only supports pure Python execution, so C extensions aren’t directly usable.
That said, in future designs, PyXL could work in tandem with a traditional CPU core (like ARM or RISC-V), where C libraries execute on the CPU side and interact with PyXL for control flow and Python-level logic.
There’s also a longer-term possibility of compiling C directly to PyXL’s instruction set by building an LLVM backend — allowing even tighter integration without a second CPU.
Right now the focus is on making native Python execution viable and efficient for real-time and embedded systems, but I definitely see broader hybrid models ahead.
Would this be able to handle an exec()- or eval()-call? Is there a Python byte code compiler available as python byte code to include in this processor?
Yeah this is surely a subset of Python.
It would be interesting to see something like this that runs WASM as a universal bytecode.
I'm sure it's been done. I doubt it really is any better though because you can do a lot of optimisations in software that you can't do in hardware.
I can totally see a future where you can select “accelerated python” as an option for your AWS lambda code.
When I first started PyXL, this kind of vision was exactly on my mind.
Maybe not AWS Lambda specifically, but definitely server-side acceleration — especially for machine learning feature generation, backend control logic, and anywhere pure Python becomes a bottleneck.
It could definitely get there — but it would require building a full-scale deployment model and much broader library and dynamic feature support.
That said, the underlying potential is absolutely there.
This sounds brilliant.
What's missing so you could create a demo for vc's or the relevant companies , proving the potential of this as competitive server-class core ?
Good question!
PyXL today is aimed more at embedded and real-time systems.
For server-class use, I'd need to mature heap management, add basic concurrency, a simple network stack, and gather real-world benchmarks (like requests/sec).
That said, I wouldn’t try to fully replicate CPython for servers — that's a very competitive space with a huge surface area.
I'd rather focus on specific use cases where deterministic, low-latency Python execution could offer a real advantage — like real-time data preprocessing or lightweight event-driven backends.
When I originally started this project, I was actually thinking about machine learning feature generation workloads — pure Python code (branches, loops, dynamic types) without heavy SIMD needs. PyXL is very well suited for that kind of structured, control-flow-heavy workload.
If I wanted to pitch PyXL to VCs, I wouldn’t aim for general-purpose servers right away. I'd first find a specific, focused use case where PyXL's strengths matter, and iterate on that to prove value before expanding more broadly.
I need to bit bang the RHS2116 at 25MHz: https://intantech.com/files/Intan_RHS2116_datasheet.pdf
Right now I'm doing this with a dsl with an fpga talking to a computer.
Does your python implementation let you run at speeds like that?
If yes, is there any overhead left for dsp - preferably fp based?
I am a pretty smart person. But once in a while I see something like this which reminds me there's always someone far smarter.
Absolutely incredible.
Is this running on an FPGA or were you able to fab a custom chip?
Just running on FPGA at the moment.
This is still an early-stage project — it's not completed yet, and fabricating a custom chip would involve huge costs.
I'm a solo developer worked on this in my spare time, so FPGA was the most practical way to prove the core concepts and validate the architecture.
Longer term, I definitely see ASIC fabrication as the way to unlock PyXL’s full potential — but only once the use case is clear and the design is a little more mature.
Oh, my comment wasn't meant as a criticism just curiosity because I would have been extremely surprised to see such a project being fabricated.
I find the idea of a processor designed for a specific very high level language quite interesting. What made you choose python and do you think it's the "correct" language for such a project? It sure seems convenient as a language but I wouldn't have thought it is best suited for that task due to the very dynamic nature of it. Perhaps something like Nim which is similar but a little less dynamic would be a better choice?
Could be a candidate for Tiny Tapeout in the future.
https://tinytapeout.com
there are several free asic shuttle runs available for hobbyists iirc
This is kind of cool, basically a Python Machine. :)
I see what you did there! There's a LISP Machine with its guts on display at the MIT Museum. I recall we had one in the graduate student comp sci lab at University of Delaware (I was a tolerated undergrad). By then LISP was faster on a Sun workstation, but someone had taught it to play Tetris.
How are you simulating the designs for the FPGA? Are you paying for ModelSim?
How does garbage collection work here? Are they just set of PySM code?
GC is still a WIP, but the key idea is the system won't stall — garbage collection happens asynchronously, in the background, without interrupting PyXL execution.
Sounds similar to something one of my classmates worked on at uni https://www.bristol.ac.uk/research/groups/trustworthy-system...
Amazing work! Is the primary goal here to allow more production use of python in an embedded context, rather than just prototyping?
Thank you! And yes, exactly.
Congratulations!
This is so cool, I have dreamt about doing this but wouldn't know where to start. Do you have a plan for releasing it? What is your background? Was there anything that was way more difficult than you thought it would be? Or anything that was easier than you expected?
Thanks so much — really appreciate it!
Right now, the plan is to present it at PyCon first (next month) and then publish more about the internals afterward. Long-term, I'm keeping an open mind, not sure yet.
My background is in high-frequency trading (HFT), high-performance computing (HPC), systems programming, and networking. I didn't come from HW background — or at least, I wasn't when I started — but coming from the software side gave me a different perspective on how dynamic languages could be made much more efficient at the hardware level.
Difficult - adapting the Python execution model to my needs in a way that keeps it self-coherent if it makes sense. This is still fluid and not finalized...
Easy - Not sure if categorize as easy, but more surprising: The current implementation is rather simple and elegant (at least I think so :-) ), so still no special advanced CPU design stuff (branch prediction, super-scalar, etc). So even now, I'm getting a huge improvement over CPython or MicroPython VMs in the known python bottlenecks (branchings, function calls, etc)
Difficult - adapting the Python execution model to my needs in a way that keeps it self-coherent if it makes sense. This is still fluid and not finalized...
Alright well those dots are begging me to ask what they mean, or at least one specific story for the nerds :-)
Long-term, I'm keeping an open mind, not sure yet.
Well please consider open source, even if you charge for access to your open source code. And even if you don't go open source, atleast make it cheap enough that a solo developer could afford to build on it without thinking.
Have you tested it on any faster FPGAs? I think Azure has instances with xilinx/AMD accelerators paired.
>Standard_NP10s instance, 1x AMD Alveo U250 FPGA (64GB)
Would be curious to see how this benchmarks on a faster FGPA since I imagine clock frequency is the latency dictator - while memory and tile can determine how many instances can run in parallel.
Not yet — I'm currently testing on a Zynq-7000 platform (embedded-class FPGA), mainly because it has an ARM CPU tightly integrated (and it's rather cheap). I use the ARM side to handle IO and orchestration, which let me focus the FPGA fabric purely on the Python execution core, without having to build all the peripherals from scratch at this stage.
To run PyXL on a server-class FPGA (like Azure instances), some adaptations would be needed — the system would need to repurpose the host CPU to act as the orchestrator, handling memory, IO, etc.
The question is: what's the actual use case of running on a server? Besides testing max frequency -- for which I could just run Vivado on a different target (would need license for it though)
For now, I'm focusing on validating the core architecture, not just chasing raw clock speeds.
How big a deal would it be to include the bytecode->PySM translation into the ISA? It seems like it would be even cooler if the CPU actually ran python bytecode itself.
That's a great question! I actually thought a lot about that early on.
In theory, you could build a CPU that directly interprets Python bytecode — but Python bytecode is quite high-level and irregular compared to typical CPU instructions. It would add a lot of complexity and make pipelining much harder, which would hurt performance, especially for real-time or embedded use.
By compiling the Python bytecode ahead of time into a simpler, stack-based ISA (what I call PySM), the CPU can stay clean, highly pipelined, and efficient. It also opens the door in the future to potentially supporting other languages that could target the same ISA!
This is a one-person project? I'm impressed!
Thanks so much — really appreciate it! Yes, it's been a one-person project so far — just a lot of spare time, persistence, and iteration.
fantastic project. Do you envision this as living on FPGA's forever, or getting into silicon directly? Maybe an extension of RISC-V?
Oh boy, I definitely considered that — turning PyXL into a RISC-V extension was an early idea I thought of.
It could probably be adapted into one.
But I ultimately decided to build it as its own clean design because I wanted the flexibility to rethink the entire execution model for Python — not just adapt an existing register-based architecture.
FPGA is for prototyping. although this could probably be used as a soft core. But looking forward, ASIC is definitely the way to go.
So basically you took the idea of Jazelle extensions that can run Java bytecode natively, but for python?
This is amazing, great work!
Thanks you very much. I learned of Jazelle after started working on it and this is a good thing, because Jazelle didn't become too popular AFAIK, so it would just make me quit. Glad I didn't though :)
The significant difference between Jazelle and your project is how Jazelle sits on top of a CPU that can already run a java interpreter without the instruction set extensions, said instruction set didn't implement all of java (it still required a runtime to implement the missing opcodes, in ARM), and java runtimes quickly got better optimized than doing the same thing with the instruction set.
I think building a CPU that can only do this is a really novel idea and am really interested in seeing when you eventually disclose more implementation details. My only complaint is that it isn't Lua :P
What's the logic behind going for stack based?
Python’s execution model is already very stack-oriented — CPython bytecode operates by pushing and popping values almost constantly. Building PyXL as a stack machine made it much more natural to map Python semantics directly onto hardware, without forcing an unnatural register-based structure on it. It also avoids a lot of register allocation overhead (renaming and such).
Name's a bit confusing when XLWings exists
> Name's a bit confusing when XLWings exists
How? XLWings is not a similar name to pyxl. However, even so, the name is... Heavily overloaded:
https://pyxl.com/ (some kind of strategy/CRM/AI thing)
https://pyxl.ai/ (AI website builder)
https://www.pyxl.pro/ (AI image generator)
https://github.com/dropbox/pyxl (Inline HTML extension for Python)
https://openpyxl.readthedocs.io/en/stable/ (A Python library to read/write Excel files)
https://www.pyxll.com/ (Excel Add-in to support add-ins written in Python)
>has XL >has to do with Python
Indeed, the namespace is rather crowded.
Nice, next step could be rolling out that bytecode compiler in Python, so it’s self-contained. And a port to some LLM-on-silicon, so we could have it executing Python as the inference goes :-P
This is amazing! Is the “microcode” compiled to final native on the host or the coprocessor?
I’m guessing due to the lack of JIT, it’s executed on the host?
The microcode or the ISA of the system actually runs on the co-processor (PyXL custom cpu)
If you refer to the ARM part as the host (did you?) it's just orchestrating the whole thing, it doesn't run the actual Python program
Look impressive How does this compare to pypy?
PyPy is a JIT compiler — it runs on a standard CPU and accelerates "hot" parts of a program after runtime analysis.
This is a great approach for many applications, but it doesn’t fit all use cases.
PyXL is a hardware solution — a custom processor designed specifically to run Python programs directly.
It's currently focused on embedded and real-time environments where JIT compilation isn't a viable option due to memory constraints, strict timing requirements, and the need for deterministic behavior.
That a interesting project! I have some follow up:
> No VM, No C, No JIT. Just PyXL.
Is the main goal to achive C-like performance with the ease of writing python? Do you have a perfomance comparision against C? Is the main challenge the memory management?
> PyXL runs on a Zynq-7000 FPGA (Arty-Z7-20 dev board). The PyXL core runs at 100MHz. The ARM CPU on the board handles setup and memory, but the Python code itself is executed entirely in hardware. The toolchain is written in Python and runs on a standard development machine using unmodified CPython.
> PyXL skips all of that. The Python bytecode is executed directly in hardware, and GPIO access is physically wired to the processor — no interpreter, no function call, just native hardware execution.
Did you write some sort of emulation to enable testing it without the physical Arty board?
Goal: Yes — the main goal is to bring C-like or close-to-C performance to Python code, without sacrificing the ease of writing Python. However, due to the nature of Python itself, I'm not sure how close I can get to native C performance, especially competing with systems (both SW and HW) that were revised and refined for decades.
Performance comparison against C: I don't have a formal benchmark directly against C yet. The early GPIO benchmark (480ns toggle) is competitive with hand-written C on ARM microcontrollers — even when running at a lower clock speed. But a full systematic comparison (across different workloads) would definitely be interesting for the future.
Main challenge: Yes — memory management is one of the biggest challenges. Dynamic memory allocation and garbage collection are tricky to manage efficiently without breaking real-time guarantees. I have a roadmap for it, but would like to stick to a real use case before moving forward.
Software emulation: I am using Icarus (could use Verilator) for RTL simulation if that's what you meant. But hardware behavior (like GPIO timing) still needs to be tested on the real FPGA to capture true performance characteristics.
There are a lot of dimensions to what you could call performance. The FPGA here is only clocked at 100 MHz and there's no way you're going to get the same throughput with it as you would on a conventional processor, especially if you add a JIT to optimize things. What you do get here is very low latency.
this project takes bytecode, maps it to fpga instructions. pypy can't do that.
Wow. Congratz
Thank you!
Is the source code available?
The source isn’t public at this stage. I'm still deciding the best path forward after PyCon.
This type of project is why I love HN. This work is brilliant!
Almost every question I had, you already answered in the comments. The only one remaining at the moment: How long exactly have you been working on PyXL?
> the program is compiled to a CPython Bytecode and then compiled again to PyXL assembly. It is then linked together and a binary is generated.
why are we not doing this for a standard python? i think LLVM is just for that, no?
How long did you work on this?
This is awesome
What's your development background that prepared you to take on a project like this?
Clearly you know a lot about both low level Python internals and a fair amount about hardware design to pull this off.
I'm a software engineer by background, mostly in high-frequency trading (HFT), HPC, systems programming, and networking — so a lot of focus on efficiency and low-level behavior. I had played a bit with FPGAs before, but nothing close to this scale — most of the hardware and Python internals work I had to figure out along the way.
Amazing, I'm sure many programmers would join to contribute to your great project, which could become as big as a Python-based operating system, which due to the simplicity of the code would advance very quickly.
Thank you! Right now I'm focusing on keeping the core simple, efficient, and purpose-driven — mainly to run Python well on hardware for embedded and real-time use cases.
As for the future, I’m keeping an open mind. It would be exciting if it grew into something bigger, but my main focus for now is making sure the foundation is as solid and clean as possible.
Not to be confused with openpyxl, a library for working with Excel files.
That then makes me wonder if someone could implement Excel in hardware! (Or something like it)
I just had to give it a name. Didn't really search for vacancies. Maybe I need to rename :)
There is a long history of CPUs tailored to specific languages:
- Lisp/lispm
- Ada/iAPX
- C/ARM
- Java/Jazelle
Most don't really take off or go in different directions as the language goes out of fashion.
Well, one could argue that modern CPUs are designed as C Machine, even more so that now everyone is adding hardware memory tagging as means to fix C memory corruption issues.
Only if you don't understand the history of C. B was a LCD grouping of assembler macros for a typical register machine, C just added a type system and a couple extra bits of syntax. C isn't novel in the slightest, you're structuring and thinking about your code pretty similar to a certain style of assembly programming on a register machine. And yes, that type of register machine is still the most popular way to design an architecture because it has qualities that end up being fertile middle ground between electrical engineers and programmers.
Also there are no languages that reflect what modern CPUs are like, because modern CPUs obfuscate and hide much of how the way they work. Not even assembly is that close to the metal anymore, and it even has undefined behavior these days. There was an attempt to make a more explicit version of the hardware with Itanium, and it was explicitly a failure for much of the same reason than iAPX432 was a failure. So we kept the simpler scalar register machine around, because both compilers and programmers are mostly too stupid to work with that much complexity. C didn't do shit, human mental capacity just failed to evolve fast enough to keep up with our technology. Things like Rust are more the descendant of C than the modern design of a CPU.
What do you think a language based on a modern CPU architecture would look like? The big deal is representing the OoO and speculative execution, right?
Text files seem a bit too sequential in structure, maybe we can figure out a way to represent the dependency graphs directly.
I envision an inflected grammar. That sounds crazy I know, but x64 is an inflected language already. The pointer arithmetic you can attach to a register isn't an expression or a distinct group of words, it's a suffix. Part of the word, indistinguishable from it. Someone once did a great job of explaining to me how that mapped to microcode in a shockingly static way and it blew my mind. I see affixes for controlling the branch predictor. Operations should also be inflected in a contextual way, making their relationship to other operations explicit, giving you control over how things are pipelined. Maybe take some inspiration from afro-asiatic languages, use kind of consonantal root system.
The end result would look nothing like any other programming language and would die in obscurity, to be honest. But holy shit it would be really fucking cool.
I certainly understand the design of the language used to expose a PDP-11 in a portable way.
By the way, my introduction to C was via RatC, with the complete listing on A Book on C, from 1988, bought in 1990.
Intel failures tend to be more political than technical, as root cause.
Also a fairly interesting Haskell efforts.
https://mn416.github.io/reduceron-project/
These range from a few instructions to accelerate certain operations, to marking memory for the garbage collector, to much deeper efforts.
Also: UCSD p-System, Symbolics Lisp-on-custom hardware, ...
Historically their performance is underwhelming. Sometimes competitive on the first iteration, sometimes just mid. But generally they can't iterate quickly (insufficient resources, insufficient product demand) so they are quickly eclipsed by pure software implementations atop COTS hardware.
This particular Valley of Disappointment is so routine as to make "let's implement this in hardware!" an evergreen tarpit idea. There are a few stunning exceptions like GPU offload—but they are unicorns.
They were a tar pit in the 1980s and 1990s when Moores law meant a 16x increase in processor speed every 6 years.
Right now the only reason why we don't have new generations of these eating the lunch of general purpose CPUs is that you'd need to organize a few billion transistors into something useful. That's something a bit beyond what just about everyone (including Intel now apparently) can manage.
Sure. The need to organize millions (now 10s to 100s of billions) of transistors to do something useful, the economics and will to bring those to market, the need to coordinate functions baked into hardware with the faster moving and vastly more-plastic software world—oh, and Amdahl's Law.
They are the tar pit. Transistor counts skyrocket, but the principles and obstacles have not changed one iota in over 50 years.
Amazing,
I wonder if silicon can feel pain.
For a minute there I was imagining Python as the actual instruction set and my brain was segfaulting.
Very cool project still