#prjunnamed on 2025-06-11 — irc logs at libera.catirclogs.org

2025-05-08 06:38 whitequark[cis] changed the topic of #prjunnamed to: FPGA toolchain project · rule #0 of prjunnamed: no one should ever burn out building software · https://prjunnamed.org · https://github.com/prjunnamed/prjunnamed · logs: https://libera.catirclogs.org/prjunnamed

07:20 <whitequark[cis]> i think this is quite cool

07:20 <whitequark[cis]> the edge detector function should be a part of cxxrtl proper, eventually

15:51 <vancz> <3 <@whitequark[cis]> i think this is quite cool

16:08 <vancz> it luckily didnt need very much glue, turns out most of the work was understanding enough about how qemus apis work, and it turns out everything was there

16:08 <vancz> now its very dumb, no multithreading, etc

16:09 <vancz> got some sleep..

16:10 <vancz> despite my better judgement, I YOLOd and won some university grant money 9 months ago, and I wanted to do this anyway...now I need to try milking it into something presentable so i dont have to pay the grant back and the deadline is coming....at least i got _SOMETHING_ to show now :V

16:10 RobTaylor[m] has joined #prjunnamed

16:10 <RobTaylor[m]> <vancz> "I got a really shitty PoC..." <- nice!

16:11 <vancz> i barely know the subject area. it seems like xilinx (and others?) have a thing / methods for integrating peripherals, but i didnt find anything where someone did this for the cpu side

16:12 <vancz> im still scratching my head if anyone would even care about this but the point is kind of that qemu is a lot of existing tooling and hopefully you can eventually reuse peripherals or something

16:12 <vancz> i havent looked at gem5

16:12 <vancz> I found this the other day https://parallel.princeton.edu/papers/aspl20-balkind.pdf BYOC: A "Bring Your Own Core" Framework forHeterogeneous-ISA Research

16:13 <vancz> that I guess isnt really strictly related but it looks like they can get a lot done by externalizing communication (which i mean, makes sense)

16:13 <vancz> so im not really sure how much of a point there is to what im doing but ive been going by vibe "it would be cool if you could verilog a cpu in qemu"

16:14 <vancz> im partly trying to pump up some justification now

16:14 <vancz> and i guess i really should try to boot up a SERV and zephyros or something

16:14 <whitequark[cis]> i would recommend not using SERV

16:14 <whitequark[cis]> that would be much slower than a non-bit-serial CPU

16:15 <vancz> ok the main rationale for anything right now is what is the easiest for me to implement

16:15 <vancz> i am open to any suggestions regarding anything

16:16 * vancz has a weird time trying to justify "research" that is obvious and really just needed some engineering

16:17 <vancz> which is to say i should try to turn this into babbys first primary authored paper

17:54 galibert[m] has quit [Quit: Idle timeout reached: 172800s]

19:00 <vancz> this is what the qemu side runner looks like. its simple and bad https://bpa.st/4TKA

19:00 <vancz> basically run in a loop and call memory operations

19:01 <vancz> whitequark[cis]: I imagine cxxrtl doesnt do parallelism?

19:01 <whitequark[cis]> no, not at the moment

19:01 <whitequark[cis]> parallelism of RTL simulations is pretty difficult to do in a way where it actually improves performance

19:01 <vancz> im not entirely surprised

19:02 <vancz> AFAIU verilator does?

19:03 <whitequark[cis]> i believe so. it used to be not very good and i believe (i have never measured it) these days it's reasonably good, which is an impressive achievement

19:03 galibert[m] has joined #prjunnamed

19:03 <galibert[m]> Isn’t verilator’s performance with parallelism lower than cxxrtl’s without?

19:03 <vancz> cool cool

19:03 <vancz> galibert[m]: well going by what whitequark[cis] said that wouldnt be entirely surprising either

19:03 <whitequark[cis]> galibert: i've never seen data that would suggest that

19:03 <vancz> hopefully galibert[m] has some :D

19:03 <galibert[m]> Should measure it someday then

19:04 <whitequark[cis]> cxxrtl's single threaded performance is about on par with verilator

19:04 <whitequark[cis]> it used to be slightly more but i think verilator improved since

19:04 <whitequark[cis]> (obviously, broad reaching statements like this omit a lot of nuance relating to individual netlists. this is just a general idea of what you should expect)

19:04 <whitequark[cis]> provided that verilator's parallel performance beats verilator's single-threaded performance it would then beat cxxrtl

19:05 <whitequark[cis]> personally, i would parallelize cxxrtl simulations by splitting your design along AXI bus lines (you're using AXI right?) and using message passing

19:06 <whitequark[cis]> since an AXI bus is five unidirectional channels it's really easy to turn bus transactions into messages, and then your individual simulations can use FIFOs. you can choose a tradeoff between cycle accuracy and the level of parallelism (synchronizing on each cycle will make it a bit slower)

19:06 <galibert[m]> I should try running a 68k in both someday

19:06 <whitequark[cis]> yeah

19:06 <whitequark[cis]> speed was not cxxrtl's development goal, it was the ease of getting visibility into every signal

19:06 <whitequark[cis]> i just happened to be able to make it quite fast in the process

19:53 <vancz> talking to a friend and it just hit me that bruh i wonder if the xilinx people basically just do the same thing in terms of implementation. just start another thread and talk to the memory. lmao shit. only difference is i bother yanking out the cpu

19:54 <vancz> i made a boneless (whatever that actually means) qemu

19:54 <vancz> doesnt matter if you stick a peripheral or a cpu in there

19:54 <vancz> its just a communication bus

19:57 <whitequark[cis]> > implying you eat chicken for the bones

19:57 <whitequark[cis]> s/>/\>/

19:59 <vancz> oh its an eating thing? (duh I guess)

19:59 <vancz> somehow I always thought of the more brütal tearing the mechanically supporting skeleton out of the thing

19:59 <vancz> i eat qemu for the peripherals

20:01 <whitequark[cis]> https://www.youtube.com/watch?v=ud1JXqGWPvU

20:03 <vancz> heh

20:07 <vancz> ok this is looking like a quality series hm

20:31 <vancz> Seems like serv doesnt support IRQs other than the timer interrupt? I guess that means if I want to say use a UART I'd do so in polling mode?

20:32 <whitequark[cis]> serv is dramatically size optimized

20:32 <vancz> sure, makes sense

20:33 <vancz> i guess thats actually simpler than what i had in mind because it means i dont necessarily need IRQs

23:26 DemiMarieObenou4 has joined #prjunnamed

23:26 <DemiMarieObenou4> <whitequark[cis]> "since an AXI bus is five..." <- I wonder if you could use deoptimization for this. Generate fast code under the assumption that nobody is looking at the intermediate signals, and then when someone *does* look at them, fall back to slower code that exposes everything.

23:26 <whitequark[cis]> this is literally what cxxrtl does

23:26 <DemiMarieObenou4> nice :)

23:26 <DemiMarieObenou4> It’s a JIT?

23:27 <whitequark[cis]> no. it generates two versions of the eval function. one computes next state and outputs. another computes every signal with a public name

23:27 <DemiMarieObenou4> Ah, okay.

23:27 <whitequark[cis]> this way you can get really fast state advancement, and then re-simulate from a record/replay trace when you need to debug

23:27 <whitequark[cis]> you can have a full view with only about 10% runtime overhead

23:27 <DemiMarieObenou4> Oh nice

23:27 <DemiMarieObenou4> Have you heard of Truffle?

23:28 <whitequark[cis]> yes! it's very cool

23:28 <DemiMarieObenou4> I also wonder if there are use-cases for a non-optimizing P&R tool

23:28 <whitequark[cis]> i'm not sure that's possible (to make a non-optimizing router)

23:29 <whitequark[cis]> like, how would you find a path? randomly toggle bits until something connects?

23:29 <DemiMarieObenou4> AKA “my employer wants me to have fast turnaround times so the money they pay me goes further, so they give me an FPGA that is 10x bigger than the one that the target device will have, so that P&R is easier and I get results sooner”

23:29 <whitequark[cis]> ok, i think that doesn't work either. usually your limiting factor is Fmax. using a bigger device will typically not improve your Fmax

23:30 <whitequark[cis]> if you're really routing congested, for example, this usually happens in a small part of the design

23:31 <DemiMarieObenou4> whitequark[cis]: what if you don’t need it to run at full speed?

23:33 <whitequark[cis]> that just doesn't happen a lot

23:33 <DemiMarieObenou4> whitequark[cis]: I see. I thought the reason that P&R was so hard is that it was having to solve NP-hard problems, and I know that those can often get easier (in practice) when the number of constraints is sufficiently small. Therefore, I assumed that making the problem under-constrained (by using an oversize device) would make the job of P&R easier.

23:33 <whitequark[cis]> using a bigger device tends to make P&R work harder (because it has bigger equations to consider)

23:33 widlarizerEmilJT has joined #prjunnamed

23:33 <widlarizerEmilJT> My intuition is that optimizing less/worse just implies worse congestion in the process too. Physical design automation really does differ from compiler-like tradeoffs as I understand it

23:34 <whitequark[cis]> and the constraints are local, not global, so it won't necessarily help with the result

23:35 <whitequark[cis]> consider: when doing P&R, you are repeatedly increasing slack until it becomes positive for every path

23:35 <whitequark[cis]> i.e. shortening every path until it is smaller than the period or datapath delay constraint

23:36 <whitequark[cis]> having a bigger device doesn't really help you shorten the path, in most cases (if you are at very high utilization, >80%, it will, but below that it generally won't)

23:36 <whitequark[cis]> I think in practice people achieve faster P&R runs using incremental methods (where the P&R tool uses the previous run as a template for the current one)

23:37 <DemiMarieObenou4> whitequark[cis]: Is this because hard real-time requirements are usually the reason one is using an FPGA in the first place?

23:39 <whitequark[cis]> i don't think it generally has anything to do with hard real-time

23:39 <whitequark[cis]> i mean, doing something exactly once a second is hard real-time but hardly difficult

23:39 <whitequark[cis]> you usually use an FPGA to process a lot of data. the most common use cases are telecom, emulation, and DSP

23:40 <whitequark[cis]> for telecom, your data rate is usually fixed by the design, and your design absolutely must meet it => hard Fmax bound

23:40 <whitequark[cis]> for emulation, you want to run your emulated SoC as fast as possible. you basically never achieve anything close to the production ASIC speed, you might run at 50 MHz instead of 500 or 1500 MHz. but for the same reason you really really want it to be as fast as possible

23:41 <whitequark[cis]> for DSP, you usually go for an FPGA because a CPU+FPU cannot cope with the amount of data => hard Fmax bound again (but you can tweak the algorithms)

23:42 <DemiMarieObenou4> whitequark[cis]: Is it reasonable to assume that emulating really really old ASICs (which had slow clocks due to old process nodes) is an exception?

23:42 <DemiMarieObenou4> I was thinking of cases where one can run with a mock data source that produces (say) 1/10th the data

23:43 <DemiMarieObenou4> It’s meant for “I need to do some iteration and the software simulator is too slow to run, but the full P&R takes too long to build”. Is this not a situation that happens in practice?

23:43 <DemiMarieObenou4> Or is incremental P&R sufficient?

23:43 <whitequark[cis]> emulating really old ASICs is not going to require a lot of P&R time in first place

23:44 <whitequark[cis]> like, people want faster P&R because their P&R run might take 8 to 24 hours

23:44 <DemiMarieObenou4> Could P&R itself be hardware accelerated somehow?

23:44 <DemiMarieObenou4> or at least be SIMD accelerated?

23:44 <whitequark[cis]> nobody has produced a useful hardware accelerator for it. it's graph traversal

23:45 <whitequark[cis]> the CPU is already about as fast at graph traversal as it gets

23:45 <whitequark[cis]> nextpnr built for the x32 ABI is actually faster than nextpnr built for x86_64 ABI because it's cache and/or memory bandwidth limited and so the smaller pointers really help

23:46 <DemiMarieObenou4> whitequark[cis]: Would something like Intel’s never-commercialized PIUMA graph traversal accelerator (https://arxiv.org/abs/2010.06277) help if it existed?

23:47 <DemiMarieObenou4> It’s a bunch of CPUs optimized for workloads where the cache and branch predictors are of little help, and instead relies on fine-grained multi-threading to hide latency.

23:48 <DemiMarieObenou4> s/cache/caches/

23:49 <DemiMarieObenou4> whitequark[cis]: That is interesting!

23:50 <whitequark[cis]> DemiMarieObenou4: no idea

23:50 <whitequark[cis]> DemiMarieObenou4: yowasp-nextpnr is about as fast as native nextpnr despite worse codegen + bounds checking everywhere

23:51 <DemiMarieObenou4> Is there a book with the answers to all of the silly (to you) questions I keep asking?

23:52 <whitequark[cis]> I don't know. I don't mind answering though

23:52 <DemiMarieObenou4> That paper stated that GPUs can beat CPUs at graph analytics so long as the data fits in VRAM, which I imagine it usually does.

23:52 <DemiMarieObenou4> * usually does in P&R workloads.

23:52 <DemiMarieObenou4> * usually does in P&R workloads if you have a server-class GPU.

23:53 <DemiMarieObenou4> Can one optimize simulators by mapping e.g. an adder to the native CPU addition instruction?

23:55 <whitequark[cis]> cxxrtl does that

23:56 <DemiMarieObenou4> <whitequark[cis]> "no idea" <- nice