<whitequark[cis]>
the edge detector function should be a part of cxxrtl proper, eventually
<vancz>
<3 <@whitequark[cis]> i think this is quite cool
<vancz>
it luckily didnt need very much glue, turns out most of the work was understanding enough about how qemus apis work, and it turns out everything was there
<vancz>
now its very dumb, no multithreading, etc
<vancz>
got some sleep..
<vancz>
despite my better judgement, I YOLOd and won some university grant money 9 months ago, and I wanted to do this anyway...now I need to try milking it into something presentable so i dont have to pay the grant back and the deadline is coming....at least i got _SOMETHING_ to show now :V
RobTaylor[m] has joined #prjunnamed
<RobTaylor[m]>
<vancz> "I got a really shitty PoC..." <- nice!
<vancz>
i barely know the subject area. it seems like xilinx (and others?) have a thing / methods for integrating peripherals, but i didnt find anything where someone did this for the cpu side
<vancz>
im still scratching my head if anyone would even care about this but the point is kind of that qemu is a lot of existing tooling and hopefully you can eventually reuse peripherals or something
<vancz>
that I guess isnt really strictly related but it looks like they can get a lot done by externalizing communication (which i mean, makes sense)
<vancz>
so im not really sure how much of a point there is to what im doing but ive been going by vibe "it would be cool if you could verilog a cpu in qemu"
<vancz>
im partly trying to pump up some justification now
<vancz>
and i guess i really should try to boot up a SERV and zephyros or something
<whitequark[cis]>
i would recommend not using SERV
<whitequark[cis]>
that would be much slower than a non-bit-serial CPU
<vancz>
ok the main rationale for anything right now is what is the easiest for me to implement
<vancz>
i am open to any suggestions regarding anything
* vancz
has a weird time trying to justify "research" that is obvious and really just needed some engineering
<vancz>
which is to say i should try to turn this into babbys first primary authored paper
galibert[m] has quit [Quit: Idle timeout reached: 172800s]
<vancz>
this is what the qemu side runner looks like. its simple and bad https://bpa.st/4TKA
<vancz>
basically run in a loop and call memory operations
<vancz>
whitequark[cis]: I imagine cxxrtl doesnt do parallelism?
<whitequark[cis]>
no, not at the moment
<whitequark[cis]>
parallelism of RTL simulations is pretty difficult to do in a way where it actually improves performance
<vancz>
im not entirely surprised
<vancz>
AFAIU verilator does?
<whitequark[cis]>
i believe so. it used to be not very good and i believe (i have never measured it) these days it's reasonably good, which is an impressive achievement
galibert[m] has joined #prjunnamed
<galibert[m]>
Isn’t verilator’s performance with parallelism lower than cxxrtl’s without?
<vancz>
cool cool
<vancz>
galibert[m]: well going by what whitequark[cis] said that wouldnt be entirely surprising either
<whitequark[cis]>
galibert: i've never seen data that would suggest that
<vancz>
hopefully galibert[m] has some :D
<galibert[m]>
Should measure it someday then
<whitequark[cis]>
cxxrtl's single threaded performance is about on par with verilator
<whitequark[cis]>
it used to be slightly more but i think verilator improved since
<whitequark[cis]>
(obviously, broad reaching statements like this omit a lot of nuance relating to individual netlists. this is just a general idea of what you should expect)
<whitequark[cis]>
provided that verilator's parallel performance beats verilator's single-threaded performance it would then beat cxxrtl
<whitequark[cis]>
personally, i would parallelize cxxrtl simulations by splitting your design along AXI bus lines (you're using AXI right?) and using message passing
<whitequark[cis]>
since an AXI bus is five unidirectional channels it's really easy to turn bus transactions into messages, and then your individual simulations can use FIFOs. you can choose a tradeoff between cycle accuracy and the level of parallelism (synchronizing on each cycle will make it a bit slower)
<galibert[m]>
I should try running a 68k in both someday
<whitequark[cis]>
yeah
<whitequark[cis]>
speed was not cxxrtl's development goal, it was the ease of getting visibility into every signal
<whitequark[cis]>
i just happened to be able to make it quite fast in the process
<vancz>
talking to a friend and it just hit me that bruh i wonder if the xilinx people basically just do the same thing in terms of implementation. just start another thread and talk to the memory. lmao shit. only difference is i bother yanking out the cpu
<vancz>
i made a boneless (whatever that actually means) qemu
<vancz>
doesnt matter if you stick a peripheral or a cpu in there
<vancz>
its just a communication bus
<whitequark[cis]>
> implying you eat chicken for the bones
<whitequark[cis]>
s/>/\>/
<vancz>
oh its an eating thing? (duh I guess)
<vancz>
somehow I always thought of the more brütal tearing the mechanically supporting skeleton out of the thing
<vancz>
ok this is looking like a quality series hm
<vancz>
Seems like serv doesnt support IRQs other than the timer interrupt? I guess that means if I want to say use a UART I'd do so in polling mode?
<whitequark[cis]>
serv is dramatically size optimized
<vancz>
sure, makes sense
<vancz>
i guess thats actually simpler than what i had in mind because it means i dont necessarily need IRQs
DemiMarieObenou4 has joined #prjunnamed
<DemiMarieObenou4>
<whitequark[cis]> "since an AXI bus is five..." <- I wonder if you could use deoptimization for this. Generate fast code under the assumption that nobody is looking at the intermediate signals, and then when someone *does* look at them, fall back to slower code that exposes everything.
<whitequark[cis]>
this is literally what cxxrtl does
<DemiMarieObenou4>
nice :)
<DemiMarieObenou4>
It’s a JIT?
<whitequark[cis]>
no. it generates two versions of the eval function. one computes next state and outputs. another computes every signal with a public name
<DemiMarieObenou4>
Ah, okay.
<whitequark[cis]>
this way you can get really fast state advancement, and then re-simulate from a record/replay trace when you need to debug
<whitequark[cis]>
you can have a full view with only about 10% runtime overhead
<DemiMarieObenou4>
Oh nice
<DemiMarieObenou4>
Have you heard of Truffle?
<whitequark[cis]>
yes! it's very cool
<DemiMarieObenou4>
I also wonder if there are use-cases for a non-optimizing P&R tool
<whitequark[cis]>
i'm not sure that's possible (to make a non-optimizing router)
<whitequark[cis]>
like, how would you find a path? randomly toggle bits until something connects?
<DemiMarieObenou4>
AKA “my employer wants me to have fast turnaround times so the money they pay me goes further, so they give me an FPGA that is 10x bigger than the one that the target device will have, so that P&R is easier and I get results sooner”
<whitequark[cis]>
ok, i think that doesn't work either. usually your limiting factor is Fmax. using a bigger device will typically not improve your Fmax
<whitequark[cis]>
if you're really routing congested, for example, this usually happens in a small part of the design
<DemiMarieObenou4>
whitequark[cis]: what if you don’t need it to run at full speed?
<whitequark[cis]>
that just doesn't happen a lot
<DemiMarieObenou4>
whitequark[cis]: I see. I thought the reason that P&R was so hard is that it was having to solve NP-hard problems, and I know that those can often get easier (in practice) when the number of constraints is sufficiently small. Therefore, I assumed that making the problem under-constrained (by using an oversize device) would make the job of P&R easier.
<whitequark[cis]>
using a bigger device tends to make P&R work harder (because it has bigger equations to consider)
widlarizerEmilJT has joined #prjunnamed
<widlarizerEmilJT>
My intuition is that optimizing less/worse just implies worse congestion in the process too. Physical design automation really does differ from compiler-like tradeoffs as I understand it
<whitequark[cis]>
and the constraints are local, not global, so it won't necessarily help with the result
<whitequark[cis]>
consider: when doing P&R, you are repeatedly increasing slack until it becomes positive for every path
<whitequark[cis]>
i.e. shortening every path until it is smaller than the period or datapath delay constraint
<whitequark[cis]>
having a bigger device doesn't really help you shorten the path, in most cases (if you are at very high utilization, >80%, it will, but below that it generally won't)
<whitequark[cis]>
I think in practice people achieve faster P&R runs using incremental methods (where the P&R tool uses the previous run as a template for the current one)
<DemiMarieObenou4>
whitequark[cis]: Is this because hard real-time requirements are usually the reason one is using an FPGA in the first place?
<whitequark[cis]>
i don't think it generally has anything to do with hard real-time
<whitequark[cis]>
i mean, doing something exactly once a second is hard real-time but hardly difficult
<whitequark[cis]>
you usually use an FPGA to process a lot of data. the most common use cases are telecom, emulation, and DSP
<whitequark[cis]>
for telecom, your data rate is usually fixed by the design, and your design absolutely must meet it => hard Fmax bound
<whitequark[cis]>
for emulation, you want to run your emulated SoC as fast as possible. you basically never achieve anything close to the production ASIC speed, you might run at 50 MHz instead of 500 or 1500 MHz. but for the same reason you really really want it to be as fast as possible
<whitequark[cis]>
for DSP, you usually go for an FPGA because a CPU+FPU cannot cope with the amount of data => hard Fmax bound again (but you can tweak the algorithms)
<DemiMarieObenou4>
whitequark[cis]: Is it reasonable to assume that emulating really really old ASICs (which had slow clocks due to old process nodes) is an exception?
<DemiMarieObenou4>
I was thinking of cases where one can run with a mock data source that produces (say) 1/10th the data
<DemiMarieObenou4>
It’s meant for “I need to do some iteration and the software simulator is too slow to run, but the full P&R takes too long to build”. Is this not a situation that happens in practice?
<DemiMarieObenou4>
Or is incremental P&R sufficient?
<whitequark[cis]>
emulating really old ASICs is not going to require a lot of P&R time in first place
<whitequark[cis]>
like, people want faster P&R because their P&R run might take 8 to 24 hours
<DemiMarieObenou4>
Could P&R itself be hardware accelerated somehow?
<DemiMarieObenou4>
or at least be SIMD accelerated?
<whitequark[cis]>
nobody has produced a useful hardware accelerator for it. it's graph traversal
<whitequark[cis]>
the CPU is already about as fast at graph traversal as it gets
<whitequark[cis]>
nextpnr built for the x32 ABI is actually faster than nextpnr built for x86_64 ABI because it's cache and/or memory bandwidth limited and so the smaller pointers really help
<DemiMarieObenou4>
whitequark[cis]: Would something like Intel’s never-commercialized PIUMA graph traversal accelerator (https://arxiv.org/abs/2010.06277) help if it existed?
<DemiMarieObenou4>
It’s a bunch of CPUs optimized for workloads where the cache and branch predictors are of little help, and instead relies on fine-grained multi-threading to hide latency.
<DemiMarieObenou4>
s/cache/caches/
<DemiMarieObenou4>
whitequark[cis]: That is interesting!
<whitequark[cis]>
DemiMarieObenou4: no idea
<whitequark[cis]>
DemiMarieObenou4: yowasp-nextpnr is about as fast as native nextpnr despite worse codegen + bounds checking everywhere
<DemiMarieObenou4>
Is there a book with the answers to all of the silly (to you) questions I keep asking?
<whitequark[cis]>
I don't know. I don't mind answering though
<DemiMarieObenou4>
That paper stated that GPUs can beat CPUs at graph analytics so long as the data fits in VRAM, which I imagine it usually does.
<DemiMarieObenou4>
* usually does in P&R workloads.
<DemiMarieObenou4>
* usually does in P&R workloads if you have a server-class GPU.
<DemiMarieObenou4>
Can one optimize simulators by mapping e.g. an adder to the native CPU addition instruction?
<whitequark[cis]>
cxxrtl does that
<DemiMarieObenou4>
<whitequark[cis]> "no idea" <- nice