<Ameisen>
heat: iirc, most chips prefer negative polarity - that is, they assume that a branch that decreases the current program counter is taken at first? (Except for chips that just choose randomly)
<heat>
not sure about backwards jumps, i guess it makes sense that they would, cuz of loops
<Ameisen>
the instruction count branch is only generated if it's needed, but it is often needed since iterative/pausing execution is one of the features.
<heat>
but for forwards branches they definitely assume you're not taking the branch
<Ameisen>
the main issue generally would be the myriad branches for potential exceptions
<Ameisen>
hard to eliminate the forward branches in regards to exceptions, at least
<Ameisen>
usually ends up getting written as: operation, check exception, goto no_exception, throw exception, no_exception: store result
<Ameisen>
at some point a forward jump is required
<heat>
well that's wrong
<nikolar>
i think they assume that backwards are taken and forwards are not taken
<nikolar>
at least when encountered for the first ime
<heat>
unlikely stuff should go towards the end of the code
<heat>
dont jump over things in likely paths
<Ameisen>
then it'd need to be: operation, check exception, goto exception if exception, store result, goto no_exception, exception: throw exception, no_exception:
<Ameisen>
so a forward jump over unlikely code instead
<Ameisen>
I can certainly change it to generate that sort of sequence insteead
<heat>
that still sucks
<heat>
don't jump over code
<Ameisen>
well, I have to jump at some point
<heat>
you really don't
<Ameisen>
how do I handle exceptions, then?
<heat>
i mean
<heat>
don't jump over code in the hotpath
<heat>
you only need those exception: thunks once every ~2GB on x86
<heat>
(because of jmp imm32)
<Ameisen>
yes, those jumps for exceptions go to those thunks.
<Ameisen>
I put them every generated chunk, but meh
<heat>
that's just crap
<heat>
you're adding extra jmps and filling your icache with stuff that's almost never going to run
<heat>
mjg would say it is PESSIMAL but the guy's sleeping
<Ameisen>
written out, since this isn't what the patches actually look like in source :|
<zid>
why do you need the overflow case btw?
<Ameisen>
there are some instructions - especially load/stores - which are more complex though, and sometimes involves calls back into the interpreter. They also usually need to pass an operand to the thunk. Right now their order is opposite of that.
<zid>
flags act differently?
<Ameisen>
zid: because the MIPS32r6 spec mandates it.
<zid>
yes but *why*
<zid>
what do you have to implement there
<Ameisen>
because MIPS doesn't have flags.
<zid>
oh it causes an exception interrupt?
<Ameisen>
yes
<zid>
neat
<Ameisen>
it doesn't have an exception interrupt for divide-by-zero, interestingly.
<Ameisen>
it's just 'undefined'
<Ameisen>
so I end up having to turn x86 flags into exceptions, thankfully there are jumps specifically for those flags
<Ameisen>
there are just... a lot of jumps.
<Ameisen>
ever `ADD` needs one simply because the flag must be checked, as an example
<Ameisen>
well, almost every `ADD`, some of the conditions generate different patches that cannot overflow.
<zid>
Yea sounds like it would benefit heavily from some optimization
<Ameisen>
yeah, I'm just unsure how. I need to handle the overflow exception, I'm just not sure what the best approach is. As said, certain instructions are more complex and have more jumps - load/stores can be... weird.
<Ameisen>
since I'm also checking for valid address ranges and such
<zid>
well, the 'best' way would be to prove it can be elided
<Ameisen>
patched jumps are probably the weirdest.
<zid>
otherwise you're just emulating it
<Ameisen>
presently, I only perform static analysis on the registers themselves - it doesn't try to introspect on the values.
<Ameisen>
I have an idea on how to do that that won't break things (I cannot do tracing, but I can generate short 'hot paths' that execute as a single unit instead)
<zid>
You need to write an optimizing compiler, effectively
<zid>
where the equivalent C source you're compiling is if(a + (long)b > UINT_MAX) except_overflow();
<zid>
so you can either prove a + b can't be big enough, or that except_overflow has no effects
<zid>
but that's obviously, hard
<Ameisen>
right, that requires some level of introspection into what the values can potentially be at that point.
<Ameisen>
which is possible to a point in a tracing optimizer
<Ameisen>
I can know, at least, sometimes if they're zero (since $0 is always zero)
<Ameisen>
surprisingly, the compiler generates that more often that I'd expect... basically just making moves
<Ameisen>
not sure why.
<zid>
It may literally be aliased from 'mov'
<zid>
on mips 1 I did 'ori' for my mov
<Ameisen>
it's not - the table generator masks the instructions out if they're aliased.
<zid>
what?
<Ameisen>
mips32r6 has some instructions that only differ by a bit, and they're often defined as the same instruction. The table generator masks those out when generating lookups.
<Ameisen>
so they will get resolved as different instructions
<zid>
what's a table generator, what table generator, what is 'aliased', how is this a response to what I said?
<zid>
I literally don't understand any of it
<Ameisen>
I define the instructions by mask and masked bits. Something like, say, EHB and SLL which are the same instruction are distinguished by their mask (zero registers and a specific shift size in that case), so they show up to the system as distinct instructions.
<Ameisen>
the table generator just generates the lookup table from that
<zid>
I didn't ask you what you defined things as
<zid>
you said the compiler generated a lot of add n, r, 0
<zid>
I said yes, that's a common way to implement the instruction 'mov'
<zid>
I either use that or 'ori' for mov
<zid>
or addiu I guess, for mips
<Ameisen>
Well, the funny thing is that I see _both_
<Ameisen>
multiple instructions used like that
<Ameisen>
that's what's weird to me.
<zid>
might be different peepholes?
<zid>
idk what compiler it is
<Ameisen>
Clang, so LLVM backend.
<zid>
an smips has 'li' meaning addiu apparently
<Ameisen>
I even saw a few shift instructions with shifts of zero.
<zid>
padding? filling delay slots?
<zid>
you said no flags so probably not that
<Ameisen>
it's possible, though any instruction would do then. They were operating as moves still, not identity-writes.
<Ameisen>
There are internal flags that are 'defined' but not quite physical, like delay branches
<zid>
I can only assume they just get picked by different codepaths in their optimizer
<Ameisen>
they're not user-visible though
<zid>
unless something clever pops into my face
<Ameisen>
I do have the compiler for the toolchain set to avoid delay branches, though - they're slower than compact branches because there's more logic associated with them.
<Ameisen>
that's my guess, I just wasn't expecting it.
<Ameisen>
I was thinking "do I really need to write the optimal patches for things like shifting by zero? Why would they do that?"
<zid>
mark mips down as "not amenable to JIT" and run it in an interp :P
<Ameisen>
I mean, I do have an interpreter as well, they interplay (can and do switch between them)
<Ameisen>
even the current dynamic recompiler is vastly faster than it, though.
<Ameisen>
with it, performance test runs in 33.5 seconds right now (i have some test logic in, normally closer to 30). Fully interpreted, it takes... well, it's still running.
<Ameisen>
Natively, it takes around 4-5 seconds to run.
<Ameisen>
interpreted: 819s
<zid>
your interp is bad and it should feel bad
<zid>
you should be getting a few cycles per cycle, not whatever that is
GeDaMo has joined #osdev
_whitelogger has joined #osdev
sprock has joined #osdev
parabirb has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
parabirb has joined #osdev
sprock has quit [Remote host closed the connection]
<Ameisen>
though I'm not sure how you can possibly get a few cycles per cycle emulating MIPS in an interpreter. The overhead should at least be an order of magnitude worse.
<zid>
why would it be, it should be like, a mask, a jump, then two more masks, then a single instruction, in 99% of cases
<zid>
and x86 is magic and will run half of that before you even asked it to
<zid>
while it waits for the previous write to settle
<heat>
computed goto
<geist>
getting near 10:1 is about as good as you
<geist>
you'll get with an interpreter, from what i understand
<geist>
an emulator i wrote years ago transcoded into a more expanded, kinda VLIW looking instruction set that ends up being basically a big switch statement and everything computed
<geist>
got fairly close to 10:1 average
<geist>
well more like 10:1 overhead of the loop
the_oz has quit [Quit: Leaving]
<nikolar>
so you got the original code and transpiled it into your own representation?
<heat>
transmeta but geist
<nikolar>
kek
<nikolar>
and in software
<mjg>
OY netbsd landed O_CLOFORK
<mjg>
This is Ricardo Branco's implementation of O_CLOFORK (and
<mjg>
associated fcntl, etc) for NetBSD (with a few minor changes
<zid>
I wish my gameboy was 10:1 but it has to update all its devices every subcycle because I was too lazy to write the version with scheduling :P
<Ameisen>
geist - see, what I call an interpreter itself would be 'it looks up each instruction, and executes it, each time'. It doesn't do anything else.
<Ameisen>
So, performance is poor. It's interpreting things. Inbetween, you can start doing more work like precomputing function calls, dynamic recompilation, etc...
<zid>
I bet mips is pretty amenable to avx tricks too
<Ameisen>
my interpreter is a very dumb interpreter since it's not really intended to be used except as a fallback or for specific tasks, but it's pretty simple. It has to look up each instruction, execute it, check state, etc.
<Ameisen>
the dynamic recompiler is not optimal and I don't think that I can make it such within the constraints it has
<Ameisen>
I think it's... 'closer' to what you're calling an interpreter though.
<Ameisen>
I don't like calling mine a JIT simply because most people seem to use that just to refer to things like tracing JITs
<Ameisen>
I need instruction-level accuracy, so I can't start merging instructions together (except in certain cases that I'm looking into)
<heat>
mjg: lol lol lol lol lol lol lol
<heat>
lol
<heat>
xd xd xd xd xd xd xd
<mjg>
chill dawg
<nikolar>
i think you said that already heat
<heat>
Good
<zid>
nikolar: Hey don't make fun of heat for his low mental clockspeed
<nikolar>
lol
<mjg>
heat is a arschloch
<mjg>
an
<zid>
UMA MUSUME, THEY WERE BORN TO RUN
<zid>
That's the clockspeed limit of my brain ^
<Ameisen>
zid: yeah, if you're rebuilding the binary as a whole into something new, you can do a lot of tricks. My goal has just had specific constraints and I keep butting my head into the issues those constraints cause.
<zid>
Ameisen: that's a dynarec/jit, not an interp
<Ameisen>
though if those issues weren't there it'd be boring.
<zid>
I'm saying I bet you can avx the *interp*
<Ameisen>
I'm not sure how
<heat>
mjg: same as you hunny bun <3
<zid>
decode multiple instructions at the same time
<zid>
as one example
<Ameisen>
but how are you executing them, following the mandated specification requirements during them, and also allowing the user to interrupt execution after 2 instructions with the state maintained?
<zid>
mips is fixed width right
<mjg>
heat: true
<zid>
???
<zid>
what's that got to do with anything
<zid>
you still *retire* them in-order
<Ameisen>
you're just talking about instruction lookup?
<Ameisen>
yeah, my approach for that is awful, I just haven't bothered to improve it because the interpreter isn't intended to be used in that way.
<zid>
but when you do op = INSTR&OPMASK; reg_dst = (INSTR & REG_DST_MASK) >> REG_DST_MASK_CRAP; ...
<Ameisen>
ah, so full decoding you mean
<zid>
you could probably just fetch 8 instructions and do all 8 at once, generating reg_dst[8] and then just looping over those
<Ameisen>
I'd have to write a new interpreter to do things like that; the current one is intended to be fully portable and relatively simple.
<Ameisen>
it wouldn't be impossible to do
<zid>
nikolar: I have downloaded the horseime.
<zid>
It is 80GB for S01
<nikolar>
why is it 80 gigs
<Ameisen>
I care more about the performance of the dynamic recompiled code itself, since that's where the time is really being spent (unless the interpreter is getting used for some reason, heavily)
<zid>
cus it's old so it has blurays instead of webrips :P
<nikolar>
kek
<nikolar>
i thought someone would've reencoded or something
<zid>
yea the main 'speedup' of a JIT, even if you're non-optimizing, is that you delete all of the decoder code
<zid>
nikolar: but every umapixel is important
<nikolar>
good counter point
<zid>
the decodes stay decoded
<zid>
because you're writing them to a buffer
<Ameisen>
right now, during the test run, the dynamic recompiler doesn't drop to the interpreter fully at all (until exit). It processes a handful of emulated instructions - like 30 out of 45 billion.
<Ameisen>
the decoder isn't hit at all when the dynamic recompiled code is running.
<Ameisen>
it only gets hit when chunks are being generated
<Ameisen>
I don't think it's ever come up in a profile (though profiling this is a pain).
<zid>
exactly
<zid>
Hence
<zid>
> the main 'speedup' of a JIT, even if you're non-optimizing, is that you delete all of the decoder code
<Ameisen>
oh, I thought you meant completely unloading it in a literal sense.
<nikolar>
well, if you're going really fancy, you can dynamically optimize
<nikolar>
but yeah, yeeting the decoder code is the first obvious speed up
<Ameisen>
There are other gains to, mainly in that I can control state sharing between instructions more precisely in generated code than I can from even the flattest C++.
<Ameisen>
too*
<nikolar>
eww c++
<zid>
The fuck is a flattest C++ and what state
<zid>
???
<Ameisen>
the registers, the state of various things (like the IP, DBP, etc), and not needing things to get pushed back into memory until necessary.
<zid>
"jits are faster than interps because of the huge gain of not using flat C++"
<zid>
what you just said
<Ameisen>
I'm not sure how to explain what flattened code means at 6:30 AM
<Ameisen>
:|
<zid>
then don't pretend it's a thing we'll understand
<zid>
but yea, mips doesn't get to benefit from pinned regs, rip
<Ameisen>
It's a term we used in game development. Recursive inlining so your call tree is flat.
<Ameisen>
along with intraprocedural optimizations, you can get some very nice code generation that way. Or very bad.
<zid>
yea I'm not sure C++ workarounds are super relevent to describing a jit
<nikolar>
lol
<Ameisen>
They're common to C or C++, but obviously not.
<Ameisen>
I'm just saying I cannot get the compiler to do a lot of things regardless of what I do when using C or C++, that you can when generating code.
<zid>
It doesn't need a name in C, it's just called "The compiler will do it"
<Ameisen>
the compiler is surprisingly bad at doing it a lot :)
<zid>
it really isn't lol
<zid>
my entire gameboy emulator is in a single god damn function
<Ameisen>
especially if you need portability and have to target compilers that optimize poorly in these regards.
<zid>
with labels like lto.38493
<zid>
thanks gcc
<Ameisen>
I'm surprised that it did that without `__attribute__((__flatten__))` everywhere.
<Ameisen>
whenever I test these things with GCC, Clang, or MSVC... they really don't want to inline in a lot of cases.
<nikolar>
yeah, because it was written in c
<nikolar>
not c++
<Ameisen>
the backends don't care if it's C or C++.
<zid>
That's just how C works Ameisen, I get to define my interfaces properly
<nikolar>
the backends don't care no
<nikolar>
but the frontends differ, widely
<Ameisen>
the optimize doens't either.
<nikolar>
in the code they emit
<Ameisen>
The Clang frontends for both are identical
<Ameisen>
they're the same code
<nikolar>
sure
<zid>
C++ frontend is in a huge battle against de-virtualization and blah blah blah
<Ameisen>
GCC is different, no idea about MSVC.
<zid>
that C just gets to entirely skip
<nikolar>
^
<nikolar>
i said they emit different code
<Ameisen>
devirtualization is an optimization pass
<nikolar>
not that they are different code
<Ameisen>
it's not a part of the frontend.
<nikolar>
either way
<nikolar>
c is easier to optimize
<Ameisen>
the only thing the frontend does it generate the IL, which if you're not using those features, doesn't really differ between C and C++.
<zid>
C++ has to make way way more shit public
<zid>
which just ruins a lot of optimizations
<Ameisen>
eh? Everything is public in C++...
<Ameisen>
err, C
<zid>
no, basically nothing is
<nikolar>
that's not the kind of "public" he's talking about i imagine
<nikolar>
it's the same in c++ anyway
<Ameisen>
I'm not sure what he means by 'public' in that context. Do you mean 'C++ has more features available that the compiler has to take into account'?
<zid>
no
<zid>
what do the C++ weenies call it
<zid>
pimpl
<Ameisen>
I mean, I don't see any private implementations in my code.
<Ameisen>
So I'm not sure how they're relevant...
<Ameisen>
I can implement the same paradigm in C anyways
<Ameisen>
though it'll be worse.
<zid>
right, but then they're literally the same
<Ameisen>
Correct, that's my point.
<zid>
if you're talking C++, that means C++ features
<zid>
not C
<Ameisen>
Just because you're using C++ doesn't mean that you're using all of C++'s features at all times everywhere.
<zid>
C++ is x means nothing if you're just writing .c files but compiling them with -x cpp-with-preprocessor
<Ameisen>
That would be stupid.
<Ameisen>
I heavily use C++ features, usually templates and constexpr.
<Ameisen>
I don't use the particular ones you're talking about very often because they're not really relevant to my use-cases.
<zid>
great, but that doesn't change what we were talking about really
<zid>
C compilers are very good at inlining C
<Ameisen>
GCC is an odd one in that the frontends are different (though I haven't looked at the IL for it). Clang generates basically the same thing for comparable C and C++.
<zid>
C++ (actual C++ code, not C in a disguise) is much harder to inline, because of things like having to *also* pull off devirtualization.
<zid>
before the flat C re-eappears
<Ameisen>
To put another way: I'm not using any features in my code that's particularly hard for the compiler to process in that sense.
<zid>
re-appears*
<Ameisen>
and when I do, I'd be doing something worse in C anyways.
<nikolar>
optimizers are limited in the number of things they can do
<nikolar>
the more things it needs to see throigh and optimize, the worse it gets
<Ameisen>
the optimizer is literally the same in this case, though.
<nikolar>
so c being simpler, is far easier to optmize
<zid>
like if I write T<x<<"bob"::f(j &)> crap, it has to turn that into flat C, *then* apply optimizations you'd consider for C
<Ameisen>
The optimizer doesn't know if you're optimizing C or C++
<Ameisen>
it's optimizing IL
<zid>
like inlining
<nikolar>
Ameisen: i am not saying it knows
<nikolar>
i am saying it needs to wade through more shit to get to the core if it's optimizing what was emitted for c++
<zid>
> if I write T<x<<"bob"::f(j &)> crap, it has to turn that into flat C, *then* apply optimizations you'd consider for C
<nikolar>
it doesn't need to know or care for that to be a fact
<zid>
nikolar ^
<Ameisen>
I mean, that's wholly untrue unless you're talking about weird contexts.
<Ameisen>
Anyways, this argument is dumb and religious, and I really don't want to have it since it won't go anywhere, so I'm going to go to bed.
<nikolar>
i mean you said it yourself
<zid>
wholly untrue? lol. It's just a basic fact of language
<nikolar>
you need to convince compiler to inline things for you
<nikolar>
i am just explaining why you don't have to do that when you're working in c
<nikolar>
at least not as much
<zid>
nikolar: And given C++ is largely a superset, they suffer from the same things that they *can't* optimize well, generally. C++ obviously just has points of failure *on top*. Because it has more language.
<Ameisen>
I should note that I have a _lot_ of familiarity with C++ and compiler optimization passes in these senses - I did a lot of work on more constrained targets like AVR specifically with C++ to understand what the compiler and optimization passes struggled with and what they didn't. The things they struggled with were there, but they were generally limited to things like `virtual` (and that _specifically_), but the equivalent constructs in C generally
<Ameisen>
resulted in worse code.
<zid>
Ameisen: Were any of those passes only relevent to the C++ language?
<Ameisen>
exceptions also were problematic, ESPECIALLY on AVR.
<zid>
Such as, for example, devirtualization
<Ameisen>
I already answered your question.
<Ameisen>
I specifically mentioned `virtual`, which is the only reason that devirtualization exists.
<zid>
No it isn't
<zid>
It just means turning a class call into a function call
<zid>
bypassing the vtable
<Ameisen>
and... what causes a vtable to exist?
<zid>
by recognizing that the class hasn't b een inherited etc
<zid>
vtables exist because that's the way you implement C++ classes, because of inheritance and stuff meaning the pointers might change at runtime
<Ameisen>
That's... not correct.
<zid>
you *need to do an optimization pass* to prove they *won't* change, called devirtualization, which turns them back into flat calls
<Ameisen>
vtables exist (though not mandated by the spec) simply because it's a convenient way to implement virtual dispatch.
<zid>
That's literally what I just said
<Ameisen>
No class has a vtable if it isn't inheriting virtually.
<zid>
That's literally what I just said
<Ameisen>
and it's literally what I said first.
<Ameisen>
if you aren't using `virtual`, devirtualization isn't relevant.
<Ameisen>
It literally is a pass with nothing to do .
<zid>
You're agreeing with me 100% then telling me I am wrong
<Ameisen>
read what I wrote.
<Ameisen>
then what you asked.
<zid>
I did read it
<zid>
You agree that it can elide all this crap when certain things aren't happening
<zid>
You agree that these things are part of C++ and not C
<Ameisen>
> but they were generally limited to things like `virtual` (and that _specifically_)
<Ameisen>
which implies devirtualization
<zid>
but you disagree that.. C++ has to elide it when certain things aren't happening
<zid>
as an optimization
<Ameisen>
that optimization pass runs whether it's C or C++.
<Ameisen>
it is performed on IL.
<zid>
good for it
<zid>
That's a weird implmentation detail
<zid>
and 100% irrelevent
<Ameisen>
if your C++ doesn't contain virtual inheritance, then optimization passes don't exist.
<Ameisen>
and `virtual` is actually pretty rare in most contexts.
<zid>
so now the optimization DOES exist, but it's "rare"
<zid>
so does it exist or not? You've had it both ways now
<Ameisen>
*sigh* I'm going to sleep, this is stupid.
<Ameisen>
I can't tell if you're being obstinate or if there's a language barrier.
<heat>
Oh nice, the not-c++ people are explaining c++
<Ameisen>
heat: I had an argument on the C programming subreddit once. it was... fun. They were arguing things like 'all objects in C++ are dynamically allocated', and other things. Then they started saying things that sounded suspiciously like C#... and they linked to a page that was clearly AI copied from C# to C++.
<Ameisen>
my brain was being very loud and making it hard to sleep
<Ameisen>
zid: if you'd like, when my shoulder is in better shape I can write up a report about C and C++ optimization issues in this regard, since I've done a _lot_ of work into it. It might be more productive than bickering on an IRC channel.
<Ameisen>
otherwise, I'm thinking of ways I can portably use things like AVX (there are generic ways to do similar) to prefetch instructions as ye suggested.
<nikolar>
Depends on what you mean by probably
<nikolar>
Since requiring avx limits portability
<nikolar>
*portability
<heat>
*portability
xvmt has quit [Ping timeout: 252 seconds]
<zid>
*portability*
<Ameisen>
*portabello
<Ameisen>
AVX does, though there are similar-ish extensions on other archs, and there are generic SIMD libraries that can expand to it
<nikolar>
simde is nice, but I don't know if it does what you want
<zid>
bonus points: Disregard all the mov [ebp+reg0] stuff, keep all the mips regs in avx regs :P
<zid>
There's enough regs on amd64 that I had considered making my z80 emulator just pin all its regs to real regs with register asm("r9"); type stuff, but never got around to bothering to test it
<zid>
(I'd have to write thunks around the sdl code to conform to the C abi again, so I couldn't just do it as a one-liner)
<nikolar>
How many registers does z80 have
<zid>
real z80 has more cus of IX/IY and some dram refresh reg and things, but gbz80 has AF, BC, DE, HL, SP, PC
Turn_Left has joined #osdev
Left_Turn has quit [Ping timeout: 265 seconds]
Left_Turn has joined #osdev
<Ameisen>
zid: most things I've seen have suggested that if you can keep your register file in a cache line well-enough, it will be faster than trying to insert/extract things from SIMD registers.
<zid>
yea for sure
<zid>
It is however, hilarious
Turn_Left has quit [Ping timeout: 244 seconds]
<nikolar>
That's a good argument
<heat>
you're making me want to work on eBPF again :/
<nikolar>
Because it's hilarious?
<heat>
because i get to work on a jitter
<heat>
i have a cBPF implementation and x86 jitter i'm yet to integrate in the kernel
<nikolar>
cbpf was the earlier one right
<heat>
yeah
<heat>
eBPF has lots more regs, and shit like atomics
<heat>
and if you want to be real correct, you need a verifier as well (though, lol, i disagree that you even need it)
<nikolar>
Lol
<heat>
as far as I understand the eBPF verifier is kind of a leftover from the times where they thought unprivileged eBPF would be fine
<heat>
and now only exists because ring 0 doesn't imply code loading privileges as well
<heat>
besides helping prove correctness, but...
<nikolar>
Interesting
ZetItUp has joined #osdev
xvmt has joined #osdev
karenw has joined #osdev
Left_Turn has quit [Remote host closed the connection]
Left_Turn has joined #osdev
c0co has quit [Ping timeout: 240 seconds]
jedesa has quit [Remote host closed the connection]
Turn_Left has joined #osdev
Left_Turn has quit [Ping timeout: 245 seconds]
karenw has quit [Ping timeout: 240 seconds]
kata has quit [Ping timeout: 240 seconds]
kata has joined #osdev
kata has quit [Remote host closed the connection]
kata has joined #osdev
kata has quit [Remote host closed the connection]
kata has joined #osdev
kata has quit [Read error: Connection reset by peer]
kata has joined #osdev
<Ermine>
"jitter"
<Ermine>
that scares real time streaming people...
<kof673>
:D double meaning
* kof673
awards meta points
<kof673>
a jit is liable to create jitter
<kof673>
as well as jitterish
msv has quit [Remote host closed the connection]
msv has joined #osdev
msv has quit [Remote host closed the connection]
msv has joined #osdev
Turn_Left has quit [Read error: Connection reset by peer]
<Ameisen>
in a store, for instance, most of the time is spent just validating things and checking if an exception needs to be thrown, which sucks.
<heat>
i would try to move those thunks to a designated place
<Ameisen>
though I really need to break out of getting stuck in the weeds, since there are actual features I need to add and bugs I need to fix, and the little performance work is just very addictive but not very useful.
<Ameisen>
the thunks themselves are, those are just the jumps to them
<heat>
you're jumping over the thunks though?
<heat>
ah ok i see
<heat>
well that's still a thunk :p
<heat>
also, proper register allocation would be great -- though, yes, hard
<Ameisen>
it's sorta a thunk. The thing they're jumping to is a thunk outright, though
<Ameisen>
yeah, it's more difficult because of the fact that the instructions need to stay discrete. I have an idea for how to do it, but it's going to take a lot of work.
<Ameisen>
and I'm not 100% sure it will always be beneficial.
<heat>
you gotta test
<heat>
sometimes with performance work you spend a lot of timing doing something just to figure out it isn't worth it
<heat>
such is life
<Ameisen>
yeah. I know that my idea, if there's a loop that jumps across chunk boundaries or even just jumps into a weird place in the same chunk, and it does so a -lot-, will make it worse.
<Ameisen>
I can easily reserve/maintain register-cached values across the chunk itself, assuming linear execution. I just have to push/pop that state on a jump.
Lucretia has quit [Remote host closed the connection]
<Ameisen>
I suspect that allowing for hot paths in code (allowing optimized cross-instruction paths where there's no instruction count or jump hazard) will be more beneficial
<Ameisen>
sorta-almost tracing
<Ameisen>
but the performance of it right now is acceptable, I really need to work on getting a few things implemented.
<Ameisen>
the biggest one is getting the ability to update chunks, and thus invalidate patches, in so that self-modifying code will work right.
<Ameisen>
there's also two edge cases associated with that (needing to add/remove delay branch flags from the start of the _next_ chunk if relevant, and also handling when the chunk triggered its own update, which means that I have to `ret` to somewhere else)
<Ameisen>
then I need to fix a performance issue regarding jumping to invalid memory, and implementing LL/SC properly.
<Ameisen>
then it should be stable
c0co_ has joined #osdev
<heat>
LL/SC? yikes
<Ameisen>
yeah. They don't behave correctly right now.
<Ameisen>
they just act like normal load/stores. I am probably going to do the most relaxed version the spec allows me to do.
<Ameisen>
that's basically 'if there was a store anywhere, the linked operation fails'
<Ameisen>
the spec does in fact allow me to do that
c0co has quit [Ping timeout: 240 seconds]
<Ameisen>
regarding that jmp to the thunk - in equivalentish code, clang keeps it in the middle, gcc puts it at the end. They both generate very similar code to me, at least.