michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
IndecisiveTurtle has quit [Ping timeout: 260 seconds]
thilo has quit [Ping timeout: 276 seconds]
thilo has joined #ffmpeg-devel
HideoSugai has quit [Ping timeout: 240 seconds]
cone-421 has quit [Quit: transmission timeout]
minimal has quit [Quit: Leaving]
Marth64 has joined #ffmpeg-devel
MisterMinister has joined #ffmpeg-devel
Xaldafax has quit [Quit: Bye...]
HideoSugai has joined #ffmpeg-devel
Martchus has joined #ffmpeg-devel
Martchus_ has quit [Ping timeout: 248 seconds]
Traneptora has quit [Quit: Quit]
jamrial has quit []
HideoSugai has quit [Quit: Client closed]
MisterMinister has quit [Ping timeout: 252 seconds]
<jkqxz>
jamrial: Isn't there a penalty for mixing xmm and ymm access to the same register? (I have been careful to avoid that but your suggested change does it.)
<kurosu>
the bottom of a ymm is an xmm, but you incur a penalty (implicit vzeroupper) whenver you use a ymm
<kurosu>
another penalty is domain transition (between int and float) but no idea if that's that big even nowadays
<kurosu>
jkqxz: looked at the code, no penalty
<jkqxz>
Is there any penalty for having written a ymm register and then later addressing the xmm half only?
<kurosu>
no
<kurosu>
I mean, in the same function. What you refer to is likely conflict resolution and false dependency that vzeroupper solves
<kurosu>
If you're going to do pure xmm in a 2nd part of the function, or a later one, then sure
<jkqxz>
A false dependency on the upper half is fine as long as there isn't some unexpected lane penalty that I'm missing.
<kurosu>
I kind of recall it depends on the instruction encoding as well (the vex evex etc)
<jkqxz>
Trying to work this out from the Intel optimisation manual it does look like the problems are all in mixed VEXed/unVEXed cases.
<kurosu>
and I think the xmm operations in an INIT_YMM avx function will be auto-encoded as the VEX form
<jkqxz>
And the implicit upper-zeroing of writes to xmm registers doesn't cost anything.
<kurosu>
So, non, I don't think it will cause a problem
<kurosu>
RET will auto-insert a vzeroupper
<jkqxz>
(Which does make sense when thinking about the lane split, because you just rename the upper half to be a reference to your zero register.)
<jkqxz>
I think it makes sense to rewrite the final normalisation in pairs anyway, which avoids any xmm in the >8-bit case (write to memory with vextracti128).
<jkqxz>
That doesn't quite work with the 8-bit case because it needs 64-bit writes, though maybe vpermq + movq.
<linkmauve>
jkqxz, kurosu, until Skylake there was a very high performance cliff for using a ymm register as xmm without vzeroupper, that got fixed in Skylake which made vzeroupper more or less a noop IIRC.
<fflogger>
[newticket] Levan: Ticket #11550 ([ffmpeg] Simple commend -c copy -t ** file.mp4 no longer works) created https://trac.ffmpeg.org/ticket/11550
<kurosu>
jkqxz: btw, I imagine it's a high bitrate codec, but what is the nz count high enough that traditional dequant during inverse zz scan is slower?
<kurosu>
-what
<jkqxz>
The default compression ratio target is ~7 and that seems to average something like 10 nonzero coefficients per block (i.e. ~9 bits per nonzero coefficient).
<jkqxz>
Having put the unzigzag inside the entropy the combination felt better this way around, but I admit I have not actually compared against the reverse.
mkver has quit [Ping timeout: 276 seconds]
cone-731 has quit [Quit: transmission timeout]
<jkqxz>
kurosu: Any idea whether there would be value in checking for zero rows to help that? It can do that in the dequant easily, but if the row transform is first then it can happen there as well. (As "ptest mN ; jz skip_row_N".)
<jkqxz>
Those branches would be somewhat predictable, too.
<jkqxz>
I guess in the entropy you'd know what nonzero coefficients you had written and then it could dispatch to one of 8x8, 4x8, 8x4, 4x4 or DC-only for the transform.
<jkqxz>
Does dav1d do anything like that for the large transforms? (AV1 mandates the 64x64 to only have coefficients in the top-left 32x32, but I mean for others where it can do it opportunistically.)
<another|>
yes
quietvoid has quit []
<another|>
at least if I understand you correctly
<another|>
you mean like early exits in case there is only data in the top left corner?
<kurosu>
I kind of remember an older idct skipping empty rows, but again much lower ratio
<kurosu>
If you get a gain by that, it's worth benchmarking doing the dequant in the entropy decoding loop. I think that's what is done again in older codecs
<kurosu>
implementations in FFmpeg
<kurosu>
(sorry no PC to easily and quickly look that up)
<jkqxz>
It seems worth trying. At the higher compression ratios it's plausible that a decent proportion of blocks will fit in 4x4. Probably people won't use the sort of artificial content which gives you DC-only, though.
<jkqxz>
And yes, dequant in entropy would be wanted to go with that.
<kurosu>
Yep see COND macros in simple_idct asm
<kurosu>
Anyway, maybe just go with whichever is simpler, get it merged, then experiment
<kurosu>
Though simple idct explicitly ors coeffs to know what it can skip, and doesn't get the info from the entropy decoding
<kurosu>
(like dav1d does)
paulk has quit [Ping timeout: 252 seconds]
paulk has joined #ffmpeg-devel
paulk has joined #ffmpeg-devel
paulk has quit [Ping timeout: 260 seconds]
Luna_Rabbit has quit [Quit: Do you believe in magic?]
Moon_Rabbit has joined #ffmpeg-devel
<Lynne>
jkqxz: how fast is the decoder, for typical video?
<haasn>
ramiro: I may have to add some fudge for code for 3-element writes (by assuming the actual write size is rounded up to some reasonable power of 2)
<haasn>
did you implement that yet in your code?
<haasn>
I guess with your strided writes it's easy?
<haasn>
on x86 it's really hard to handle packed 3-element writes without any overwrite, though I'm sure it's possible (maybe I can just write the last 96 bits of the last XMM reg as two scalar writes)