michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
thilo has quit [Ping timeout: 260 seconds]
thilo has joined #ffmpeg-devel
Kei_N_ has joined #ffmpeg-devel
Kei_N has quit [Ping timeout: 244 seconds]
<jamrial>
jkqxz: openapv just accepts any kind of input and will always report profile_idc 33, lol
<jamrial>
maybe we should remove the yuv444p10 support until the library is a bit more mature
<fflogger>
[newticket] SYamaguchi: Ticket #11579 ([ffplay] The tints appear to be different when the same image with different resolutions is played with ffplay.) created https://trac.ffmpeg.org/ticket/11579
<fflogger>
[editedticket] MasterQuestionable: Ticket #11579 ([ffplay] The tints appear to be different when the same image with different resolutions is played with ffplay.) updated https://trac.ffmpeg.org/ticket/11579#comment:2
kunkku has joined #ffmpeg-devel
TheVibeCoder has quit [Quit: Client closed]
TheVibeCoder has joined #ffmpeg-devel
minimal has joined #ffmpeg-devel
<Lynne>
I think it was too early to give softworkz push access
av500 has quit [Remote host closed the connection]
av500 has joined #ffmpeg-devel
<fflogger>
[newticket] juanitotc: Ticket #11580 ([build system] ffmpeg-7.1.1 build fails with nasm) created https://trac.ffmpeg.org/ticket/11580
<haasn>
wow
<haasn>
2 bit gray is almost indistiguishable from 10 bit gray on my display, with temporal dithering on a 64x64 blue noise texture
<haasn>
at 1 bit depth you can definitely tell the dither pattern but the 240 Hz temporal dithering smooth it out so well that even 2 bits is basically visually transparent
<haasn>
that's kinda insane, you definitely don't get that level of smoothness at 60 Hz
<haasn>
but the eye just fuses the temporal dither pattern into a single dither pattern with a much higher resolution
<haasn>
anyway, I've determined experimentally that 64x64 blue noise dither provides equivalent or better quality compared to error diffusion
<haasn>
especially in motion
<haasn>
with temporal dither, it's even better than ED
<jkqxz>
jamrial: Yeah, I think only 422-10 profile is correct.
<jkqxz>
They also need to sort out the stability, they can't have regular ABI breaks if it is being packaged. (E.g. see top PR now.)
<jamrial>
yeah, saw that
<jamrial>
jkqxz: their headers also don't even report a version, so...
<jamrial>
could be a good oportunity for them to clean them up and introduce a version define once it's stable
<haasn>
well, with non-temporal dithering even 64x64 blue noise is a bit "rougher" than ED at 1 bit depth, particularly for midtones
<haasn>
though this is a flawed comparison anyway because we are not doing gamma aware dithering (I'm working on that)
cone-227 has joined #ffmpeg-devel
<cone-227>
ffmpeg James Almer master:244ad944e947: avcodec/liboapvenc: remove 4:4:4 support until it's properly handled
<ramiro>
haasn: when inputting yuv444p12le, the max values are set to 65535. shouldn't it be 4095, or are we accepting that the input might be invalid?
<haasn>
I’m undecided on the issue
<haasn>
I think I may split it into max possible and max legal
<haasn>
Well, no, that wouldn’t really help
<ramiro>
because, currently, -src yuv444p12le -dst yuv444p16le goes through f32 and scale, but it could be done with no converting and a simple left shift.
<ramiro>
haasn: I have 3 classes of converters that are still slower: 1) simple shuffles (but I'm almost done with this one using your shuffle_solver), 2) the issue I mentioned above, converting from a smaller yuv444p to one with more bits (where a simple left should would be faster), and 3) conversions that only add or remove alpha planes (such as yuv444p -> yuva444p), where a wrapper with memcpy and memset
<ramiro>
would be faster.
jamrial has quit [Ping timeout: 244 seconds]
jamrial has joined #ffmpeg-devel
<haasn>
I'll revive the dedicated memcpy backend
<haasn>
the only case when it's not faster is when a plane needs to be duplicated, e.g. gray -> yuvj
<haasn>
gray -> gbrp rather
<haasn>
ramiro: I think the best way around the unnecessary clamp issue is to require SWS_OP_LUT to clamp its own input if it may exceed the LUT range
<haasn>
e.g. doing a 10-bit LUT lookup on a uint16_t input
<haasn>
now that we have access to the SwsFormat in the SwsOpList I can actually cleanly infer the expected signal range inside the optimizer
<haasn>
that's something we simply didn't have access to before, which is why I made it based on the pixel range instead of the legal range
<haasn>
for the shuffle solver, should we lift some portions of it to the common code?
<haasn>
since I assume you're copy/pasting it atm
<haasn>
ramiro: something we also don't handle completely atm is the alpha_blend_mode, e.g. blending to checkerboard
<haasn>
currently we just completely drop the alpha channel when converting e.g. rgba -> rgb24; which is obviously not desirable in practice
<haasn>
ramiro: pushed ff_sws_solve_shuffe() and the above fix to haasn/swscale6
th3synth4x has joined #ffmpeg-devel
th3synth4x has quit [Client Quit]
minimal has quit [Quit: Leaving]
<haasn>
ramiro: implemented a memcpy backend, 25% faster for yuv444p -> yuva444p (now matches reference)
<haasn>
and 32% faster for gray -> yuvj444p (memset chroma)
<haasn>
I wonder how we can handle e.g. gray -> yuv444p, which still wants memset on the chroma; one of the things I'm thinking in the back of my mind is that we want some mechanism for trying to separate planes from each other
<haasn>
we want this anyway for e.g. turning gray -> yuvj into a refcopy
<haasn>
pushed it to swscale6 as well, give it a try
<ramiro>
haasn: thanks for "swscale/optimizer: use legal value range for determining clamp requirements", now yuv444p10 -> yuv444p16 is much faster, but it's still not as fast as legacy. I believe it's because legacy does it one plane at a time. do we already have in place a mechanism to detect when the planes don't depend on each other?
<haasn>
not yet, that's what I was just talking about :)
<haasn>
how is doing it one plane at a time faster?
<ramiro>
haasn: oh, I started writing that message before I read your last messages :P
<haasn>
also how much faster are we talking about?
<ramiro>
haasn: I guess it's faster to do one plane at a time because then you don't have to access 3 memory regions at once. do the entire first plane, then entire second plane, and then entire third plane...
<ramiro>
but that's just a guess, I haven't written code to test that yet
<ramiro>
haasn: asmjit code is currently 0.855x slower compared to planarCopyWrapper (which is pure c and does one pixel at a time, so I suspect we could be much faster)
<haasn>
I will add a 4x4 dependency matrix for starters
<haasn>
that way we can try to split planes in general, maybe it's always faster to process one plane at a time?
<ramiro>
perhaps people more knowledgeable in the inner workings of many differet CPUs can give us a better answer
<ramiro>
Lynne: ^^
<haasn>
fun, memcpy backend doesn't pass checkasm because it "over-writes" into the stride area
<haasn>
I guess we actually _don't_ want to check for that
<haasn>
or rather, we should check for over-write only after the last line
<BtbN>
Isn't that exactly what that area is there for? :D
<BtbN>
including after the last line
<ramiro>
haasn: "backend_murder" :P
<BtbN>
if I see a frame with a linesize/stride of 1024, I'd expect to be able to write up to the full 1024 bytes each line without any averse effects
<haasn>
Lynne: tl;dr ramiro suspects that for (i < size) { y[i] <<= 6; u[i] <<= 6; v[i] <<= 6; } is slower than for (i < size) y[i] <<= 6; for (i < size) u[i] <<= 6; for (i < size) v[i] <<= 6;
<ramiro>
Lynne: is it always faster to process one plane at a time, when they're independent, than processing them all at once? for example ld1/lsl/st1 per plane, or ld1/ld1/ld1/lsl/lsl/lsl/st1/st1/st1
<Lynne>
for arm, the mantra is "unused registers are wasted; instruction decoding and binary is cheap"
<Lynne>
can you do both?
<ramiro>
Lynne: sure, but what about memory access? reading from 3 planes and writing from 3 planes at once, or reading and writing from 1 plane at a time?
<Lynne>
as for the if+shifts, the former is faster imho unless you're in a brach heavy code (predictor has limited res) and somehow the compiler can cmov all
<ramiro>
because currently planarCopyWrapper is faster than a very tight neon loop that does 3 planes at once.
<Lynne>
cpu should be able to pipeline that, shouldn't it
<Lynne>
also if on an in-order cpu, you can just manually move the loads and spread them out
IndecisiveTurtle has joined #ffmpeg-devel
<jkqxz>
I would guess the two approaches are equal on a big-core CPU, but the all-in-one-loop might have pathological edge cases because of memory aliasing being caught out.
<haasn>
ramiro: what is that solving?
<jkqxz>
Each step is short and independent, so the CPU can happily fill up its rename capacity with however many of them fit regardless of whether they are together or not.
<jkqxz>
But you could fall over in the together loop if it ever accidentally thinks that some of the paths alias (because most of the address bits are the same or something), and that will have huge negative consequences.
Teukka has quit [Read error: Connection reset by peer]
Teukka has joined #ffmpeg-devel
Teukka has quit [Changing host]
Teukka has joined #ffmpeg-devel
<ramiro>
haasn: for a clear_val of 0x80, the end result would be 0x81, 0x82, 0x83... with this patch it always sets 0x80 as clear_val.
<haasn>
ah gotcha
<haasn>
that case was ignored in the x86 backend because only the high bit mattered
<ramiro>
haasn: not much of an issue since they're all treated the same by pshufb and tbl, but still. it makes the assembly cleaner.
iive has joined #ffmpeg-devel
rvalue- has joined #ffmpeg-devel
rvalue has quit [Ping timeout: 272 seconds]
Guest71 has joined #ffmpeg-devel
rvalue- is now known as rvalue
mkver has quit [Ping timeout: 252 seconds]
Traneptora has joined #ffmpeg-devel
Guest71 has quit [Quit: Client closed]
mkver has joined #ffmpeg-devel
novaphoenix has quit [Quit: i quit]
novaphoenix has joined #ffmpeg-devel
lemourin has quit [Ping timeout: 245 seconds]
IndecisiveTurtle has quit [Ping timeout: 265 seconds]
IndecisiveTurtle has joined #ffmpeg-devel
IndecisiveTurtle has quit [Ping timeout: 265 seconds]
novaphoenix has quit [Quit: i quit]
lemourin has joined #ffmpeg-devel
novaphoenix has joined #ffmpeg-devel
<fflogger>
[newticket] cus: Ticket #11581 ([avformat] WAV demuxer codec probe misdetects PCM data as MP3) created https://trac.ffmpeg.org/ticket/11581
<fflogger>
[newticket] Anton1699: Ticket #11582 ([ffmpeg] Please add an option to make the new "elapsed" stat optional) created https://trac.ffmpeg.org/ticket/11582