michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
thilo has quit [Ping timeout: 248 seconds]
thilo has joined #ffmpeg-devel
iive has quit [Quit: They came for me...]
\\Mr_C\\ has quit [Remote host closed the connection]
kasper93 has quit [Remote host closed the connection]
kasper93 has joined #ffmpeg-devel
realies9 has joined #ffmpeg-devel
realies has quit [Ping timeout: 276 seconds]
realies9 is now known as realies
mkver has quit [Ping timeout: 248 seconds]
Martchus_ has joined #ffmpeg-devel
Martchus has quit [Ping timeout: 252 seconds]
jamrial has quit []
_whitelogger has joined #ffmpeg-devel
secondcreek has joined #ffmpeg-devel
MisterMinister has joined #ffmpeg-devel
MisterMinister has quit [Remote host closed the connection]
MisterMinister has joined #ffmpeg-devel
secondcreek1 has joined #ffmpeg-devel
kepstin has quit [Ping timeout: 252 seconds]
kepstin has joined #ffmpeg-devel
secondcreek has quit [Ping timeout: 276 seconds]
secondcreek1 is now known as secondcreek
bcheng has quit [Ping timeout: 252 seconds]
bcheng has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
Kei_N_ has quit [Remote host closed the connection]
Kei_N has joined #ffmpeg-devel
Kei_N has quit [Ping timeout: 248 seconds]
Kei_N has joined #ffmpeg-devel
System_Error has joined #ffmpeg-devel
___nick___ has joined #ffmpeg-devel
cone-078 has joined #ffmpeg-devel
<cone-078>
ffmpeg Manuel Lauss master:9369ebf23806: avcodec/sanm: recognize common FOBJ sizes
<cone-078>
ffmpeg Manuel Lauss master:7f0b7b049696: avcodec/sanm: ignore codec48 compression type 6
<cone-078>
ffmpeg Manuel Lauss master:37064b2d1661: avcodec/sanm: support "StarWars - Making Magic" video
<fflogger>
[newticket] Size4658: Ticket #11583 ([undetermined] ffmpeg with certain files will output getting a 1kB output file) created https://trac.ffmpeg.org/ticket/11583
cone-078 has quit [Quit: transmission timeout]
j45 has quit [Ping timeout: 276 seconds]
j45 has joined #ffmpeg-devel
j45 has quit [Changing host]
j45 has joined #ffmpeg-devel
aaabbb has quit [Changing host]
aaabbb has joined #ffmpeg-devel
\\Mr_C\\ has joined #ffmpeg-devel
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #ffmpeg-devel
<haasn>
if we were to implement temporal dithering in swscale, should the dither offset index be based on the frame PTS or the output frame count? (or configurable as an expression?)
<haasn>
basing it on the frame PTS would lead to a reproducible result when seeking
<haasn>
(most likely in terms of the swscale API it would just be an arbitrary 8-bit integer, so this is more of a question of how vf_scale should set this integer)
<nevcairiel>
is there a real difference ultimately?
<haasn>
probably not, I guess you wouldn't enable temporal dither when you need reproducible results
<haasn>
then maybe it should not even be a configurable expression and libswscale should just directly use the frame PTS internally, saves us from even having to add API for it
\\Mr_C\\ has quit [Remote host closed the connection]
<Lynne>
haasn: I'd prefer based on PTS, since its technically more correct, and you would have bitexact result between chunks extracted from streams
<haasn>
input frame or output frame?
<haasn>
probably input frame, since you're more likely to cut off the input than the output stream
<haasn>
for vf_scale it doesn't matter since output pts = input pts
<Lynne>
yeah, input frame
<BtbN>
haasn: would you say what you're doing with swscale can be used to also make it work on CUDA frames, by generating PTX code or something? Not sure how it all works yet.
<haasn>
BtbN: yes
<haasn>
that's a stated design goal
<haasn>
well, not 100% sure about CUDA internals and how easy it is to autogenerate code at runtime there
<haasn>
but on vulkan you can pretty straightforwardly generate SPIR-V kernels
<haasn>
and I'm sure you can probably ingest SPIR-V kernels on CUDA if you try hard enough
<BtbN>
Auto-Generating actual C++ CUDA code is in theory possible, but needs a closed source compiler library (or interfacing with libllvm I guess?)
<BtbN>
But the pure driver API absolutely can consume PTX "Assembly" at runtime and build it
<haasn>
at worst you can probably stitch together individual kernels for each op
<haasn>
like runtime linking of precompiled modules / kernels
<Lynne>
I'd prefer to interface with LLVM and generate LLVM-IR directly
<BtbN>
rcombs has build something like this for Plex before, so it's 100% possible
<Lynne>
that can then be used to compile into ptx and spirv
<BtbN>
i.e. scale_cuda in Plex-Ffmpeg works very different compared to ours
<BtbN>
The problem with interacing with LLVM is that it's a royal mess
<Lynne>
it's just a C++ API?
<BtbN>
I'm dealing with this at work via numba/llvmlite
<BtbN>
and the llvm folks break ABI and API every release
<BtbN>
And not just a little bit
<Lynne>
ah, opaque pointers?
<BtbN>
not sure, since I only see that the numba folks need over a year to implement support for each new llvm major version
<BtbN>
But they almost have to start over from scratch each time
<BtbN>
In theory it's a great idea though, but very maintenance intensive
<BtbN>
I thought about that for scale_cuda before, but discarded it due to being massively overkill
<haasn>
damn SSIM really hates blue noise
<Lynne>
I hate glsl with the force of 10000 suns, and my plan has been to convert all glsl code to opencl, which can be partially compiled and linked during runtime with llvm
<haasn>
since it's higher frequency than ordered dither
<haasn>
and I guess a bit less mathematically precise?
<haasn>
my eyes prefer blue noise though
<BtbN>
it might unironically be easier to invoke clang via cli than to interface with libllvm. The cli interface is more stable.
<haasn>
Lynne: I still have high hopes for directly generating SPIR-V in swscale2
<Lynne>
the thing is that this involves going through spir-v assemblers if you do it textually
<Lynne>
...which is khronos code
<haasn>
maybe will have to steal put_bits.h from avcodec
<JEEB>
BtbN: yea that sounds quite possible
<haasn>
no, I mean obviously as binary
<JEEB>
or libclang I guess?
<BtbN>
Does that exist?
<Lynne>
yeah
<Lynne>
that's what I meant by LLVM
<BtbN>
How is it different from libllvm?
<Lynne>
it runs a frontend
<BtbN>
ah, fair
<BtbN>
maybe it's so bad for the numba folks since they are effectively building a frontend for Python there, out-of-tree of llvm
<JEEB>
yea so you're not directly interfacing with LLVM, but instead I think the interface is on the C / clang layer.
<JEEB>
at least for editor stuff it seems stable enough, since I've used it both for sublime text as well as vscode
<BtbN>
If sws can then generate something llvm can compile, it should be trivially possible to make it output nvptx
<BtbN>
well, maybe not trivially
<Lynne>
I'll write the vulkan backend with that in mind
<BtbN>
if it can output spir-v via llvm, it can output nvptx
<BtbN>
All that's needed then is some scaffolding that generated ptx code can be plugged into
minimal has joined #ffmpeg-devel
microlappy has joined #ffmpeg-devel
kasper93_ has joined #ffmpeg-devel
kasper93 is now known as Guest8861
Guest8861 has quit [Killed (osmium.libera.chat (Nickname regained by services))]
microlappy has quit [Quit: Konversation terminated!]
mkver has joined #ffmpeg-devel
<IndecisiveTurtle>
There are lightweight libraries for emitting spirv in directly in binary, but for c++
<kierank>
•haasn> if we were to implement temporal dithering in swscale, should the dither offset index be based on the frame PTS or the output frame count? (or configurable as an expression?)
<kierank>
surely this is more of an avfilter thing
<kierank>
sws doesn't care about pts right now
<haasn>
s/swscale/vf_scale/
<Lynne>
IndecisiveTurtle: grats to you too
<Lynne>
do you know when you might have time to fix the issues from the last rounds of review for the vc2 encoder?
<IndecisiveTurtle>
Thx xD, I've been chipping away at those the past week
<IndecisiveTurtle>
I finished most of the cosmetic ones, fixed the leak and implemented the LUT optimizations in the shader. Now I'm looking for interlacing support
<IndecisiveTurtle>
I believe I also found a bug in cpu encoder, for some interlaced videos it crashes with "Assertion n <= s->buf_end - s->buf_ptr failed at libavcodec/put_bits.h:390"
<Lynne>
I think you can skip on interlacing for now
<IndecisiveTurtle>
I felt kinda bad gutting all that code for something seemgly simple to add, but okay
<IndecisiveTurtle>
Other than that there is the const casting comment which I can't really solve without a separate patch, as ff_vk functions require a mutable pointer
<Lynne>
which functions?
jamrial has joined #ffmpeg-devel
<IndecisiveTurtle>
ff_vk_exec_add_dep_frame, ff_vk_create_imageviews, ff_vk_frame_barrier etc all of them receive a non-const AVFrame*
<Lynne>
just cast it
<Lynne>
we do that in the ffv1enc_vulkan.c code too
<Lynne>
it's unavoidable
jamrial has quit [Ping timeout: 252 seconds]
<IndecisiveTurtle>
Andres suggested to rewrite the vulkan code to avoid it, so I thought of going over them and seeing which ones can be changed to make them const correct and simply cast on those cant need mutable. For a lot of them it seems possible from a first look
<IndecisiveTurtle>
But I'll leave it you if that should be done in the end or not
<IndecisiveTurtle>
(If I need to do it I mean)
<Lynne>
mkver: accessing vulkan frames requires locking a mutex and incrementing a semaphore, so it's not really possible to make the avframe truly const
<Lynne>
we could make the frame const, but access the payload buffer as non-const, but I think it's fine leaving it as-is
jamrial has joined #ffmpeg-devel
jamrial has quit [Ping timeout: 244 seconds]
minimal has quit [Quit: Leaving]
rvalue has quit [Read error: Connection reset by peer]
rvalue has joined #ffmpeg-devel
jamrial has joined #ffmpeg-devel
<fflogger>
[newticket] jr_clifton: Ticket #11584 ([ffprobe] ffprobe returns "n/a" for bitrate in opus audio file) created https://trac.ffmpeg.org/ticket/11584
<BtbN>
In what state actually is the sws work? Like, could I start helping testing/devloping it for CUDA, or is that too early?
zsoltiv has quit [Ping timeout: 260 seconds]
zsoltiv_ has quit [Ping timeout: 248 seconds]
Traneptora has quit [Quit: Quit]
<BtbN>
What I'm slightly afraid of is llvm becoming a dependency, since it's HUGE and would quadruple the size of the static builds
<fflogger>
[newticket] paulpacifico: Ticket #11585 ([ffmpeg] Error converting 10bit to 8bit and vice-versa (Operation not supported)) created https://trac.ffmpeg.org/ticket/11585
<rcombs>
BtbN: is there any reason the codegen would need to happen at runtime instead of compile-time? the clang integration in the main ffmpeg build system is perfectly fine
<BtbN>
It's how I understand new-swscale is working
<BtbN>
it generates code for the desired conversion, builds it for the desired target (local CPU, SPIR-V, NVPTX, ...), and runs it
<BtbN>
With the code apparently being LLVM IL?
<BtbN>
Pre-Generating all possible conversion would be a GIGANTIC ball of code
<BtbN>
scale_cuda is also to a degree suffering from that, with its relatively small subset of conversion
<rcombs>
so the problem there is combinatorials, right?
<rcombs>
there's a pretty straightforward trick I used for that in CUDA tonemapping, which reduces the problem massively
<BtbN>
I think the goal for swscale is maximum performance
<BtbN>
so it can build a conversion kernel optimized for the local CPU
<rcombs>
in the CUDA code, I declare some `extern const __constant__` variables, and branch on them
<BtbN>
and as a side effect, it can also build those kernels for GPUs
<rcombs>
and then I generate a PTX assembly string defining those variables' values
<Lynne>
BtbN: llvm is a hard dep on mesa, so I'm not too worried about it on linux
<BtbN>
If the OS provides it, sure
<rcombs>
the compiler then sees that the values will be constant at runtime and inlines them
<BtbN>
but for static builds, I think the binary sice would crack 1GB, up from 100MB or so
<rcombs>
so you get the same performance as if doing the branching in C++ with templates (or with generated code)
<rcombs>
but without needing to generate any actual CUDA source (or LLVM IR or whatever)
<BtbN>
Yeah, for a pure CUDA implementation, something like that is fine
<BtbN>
but I think the goals for new swscale are bigger
<rcombs>
similar things are straightforwardly possible for vulkan and such, afaik?
<rcombs>
for the local CPU, uhhhhhhhhh
<rcombs>
you have a number of problems there, including "the system won't always allow you to JIT anyway"
<BtbN>
you can jit all you like
<BtbN>
every emulator under the sun would be defunct if you couldn't
<Lynne>
its GPU code, so you can JIT
<BtbN>
It's about the CPU "kernels" swscale would generate
<Lynne>
its not generating anything now
<BtbN>
Well, but it's the goal, isn't it?
<Lynne>
no, x86 is regular SIMD
<BtbN>
ah
<rcombs>
JIT on macOS and iOS requires an entitlement; you can opt into it on macOS (it's a bit annoying but it's not too bad) but on iOS it requires special permission
<BtbN>
Apple can just get lost then
<rcombs>
look I don't like it either but it's a major target platform
<BtbN>
They get dumb C implementations then, nothing else to be done
<rcombs>
and the macOS approach here is pretty reasonable (needing to explicitly opt in to enable the JIT API)
<rcombs>
I mean, there are entirely reasonable assembly implementations today
<Lynne>
ramiro went on to experiment with runtime JIT for aarch64 SIMD but that's just a side project; I wouldn't be inclined to accept it
<BtbN>
On both Linux and Windows you just need to explicitly allocate executable memory
<haasn>
BtbN: bit late to the party; let me clarify some things
<haasn>
1) now would be as good a time as any to attempt a gpu backend, yes; especially as we will probably need to change at least some parts of the API for it
<rcombs>
the entitlement means that it's harder to exploit a compromised process, since just calling mmap/mprotect with the relevant flags would fail unless the executable itself was flagged as "yes I actually use that feature"
<haasn>
best get that out of the way
<haasn>
but I already committed and wrote a full x86 SIMD backend so I don't expect major changes at this point
<BtbN>
I'm curious though if that SIMD would be beaten by llvm compiled native code for the current platform
<haasn>
it's not really comparable to LLVM IL and we don't have any LLVM or JIT code in my branch currently
<haasn>
an LLVM backend would be a fun project but not something I'm relying on
mkver has joined #ffmpeg-devel
<BtbN>
I think manually generating NVPTX from that is more realistic
<haasn>
(the ideal case here would be if we could use the same backend to also generate CUDA and SPIR-V code)
<BtbN>
It's not that hard of an assembly language
<haasn>
I think manually generated SPIR-V binary is also more realistic than linking against LLVM
<rcombs>
can confirm nvptx is pretty easy
<haasn>
FFmpeg loves to NIH C format generators and SPIR-V looks like a particularly friendly binary format
<rcombs>
helps that it targets an abstract machine with, like, hundreds of registers
<rcombs>
and that branching on constants is literally free
<haasn>
3) we don't need to JIT for SIMD backends, we could also just stitch together kernels "by hand" - they have a pretty regular / obvious shape; just look for the JMP instruction and instead put the next kernel there
<haasn>
though I'm not convinced it will give any performance benefit on modern CPUs with working branch predictors
<BtbN>
Lynne: something you might be interested in is new stuff in CUDA 12.9 regarding Vulkan interop. Apparently one can now create a CUDA context from a Vulkan one, so they are effectively the same context, making stuff more efficient.
<haasn>
since the jumps are all non-dependent and point to the same destination 100% of the time
<Lynne>
I'd still rather link against LLVM, since the compiler can optimize the code further and we'd get both spirv and ptx out of it, and I'm intending to switch the existing vulkan code anyway
<Lynne>
BtbN: but what about the underlying objects?
<haasn>
4) even if you can't gen the binary at runtime, you don't need necessarily need to pre-generate kernels for every possible format conversion at compile time, if you can generate a kernel for every possible operation (a few dozen) and stitch/link them together
<BtbN>
Lynne: what do you mean?
<Lynne>
can you export buffers and images in between APIs freely?
<haasn>
rcombs: you don't even need to branch at all, everything is SSA, every operation is strictly linear with only one input and one output
<rcombs>
in theory a jump has a slight cost even if it's perfectly predicted, but in practice unless your EU utilization is *crazy* good, it's fine
<BtbN>
To a degree, yeah
<rcombs>
haasn: I mean using branching to "stitch together" different operations
<haasn>
ah, sure
<haasn>
we can also quite easily provide a guarantee about the maximum number of ops
<rcombs>
like, in tonemap_cuda there are a bunch of different tonemap algorithms, and they all get inlined into a single function with branching on a constant, which is free at runtime
<haasn>
so you could unroll every kernel into a fixed size loop of switch/case statements and plug in the constants later
<rcombs>
Lynne: if we wanted, we could probably build out a small collection of C++ templates that would let us write shader files that target both NVPTX and SPIR-V in a single shader
<haasn>
BtbN: we can make the C implementation "fast enough" if we want to, in my original design for swscale3 I had _only_ the C code
<haasn>
just requires compiling the template with -mfpu=neon or w/e
<Lynne>
rcombs: it's nothing that we couldn't do with C
<haasn>
and it was still faster than the old x86 code (with -mavx2)
<haasn>
obviously the new hand written AVX2 is way faster
<rcombs>
¯\_(ツ)_/¯ the compiler's C++ anyway, no reason not to make use of templates
<Lynne>
what compiler?
<rcombs>
clang targeting NVPTX or SPIR-V
<haasn>
BtbN: I still think we should write a NEON backend though that doesn't rely on JIT, but uses the same approach as the current x86 backend
<haasn>
(If STF accepts my proposal I would get contracted to do that anyways)
<BtbN>
We absolutely should have a NEON backend. What'd be the alternative for fast ARM code?
<Lynne>
rcombs: if the API is C++, sure
<rcombs>
I mean, for NVPTX the underlying API is assembly, but using templates was still very valuable
<rcombs>
think of it like macros but better
<Lynne>
though I foresee a libclang implementation accepting opencl code strings rather than C
<Lynne>
but that's if they fix it in time
<Lynne>
I'd like to get STF to fund a Vulkan backend, and right now it segfaults when I ask it to output vulkan-flavoured spirv out of opencl
<Lynne>
so runtime SPIRV gen looks more likely
<Lynne>
unlike other vulkan code we have, it's less likely to change, as it builds up everything from small primitives
cone-126 has quit [Quit: transmission timeout]
<haasn>
ramiro: how do you handle dithering in your backend atm?
<haasn>
I am thinking about how to handle sizes larger than 16x16, which requires access to the current x coordinate, something I don't currently track at all
<ramiro>
haasn: can't you use exec.x and exec.y?
<haasn>
I made exec immutable by the backend
<haasn>
I guess the obvious answer is to change that
<ramiro>
why would you make it immutable?
<linkmauve>
> 20:08:24 Lynne> BtbN: llvm is a hard dep on mesa, so I'm not too worried about it on linux
<linkmauve>
Only for radeonsi (and even there it can work with ACO nowadays) and llvmpipe, and the rusticl frontend. You can compile Mesa without.
<ramiro>
haasn: sorry, internet went down. I saw your replies on some irclogs online
<ramiro>
this way I can do any combination of input/output format by always writing full vectors, even if it means working on chunks of 96 bytes (input or output).
<ramiro>
240 combinations use the shuffle solver. some are much much faster, most are faster, a few are the same speed, and a couple are a tiny little bit slower.
<ramiro>
there is no over-reading or over-writing, no unused chunks of vector
<Lynne>
linkmauve: I know, but for most purposes, its a hard dep as most distros compile with lavapipe and such
cone-710 has joined #ffmpeg-devel
<cone-710>
ffmpeg Michael Niedermayer master:9230c93cc9fd: avcodec/rv60dec: inter also fails with qp >= 32
<cone-710>
ffmpeg Michael Niedermayer master:43926e026dd8: avcodec/mmvideo: fix palette index
<cone-710>
ffmpeg Michael Niedermayer master:4e5523c98597: avcodec/hevc/ps: Fix dependant layer id check
<cone-710>
ffmpeg Michael Niedermayer master:ce1fd73d637a: avformat/iff: Check nb_channels == 0 in MHDR
<averne>
Lynne: Hey, thanks :) I've been away for a bit and haven't been able to work on stuff, but things are finally aligning for me to get some free time
<averne>
By the way, do you know of any projects doing shader-based/assisted media decoding (aside from ffmpeg and nvidia's jpeg thing)? As I'm supposed to study existing art during the first 3 weeks.
<Lynne>
not really, the ffv1 code is representative enough
<averne>
Yeah I was also planning to look at the VC2 code from last year's gsoc
mkver has quit [Remote host closed the connection]
mkver has joined #ffmpeg-devel
<averne>
Ideally I'd also like to reverse nvidia's jpeg shaders, to get acquainted with low-level shader stuff. I dumped them from the cuvid library a while ago, but that might be a lot of effort
<Lynne>
most of the other projects I've seen (jpeg) mostly implemented codecs with floats, which is fine if you're presenting on screen once, but we want the output from the software and compute decoders to match
<averne>
nvidia's thing does NV12 decode iirc
<Lynne>
not a great idea tbh
<Lynne>
the issue is that unless you decode components simultaneously, you would have to load, insert the value into the vector, store
<Lynne>
so for ffv1 decoding we use yuv420p instead
<averne>
Oh, they do decode to yuv420p, then launch a separate shader to merge the chroma planes into the final output
<Lynne>
btw when its time to optimize I highly recommend using mesa+rgp
<Lynne>
nvidia's tools do not support pure compute programs
<Lynne>
only mesa can debug compute-only code bu dumping rgp which can then be read
<averne>
Thanks for the tip, I admit I have little experience in that way
<averne>
Does Geo Ster hang around here? I see that he was accepted into the prores encoder project, might be worth exchanging early to see if we can commonize some stuff
<Lynne>
yeah, IndecisiveTurtle
<Lynne>
there's also pmozil from last year who wrote a vc2 decoder, but he hasn't had time to work on it (due to being in a literal warzone)
<averne>
Nice, and yeah I've seen the VC2 stuff when browsing prior gsoc editions
user23 has joined #ffmpeg-devel
user23 has quit [Excess Flood]
user23 has joined #ffmpeg-devel
<IndecisiveTurtle>
averne: Hi xd Yes I did vc2 encoder last year as well
<averne>
Oh right I knew I saw your name from the previous edition
<averne>
Anyway I was thinking there might be some code the decoder and encoder could share, eg. (i)DCT
<IndecisiveTurtle>
Is it the exact same operation, cause on vc2 the encoder applies wavelet transform while decoder reverses it
<IndecisiveTurtle>
Or at least that is what I understood, haven't looked at the diracdec.c file much
<averne>
I think? The matrices will change but in essence it's the same operation. So you could upload different matrices as UBOs or use a #define
MetaNova has quit [Ping timeout: 276 seconds]
<Lynne>
no, they're completely different
<Lynne>
we don't use matrices for wavelets
<IndecisiveTurtle>
Ideally I want to use subgroups again for this DCT, need to study the code and see if its possible
<Lynne>
prores uses regular 8x8 DCTs which you can get away with a matrix mult (but you shouldn't, full matrix mults are almost always slower than a traditional 2D DCT)
<averne>
Yeah I wasn't talking about wavelets or VC2, I don't know anything about it. But I thought DCT was representable by malmuts?
<IndecisiveTurtle>
Not sure about nvidia, but amd gpus translate matrix ops into a dozen fma instructions
<Lynne>
I guess you could try to use the coop matrix extension, which would make it probably worth it
<averne>
Yeah that would be very nice to use
<Lynne>
using it is very much a total pain
<IndecisiveTurtle>
I'm not sure if my GTX 1650 even supports that
<averne>
Heh why? Is it because it runs in accelerators separate from the main shader cores?
<IndecisiveTurtle>
The vk extension mentions SPV_NV_tensor_addressing being a dependency so yeah it probably needs tensor cores
<averne>
IndecisiveTurtle: I have a 3050 so I should be able to do it. One of the issues I identified is precision, nvidia only supports up to 16b floats iirc
<IndecisiveTurtle>
I think we need to use integers for the transforms to maintain bit accuracy
<Lynne>
averne: no, the interface is utter pain to use
<Marth64>
Hello all. I am very sorry all for disappearing due to moving which was then followed by some IRL problems, I was underwater. I plan to be active again.
<IndecisiveTurtle>
We probably need full 32-bit integers, but I'm not super sure
<Lynne>
IndecisiveTurtle: never, they expect you to static link the libraries
<Lynne>
because the language changes every 5 minutes
<Lynne>
you can get away with 8*8->16bit DCTs
<IndecisiveTurtle>
I see
<Lynne>
but really, whatever's fastest
<IndecisiveTurtle>
On amd gpus a whole 8x8 block can fit inside a wave hmm
<IndecisiveTurtle>
I think it could work with 4 threads per row too, so it fills an nvidia subgroup. But it will be easiest to implement it first as a thread per row and then optimize it
kasper93_ has joined #ffmpeg-devel
kasper93 has quit [Killed (tantalum.libera.chat (Nickname regained by services))]