Previous Entry Share Next Entry
xiphmont

Next-Next Generation Video: Introducing Daala

Xiph.Org has been working on Daala, a new video codec for some time now, though Opus work had overshadowed it until just recently. With Opus finalized and much of the mop-up work well in hand, Daala development has taken center stage.

I've started work on 'demo' pages for Daala, just like I've done demos of other Xiph development projects. Daala aims to be rather different from other video codecs (and it's the first from-scratch design attempt in a while), so the first few demo pages are going to be mostly concerned with what's new and different in Daala.

I've finished the first 'demo' page (about Daala's lapped transforms), so if you're interested in video coding technology, go have a look!

Tags: ,

> A DCT is a circular transform
Objection! DFT is circular, not DCT!

Ah, yes! DCT is symmetric, not circular! Braino there, correcting that immediately.

Didn't VC-1 use lapped transofrms and frequency domain prediction, and turn out to be heavily patented?

Why lapped transforms are compared to plain DCT without intra prediction (and maybe ADST), if this is supposed to be better than H.265?

I am disappointed with amount of bullshit in this demo. It's worse than all previous ones. I'll have to ascribe it to Mozillain corporate influence I guess.

There are scattered patents surrounding every technique, but they're not nearly as thick as what surrounds more traditional (block-DCT-based) designs.

I don't compare with prediction, block filtering, adaptive quant, motion compensation, or anything else because I'm comparing _only_ the transforms. The demo page is an illustration of what a lapped transform is, and why you might want to use it. It's not a comparison of a finished Daala coder vs any other coder.

>I am disappointed with amount of bullshit in this demo

Well, now I'm sorry I bothered replying :-(

Edited at 2013-06-24 01:50 pm (UTC)

Lapped comparison

(Anonymous)

2013-06-22 08:10 pm (UTC)

"Comparison of original image to images transformed and quantized using the DCT and Daala's lapped transform. Both transforms used the same scaling and coarse flat quantizer to simulate low-bitrate encoding."
This is a bit misleading without knowing the kind of magnitudes the transform produces - in other words, I'd rather see the results at the same bitrate and not with the same quantization. This of course involves entropy encoding and probably prediction, but it still would be nice to know the general ballpark.

Re: Lapped comparison

xiphmont

2013-06-24 01:27 pm (UTC)

>This is a bit misleading without knowing the kind of magnitudes the transform produces

The DCT-only and lapped transforms had the same scales. Both were producing coefficients in the range of roughly +/-1500 quantized to a range of +/-48. The summed log-energy after quantization was the same for both. They were on entirely equal footing (otherwise it would have been misleading, just as you worried). And again-- I was illustrating only the relative performance of the naked transforms.

> I'd rather see the results at the same bitrate and not with the same quantization.

That's the entire problem-- then you're not measuring the transform, you're measuring the entire encoder. If you simply drop a lapped transform into an existing format (like h.264 or VP8) it performs rather poorly-- those formats are not designed to exploit it. Similarly, when we drop the lapped transform in Daala and just replace it with the DCT, the performance takes a similar hit. And none of the above would illustrate the differences in the transform itself.

There will be complete end-to-end codec comparisons in due course, but Daala is not complete enough to do that yet. For now, one variable at a time.

Edited at 2013-06-24 01:36 pm (UTC)

Re: Lapped comparison

(Anonymous)

2013-06-24 07:13 pm (UTC)

Thanks for the clarification. It'd have been good to mention the important bit ("the summed log-energy after quantization was the same for both") in the article.

That said, that particular example uses very high quantization, and from what I've heard the problem with lapping is that while it helps at low rates it hurts at higher ones. Some formats that use it turn it off for low quantizers (WMV9 certainly does this, and I think JPEG XR does too). Apparently the issue is that they spread detail across more coefficients, so the end result is more expensive to code - hence my original question.

Re: Lapped comparison

xiphmont

2013-06-25 02:34 am (UTC)

>It'd have been good to mention the important bit ("the summed log-energy after quantization was the same for both") in the article.

Perhaps you're right. I was trying to err on the side of not going down too many technical ratholes, but I might have kept it too simple there. What I was worried about: the figure is dominated by the DC components, and so that quickly devolves into other questions about 'why no DC prediction?' or 'why didn't you leave the DC out of the figure, since it's the AC behavior that's interesting?' and so on.

>That said, that particular example uses very high quantization, and from what I've heard the problem with lapping is that while it helps at low rates it hurts at higher ones.

We've gotten far enough to see that it helps with both, at least in Daala. You can find a number of researchers on the net who say 'lapped transforms don't really help' but not much documentation of what they tried that didn't work... Tim mentions a bit of this in his slide deck.

>Apparently the issue is that they spread detail across more coefficients, so the end result is more expensive to code - hence my original question.

I would think the problem is not that they spread detail across more coefficients in a block (this is the problem with wavelets), but that they duplicate edge detail into multiple blocks because blocks overlap. Perhaps we're winning because our predictors are able to account for that.

Another potential drawback is that lapping is going to spread ringing farther; an 8x8 lapped transform operating with 16x16 support is going to have ringing somewhere between an 8x8 and 16x16 DCT. That said, a 4x4 or 8x8 lapped transform is going to have as much coding gain as an 8x8 or 16x16 unlapped DCT, so we still theoretically come out ahead. Of course, you have to use a transform coefficient set that actually delivers that much coding gain, and we've spent quite a lot to time looking for them.

Re: Lapped comparison

(Anonymous)

2013-06-25 06:40 pm (UTC)

> Of course, you have to use a transform coefficient set that actually delivers that much coding gain, and we've spent quite a lot to time looking for them.

I don't understand this. Isn't the goal to find coefficients which give best perceptual quality? Otherwise it's tuning for PSNR again. I hope it was done on real images at least.

(Yes I know that you're not finished yet. Still, I don't understand why do you expect tuning isolated pieces to perceptually dubious metrics will yeild a good result.)

Re: Lapped comparison

xiphmont

2013-06-25 07:31 pm (UTC)

>I don't understand this. Isn't the goal to find coefficients which give best perceptual quality?

In general for a transform, the higher the coding gain and the more coherent the coefficient grouping, the better the end quality will be. It's a direct [if simplistic] measure of how many bits you save when you code the image to a given precision using that transform.

That said, the transform does not exist in isolation. We want to have a transform that both doesn't inherently 'look bad' when it loses precision, and interacts well with other parts of the system design. You'd be right to point out that, eg, wavelets have high-ish coding gain but don't perform well on either count.

There are _many_ possible lapped transforms. Restricting things to just our structure and six-bit dyadic coefficients with a denominator of 64, that's IIRC 442 possibilities for 4x4, tens of millions for 8x8 and TNTC for 16x16. Some of those transforms have high coding gain and look bad. Some look good [in isolation] but have poor coding gain. Some have high coding gain _and_ look like what we want. We're looking for those.

> Otherwise it's tuning for PSNR again.

heh. PSNR has some valid uses when comparing a codec _to itself_. You can't use it to compare different codecs or different techniques directly.

>I hope it was done on real images at least.

Yes, we're hoping to establish these as standard testing sets:
https://people.xiph.org/~tterribe/daala/subset1-y4m.tar.gz
https://people.xiph.org/~tterribe/daala/subset3-y4m.tar.gz
PNG versions of the same images:
https://people.xiph.org/~tterribe/daala/subset1-8bit.tar.gz
https://people.xiph.org/~tterribe/daala/subset3-8bit.tar.gz

Edited at 2013-06-25 07:32 pm (UTC)

Nice!

(Anonymous)

2013-06-22 09:17 pm (UTC)

Very interesting post, as always! And thanks to all the Xiph team!

Just a complaint... it seems that theora and vorbis are no longer actively developed. It would be nice to release a last theora release with all improvements in svn (IIRC Firefox is already shipping theora from svn) and finally merge aoTuV in xiph vorbis. Linux distro usually ship the official xiph release which miss all those improvements.

Thanks!

I agree actually. I would count myself the king of infinite ambition, were it not that I needed sleep.

I don't understand why video codecs don't treat time as a third dimension and just do dct on 8x8x8 blocks in three dimensions. That seems like it would be much more effective than the hacky motion compensation and doing compression frame by frame seems like it's leaving too much on the table

Edited at 2013-06-24 11:28 am (UTC)

That's what the old Tarkin codec tried to do. As it turns out, temporal artifacts ('preecho' in images, where motion begins... before the motion begins) are rather disturbing as artifacts go.

Hm, and isn't it possible to do a "lapped-transform" even in the time-domain?

Sure, but that doesn't eliminate time-spread artifacts, it just reduces them. The time vector in video is simply not smooth enough to handle entirely through a traditional transform.

Edited at 2013-10-27 02:23 am (UTC)

Is there any code or example so far? I would like to see that on my own eyes.

You mean for Tarkin? The old Tarkin code is still in Xiph.Org SVN. Look for both 'Tarkin' and 'W3D' (they were competing ideas).

Or do you mean for Daala? It's in Xiph Git: http://git.xiph.org/?p=daala.git;a=summary

Lossless helps with automated testing?

(Anonymous)

2013-06-25 10:19 am (UTC)

Does having a lossless mode help with automated QA of codec implementations since you can feed it randomly generated data and know exactly what to expect at the other end? Or is this already covered by generating some kind of checksum similar to what Opus does?

Re: Lossless helps with automated testing?

xiphmont

2013-06-25 07:33 pm (UTC)

>Does having a lossless mode help with automated QA of codec implementations since you can feed it randomly generated data and know exactly what to expect at the other end?

Short answer is yes, though the testing isn't usually randomized unless it's fuzzing.

Greetings and Thanks

(Anonymous)

2013-07-04 01:32 am (UTC)

Greetings!

Many thanks to xiph.org team for creating these patent-free codecs. You're on the right track aiming for the best quality. This is what Free/Open Source is all about, the next best (or next-next best in this case) thing by the community. FOSS must set the quality standard, not software patent trolls.

Cheers! d^_^b (2 thumbs up)

From an English non-speaker

(Anonymous)

2013-08-03 06:30 pm (UTC)

My English dictionary doesn't have the word "matricies". It lists two plural forms for "matrix": "matrices" and "matrixes".

Re: From an English non-speaker

xiphmont

2013-10-27 02:00 am (UTC)

:-P Fair enough, fixed.

You are viewing xiphmont