Mailing List Archive: Maemo: Developers
N800 & Video playback



siarhei.siamashka at gmail

Mar 18, 2007, 10:57 AM


Views: 13699
N800 & Video playback

Hello All,

I did some tests with the framebuffer when trying to find a way to reduce
tearing effect in MPlayer. Here are the results.

I did a mistake when I assumed that screen updates are synchronous for video
planes. They are actually asynchronous just like with Nokia 770, but just a
lot slower so that it is not possible to keep framerate in real time. So we
get blocking when trying to display the next frame and the previous screen
update is still in process.

If we look at the framebuffer API. There are two ioctl important for screen
updates and tearing synchronization if I understand them correctly now:

* ioctl(fd, OMAPFB_UPDATE_WINDOW, &update);

This ioctl initiates asynchronous screen update from a 'framebuffer' memory
(actually that's just a system memory) to a graphics chip. If a previous
screen update request is still incomplete, this call blocks and waits until it
is done. The structure 'update' can have OMAPFB_FORMAT_FLAG_TEARSYNC
flag set which instructs the framebuffer driver to wait internally (ioctl call
is not blocked) and start data transfer on the start of the next LCD internal
screen refresh (aparently refresh rate is something ~60Hz, but the numbers
are below).

* ioctl(fd, OMAPFB_SYNC_GFX);

This ioctl call ensures that current screen update is done (blocking
until data transfer is complete).

Perforrming both these ioctls consequently we can benchmark screen update
performance. The results are the following.

On N800 (OS2007 2.2006.51-6), every YUV screen update (OMAPFB_COLOR_YUY422)
takes about 41ms without tearsync enabled and 41-58ms with tearsync. It does
not matter what video resolution we try to watch, the result is the same. So
the maximal screen update rate is about ~24fps without tearsync and ~20fps on
average with tearsync. That's not enough to watch 30 fps video and achieving
24 fps is theoretically possible, but very tricky and unrealistic. If tearsync
comes into action, watching full framerate videos is impossible now. Analyzing
the difference in screen update times with vsync, looks like full cycle of LCD
internal refresh takes ~17ms (that's ~60Hz, but as the precision is not good,
that may be something else, 50Hz for example). Nevertheless screen update on
N800 can't be completed for these 17ms and tearsync does not work perfectly
(most likely it can fill the screen up to the bottom horizontal line observed
on playing video).

If we try RGB screen updates, we can see that the time needed for screen
update gets lower for updating smaller screen regions). The numbers are
the following (without tearsync enabled):
640x480: 33.7ms
400x230: 10.2ms
320x240: 8.5ms

Of course RGB screen updates are not very suitable for video as we would lose
much more time doing YUV->RGB conversion.

If we benchmark the screen update performance on Nokia 770, the numbers are:
640x480: 11.1ms
320x240: 2.9ms (that's fullscreen playback with pixel doubling)

If we estimate bus performance on Nokia 770, it is ~55MB/s and is more than
enough to display 800x480 sized video frames with 30 fps. Adding a tearsync
would be a nice addition, as 11.1ms for 640x480 screen update time is lower
than 17ms LCD refresh cycle. And in the worst case of video sync when we get
11.1ms+17ms=28.1ms for a single frame, it will be still capable of displaying
35 fps at the very least. So any resolution video can be played with perfect
quality given enough cpu performance for video decoding (that's a real
bottleneck on Nokia 770).

Looks like graphics bus on N800 is 3x slower than on Nokia 770. It might
be caused by inefficient framebuffer driver implementation in its initial
revision. But if it is a hardware issue, getting normal video playback at
native framerate may be troublesome. Performing software downscaling of
video before sending data to the graphics chip may be a solution, but it
sacrifices image quality. Switching to 12bit YUV format from 16bit will save
~33% of bus bandwidth, but it can't compensate 3x performance regression
and may be not enough for 30 fps fullscreen video playback.

Right now, I can workaround tearing somewhat, but some frames will have to
be skipped, resulting in somewhat jerky playback (even for transcoded video
unless framerate is halved).

Apparently the same issue applies to emulators, games and other software.

As Daniel explained, the next firmware will bring a big improvement in this
area. I'm not sure whether it is worth to release the next version of MPlayer
before that, since it will still be far from perfect on N800.

A preview of the next kernel for beta testing might reduce time needed to get
MPlayer fully working on N800, but I'm not demanding or expecting anything. It
is just a matter of time anyway and I'm not so impatient :)

I would be grateful for any comments and corrections. Some things are not
yet clear to me, figuring them out myself is just a waste of time that could
be spent on something more useful. Even a small hint may save a huge
amount of time.

PS. The last 'inefficient' period of time was when I was struggling with
gstreamer API (with no prior experience with it) to get MP3 playback in
MPlayer working on DSP for a few months. Looks like the history repeats.
Once again, I'm not demanding anything, it is just a matter of 'optimizing'
development and spending scarce amounts of spare time more efficiently.
I know that Nokia developers are too busy with their primary work, and
really appreciate what they are doing. So consider this as a polite request
for a favour (not necessary to fulfil right now or fulfil at all).
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


marius at pov

Mar 19, 2007, 9:24 AM


Views: 13560
Re: N800 & Video playback

On Sun, Mar 18, 2007 at 07:57:36PM +0200, Siarhei Siamashka wrote:
> I did some tests with the framebuffer when trying to find a way to reduce
> tearing effect in MPlayer. Here are the results.
<snip>

This is a very interesting post. Thanks!

Marius Gedminas
--
... Another nationwide organization's computer system crashed twice in less
than a year. The cause of each crash was a computer virus....
-- Paul Mungo, Bryan Glough _Approaching_Zero_
(in 1986 computer crashes were something out of the ordinary. Win95 anyone?)
Attachments: signature.asc (0.18 KB)


abos at hanno

Mar 19, 2007, 9:41 AM


Views: 13534
Re: N800 & Video playback

Siarhei Siamashka schrieb:
> Looks like graphics bus on N800 is 3x slower than on Nokia 770. It might
> be caused by inefficient framebuffer driver implementation in its initial
> revision. But if it is a hardware issue, getting normal video playback at
> native framerate may be troublesome.

It would be a major disappointment if this turns out to be a hardware
issue...

Regards,

Hanno
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Mar 19, 2007, 1:34 PM


Views: 13548
Re: N800 & Video playback

Hi,

On Sun, Mar 18, 2007 at 07:57:36PM +0200, ext Siarhei Siamashka wrote:
> If we look at the framebuffer API. There are two ioctl important for screen
> updates and tearing synchronization if I understand them correctly now:
>
> [...]

You do indeed understand them correctly.

> Looks like graphics bus on N800 is 3x slower than on Nokia 770. It might
> be caused by inefficient framebuffer driver implementation in its initial
> revision. But if it is a hardware issue, getting normal video playback at
> native framerate may be troublesome. Performing software downscaling of
> video before sending data to the graphics chip may be a solution, but it
> sacrifices image quality. Switching to 12bit YUV format from 16bit will save
> ~33% of bus bandwidth, but it can't compensate 3x performance regression
> and may be not enough for 30 fps fullscreen video playback.

Unfortunately, it's a hardware issue. What we can do is get the LCD
controller to perform colourspace conversion from a custom planar format
('YUV420') and the scaling as well. Unfortunately this isn't a
colourkey, but only a simple rectangle, so the semantics are actually
quite complex. But it works well enough that we've shipped an X server
and kernel with this support. We've tried jacking the RFBI frequency up
a bit, and the most we could get was a ~10% improvement, with a loss in
stability: anything above that would kill your device quick smart,
whereas this one only crashed it every day or so.

> As Daniel explained, the next firmware will bring a big improvement in this
> area. I'm not sure whether it is worth to release the next version of MPlayer
> before that, since it will still be far from perfect on N800.

I'd hold your breath, to be honest.

> A preview of the next kernel for beta testing might reduce time needed to get
> MPlayer fully working on N800, but I'm not demanding or expecting anything. It
> is just a matter of time anyway and I'm not so impatient :)

Unfortunately, again, it's not my call: there are various processes to
get things released (legal, in particular), and I can't really pre-empt
those.

> I would be grateful for any comments and corrections. Some things are not
> yet clear to me, figuring them out myself is just a waste of time that could
> be spent on something more useful. Even a small hint may save a huge
> amount of time.

Anything in particular? I thought my last mails on the subject would've
been reasonably exhaustive.

> PS. The last 'inefficient' period of time was when I was struggling with
> gstreamer API (with no prior experience with it) to get MP3 playback in
> MPlayer working on DSP for a few months. Looks like the history repeats.
> Once again, I'm not demanding anything, it is just a matter of 'optimizing'
> development and spending scarce amounts of spare time more efficiently.
> I know that Nokia developers are too busy with their primary work, and
> really appreciate what they are doing. So consider this as a polite request
> for a favour (not necessary to fulfil right now or fulfil at all).

Again, if there are any particular questions I can answer, don't be
subtle: ask me straight up. If I can answer them (some things I can't
necessarily say, some things I don't necessarily know), I will.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


klaus at rotters

Mar 20, 2007, 1:31 AM


Views: 13566
Re: N800 & Video playback

Daniel Stone wrote:
> On Sun, Mar 18, 2007 at 07:57:36PM +0200, ext Siarhei Siamashka wrote:
>> Looks like graphics bus on N800 is 3x slower than on Nokia 770. It might
>> be caused by inefficient framebuffer driver implementation in its initial
>> revision. But if it is a hardware issue, getting normal video playback at
>> native framerate may be troublesome. [...]

> Unfortunately, it's a hardware issue. What we can do is get the LCD

The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
the bandwidth in the N770? Is it really _that_ big?

What is limiting the bandwidth: The OMAP interface, the LCD controller
itself or was it a design issue.

-Klaus

--
Klaus Rotter * klaus <at> rotters <dot> de * www.rotters.de
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Mar 20, 2007, 2:58 AM


Views: 13548
Re: N800 & Video playback

On Tue, Mar 20, 2007 at 09:31:00AM +0100, ext Klaus Rotter wrote:
> Daniel Stone wrote:
> >On Sun, Mar 18, 2007 at 07:57:36PM +0200, ext Siarhei Siamashka wrote:
> >>Looks like graphics bus on N800 is 3x slower than on Nokia 770. It might
> >>be caused by inefficient framebuffer driver implementation in its initial
> >>revision. But if it is a hardware issue, getting normal video playback at
> >>native framerate may be troublesome. [...]
>
> >Unfortunately, it's a hardware issue. What we can do is get the LCD
>
> The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
> the bandwidth in the N770? Is it really _that_ big?

Siarhei's calculations were correct, so, yes.

> What is limiting the bandwidth: The OMAP interface, the LCD controller
> itself or was it a design issue.

a) and c). It's just not stable at higher frequencies.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


klaus at rotters

Mar 20, 2007, 6:03 AM


Views: 13577
Re: N800 & Video playback

Daniel Stone wrote:
> On Tue, Mar 20, 2007 at 09:31:00AM +0100, ext Klaus Rotter wrote:
>> The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
>> the bandwidth in the N770? Is it really _that_ big?
>
> Siarhei's calculations were correct, so, yes.

Bad... the N770 interface wasn't the fasted either. So we have even a
more slow down. On the N770 there was the feature (with SDL games) of
doubling the pixels by hardware with a X-server extension. Will this
feature be available in the new kernel / X11 server for the N800? It
would be great if it would use the same API.

--
Klaus Rotter * klaus <at> rotters <dot> de * www.rotters.de
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Mar 20, 2007, 7:05 AM


Views: 13533
Re: N800 & Video playback

On Tue, Mar 20, 2007 at 02:03:16PM +0100, ext Klaus Rotter wrote:
> Daniel Stone wrote:
> >On Tue, Mar 20, 2007 at 09:31:00AM +0100, ext Klaus Rotter wrote:
> >>The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
> >>the bandwidth in the N770? Is it really _that_ big?
> >
> >Siarhei's calculations were correct, so, yes.
>
> Bad... the N770 interface wasn't the fasted either. So we have even a
> more slow down. On the N770 there was the feature (with SDL games) of
> doubling the pixels by hardware with a X-server extension. Will this
> feature be available in the new kernel / X11 server for the N800? It
> would be great if it would use the same API.

Yes, pixel doubling has been fixed, and still uses the XSP API for now.
Future releases (long-term, as I haven't implemented this yet) will use
the standard XRandR API.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


siarhei.siamashka at gmail

Mar 21, 2007, 2:20 PM


Views: 13527
Re: N800 & Video playback

On Tuesday 20 March 2007 15:03, Klaus Rotter wrote:

> > On Tue, Mar 20, 2007 at 09:31:00AM +0100, ext Klaus Rotter wrote:
> >> The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
> >> the bandwidth in the N770? Is it really _that_ big?
> >
> > Siarhei's calculations were correct, so, yes.
>
> Bad... the N770 interface wasn't the fasted either. So we have even a
> more slow down.

There is one important thing to note. Screen updates are asynchronous and
are performed simultaneously with CPU doing some other useful things at
the same time. Screen updates do not introduce any overhead or affect
performance (at least I did not notice any such effect). So insanely boosting
graphics bus performance will not provide any improvements at all once it is
capable to sustain acceptable framerate. And what is acceptable depends
on applications. Video may require higher framerate, but it is both high
resolution and high framerate movies that may exceed graphics bus
capabilities, in this case video will be still played (if cpu is fast enough
to decode it, that's another story) but with some frames skipped and
many people will not even notice any problems. Quite a lot of people
are even satistied with 15fps transcoded video, so getting maybe 20-25fps
(random guess) on some videos instead of 30fps is not so bad.

Tearing at the bottom is most likely caused by screen update time being
longer than two LCD refresh cycles. With tearsync enabled, both screen
update and refresh cycle start at the same time, refresh is faster, so we
still see the previous frame on the screen. When the first refresh cycle
completes, screen buffer is slightly less than half updated at that moment.
The second LCD refresh cycle starts displaying the data from the new image,
while screen buffer still continues to get updated, but not fast enough to
complete before this second LCD refresh cycle catches up not too far
from the bottom part of the screen. If the screen update was faster than two
refresh cycles, there would be no tearing visible. Screen update only needs
to be 15-20% faster to achieve this. If improving graphics bus performance
does not work, I wonder if it is possible to to reduce LCD refresh rate
instead?

Anyway, I think it is better to believe Daniel and wait for the new
firmware update :)

> On the N770 there was the feature (with SDL games) of
> doubling the pixels by hardware with a X-server extension. Will this
> feature be available in the new kernel / X11 server for the N800? It
> would be great if it would use the same API.

Doubling pixels will definitely reduce the load on the graphics bus so that
its bandwidth should become not an issue.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

Apr 19, 2007, 11:41 PM


Views: 13392
Re: N800 & Video playback

On Monday 19 March 2007 22:34, you wrote:

<snip>

> Again, if there are any particular questions I can answer, don't be
> subtle: ask me straight up. If I can answer them (some things I can't
> necessarily say, some things I don't necessarily know), I will.

Thanks, here we go and sorry for a long delay with this answer.

First thanks for Xv update which makes it really usable now, MPlayer now uses
Xv video output on N800 by default. But there are still some problems. Using
unmodified upstream MPlayer code for Xv (N800 with 3.2007.10-7
firmware at the moment) does not work good. It has two at least problems:

1. Lockups which look like cycling two sequential frames, very similar or the
same problem as https://maemo.org/bugzilla/show_bug.cgi?id=991
Also keypresses are not very responsive. A fix (or workaround) required
changing XFlush to XSync in screen update code, now it looks a lot better.

2. Switching windowed/fullscreen mode generally makes mplayer terminate
with the following error messages:
"X11 error: BadValue (integer parameter out of range for operation)"
"Xlib: unexpected async reply (sequence 0x5db)!"
A workaround to make this problem less frequent was a code addition which
prevents screen updates until we get Expose even notification.

All these Xv patches for MPlayer code can be viewed here:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php?root=mplayer&diff_format=h&view=rev&rev=166

I really don't know much about X11 programming and only started to learning
it, so your help with some advice may be very useful. Looks like MPlayer code
X11/Xv output code is a big mess with many tricks and workarounds added to
work on different systems over time. Maybe it contains some bugs which get
triggered on N800 only, but apparently this code is used for other systems
without any problems. Can you try experimenting a bit with MPlayer (upstream
release) yourself to check how it works with N800 xserver? Maybe it can reveal
some xserver bugs which need to be fixed? Also if MPlayer has some apparently
bad X11 code, preparing a clean patch and submitting it upstream maybe a
good idea.

One more strange thing with Xv on N800 can be reproduced by trying to watch
standard N800 demo video in MPlayer. It has an old familiar tearing line in
the bottom part of the screen and the performance is very poor. The same file
plays fine in the standard video player. The only difference is that mplayer
respects video aspect ratio (this video is not precisely 15:9 but slightly
off) and shows some small black bands above and below picture and
default video player scales it to fit the whole screen. Disabling aspect ratio
in mplayer with -noaspect option also 'fixes' this problem.

Using benchmark option we get the following numbers:

# mplayer -benchmark -quiet Nokia_N800.avi
[...]
BENCHMARKs: VC: 33,271s VO: 66,768s A: 0,490s Sys: 5,703s = 106,232s
BENCHMARK%: VC: 31,3189% VO: 62,8517% A: 0,4614% Sys: 5,3681% = 100,0000%
BENCHMARKn: disp: 1732 (16,30 fps) drop: 778 (30%) total: 2510 (23,63 fps)

# mplayer -benchmark -quiet -noaspect Nokia_N800.avi
[...]
BENCHMARKs: VC: 32,226s VO: 14,350s A: 0,456s Sys: 55,699s = 102,731s
BENCHMARK%: VC: 31,3694% VO: 13,9687% A: 0,4439% Sys: 54,2180% = 100,0000%
BENCHMARKn: disp: 2501 (24,35 fps) drop: 0 (0%) total: 2501 (24,35 fps)

So when showing video with proper aspect ratio, we get tearing back and more
than 4x slowdown in video output code (66,768s vs. 14,350s). This all results
in 30% of frames dropped.

These were the 'usability' problems with Xv. Now we get to performance
related issues. As YV12 is not natively supported by hardware, some
color format conversion and bytes shuffling in video output code is
unavoidable. It is a good idea to optimize this code if we need a good
performance for high resolution video playback. Color format conversion
can be optimized using assembly, for example maemo port of mplayer
has a patch for assembly optimized yv12-> yuy2 (yuv420p -> yuyv422)
nonscaled conversion which provides a very noticeable ~50% improvement
on Nokia 770:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php?root=mplayer&rev=129&view=rev

Also here is a JIT accelerated scaler for yv12-> yuy2 (yuv420p -> yuyv422)
conversion, it is very fast and supports pixels interpolation (good for image
quality) :
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer

I have seen your code in xserver which does the same job for downscaling, but
in nonoptimized C and with much higher impact on quality. Using JIT scaler
there can improve both image quality and performance a lot. The only my
concern is about instruction cache coherency. As ARM requires explicit
instructions cache flush for self modyfying or dynamically generated code, I
wonder if using just mmap is safe (does it flush cache for allocated region
of memory?). Maybe maemo kernel hackers/developers can help with this
information?

It should be noted, that all this assembly optimized code was developed for
Nokia 770. N800 has a much faster memory (up to 190MB/s memory copy
performance vs. 110MB/s on Nokia 770) but requires a bit different
optimizations (seems to need explicit prefetch with PLD instruction for
reading data). I'm going to try making N800 optimized color format conversion
functions a bit later.

But here is one more problem. As color format conversion is done in xserver,
it will take a really long time before any such optimizations can be delivered
to end users. Nokia seems to have unpredictable (to outsiders) and slow
releases schedule.

So for any performance optimizations experiments which result in immediate
video performance improvement, either direct framebuffer access should be
used again or it would be very nice if xserver could provide direct access to
framebuffer (video planes) in yuy2 and that custom yuv420 format in one of the
next firmware updates. The xserver itself should not do any excess memory copy
operations as they degrade performance (and it does such copy for yuy2 at
least).

Also I'm curious about that yuv420 format. From the comments in your code, it
looks like it is different from what is described in Epson docs. That seems a
bit weird.

Thanks for doing a great job supporting maemo community, your comments have
been always very informative and helpful here.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Apr 20, 2007, 12:39 AM


Views: 13408
Re: N800 & Video playback

Hi,

On Fri, Apr 20, 2007 at 09:41:45AM +0300, ext Siarhei Siamashka wrote:
> 1. Lockups which look like cycling two sequential frames, very similar or the
> same problem as https://maemo.org/bugzilla/show_bug.cgi?id=991
> Also keypresses are not very responsive. A fix (or workaround) required
> changing XFlush to XSync in screen update code, now it looks a lot better.

I assume this is basically just a race condition, and it doesn't trigger
on other systems, because they're a lot quicker.

> 2. Switching windowed/fullscreen mode generally makes mplayer terminate
> with the following error messages:
> "X11 error: BadValue (integer parameter out of range for operation)"
> "Xlib: unexpected async reply (sequence 0x5db)!"
> A workaround to make this problem less frequent was a code addition which
> prevents screen updates until we get Expose even notification.

Ditto.

> I really don't know much about X11 programming and only started to learning
> it, so your help with some advice may be very useful.

I mainly lurk on the server side, however.

> Looks like MPlayer code
> X11/Xv output code is a big mess with many tricks and workarounds added to
> work on different systems over time. Maybe it contains some bugs which get
> triggered on N800 only, but apparently this code is used for other systems
> without any problems. Can you try experimenting a bit with MPlayer (upstream
> release) yourself to check how it works with N800 xserver? Maybe it can reveal
> some xserver bugs which need to be fixed? Also if MPlayer has some apparently
> bad X11 code, preparing a clean patch and submitting it upstream maybe a
> good idea.

Unfortunately, I don't have the time to do this. Sorry.

> One more strange thing with Xv on N800 can be reproduced by trying to watch
> standard N800 demo video in MPlayer. It has an old familiar tearing line in
> the bottom part of the screen and the performance is very poor. The same file
> plays fine in the standard video player. The only difference is that mplayer
> respects video aspect ratio (this video is not precisely 15:9 but slightly
> off) and shows some small black bands above and below picture and
> default video player scales it to fit the whole screen. Disabling aspect ratio
> in mplayer with -noaspect option also 'fixes' this problem.
>
> Using benchmark option we get the following numbers:
>
> # mplayer -benchmark -quiet Nokia_N800.avi
> [...]
> BENCHMARKs: VC: 33,271s VO: 66,768s A: 0,490s Sys: 5,703s = 106,232s
> BENCHMARK%: VC: 31,3189% VO: 62,8517% A: 0,4614% Sys: 5,3681% = 100,0000%
> BENCHMARKn: disp: 1732 (16,30 fps) drop: 778 (30%) total: 2510 (23,63 fps)
>
> # mplayer -benchmark -quiet -noaspect Nokia_N800.avi
> [...]
> BENCHMARKs: VC: 32,226s VO: 14,350s A: 0,456s Sys: 55,699s = 102,731s
> BENCHMARK%: VC: 31,3694% VO: 13,9687% A: 0,4439% Sys: 54,2180% = 100,0000%
> BENCHMARKn: disp: 2501 (24,35 fps) drop: 0 (0%) total: 2501 (24,35 fps)
>
> So when showing video with proper aspect ratio, we get tearing back and more
> than 4x slowdown in video output code (66,768s vs. 14,350s). This all results
> in 30% of frames dropped.

Okay, I'll take a look at this. My guess is that the scaling we're
seeing prevents us from using the LCD controller's overlay, possibly
because it's done in software.

> These were the 'usability' problems with Xv. Now we get to performance
> related issues. As YV12 is not natively supported by hardware, some
> color format conversion and bytes shuffling in video output code is
> unavoidable. It is a good idea to optimize this code if we need a good
> performance for high resolution video playback. Color format conversion
> can be optimized using assembly, for example maemo port of mplayer
> has a patch for assembly optimized yv12-> yuy2 (yuv420p -> yuyv422)
> nonscaled conversion which provides a very noticeable ~50% improvement
> on Nokia 770:
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php?root=mplayer&rev=129&view=rev
>
> Also here is a JIT accelerated scaler for yv12-> yuy2 (yuv420p -> yuyv422)
> conversion, it is very fast and supports pixels interpolation (good for image
> quality) :
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer

The primary conversion we do isn't planar -> packed (this is a fallback
for when the video is obscured), but from planar to another custom
planar format. It would be good to get ARM assembly for the fallback
path, but most of the problem when using packed lies in having to
transfer the much larger amount of data over the bus.

There's one optimisation that could be done for the YUV420 conversion
(the custom planar format that Hailstorm takes), which removes a branch,
ensures 32-bit writes always (instead of one 32-bit and one 16-bit per
pixel), and unrolls a loop by half. Might be interesting to see what
effect this has, but I think it'll still be rather small.

> I have seen your code in xserver which does the same job for downscaling, but
> in nonoptimized C and with much higher impact on quality. Using JIT scaler
> there can improve both image quality and performance a lot. The only my
> concern is about instruction cache coherency. As ARM requires explicit
> instructions cache flush for self modyfying or dynamically generated code, I
> wonder if using just mmap is safe (does it flush cache for allocated region
> of memory?). Maybe maemo kernel hackers/developers can help with this
> information?

'Downscaling' is overstating it: it just removes enough lines to get the
job done. I don't believe we have enough CPU power to do proper
interpolation on that path.

Again, this is basically a 'fallback' path, and doesn't hit performance
in the normal case.

Off the top of my head, an mmap will only flush the dcache, not the
icache. But I haven't tried this out.

> It should be noted, that all this assembly optimized code was developed for
> Nokia 770. N800 has a much faster memory (up to 190MB/s memory copy
> performance vs. 110MB/s on Nokia 770) but requires a bit different
> optimizations (seems to need explicit prefetch with PLD instruction for
> reading data). I'm going to try making N800 optimized color format conversion
> functions a bit later.

Okay, cool.

> But here is one more problem. As color format conversion is done in xserver,
> it will take a really long time before any such optimizations can be delivered
> to end users. Nokia seems to have unpredictable (to outsiders) and slow
> releases schedule.

Don't look at me. :)

> So for any performance optimizations experiments which result in immediate
> video performance improvement, either direct framebuffer access should be
> used again or it would be very nice if xserver could provide direct access to
> framebuffer (video planes) in yuy2 and that custom yuv420 format in one of the
> next firmware updates. The xserver itself should not do any excess memory copy
> operations as they degrade performance (and it does such copy for yuy2 at
> least).

'Direct framebuffer access'? As in, just hand you a pointer to a
framebuffer somewhere and let you write straight to it? As this would
require a firmware update anyway, I don't really see how this would
improve matters too much, and I really don't want to write any more
Maemo-specific extensions (I've been working very hard to kill XSP).

> Also I'm curious about that yuv420 format. From the comments in your code, it
> looks like it is different from what is described in Epson docs. That seems a
> bit weird.

Which Epson docs?

> Thanks for doing a great job supporting maemo community, your comments have
> been always very informative and helpful here.

No worries. :) Thanks for your work on the media player; the fallback
paths are, as you've noticed, not necessarily optimal.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


dufkaf at seznam

Apr 20, 2007, 3:55 AM


Views: 13401
Re: N800 & Video playback

Daniel Stone wrote:

>
> Which Epson docs?
>

fanoush.wz.cz/maemo/S1D13745A01SpecRev1.0.gm.zip
Got it from Epson Electronics like the one mentioned here
http://maemo.org/pipermail/maemo-developers/2006-December/006638.html

_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


ulysses.huang at gmail

Apr 20, 2007, 9:04 AM


Views: 13383
Re: N800 & Video playback

Siarhei Siamashka 写道:
> I have seen your code in xserver which does the same job for downscaling, but
> in nonoptimized C and with much higher impact on quality. Using JIT scaler
> there can improve both image quality and performance a lot. The only my
> concern is about instruction cache coherency. As ARM requires explicit
> instructions cache flush for self modyfying or dynamically generated code, I
> wonder if using just mmap is safe (does it flush cache for allocated region
> of memory?). Maybe maemo kernel hackers/developers can help with this
> information?
>
arm linux support flush icache by syscall "cacheflush",

qemu have this function:
static inline void flush_icache_range(unsigned long start, unsigned long
stop)
{
register unsigned long _beg __asm ("a1") = start;
register unsigned long _end __asm ("a2") = stop;
register unsigned long _flg __asm ("a3") = 0;
__asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r"
(_flg));
}

you can reference kernel source arch/arm/kernel/traps.c and
include/asm-arm/unistd.h

_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

Apr 23, 2007, 11:46 PM


Views: 13359
Re: N800 & Video playback

On Friday 20 April 2007 10:39, you wrote:

> The primary conversion we do isn't planar -> packed (this is a fallback
> for when the video is obscured), but from planar to another custom
> planar format. It would be good to get ARM assembly for the fallback
> path, but most of the problem when using packed lies in having to
> transfer the much larger amount of data over the bus.

It is only a problem of definition :) Whatever it is, packed or planar, this
YUV420 format is not YV12. So it still needs conversion which is
performed by only reordering bytes and is not much different from
packed YUY2 (except that it requires less space and bandwidth).

> There's one optimisation that could be done for the YUV420 conversion
> (the custom planar format that Hailstorm takes), which removes a branch,
> ensures 32-bit writes always (instead of one 32-bit and one 16-bit per
> pixel), and unrolls a loop by half. Might be interesting to see what
> effect this has, but I think it'll still be rather small.

My main performance concern is exactly about this 'omapCopyPlanarDataYUV420'
function. My experience from Nokia 770 video output code optimization shows
that optimization effect can be really huge (it was 1.5x improvement on Nokia
770 for unscaled YV12 -> YUY2 conversion going from a simple loop in C to
optimized assembly code, I provided a link to the relevant code in my previous
post). But N800 code can be probably improved more because now it contains
unnecessary branch in the inner loop and branches are expensive on long
pipeline CPUs. Such color format conversion performance should be
comparable to that of memcpy if done right (it is about half memcpy speed on
Nokia 770 for unscaled YV12 -> YUY2 conversion).

But only benchmarks can be a real proof, any premature speculations are
useless and even harmful. Do you remember the times when nobody from
Nokia believed that ARM core could be good for video decoding on 770? ;-)

Testing with Nokia_N800.avi video on N800:
# mplayer -benchmark -quiet -noaspect Nokia_N800.avi

BENCHMARKs: VC: 29,525s VO: 15,029s A: 0,453s Sys: 59,919s = 104,925s
BENCHMARK%: VC: 28,1390% VO: 14,3232% A: 0,4313% Sys: 57,1065% = 100,0000%
BENCHMARKn: disp: 2511 (23,93 fps) drop: 0 (0%) total: 2511 (23,93 fps)

Enabling direct rendering (avoids extra memcpy in mplayer, but requires to
disable OSD menu):
# mplayer -benchmark -quiet -noaspect -dr -nomenu Nokia_N800.avi

BENCHMARKs: VC: 29,826s VO: 12,365s A: 0,437s Sys: 60,555s = 103,182s
BENCHMARK%: VC: 28,9058% VO: 11,9833% A: 0,4236% Sys: 58,6873% = 100,0000%
BENCHMARKn: disp: 2504 (24,27 fps) drop: 0 (0%) total: 2504 (24,27 fps)

Testing the same video on Nokia 770:
# mplayer -benchmark -quiet -noaspect Nokia_N800.avi

BENCHMARKs: VC: 44,982s VO: 7,998s A: 0,884s Sys: 47,936s = 101,801s
BENCHMARK%: VC: 44,1862% VO: 7,8568% A: 0,8688% Sys: 47,0882% = 100,0000%
BENCHMARKn: disp: 2502 (24,58 fps) drop: 0 (0%) total: 2502 (24,58 fps)


So Nokia 770, having slower CPU, slower memory and using less efficient
output format (16bpp vs. 12bpp), still requires less time for video output
than N800 (7,998s vs. 12,365s). Graphics bus performance is unrelated here
as it is asynchronous operation and it is fast enough. Surely N800 also has
some extra overhead because of interprocess communication with xserver, but
looks like YV12 -> YUV420 conversion is quite a bottleneck here too.

It should be noted that while Nokia_N800.avi video has low resolution and
N800 has no problems decoding and displaying it, our goal is higher resolution
videos such as 640x480. Getting to higher resolutions will increase color
format conversion overhead. As it can be seen from these benchmarks, video
output on N800 takes quite a significant time when compared with time needed
for decoding (29,826s for decoding, 12,365s for video output).

I can make an assembly optimized code for YV12 -> YUV420 conversion. Is there
any chance that such optimization could be also integrated into xserver in one
of the next firmware updates if it really provides a significant performance
improvement?

N800 is almost able to play VGA resolution videos properly, it only needs a
bit more optimizations. Color format conversion performance for video output
is one of the important things that can be improved.

> > So for any performance optimizations experiments which result in
> > immediate video performance improvement, either direct framebuffer access
> > should be used again or it would be very nice if xserver could provide
> > direct access to framebuffer (video planes) in yuy2 and that custom
> > yuv420 format in one of the next firmware updates. The xserver itself
> > should not do any excess memory copy operations as they degrade
> > performance (and it does such copy for yuy2 at least).
>
> 'Direct framebuffer access'? As in, just hand you a pointer to a
> framebuffer somewhere and let you write straight to it? As this would
> require a firmware update anyway, I don't really see how this would
> improve matters too much, and I really don't want to write any more
> Maemo-specific extensions (I've been working very hard to kill XSP).

Direct framebuffer access will eliminate the need for extra memcpy while
allowing to use OSD menu and subtitles and make everything much easier
(currenty this is how MPlayer works on Nokia 770). You can compare the
benchmark results with direct rendering enabled and disabled above. It
saves ~3 seconds of CPU time on playing Nokia_N800.avi video.

Direct rendering allows to use Xv buffers and decode video in-place. But
unfortunately as data from these buffers is used as reference frames for
decoding next frames, they should be non-modified. And this all makes
implementing OSD and subtitles tricky.

Having access directly to framebuffer eliminates the need to use this direct
rendering technique and saves us from the complexities associated with it.

> > Also I'm curious about that yuv420 format. From the comments in your
> > code, it looks like it is different from what is described in Epson docs.
> > That seems a bit weird.
>
> Which Epson docs?

The one mentioned by Frantisek. Well, it was just a comment
for 'omapCopyPlanarDataYUV420' function wrong and misleading,
nevermind :-) Now everything is clear.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Apr 24, 2007, 2:36 AM


Views: 13352
Re: N800 & Video playback

On Tue, Apr 24, 2007 at 09:46:52AM +0300, ext Siarhei Siamashka wrote:
> On Friday 20 April 2007 10:39, you wrote:
> > There's one optimisation that could be done for the YUV420 conversion
> > (the custom planar format that Hailstorm takes), which removes a branch,
> > ensures 32-bit writes always (instead of one 32-bit and one 16-bit per
> > pixel), and unrolls a loop by half. Might be interesting to see what
> > effect this has, but I think it'll still be rather small.
>
> My main performance concern is exactly about this 'omapCopyPlanarDataYUV420'
> function. My experience from Nokia 770 video output code optimization shows
> that optimization effect can be really huge (it was 1.5x improvement on Nokia
> 770 for unscaled YV12 -> YUY2 conversion going from a simple loop in C to
> optimized assembly code, I provided a link to the relevant code in my previous
> post). But N800 code can be probably improved more because now it contains
> unnecessary branch in the inner loop and branches are expensive on long
> pipeline CPUs. Such color format conversion performance should be
> comparable to that of memcpy if done right (it is about half memcpy speed on
> Nokia 770 for unscaled YV12 -> YUY2 conversion).

Right, the branch is a problem, and as I said, the branch can be avoided
and the writes optimised to be three 32-bit writes for two macroblocks,
instead of two 32-bit writes and two 16-bit writes.

However, I don't think the lessons from the 770 are necessarily
_directly_ applicable to the N800: on the 770, our bottleneck is
decoding speed. The bottleneck on the N800 is exactly the opposite:
video output.

> But only benchmarks can be a real proof, any premature speculations are
> useless and even harmful. Do you remember the times when nobody from
> Nokia believed that ARM core could be good for video decoding on 770? ;-)

Actually, I don't, since I've always mainly worked on the N800. ;) But
still, if there's dedicated hardware we can use to remove load from the
ARM and let it get on with tasks, and it can perform to an adequate
level, there's no reason to avoid it.

> So Nokia 770, having slower CPU, slower memory and using less efficient
> output format (16bpp vs. 12bpp), still requires less time for video output
> than N800 (7,998s vs. 12,365s). Graphics bus performance is unrelated here
> as it is asynchronous operation and it is fast enough. Surely N800 also has
> some extra overhead because of interprocess communication with xserver, but
> looks like YV12 -> YUV420 conversion is quite a bottleneck here too.

Bear in mind that, unless you explicitly disable it (the Xv attribute is
something like XV_OMAP_VSYNC), the X server _will_ flush all pending
writes before the next frame is put through. Else you get tearing,
because you can be halfway through an update, and writing the next frame
to the framebuffer, so which frame is being picked up, changes halfway
through.

Try forcing XV_OMAP_VSYNC (or whatever it is) to 0, and comparing the
results.

> I can make an assembly optimized code for YV12 -> YUV420 conversion. Is there
> any chance that such optimization could be also integrated into xserver in one
> of the next firmware updates if it really provides a significant performance
> improvement?

Yeah, if there's measurable benefit, I'll include it.

> N800 is almost able to play VGA resolution videos properly, it only needs a
> bit more optimizations. Color format conversion performance for video output
> is one of the important things that can be improved.

I don't believe it's on the critical path. The optimisation I mentioned
before will bring us up to the point where any improvement that we can
make in that conversion will be eclipsed by the time taken to send it
over the bus, I believe. But I can't prove that.

> > Which Epson docs?
>
> The one mentioned by Frantisek. Well, it was just a comment
> for 'omapCopyPlanarDataYUV420' function wrong and misleading,
> nevermind :-) Now everything is clear.

Hmm, is it? Because, unless I was _really_ tired at the time I wrote it
(which is entirely possible), that's what the code does, and it works,
so ...

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


siarhei.siamashka at gmail

Apr 26, 2007, 5:14 PM


Views: 13327
Re: N800 & Video playback

On Tuesday 24 April 2007 12:36, Daniel Stone wrote:

> > My main performance concern is exactly about this
> > 'omapCopyPlanarDataYUV420' function. My experience from Nokia 770 video
> > output code optimization shows that optimization effect can be really
> > huge (it was 1.5x improvement on Nokia 770 for unscaled YV12 -> YUY2
> > conversion going from a simple loop in C to optimized assembly code, I
> > provided a link to the relevant code in my previous post). But N800 code
> > can be probably improved more because now it contains unnecessary branch
> > in the inner loop and branches are expensive on long pipeline CPUs. Such
> > color format conversion performance should be comparable to that of
> > memcpy if done right (it is about half memcpy speed on Nokia 770 for
> > unscaled YV12 -> YUY2 conversion).
>
> Right, the branch is a problem, and as I said, the branch can be avoided
> and the writes optimised to be three 32-bit writes for two macroblocks,
> instead of two 32-bit writes and two 16-bit writes.

I did not have much free time to do complete tests, but initial benchmarks
show that actually even removing this branch and using three 16-bit writes
improves performance quite significantly. The test program is here:
http://ufo2000.sourceforge.net/files/yuv420test.c

It produces the following results if compiled with optimization
options "-O3 -fomit-frame-pointer -mcpu=arm1136j-s":

# ./yuv420test
test: 'yv12toyuv420_xomap', time=5.220, memory bandwidth=61.576MB/s
test: 'yv12toyuv420_yv12toyuv420_branch_removed', time=3.503, memory
bandwidth=91.754MB/s

An interesting thing about this test is that it uses 2504 frames 400x240
each, that's the same number of frames as Nokia_N800.avi video has.
And mplayer spent 12,365s on video output when playing this video while
YV12->YUV420 conversion should have taken 5.220s as benchmarked in
this test. So now color conversion is roughly half of the time spent on video
output for this resolution. Some tests with higher resolution videos will be
done later.

As you see from the benchmark results, we can get 1.5x improvement
already for color conversion with just a trivial removal of a piece of
redundant code. Was that branch in the code supposed to improve
performance? Seems like it resulted in quite the opposite effect.

I'll make a really optimized version of YV12 -> YUV420 convertor on this
weekend (removing branch is good, but I feel that it can be improved
more) and will try to use it on Nokia 770, any extra video performance
improvement will be useful there. I hope that the framebuffer driver on
Nokia 770 supports YUV420 color format properly.

By the way, does anybody know if it is possible to enable tearsync support
on Nokia 770 (by backporting some changes from N800 kernel or in some
other way)?

> However, I don't think the lessons from the 770 are necessarily
> _directly_ applicable to the N800: on the 770, our bottleneck is
> decoding speed. The bottleneck on the N800 is exactly the opposite:
> video output.

I can't agree here. Memory speed is actually a lot faster on N800, the only
trouble is graphics bus performance, but sending data to LCD controller
through this bus does not introduce any load on ARM core and it can freely
decode the next frame of video at the same time. At least this was the case
with the previous version of firmware (I did not have enough time to see what
was changed in framebuffer API and do any video tests with it).

But color conversion is done by ARM core and it consumes precious cpu
cycles which could be used for decoding higher resolution/bitrate video.
Optimizing color conversion will improve video performance. The
improvement will be most likely only within a few percents overall, but
every little bit helps.

> Bear in mind that, unless you explicitly disable it (the Xv attribute is
> something like XV_OMAP_VSYNC), the X server _will_ flush all pending
> writes before the next frame is put through. Else you get tearing,
> because you can be halfway through an update, and writing the next frame
> to the framebuffer, so which frame is being picked up, changes halfway
> through.
>
> Try forcing XV_OMAP_VSYNC (or whatever it is) to 0, and comparing the
> results.

OK, thanks, I'll try this test too and check if it affects Xv performance.
But I thought that using 12bpp color format _and_ sending only as much
data as needed should solve the problem. Of course 800x480 * 16bpp * 30fps
would be 23MB/s and it is too much. But for example 640x480 * 12bpp * 30fps =
12.3MB/s. Is the graphics bus fast enough to handle this?

Or is there some other problem I'm not aware of?

> > N800 is almost able to play VGA resolution videos properly, it only needs
> > a bit more optimizations. Color format conversion performance for video
> > output is one of the important things that can be improved.
>
> I don't believe it's on the critical path. The optimisation I mentioned
> before will bring us up to the point where any improvement that we can
> make in that conversion will be eclipsed by the time taken to send it
> over the bus, I believe. But I can't prove that.

Well, I believe that every optimization which can provide a visible
improvement (at least a few percents) is worth it. Optimizations are
cumulative, a number of small 1-3% improvements added together result
in a significant performance boost.

The opposite is also true, if one is lazy and adds inefficient code here and
there, these small performance regressions accumulate and the program
starts to crawl :)

Of course everything depends on the task that is being solved, sometimes
optimizations do not make much sense and are too expensive. For example,
waiting for 10% more time to get some data processed may be even unnoticeable.
But video should be decoded in realtime, and the same 10% of performance
difference may have a huge effect (result in a watchable or totally
unwatchable movie).

> > Well, it was just a comment for 'omapCopyPlanarDataYUV420' function
> > wrong and misleading, nevermind :-) Now everything is clear.
>
> Hmm, is it? Because, unless I was _really_ tired at the time I wrote it
> (which is entirely possible), that's what the code does, and it works,
> so ...

Yes, this part seems wrong to me (maybe it was an old stale comment?):
/*
* Copy I420 data to the custom 'YUV420' format, which is actually:
* y11 u11,u12,u21,u22 u13,u14,u23,u24 y12 y14 y13
* y21 v11,v12,v21,v22 v13,v14,v23,v24 y22 y24 y23
* ...
*/

I was pretty much confused until actually looked at the code. Shouldn't it be
something like this:

/*
* ... 'YUV420' format, which is actually:
* | u/v1 y1 y2 | u/v2 y3 y4 | ...
* ('u/v' means 'u' for even lines and 'v' for odd lines)
* ...
*/
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Apr 26, 2007, 6:43 PM


Views: 13303
Re: N800 & Video playback

On Fri, Apr 27, 2007 at 03:14:43AM +0300, ext Siarhei Siamashka wrote:
> On Tuesday 24 April 2007 12:36, Daniel Stone wrote:
> > Right, the branch is a problem, and as I said, the branch can be avoided
> > and the writes optimised to be three 32-bit writes for two macroblocks,
> > instead of two 32-bit writes and two 16-bit writes.
>
> I did not have much free time to do complete tests, but initial benchmarks
> show that actually even removing this branch and using three 16-bit writes
> improves performance quite significantly. The test program is here:
> http://ufo2000.sourceforge.net/files/yuv420test.c
>
> It produces the following results if compiled with optimization
> options "-O3 -fomit-frame-pointer -mcpu=arm1136j-s":
>
> # ./yuv420test
> test: 'yv12toyuv420_xomap', time=5.220, memory bandwidth=61.576MB/s
> test: 'yv12toyuv420_yv12toyuv420_branch_removed', time=3.503, memory
> bandwidth=91.754MB/s
>
> An interesting thing about this test is that it uses 2504 frames 400x240
> each, that's the same number of frames as Nokia_N800.avi video has.
> And mplayer spent 12,365s on video output when playing this video while
> YV12->YUV420 conversion should have taken 5.220s as benchmarked in
> this test. So now color conversion is roughly half of the time spent on video
> output for this resolution. Some tests with higher resolution videos will be
> done later.

Good news! Thanks for checking it out.

> As you see from the benchmark results, we can get 1.5x improvement
> already for color conversion with just a trivial removal of a piece of
> redundant code. Was that branch in the code supposed to improve
> performance? Seems like it resulted in quite the opposite effect.

You can't do unaligned writes to that area, so you need all 16-bit
writes or the branch. The branch has always been a giant FIXME, but it
never got properly sorted thanks to the magic of deadlines, and a ton of
corner cases popping up at the last minute. Sigh.

> I'll make a really optimized version of YV12 -> YUV420 convertor on this
> weekend (removing branch is good, but I feel that it can be improved
> more) and will try to use it on Nokia 770, any extra video performance
> improvement will be useful there. I hope that the framebuffer driver on
> Nokia 770 supports YUV420 color format properly.

I don't think Tornado supports YUV420, but I can check in the specs
tomorrow. My better C version basically does two macroblocks at a time,
ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
believe me). This eliminates the branch, since your surface is
guaranteed to be word-aligned, so if you do all 32-bit writes, you can
just drop the branch as you know every write will be aligned.

This will be really fast.

> By the way, does anybody know if it is possible to enable tearsync support
> on Nokia 770 (by backporting some changes from N800 kernel or in some
> other way)?

You can build the 770 kernel from the linux-omap tree, and support will
be there.

> > However, I don't think the lessons from the 770 are necessarily
> > _directly_ applicable to the N800: on the 770, our bottleneck is
> > decoding speed. The bottleneck on the N800 is exactly the opposite:
> > video output.
>
> I can't agree here. Memory speed is actually a lot faster on N800, the only
> trouble is graphics bus performance, but sending data to LCD controller
> through this bus does not introduce any load on ARM core and it can freely
> decode the next frame of video at the same time. At least this was the case
> with the previous version of firmware (I did not have enough time to see what
> was changed in framebuffer API and do any video tests with it).

Right.

> But color conversion is done by ARM core and it consumes precious cpu
> cycles which could be used for decoding higher resolution/bitrate video.
> Optimizing color conversion will improve video performance. The
> improvement will be most likely only within a few percents overall, but
> every little bit helps.

Indeed. However, once we remove stupid things like the branch which are
in the direct critical path of writes to the framebuffer, then I don't
think there's a hell of a lot left to gain. But I'm more than happy to
be proven wrong. :)

> > Bear in mind that, unless you explicitly disable it (the Xv attribute is
> > something like XV_OMAP_VSYNC), the X server _will_ flush all pending
> > writes before the next frame is put through. Else you get tearing,
> > because you can be halfway through an update, and writing the next frame
> > to the framebuffer, so which frame is being picked up, changes halfway
> > through.
> >
> > Try forcing XV_OMAP_VSYNC (or whatever it is) to 0, and comparing the
> > results.
>
> OK, thanks, I'll try this test too and check if it affects Xv performance.
> But I thought that using 12bpp color format _and_ sending only as much
> data as needed should solve the problem. Of course 800x480 * 16bpp * 30fps
> would be 23MB/s and it is too much. But for example 640x480 * 12bpp * 30fps =
> 12.3MB/s. Is the graphics bus fast enough to handle this?

I think we can do 12.3MB/s, yes.

> > before will bring us up to the point where any improvement that we can
> > make in that conversion will be eclipsed by the time taken to send it
> > over the bus, I believe. But I can't prove that.
>
> Well, I believe that every optimization which can provide a visible
> improvement (at least a few percents) is worth it. Optimizations are
> cumulative, a number of small 1-3% improvements added together result
> in a significant performance boost.

Sure. Unfortunately my job has other functions than to make video
decoding really, really fast, so I'm happy to merge, review, offer
feedback, and help you out where I can be useful, but I can't throw much
time at this myself.

> The opposite is also true, if one is lazy and adds inefficient code here and
> there, these small performance regressions accumulate and the program
> starts to crawl :)

Indeed, I already fixed quite a few of these cases all through the X
server ...

> > > Well, it was just a comment for 'omapCopyPlanarDataYUV420' function
> > > wrong and misleading, nevermind :-) Now everything is clear.
> >
> > Hmm, is it? Because, unless I was _really_ tired at the time I wrote it
> > (which is entirely possible), that's what the code does, and it works,
> > so ...
>
> Yes, this part seems wrong to me (maybe it was an old stale comment?):
> /*
> * Copy I420 data to the custom 'YUV420' format, which is actually:
> * y11 u11,u12,u21,u22 u13,u14,u23,u24 y12 y14 y13
> * y21 v11,v12,v21,v22 v13,v14,v23,v24 y22 y24 y23
> * ...
> */
>
> I was pretty much confused until actually looked at the code. Shouldn't it be
> something like this:
>
> /*
> * ... 'YUV420' format, which is actually:
> * | u/v1 y1 y2 | u/v2 y3 y4 | ...
> * ('u/v' means 'u' for even lines and 'v' for odd lines)
> * ...
> */

No?

Aligned:
*d2++ = (*sy & 0x000000ff) | (*sc << 8) | ((*sy & 0x0000ff00) << 16);

Unaligned:
*d1++ = (*sy & 0x000000ff) | ((*sc & 0x00ff) << 8);
*d1++ = ((*sc & 0xff00) >> 8) | (*sy & 0x0000ff00);

(Luma, chroma, chroma, luma.)

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


siarhei.siamashka at gmail

Apr 30, 2007, 4:27 AM


Views: 13273
Re: N800 & Video playback

On Friday 27 April 2007 04:43, Daniel Stone wrote:

> > I'll make a really optimized version of YV12 -> YUV420 convertor on this
> > weekend (removing branch is good, but I feel that it can be improved
> > more) and will try to use it on Nokia 770, any extra video performance
> > improvement will be useful there. I hope that the framebuffer driver on
> > Nokia 770 supports YUV420 color format properly.
>
> I don't think Tornado supports YUV420, but I can check in the specs
> tomorrow. My better C version basically does two macroblocks at a time,
> ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
> believe me). This eliminates the branch, since your surface is
> guaranteed to be word-aligned, so if you do all 32-bit writes, you can
> just drop the branch as you know every write will be aligned.
>
> This will be really fast.

Optimized YV12 -> YUV420 convertor is done. The sources can be found here:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer

Take a look at 'arm_colorconv.h' and 'arm_colorconv.S' files. Also there is a
test program ('test_colorconv') which can ensure that everything works
correctly and fast:

~ $ ./test_colorconv
test: 'yv12_to_yuv420_xomap',
time=7.332s, speed=32.878MP/s, memwritespeed=43.838MB/s

test: 'yv12_to_yuv420_xomap_nobranch',
time=5.679s, speed=42.448MP/s, memwritespeed=56.597MB/s

test: 'yv12_to_yuv420_line_arm_',
time=4.706s, speed=51.223MP/s, memwritespeed=68.297MB/s

test: 'yv12_to_yuv420_line_armv5_',
time=3.356s, speed=71.824MP/s, memwritespeed=95.765MB/s

test: 'yv12_to_yuv420_line_armv6_',
time=2.826s, speed=85.298MP/s, memwritespeed=113.731MB/s

ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
than current code used in N800 xserver. So it should provide a nice
improvement for video :)

I doubt that your better C version can beat it or even get any close. There
are two important optimizations in this code:
1. Cache prefetch with PLD instruction (added in '_armv5' version) which
boosts performance to 70 megapixels per second. Inner loop is unrolled
to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
such unrolling is convenient). This is the most important improvement.
You can try using __builtin_prefetch() from C code to do the same
optimization.
2. The use of ARMv6 instruction REV16 to do bytes swapping for high and low
16-bit register parts, this optimization was added in '_armv6' version and
boosted performance even more to 85 megapixels per second. This
optimization is highly unlikely probably impossible for C version at all.

I was a bit wrong about YUV420 format in my previous post.

Suppose we have planar YV12 image with the following data.
Y plane: Y1 Y2 Y3 Y4 ...
U plane: U1 __ U2 __ ...

Normal YUV420 (according to pictures in Epson docs) would be the following:
U1 Y1 Y2 U2 Y3 Y4 ...

But appears (most likely because of 16-bit interface and some endian
differences between ARM and Epson chip) that each pair of bytes is
swapped and we actually get the following somewhat weird layout:
Y1 U1 U2 Y2 Y4 Y3 ...

To do this byteswapping, ARMv6 instruction REV16 is very handy.

The assembly sources for ARMv6 code look a bit messy because
instruction reordering was needed to correctly schedule them and avoid
ARM11 pipeline interlocks which negatively affect performance. Now this
code is really fast with very little or no interlocks in the inner loop. And
gcc does not do a good job optimizing code on ARM, so C implementation
would be also at disadvantage here.

By the way, the benchmarks posted in my previous message should be
discarded. I did not initialize source buffers that time and looks like ARM11
cpu has some 'cheat' which allows treating empty data pages in some
special way and avoid reading from memory. So the numbers posted in the
previous benchmark were higher than usual. Now it is corrected.

As for the other possible Xv optimizations. You mentioned that fallback code
is not important at all. But imagine 640x480 video playback in windowed
mode. Decoding it will require quite a lot of resources, but additionally
scaling it down using a slow fallback code will be a finishing blow. In
addition, a solution (fast JIT accelerated YV12->YUY2 scaler) for this
problem already exists. I can also modify this scaler to support
YV12->YUV420 scaling. An interesting thing here is that this scaler
could be also used by xserver to solve graphics bus bandwidth
issues. Imagine that we have some high resolution video with high
framerate which exceeds graphics bus capabilities. In this case
this video can be downscaled in software using JIT scaler to lower
resolution before sending data to LCD controller. What do you think?

> Sure. Unfortunately my job has other functions than to make video
> decoding really, really fast, so I'm happy to merge, review, offer
> feedback, and help you out where I can be useful, but I can't throw much
> time at this myself.

That's fine. Now I'm waiting for further instructions :) Should I try to
prepare a complete patch for xserver? I'm really interested in getting
this optimization into xserver as it would help to play high resolution
videos. If you have any extra questions about the code or anything
else (for example I wonder what free license would be appriopriate
for it), don't hesitate to contact me.

I did not try to build xserver sources yet as I did not have enough time
for that and xserver requires quite a number of build dependencies. Can
you share some tips and tricks about maemo xserver development. Is it
difficult to compile (do I need any extra build scripts, tools, or
configuration options) and install on N800 (is it safe to upgrade
xserver on N800 from .deb file)?


I also tried to use YUV420 on Nokia 770, but it did not work well. According
to Epson, this format should be supported by hardware. Also there is a
constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770 kernel
sources. But actually using YUV420 was not very successful. Full screen update
800x480 in YUV420 seems to deadlock Nokia 770. Playback of centered
640x480 video in YUV420 format was a bit better, at least I could decipher
what's on the screen. But anyway, it looked like an old broken TV :) Image was
not fixed but floating up and down, there were mirrors, tearings, some color
distortion, etc. After video playback finished, the screen remained in
inconsistent state with a striped garbage displayed on it. Starting video
playback with YUY2 output fixed it. But anyway, looks like YUV420 is not
supported properly in the framebuffer driver from the latest OS2006 kernel.
That's bad, it could provide ~30% improvement in video output perfrmance
for Nokia 770. Maybe upgrading framebuffer driver can fix this issue (and add
tearsync support).
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

Apr 30, 2007, 7:49 AM


Views: 13270
Re: N800 & Video playback

Hi,

On Mon, Apr 30, 2007 at 02:27:49PM +0300, ext Siarhei Siamashka wrote:
> On Friday 27 April 2007 04:43, Daniel Stone wrote:
> > I don't think Tornado supports YUV420, but I can check in the specs
> > tomorrow. My better C version basically does two macroblocks at a time,
> > ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
> > believe me). This eliminates the branch, since your surface is
> > guaranteed to be word-aligned, so if you do all 32-bit writes, you can
> > just drop the branch as you know every write will be aligned.
> >
> > This will be really fast.
>
> Optimized YV12 -> YUV420 convertor is done. The sources can be found here:
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer
>
> Take a look at 'arm_colorconv.h' and 'arm_colorconv.S' files. Also there is a
> test program ('test_colorconv') which can ensure that everything works
> correctly and fast:
>
> ~ $ ./test_colorconv
> [results follow]
>
> ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
> than current code used in N800 xserver. So it should provide a nice
> improvement for video :)

Indeed. Unfortunately this is slightly misleading in that it only shows
the raw write speed. RFBI can't deal with the sorts of speeds that your
hyper-optimised version is pumping out, e.g. So it's mainly just about
cutting the latency into the critical path to low enough that it makes
no difference.

> I doubt that your better C version can beat it or even get any close.

Of course not.

> There are two important optimizations in this code:
> 1. Cache prefetch with PLD instruction (added in '_armv5' version) which
> boosts performance to 70 megapixels per second. Inner loop is unrolled
> to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
> such unrolling is convenient). This is the most important improvement.
> You can try using __builtin_prefetch() from C code to do the same
> optimization.

Ah, sounds useful. From what Dan Amelang's been saying on xorg@, gcc
should coalesce four 32-bit reads into one 128-bit read, but this sounds
promising as well.

> 2. The use of ARMv6 instruction REV16 to do bytes swapping for high and low
> 16-bit register parts, this optimization was added in '_armv6' version and
> boosted performance even more to 85 megapixels per second. This
> optimization is highly unlikely probably impossible for C version at all.

Sounds useful.

> I was a bit wrong about YUV420 format in my previous post.
>
> Suppose we have planar YV12 image with the following data.
> Y plane: Y1 Y2 Y3 Y4 ...
> U plane: U1 __ U2 __ ...
>
> Normal YUV420 (according to pictures in Epson docs) would be the following:
> U1 Y1 Y2 U2 Y3 Y4 ...
>
> But appears (most likely because of 16-bit interface and some endian
> differences between ARM and Epson chip) that each pair of bytes is
> swapped and we actually get the following somewhat weird layout:
> Y1 U1 U2 Y2 Y4 Y3 ...

Right, hence the comment in the code is correct. ;)

> As for the other possible Xv optimizations. You mentioned that fallback code
> is not important at all. But imagine 640x480 video playback in windowed
> mode. Decoding it will require quite a lot of resources, but additionally
> scaling it down using a slow fallback code will be a finishing blow. In
> addition, a solution (fast JIT accelerated YV12->YUY2 scaler) for this
> problem already exists. I can also modify this scaler to support
> YV12->YUV420 scaling. An interesting thing here is that this scaler
> could be also used by xserver to solve graphics bus bandwidth
> issues. Imagine that we have some high resolution video with high
> framerate which exceeds graphics bus capabilities. In this case
> this video can be downscaled in software using JIT scaler to lower
> resolution before sending data to LCD controller. What do you think?

IMO this is a policy issue, and X is 'mechanism, not policy'. If you
want to adapt the scaler, I'm more than happy to include it, but I'm not
about to start doing automatic scaling.

IOW, 'ask a stupid question, get a stupid answer'.

> That's fine. Now I'm waiting for further instructions :) Should I try to
> prepare a complete patch for xserver? I'm really interested in getting
> this optimization into xserver as it would help to play high resolution
> videos. If you have any extra questions about the code or anything
> else (for example I wonder what free license would be appriopriate
> for it), don't hesitate to contact me.

If you wanted to prepare a complete patch for the server, that would be
great, as I don't have time to get to it right now (trying to finish off
the merge with upstream, among others). As for the license, just the
standard MIT boilerplate in hw/kdrive/omap/* is fine, but replace Nokia
Corporation/Daniel Stone with Siarhei Siamaskha, obviously.

> I did not try to build xserver sources yet as I did not have enough time
> for that and xserver requires quite a number of build dependencies. Can
> you share some tips and tricks about maemo xserver development. Is it
> difficult to compile (do I need any extra build scripts, tools, or
> configuration options) and install on N800 (is it safe to upgrade
> xserver on N800 from .deb file)?

It's completely safe to upgrade from a deb if it's not broken. If you
set up a standard Maemo build environment and run apt-get source
xorg-server and apt-get build-dep xorg-server, it should work just fine,
in theory.

I don't have any tips, per se. Once I get it all integrated it'll be in
git, but for now, the only public source is the packages.

> I also tried to use YUV420 on Nokia 770, but it did not work well. According
> to Epson, this format should be supported by hardware. Also there is a
> constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770 kernel
> sources. But actually using YUV420 was not very successful. Full screen update
> 800x480 in YUV420 seems to deadlock Nokia 770. Playback of centered
> 640x480 video in YUV420 format was a bit better, at least I could decipher
> what's on the screen. But anyway, it looked like an old broken TV :) Image was
> not fixed but floating up and down, there were mirrors, tearings, some color
> distortion, etc. After video playback finished, the screen remained in
> inconsistent state with a striped garbage displayed on it. Starting video
> playback with YUY2 output fixed it. But anyway, looks like YUV420 is not
> supported properly in the framebuffer driver from the latest OS2006 kernel.
> That's bad, it could provide ~30% improvement in video output perfrmance
> for Nokia 770. Maybe upgrading framebuffer driver can fix this issue (and add
> tearsync support).

SoSSI is relatively quick, so you won't see much of a bandwidth win from
using YUV420 over YUV422. Aside from that, I don't know, though.

Thanks again for working on this; glad to see someone cares enough to
help sort it out. :)

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


barbieri at gmail

Apr 30, 2007, 11:29 PM


Views: 13254
Re: N800 & Video playback

On 4/30/07, Siarhei Siamashka <siarhei.siamashka [at] gmail> wrote:
> On Friday 27 April 2007 04:43, Daniel Stone wrote:

[...]

Daniel, Siarhei, Eero: I always find your mails to provide great deal
of tech information about N800. However we do not have a central place
with these information, it would be great if you guys setup a wiki
page with tech details about drivers, optimizations and weakness of
current implementations so others could base work on.

I see that Eero has a how to at:
http://maemo.org/platform/docs/howtos/howto_performance_test_process.html

Other docs, describing best fetch size, which instructions that
usually are cheap are bad implemented/slow on omap2420, etc...

Tools would be great. I see Oprofile kernel was suggested to Siarhei,
so it would be great to have it for download on this wiki page as
well.

Thank you all for your great work! Keep it coming :-)

--
Gustavo Sverzut Barbieri
--------------------------------------
Jabber: barbieri [at] gmail
MSN: barbieri [at] gmail
ICQ#: 17249123
Skype: gsbarbieri
Mobile: +55 (81) 9927 0010
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 1, 2007, 1:51 AM


Views: 13277
Re: N800 & Video playback

On Monday 30 April 2007 17:49, Daniel Stone wrote:

> > ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
> > than current code used in N800 xserver. So it should provide a nice
> > improvement for video :)
>
> Indeed. Unfortunately this is slightly misleading in that it only shows
> the raw write speed. RFBI can't deal with the sorts of speeds that your
> hyper-optimised version is pumping out, e.g. So it's mainly just about
> cutting the latency into the critical path to low enough that it makes
> no difference.

The 'framebuffer' is just the ordinary system memory, converting color format
and copying data to framebuffer will be done with the same performance as
simulated in this test. RFBI performance is only critical for asynchronous
DMA data transfer to LCD controller which does not introduce any overhead
and is performed at the same time as ARM core is doing some other work
(decoding the next frame). RFBI performance matters only if data transfer to
LCD is still not complete at the time when the next frame is already decoded
and is ready to be displayed. When playing video, ARM core and LCD controller
are almost always working at the same time performing different tasks in
parallel. I think I had already explained these details in [1]

Well, as xomap server is probably compiled for thumb, tried to compile this
test program for thumb instructions set as well and got the following results
(thumb is slower than normal ARM), also fixed some bug in test program
which resulted in memory throughoutput statistics being slightly off, so
the following results should be final now:

# gcc -o test_colorconv -O2 -mthumb test_colorconv.c arm_colorconv.S

# ./test_colorconv
test: 'yv12_to_yuv420_xomap',
time=9.493s, speed=25.394MP/s, memwritespeed=38.091MB/s
test: 'yv12_to_yuv420_xomap_nobranch',
time=8.516s, speed=28.306MP/s, memwritespeed=42.460MB/s
test: 'yv12_to_yuv420_line_arm_',
time=4.736s, speed=50.895MP/s, memwritespeed=76.343MB/s
test: 'yv12_to_yuv420_line_armv5_',
time=3.395s, speed=71.011MP/s, memwritespeed=106.517MB/s
test: 'yv12_to_yuv420_line_armv6_',
time=2.876s, speed=83.817MP/s, memwritespeed=125.726MB/s

If you remember the information posted in [2], mplayer used 12 seconds
for video output when playing Nokia_N800.avi (it contains the same number
of frames of the same size as used in this test for benchmarking). Color
format conversion code taken from xserver and compiled for thumb uses
9.5 seconds for doing the same amount of work.

So now the results of the tests are consistent - when doing video output, most
of ARM core cycles are spent in this 'omapCopyPlanarDataYUV420' function.
Optimizing it using 'yv12_to_yuv420_line_armv6' will definitely provide a huge
effect, video output overhead when using Xv will be at least halved providing
more cpu resources for video decoding.

> > That's fine. Now I'm waiting for further instructions :) Should I try to
> > prepare a complete patch for xserver? I'm really interested in getting
> > this optimization into xserver as it would help to play high resolution
> > videos. If you have any extra questions about the code or anything
> > else (for example I wonder what free license would be appriopriate
> > for it), don't hesitate to contact me.
>
> If you wanted to prepare a complete patch for the server, that would be
> great, as I don't have time to get to it right now (trying to finish off
> the merge with upstream, among others). As for the license, just the
> standard MIT boilerplate in hw/kdrive/omap/* is fine, but replace Nokia
> Corporation/Daniel Stone with Siarhei Siamaskha, obviously.
>
> > I did not try to build xserver sources yet as I did not have enough time
> > for that and xserver requires quite a number of build dependencies. Can
> > you share some tips and tricks about maemo xserver development. Is it
> > difficult to compile (do I need any extra build scripts, tools, or
> > configuration options) and install on N800 (is it safe to upgrade
> > xserver on N800 from .deb file)?
>
> It's completely safe to upgrade from a deb if it's not broken. If you
> set up a standard Maemo build environment and run apt-get source
> xorg-server and apt-get build-dep xorg-server, it should work just fine,
> in theory.
>
> I don't have any tips, per se. Once I get it all integrated it'll be in
> git, but for now, the only public source is the packages.

OK, thanks. It may take some time though. I'm still using old scratchbox
with mistral SDK here (did not have enough free time to upgrade yet). Until I
clean up my scratchbox mess, I can only provide some patch without testing, if
anybody courageous can try to build it :)

> > I also tried to use YUV420 on Nokia 770, but it did not work well.
> > According to Epson, this format should be supported by hardware. Also
> > there is a constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770
> > kernel sources. But actually using YUV420 was not very successful. Full
> > screen update 800x480 in YUV420 seems to deadlock Nokia 770. Playback of
> > centered 640x480 video in YUV420 format was a bit better, at least I
> > could decipher what's on the screen. But anyway, it looked like an old
> > broken TV :) Image was not fixed but floating up and down, there were
> > mirrors, tearings, some color distortion, etc. After video playback
> > finished, the screen remained in inconsistent state with a striped
> > garbage displayed on it. Starting video playback with YUY2 output fixed
> > it. But anyway, looks like YUV420 is not supported properly in the
> > framebuffer driver from the latest OS2006 kernel. That's bad, it could
> > provide ~30% improvement in video output perfrmance for Nokia 770. Maybe
> > upgrading framebuffer driver can fix this issue (and add tearsync
> > support).
>
> SoSSI is relatively quick, so you won't see much of a bandwidth win from
> using YUV420 over YUV422. Aside from that, I don't know, though.

I do know that I will get this 30% improvement for video output, considering
all the information I have and initial test results. I just need an updated
Nokia 770 kernel with a proper YUV420 support. I also hope that this kernel
(if it becomes available) will be included into one of the next "unofficial"
hackers edition firmware updates eventually.

Anyway, after having failed to use YUV420 with direct framebuffer access on
Nokia 770, tried the same code on N800 and surprisingly it worked perfectly,
I only had to figure out some information about framebuffer layout. It is
actually quite simple. When working with the framebuffer and performing
YUV420 screen updates, framebuffer can be treated as having the same
layout as in RGB565 mode (two bytes for each pixel). Any rectangular area
within this 16bpp framebuffer can be updated in YUV420 mode. Each line
of pixels from this rectangular area can be filled with YUV420 data. Surely,
this YUV420 data will be shorter than the length of the line (end of the line
will be unused), but screen update ioctl works fine. It works in a similar way
as pixel doubling where a rectangular block of pixel is expanded twice and
covers much more area on the screen than in framebuffer.

Well, anyway, everything worked perfectly and I could play 640x480 video
on N800 with the following statistics:

VIDEO: [DIVX] 640x480 12bpp 23.976 fps 886.7 kbps (108.2 kbyte/s)
...
BENCHMARKs: VC: 87,757s VO: 8,712s A: 1,314s Sys: 3,835s = 101,618s
BENCHMARK%: VC: 86,3592% VO: 8,5736% A: 1,2932% Sys: 3,7740% = 100,0000%
BENCHMARKn: disp: 2044 (20,11 fps) drop: 355 (14%) total: 2399 (23,61 fps)

As you see, mplayer took 8.712 seconds to display 2044 VGA resolution frames.
If we do the necessary calculations, that's 72 millions pixels per second,
quite close to 'yv12_to_yuv420_line_armv6' capabilities limit, so this
function is the only major contributor to video output time. Video output
took much less time than decoding, so it proves that video output
overhead can be reduced to minimum (in this test tearsync was not used
though).

The same file played with Xv video output and also tearsync disabled
(XV_OMAP_VSYNC explicitly set to 0):

BENCHMARKs: VC: 77,176s VO: 19,550s A: 1,880s Sys: 3,851s = 102,457s
BENCHMARK%: VC: 75,3260% VO: 19,0809% A: 1,8346% Sys: 3,7586% = 100,0000%
BENCHMARKn: disp: 1637 (15,98 fps) drop: 762 (31%) total: 2399 (23,41 fps)

Performing the calculations 1637 * 640 * 480 / 19.550s we get 26 millions
pixels per second which is also more or less consistent
with 'yv12_to_yuv420_xomap' benchmark statistics.

When tearsync comes into action, everything gets a bit more complicated. I'm
still investigating its impact on video playback performance.

Well, I'm going to continue working on YUV420 direct framebuffer video output
for N800 for the next build of mplayer as this code could be also used on
Nokia 770 if it gets YUV420 support. Also while this method of video output
does not support hardware scaling, it seems to be quite good for unscaled VGA
resolution videos and may serve as a temporary solution until we get upgrade
to a new xserver with yv12->yuv420 conversion optimizations.


1. http://maemo.org/pipermail/maemo-developers/2007-March/009202.html
2. http://maemo.org/pipermail/maemo-developers/2007-April/009925.html
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


kalle.vahlman at gmail

May 1, 2007, 3:36 AM


Views: 13233
Re: N800 & Video playback

2007/5/1, Siarhei Siamashka <siarhei.siamashka [at] gmail>:
> On Monday 30 April 2007 17:49, Daniel Stone wrote:
> > It's completely safe to upgrade from a deb if it's not broken. If you
> > set up a standard Maemo build environment and run apt-get source
> > xorg-server and apt-get build-dep xorg-server, it should work just fine,
> > in theory.
> >
> > I don't have any tips, per se. Once I get it all integrated it'll be in
> > git, but for now, the only public source is the packages.
>
> OK, thanks. It may take some time though. I'm still using old scratchbox
> with mistral SDK here (did not have enough free time to upgrade yet). Until I
> clean up my scratchbox mess, I can only provide some patch without testing, if
> anybody courageous can try to build it :)

Given that I fear not the perils of building a X server with
nonstandard options[1], I shall be more than happy to conduct such
adventurous acts :)

And unless Mr. Kulve has objections, the results could be installed
from a repository as well.

[1] http://syslog.movial.fi/archives/47-Shadows-for-everyone-well,-not-really.html

--
Kalle Vahlman, zuh [at] iki
Powered by http://movial.fi
Interesting stuff at http://syslog.movial.fi
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 1, 2007, 5:38 AM


Views: 13253
Re: N800 & Video playback

On Tuesday 01 May 2007 13:36, Kalle Vahlman wrote:
> 2007/5/1, Siarhei Siamashka <siarhei.siamashka [at] gmail>:
> > OK, thanks. It may take some time though. I'm still using old scratchbox
> > with mistral SDK here (did not have enough free time to upgrade yet).
> > Until I clean up my scratchbox mess, I can only provide some patch
> > without testing, if anybody courageous can try to build it :)
>
> Given that I fear not the perils of building a X server with
> nonstandard options[1], I shall be more than happy to conduct such
> adventurous acts :)
>
> And unless Mr. Kulve has objections, the results could be installed
> from a repository as well.
>
> [1]
>
http://syslog.movial.fi/archives/47-Shadows-for-everyone-well,-not-really.html

OK, here is this untested a patch for xserver to add ARMv6 optimized
YUV420 color format conversion. Theoretically it should compile
(I did not try to build xserver myself though) and work. If it refuses to
compile, fixing the patch should be not too difficult.

In the worst case only video playback may be broked. But if everything works
as expected, video output performance should become a lot better.

Video output performance can be tested by mplayer using -benchmark
option, 'VO:' stat shows how much time was used for video output, 'VC:' stat
shows how much time was used for video decoding.

Built-in video player also should become faster. I don't know if this
improvement can be 'scientifically' benchmarked, but it should drop less
frames on high resolution video playback.

If any of you can build xserver package with this patch, please put it for
download somewhere or send directly to me.

Thanks.
Attachments: xomap_yuv420patch.diff (13.0 KB)


dufkaf at seznam

May 1, 2007, 6:19 AM


Views: 13237
Re: N800 & Video playback

Siarhei Siamashka wrote:

> OK, here is this untested a patch for xserver to add ARMv6 optimized
> YUV420 color format conversion. Theoretically it should compile
> (I did not try to build xserver myself though) and work. If it refuses to
> compile, fixing the patch should be not too difficult.

It does not apply for me but should be trivial to fix.

[sbox-SDK_ARMEL: ~/x/xorg-server-1.1.99.3] > patch -p1
<../xomap_yuv420patch.diff
patching file hw/kdrive/omap/Makefile.am
Hunk #1 FAILED at 1.
Hunk #2 FAILED at 34.
2 out of 2 hunks FAILED -- saving rejects to file
hw/kdrive/omap/Makefile.am.rej
patching file hw/kdrive/omap/omap_colorconv.h
patching file hw/kdrive/omap/omap_colorconv.S
patching file hw/kdrive/omap/omap_video.c
Hunk #1 FAILED at 39.
Hunk #2 FAILED at 468.
Hunk #3 FAILED at 491.
3 out of 3 hunks FAILED -- saving rejects to file
hw/kdrive/omap/omap_video.c.rej

Will try this evening. I wonder who has older x server version.

Frantisek
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


kalle.vahlman at gmail

May 1, 2007, 7:49 AM


Views: 13273
Re: N800 & Video playback

2007/5/1, Siarhei Siamashka <siarhei.siamashka [at] gmail>:
> On Tuesday 01 May 2007 13:36, Kalle Vahlman wrote:
> > 2007/5/1, Siarhei Siamashka <siarhei.siamashka [at] gmail>:
> > > OK, thanks. It may take some time though. I'm still using old scratchbox
> > > with mistral SDK here (did not have enough free time to upgrade yet).
> > > Until I clean up my scratchbox mess, I can only provide some patch
> > > without testing, if anybody courageous can try to build it :)
> >
> > Given that I fear not the perils of building a X server with
> > nonstandard options[1], I shall be more than happy to conduct such
> > adventurous acts :)
> >
> > And unless Mr. Kulve has objections, the results could be installed
> > from a repository as well.
> >
> > [1]
> >
> http://syslog.movial.fi/archives/47-Shadows-for-everyone-well,-not-really.html
>
> OK, here is this untested a patch for xserver to add ARMv6 optimized
> YUV420 color format conversion. Theoretically it should compile
> (I did not try to build xserver myself though) and work. If it refuses to
> compile, fixing the patch should be not too difficult.

Applied and build without problems for me.

For testing, I fabricated some video with gstreamer:

gst-launch-0.10 videotestsrc num-buffers=300 \
! "video/x-raw-yuv, width=640, height=480" \
! ffenc_mpeg4 ! avimux \
! filesink location=640x480.avi

which resulted in 640x480 [at] 30fp and 800x480 [at] 30fp videos. For some
reason 320x240 and 352x288 refused to play with:

X11 error: BadValue (integer parameter out of range for operation)
MPlayer interrupted by signal 6 in module: flip_page

while gstreamer did play them just fine. Also the Nokia_N800.avi and
NokiaN93.avi died in the same way. My mplayer is compiled from the svn
trunk of the garage project, with some additional cflags I use (so
maybe those were the problem...).

Anyway, then I shut down af-base-apps and matchbox (to avoid scaling
the video) and ran "mplayer -benchmark <file>".

> In the worst case only video playback may be broked. But if everything works
> as expected, video output performance should become a lot better.
>
> Video output performance can be tested by mplayer using -benchmark
> option, 'VO:' stat shows how much time was used for video output, 'VC:' stat
> shows how much time was used for video decoding.

There's something fishy in the decoding or something as the color bars
in the test video were broken (yellow and cyan to be precise), but
that seemed to be the case in a "vanilla" image too so nothing to do
with this patch. I could not see any other glitches in the output.

But on to the results:

VIDEO: [DX50] 640x480 24bpp 30.000 fps 1597.6 kbps (195.0 kbyte/s)

Original:
V: 10.0 300/300 44% 74% 0.0% 0 0 0%
BENCHMARKs: VC: 4.387s VO: 7.436s A: 0.000s Sys: 0.482s = 12.305s
BENCHMARK%: VC: 35.6503% VO: 60.4311% A: 0.0000% Sys: 3.9185% = 100.0000%


Patched:
V: 10.0 300/300 42% 72% 0.0% 0 0 0%
BENCHMARKs: VC: 4.213s VO: 7.265s A: 0.000s Sys: 0.381s = 11.859s
BENCHMARK%: VC: 35.5296% VO: 61.2604% A: 0.0000% Sys: 3.2100% = 100.0000%

---

VIDEO: [DX50] 800x480 24bpp 30.000 fps 1976.5 kbps (241.3 kbyte/s)

Original:
V: 10.0 300/300 54% 114% 0.0% 0 0 0%
BENCHMARKs: VC: 5.466s VO: 11.456s A: 0.000s Sys: 0.366s = 17.287s
BENCHMARK%: VC: 31.6179% VO: 66.2677% A: 0.0000% Sys: 2.1144% = 100.0000%

Patched:
V: 10.0 300/300 53% 70% 0.0% 0 0 0%
BENCHMARKs: VC: 5.346s VO: 7.043s A: 0.000s Sys: 0.449s = 12.838s
BENCHMARK%: VC: 41.6414% VO: 54.8602% A: 0.0000% Sys: 3.4984% = 100.0000%

There is a clear drop in amount of time used to output the videos for
800x480 (the numbers were stable trough multiple runs).

So I gather from the >10s benchmark time that we didn't get to real
time yet, but close to it? And of course this is just video, audio
decoding should be considered for real video playback performance
measurement.

> If any of you can build xserver package with this patch, please put it for
> download somewhere or send directly to me.

I put the deb up at:

http://iki.fi/zuh/xserver-xomap_1.1.99.3-0.zuh2_armel.deb

until I get it to the repository. This version also has the composite
extension enabled, but AFAIK it does not depend on the libs or change
server behaviour if composite is not specifically used.

The server *should* be compiled with '-mcpu=arm1136j-s -mfpu=vfp
-mfloat-abi=softfp -O2', but as I had troubles with the
SBOX_EXTRA_COMPILER_ARGS env var being honored some time ago I'm not
guaranteeing it at the moment ;)

--
Kalle Vahlman, zuh [at] iki
Powered by http://movial.fi
Interesting stuff at http://syslog.movial.fi
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


kalle.vahlman at gmail

May 1, 2007, 7:54 AM


Views: 9927
Re: N800 & Video playback

2007/5/1, Kalle Vahlman <kalle.vahlman [at] gmail>:
> The server *should* be compiled with '-mcpu=arm1136j-s -mfpu=vfp
> -mfloat-abi=softfp -O2', but as I had troubles with the
> SBOX_EXTRA_COMPILER_ARGS env var being honored some time ago I'm not
> guaranteeing it at the moment ;)

Actually seems that I had added the env var to the rules file so it
*is* built with those options.

I can produce a build without them if need be (it does affect
performance in my experience, so if one wants to see the impact of
that patch on a more "normal" version...).

--
Kalle Vahlman, zuh [at] iki
Powered by http://movial.fi
Interesting stuff at http://syslog.movial.fi
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 1, 2007, 10:49 AM


Views: 9946
Re: N800 & Video playback

On Tuesday 01 May 2007 17:49, Kalle Vahlman wrote:
> > OK, here is this untested a patch for xserver to add ARMv6 optimized
> > YUV420 color format conversion. Theoretically it should compile
> > (I did not try to build xserver myself though) and work. If it refuses to
> > compile, fixing the patch should be not too difficult.
>
> Applied and build without problems for me.

Thanks a lot for building the package and putting it for download, everything
seems to be fine, but more details will follow below.

> For testing, I fabricated some video with gstreamer:
>
> which resulted in 640x480 [at] 30fp and 800x480 [at] 30fp videos. For some
> reason 320x240 and 352x288 refused to play with:
>
> X11 error: BadValue (integer parameter out of range for operation)
> MPlayer interrupted by signal 6 in module: flip_page while gstreamer did
> play them just fine. Also the Nokia_N800.avi and NokiaN93.avi died in the
> same way.

This X11 error on video playback start and also sometimes on switching
fullscreen/windowed mode is a known problem [1] reported in this mailing list.

If MPlayer dies on start, usually trying to start it again succeeds. So these
320x240 and 352x288 videos could be played as well if you were a bit more
persistent :)

As Daniel replied in one of the followup messages, it is most likely some race
condition. The question is which code is a suspect. Is it MPlayer Xv video
output code that has been around for ages and worked fine on different systems
or relatively new Xv extension code from N800 xserver? In addition, a previous
revision of N800 firmware had a serious bug [2] related to video playback. It
should be noted, that MPlayer needed only about 1 minute to freeze on the
initial N800 firmware. So the problem could be identified much more easily
if MPlayer was included in the standard set of tests done by Nokia QA staff
before each new IT OS release. Surely, Nokia is only interested in a
properly working xvimagesink for the software included in IT OS by default.
But testing with more client applications can improve overall xserver quality.

With all that said, I don't know if MPlayer Xv code is bugfree, it wasn't me
who developed it.

> My mplayer is compiled from the svn
> trunk of the garage project, with some additional cflags I use (so
> maybe those were the problem...).

Do you have a set of cflags settings which work better than the default set?
Can you share this information?

> There's something fishy in the decoding or something as the color bars
> in the test video were broken (yellow and cyan to be precise), but
> that seemed to be the case in a "vanilla" image too so nothing to do
> with this patch. I could not see any other glitches in the output.
>
> But on to the results:
>
> VIDEO: [DX50] 640x480 24bpp 30.000 fps 1597.6 kbps (195.0 kbyte/s)
[snip]
> VIDEO: [DX50] 800x480 24bpp 30.000 fps 1976.5 kbps (241.3 kbyte/s)
[snip]
> There is a clear drop in amount of time used to output the videos for
> 800x480 (the numbers were stable trough multiple runs).
>
> So I gather from the >10s benchmark time that we didn't get to real
> time yet, but close to it? And of course this is just video, audio
> decoding should be considered for real video playback performance
> measurement.

These videos are way too heavy for N800 to decode and play in realtime. We
may expect playback for videos up to 640x480 resolution with <1000kbps
bitrate and 24fps. This is probably current realistic limit which can be
achieved. Some minor variations to these parameters are possible (for example
we can get 30fps, but should also reduce resolution or bitrate, etc.).

If you want a guaranteed video playback with divx/xvid/mpeg4 codecs, you
should restrict to 512x384 resolution or lower and keep bitrate reasonable.

The results for these 'insane' videos you have posted are somewhat weird, a
complete statistics would require also a number of frames dropped, otherwise
we don't know how much work was done by the player. Probably missing audio
track resulted in MPlayer not being able to provide a proper report. Don't
know. Also it is strange that you did not see any improvement at all for
640x480 video, are you sure you really tested it with the patched xserver?

Anyway, the new xserver package works really good. If we do some tests with
the standard Nokia_N800.avi video clip, we get the following results with the
patched xserver:

# mplayer -benchmark -quiet -noaspect Nokia_N800.avi
BENCHMARKs: VC: 29,764s VO: 7,666s A: 0,468s Sys: 64,635s = 102,534s
BENCHMARK%: VC: 29,0287% VO: 7,4767% A: 0,4565% Sys: 63,0381% = 100,0000%
BENCHMARKn: disp: 2504 (24,42 fps) drop: 0 (0%) total: 2504 (24,42 fps)

# mplayer -benchmark -quiet -noaspect -dr -nomenu Nokia_N800.avi
BENCHMARKs: VC: 30,266s VO: 5,490s A: 0,467s Sys: 66,286s = 102,509s
BENCHMARK%: VC: 29,5255% VO: 5,3554% A: 0,4560% Sys: 64,6631% = 100,0000%
BENCHMARKn: disp: 2501 (24,40 fps) drop: 0 (0%) total: 2501 (24,40 fps)

Results with unpatched xserver and some more explanations can be found in [3].
Yes, now N800 is faster than Nokia 770 for video output performance at last :)

Video output overhead on N800 is really at least halved. Of course, video
output takes only some fraction of time in video player. So overall
performance improvement for Nokia_N800.avi playback is approximately 20%
but not 250%-300% which can be observed for 'omapCopyPlanarDataYUV420'
function alone.

In my opinion, that's a good result for less than a week of work (a few
evenings for initial research, full weekend of intensive coding, some
more time for additional testing, final tweaks and communication on the
mailing list) :)

I have also submitted this patch to maemo bugzilla, hopefully it (or its
modification) can get included into the next version of N800 firmware:
https://maemo.org/bugzilla/show_bug.cgi?id=1278


1. http://maemo.org/pipermail/maemo-developers/2007-April/009864.html
2. https://maemo.org/bugzilla/show_bug.cgi?id=991
3. http://maemo.org/pipermail/maemo-developers/2007-April/009925.html
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


dufkaf at seznam

May 1, 2007, 12:35 PM


Views: 9910
Re: N800 & Video playback

Frantisek Dufka wrote:

> [sbox-SDK_ARMEL: ~/x/xorg-server-1.1.99.3] > patch -p1
> <../xomap_yuv420patch.diff
> patching file hw/kdrive/omap/Makefile.am
> Hunk #1 FAILED at 1.
> Hunk #2 FAILED at 34.
> 2 out of 2 hunks FAILED -- saving rejects to file
> hw/kdrive/omap/Makefile.am.rej
> patching file hw/kdrive/omap/omap_colorconv.h
> patching file hw/kdrive/omap/omap_colorconv.S
> patching file hw/kdrive/omap/omap_video.c
> Hunk #1 FAILED at 39.
> Hunk #2 FAILED at 468.
> Hunk #3 FAILED at 491.
> 3 out of 3 hunks FAILED -- saving rejects to file
> hw/kdrive/omap/omap_video.c.rej
>


Sorry, my fault, mystery solved. Saved attachement in Thunderbird in
Windows XP, then moved to Ubuntu inside VMware. The problem was caused
by DOS CR+LF line endings, patch doesn't like it. Recoded to unix
linefeeds and now it applies cleanly. I'm using Windows a lot, it is
strange this never happened to me yet.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.amelang at gmail

May 1, 2007, 9:51 PM


Views: 9896
Re: N800 & Video playback

On 4/30/07, Daniel Stone <daniel.stone [at] nokia> wrote:
>
> > There are two important optimizations in this code:
> > 1. Cache prefetch with PLD instruction (added in '_armv5' version) which
> > boosts performance to 70 megapixels per second. Inner loop is unrolled
> > to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
> > such unrolling is convenient). This is the most important improvement.
> > You can try using __builtin_prefetch() from C code to do the same
> > optimization.
>
> Ah, sounds useful. From what Dan Amelang's been saying on xorg@, gcc
> should coalesce four 32-bit reads into one 128-bit read, but this sounds
> promising as well.

To expand on this: I was referring to fact that gcc is pretty smart
about using ldmia/stdmia instructions to cluster sequential
reads/writes. I see that Siarhei is already using this technique in
his assembler code, so nothing new here.

Dan
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.amelang at gmail

May 1, 2007, 9:52 PM


Views: 9924
Re: N800 & Video playback

On 5/1/07, Daniel Amelang <daniel.amelang [at] gmail> wrote:
>
> about using ldmia/stdmia instructions to cluster sequential

that was supposed to be "ldmia/sdmia", sorry.

Dan
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.amelang at gmail

May 1, 2007, 9:53 PM


Views: 9914
Re: N800 & Video playback

On 5/1/07, Daniel Amelang <daniel.amelang [at] gmail> wrote:
> On 5/1/07, Daniel Amelang <daniel.amelang [at] gmail> wrote:
> >
> > about using ldmia/stdmia instructions to cluster sequential
>
> that was supposed to be "ldmia/sdmia", sorry.

Gah, "ldmia/stmia", final answer.

Dan
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 1, 2007, 11:16 PM


Views: 9926
Re: N800 & Video playback

On Tuesday 01 May 2007 20:49, Siarhei Siamashka wrote:

Looks like I have to reply to myself.

> On Tuesday 01 May 2007 17:49, Kalle Vahlman wrote:
> > Applied and build without problems for me.
>
> Thanks a lot for building the package and putting it for download,
> everything seems to be fine, but more details will follow below.

[snip]

> Anyway, the new xserver package works really good. If we do some tests with
> the standard Nokia_N800.avi video clip, we get the following results with
> the patched xserver:
>
> # mplayer -benchmark -quiet -noaspect Nokia_N800.avi
> BENCHMARKs: VC: 29,764s VO: 7,666s A: 0,468s Sys: 64,635s = 102,534s
> BENCHMARK%: VC: 29,0287% VO: 7,4767% A: 0,4565% Sys: 63,0381% = 100,0000%
> BENCHMARKn: disp: 2504 (24,42 fps) drop: 0 (0%) total: 2504 (24,42 fps)
>
> # mplayer -benchmark -quiet -noaspect -dr -nomenu Nokia_N800.avi
> BENCHMARKs: VC: 30,266s VO: 5,490s A: 0,467s Sys: 66,286s = 102,509s
> BENCHMARK%: VC: 29,5255% VO: 5,3554% A: 0,4560% Sys: 64,6631% = 100,0000%
> BENCHMARKn: disp: 2501 (24,40 fps) drop: 0 (0%) total: 2501 (24,40 fps)
>
> Results with unpatched xserver and some more explanations can be found in
> [3].
> Yes, now N800 is faster than Nokia 770 for video output performance at
> last :)

Well, still not everything is so good until the following bug gets fixed:
https://maemo.org/bugzilla/show_bug.cgi?id=1281

The patch for optimized Xv performance will not help to watch widescreen
video which triggers this tearing bug. If you see tearing on the screen, you
should know that the YUV420 color format conversion optimization patch
does not get used at all and xserver most likely uses a slow nonoptimized
YUV422 fallback code with software scaling.

Fixing this bug is critical for video playback performance. I hope it will be
solved in the next version of N800 firmware too. But it we get some patch to
solve this problem for testing earlir, that would be nice too.

> Video output overhead on N800 is really at least halved. Of course, video
> output takes only some fraction of time in video player. So overall
> performance improvement for Nokia_N800.avi playback is approximately 20%
> but not 250%-300% which can be observed for 'omapCopyPlanarDataYUV420'
> function alone.

Before anybody noticed, correcting myself :) This 'omapCopyPlanarDataYUV420'
has 2.5x-3x improvement which is equal to 150%-200% in percents. Elementary
arithmetics is tough when you are tired
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


dufkaf at seznam

May 2, 2007, 12:06 AM


Views: 9895
Re: N800 & Video playback

Kalle Vahlman wrote:

> I put the deb up at:
>
> http://iki.fi/zuh/xserver-xomap_1.1.99.3-0.zuh2_armel.deb
>
> until I get it to the repository. This version also has the composite
> extension enabled, but AFAIK it does not depend on the libs or change
> server behaviour if composite is not specifically used.
>
> The server *should* be compiled with '-mcpu=arm1136j-s -mfpu=vfp
> -mfloat-abi=softfp -O2', but as I had troubles with the
> SBOX_EXTRA_COMPILER_ARGS env var being honored some time ago I'm not
> guaranteeing it at the moment ;)
>

I also succeeded in making the deb:
http://fanoush.wz.cz/maemo/xserver-xomap_1.1.99.3-0osso31_armel.deb

This one is compiled as thumb (except the ASM code) and no special CPU
flags so it can be verified if there is any slowdown. Thumb mode saves
approx. 300kb of executable size. It seems to be used by default in
firmware images.

Kalle, did it link properly for you? With the patch the final Xomap link
did not add the ASM code, I had to do it by hand. I didn't find proper
place in Makefile for it to be added to libomap.a, the place patched by
Siarhei was ignored by the build process for me.

Frantisek

_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

May 2, 2007, 2:39 AM


Views: 9935
Re: N800 & Video playback

On Wed, May 02, 2007 at 09:16:01AM +0300, ext Siarhei Siamashka wrote:
> On Tuesday 01 May 2007 20:49, Siarhei Siamashka wrote:
> > Results with unpatched xserver and some more explanations can be found in
> > [3].
> > Yes, now N800 is faster than Nokia 770 for video output performance at
> > last :)
>
> Well, still not everything is so good until the following bug gets fixed:
> https://maemo.org/bugzilla/show_bug.cgi?id=1281
>
> The patch for optimized Xv performance will not help to watch widescreen
> video which triggers this tearing bug. If you see tearing on the screen, you
> should know that the YUV420 color format conversion optimization patch
> does not get used at all and xserver most likely uses a slow nonoptimized
> YUV422 fallback code with software scaling.

Indeed. And the reason the code is there is because Hailstorm can only
downscale at fixed ratios (half and one-quarter), and even then, it
locked up when we tried. Similarly, the display controller's
downscaling didn't work, either. So we can optimise the fallback path,
but you'll still be screwed by sending 16bpp (instead of 12bpp) through
RFBI.

> Fixing this bug is critical for video playback performance. I hope it will be
> solved in the next version of N800 firmware too. But it we get some patch to
> solve this problem for testing earlir, that would be nice too.

The only patch is optimising that function, really. Even if we did work
out a way to make Hailstorm happy, you can still only scale at those
exact multiples, which doesn't make it a viable general solution.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


daniel.stone at nokia

May 2, 2007, 2:47 AM


Views: 9903
Re: N800 & Video playback

On Tue, May 01, 2007 at 08:49:20PM +0300, ext Siarhei Siamashka wrote:
> On Tuesday 01 May 2007 17:49, Kalle Vahlman wrote:
> > For testing, I fabricated some video with gstreamer:
> >
> > which resulted in 640x480 [at] 30fp and 800x480 [at] 30fp videos. For some
> > reason 320x240 and 352x288 refused to play with:
> >
> > X11 error: BadValue (integer parameter out of range for operation)
> > MPlayer interrupted by signal 6 in module: flip_page while gstreamer did
> > play them just fine. Also the Nokia_N800.avi and NokiaN93.avi died in the
> > same way.
>
> This X11 error on video playback start and also sometimes on switching
> fullscreen/windowed mode is a known problem [1] reported in this mailing list.
>
> If MPlayer dies on start, usually trying to start it again succeeds. So these
> 320x240 and 352x288 videos could be played as well if you were a bit more
> persistent :)

Resizing is a bit tricky. Most video hardware lets you use the hardware
to clip, so if you move it beyond the edge of the screen, it just
happily ignores anything beyond the hardware's bounds. Unfortunately
for us, attempting to move a video surface off-screen (even by just a
few pixels) triggers a hardware lockup.

Given that we can't display the frame at all, we send BadValue (there
are a couple of other conditions where this is possible, but this is the
main one). I don't see the point in returning Success when no video is
drawn at all. So, I guess you could hack mplayer's error handler to
just ignore BadValues from Xv(Shm)PutImage, unless you get more than
five or ten in a row, say.

> As Daniel replied in one of the followup messages, it is most likely some race
> condition. The question is which code is a suspect. Is it MPlayer Xv video
> output code that has been around for ages and worked fine on different systems
> or relatively new Xv extension code from N800 xserver? In addition, a previous
> revision of N800 firmware had a serious bug [2] related to video playback. It
> should be noted, that MPlayer needed only about 1 minute to freeze on the
> initial N800 firmware. So the problem could be identified much more easily
> if MPlayer was included in the standard set of tests done by Nokia QA staff
> before each new IT OS release. Surely, Nokia is only interested in a
> properly working xvimagesink for the software included in IT OS by default.
> But testing with more client applications can improve overall xserver quality.

Bear in mind that, as you've hinted at, the only part of the Xv code
which is custom is the _output_ code. We're using the standard X server
implementation (as used by tens of millions of people) for the protocol
decode and standard semantics, the standard KDrive layer for extended
stuff (as used by god-knows-how-many embedded and consumer devices), and
then the only part we have to play is taking frames and putting them on
the screen.

Due to some restrictions (as above), we have to deliberately error out
on some operations. But errors like that tend to say 'you've hit a
hardware restriction, I can't do this', rather than 'you hit one of the
many random return BadValues we put in this weird code just to confuse
people'.

Also, bear in mind that a lot of the initial instability was due to the
DSP. The video was actually rather stable when you played without
sound, although now the situation is somewhat reversed with the DSP
being pretty steady now, and the new YUV420 code having complicated
semsnatics.

> I have also submitted this patch to maemo bugzilla, hopefully it (or its
> modification) can get included into the next version of N800 firmware:
> https://maemo.org/bugzilla/show_bug.cgi?id=1278

I'll merge it with some changes.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


daniel.stone at nokia

May 2, 2007, 2:54 AM


Views: 9909
Re: N800 & Video playback

On Tue, May 01, 2007 at 11:51:50AM +0300, ext Siarhei Siamashka wrote:
> On Monday 30 April 2007 17:49, Daniel Stone wrote:
> > Indeed. Unfortunately this is slightly misleading in that it only shows
> > the raw write speed. RFBI can't deal with the sorts of speeds that your
> > hyper-optimised version is pumping out, e.g. So it's mainly just about
> > cutting the latency into the critical path to low enough that it makes
> > no difference.
>
> The 'framebuffer' is just the ordinary system memory, converting color format
> and copying data to framebuffer will be done with the same performance as
> simulated in this test. RFBI performance is only critical for asynchronous
> DMA data transfer to LCD controller which does not introduce any overhead
> and is performed at the same time as ARM core is doing some other work
> (decoding the next frame). RFBI performance matters only if data transfer to
> LCD is still not complete at the time when the next frame is already decoded
> and is ready to be displayed. When playing video, ARM core and LCD controller
> are almost always working at the same time performing different tasks in
> parallel. I think I had already explained these details in [1]

Right. My point is that the numbers you're showing -- while very good,
don't get me wrong -- won't necessarily have a huge direct impact on
video playback. Particularly if you want to avoid tearing.

> So now the results of the tests are consistent - when doing video output, most
> of ARM core cycles are spent in this 'omapCopyPlanarDataYUV420' function.

Well, either that, or just waiting for RFBI transfers to complete.

> Optimizing it using 'yv12_to_yuv420_line_armv6' will definitely provide a huge
> effect, video output overhead when using Xv will be at least halved providing
> more cpu resources for video decoding.

Yes, this is one good aspect.

> > I don't have any tips, per se. Once I get it all integrated it'll be in
> > git, but for now, the only public source is the packages.
>
> OK, thanks. It may take some time though. I'm still using old scratchbox
> with mistral SDK here (did not have enough free time to upgrade yet). Until I
> clean up my scratchbox mess, I can only provide some patch without testing, if
> anybody courageous can try to build it :)

I'm still using Scratchbox 0.9.8.5 for day-to-day stuff ...

> Well, anyway, everything worked perfectly and I could play 640x480 video
> on N800 with the following statistics:
>
> VIDEO: [DIVX] 640x480 12bpp 23.976 fps 886.7 kbps (108.2 kbyte/s)
> ...
> BENCHMARKs: VC: 87,757s VO: 8,712s A: 1,314s Sys: 3,835s = 101,618s
> BENCHMARK%: VC: 86,3592% VO: 8,5736% A: 1,2932% Sys: 3,7740% = 100,0000%
> BENCHMARKn: disp: 2044 (20,11 fps) drop: 355 (14%) total: 2399 (23,61 fps)
>
> As you see, mplayer took 8.712 seconds to display 2044 VGA resolution frames.
> If we do the necessary calculations, that's 72 millions pixels per second,
> quite close to 'yv12_to_yuv420_line_armv6' capabilities limit, so this
> function is the only major contributor to video output time. Video output
> took much less time than decoding, so it proves that video output
> overhead can be reduced to minimum (in this test tearsync was not used
> though).

I'd be curious to see the results from this with tearsync _enabled_?
i.e., after your OMAPFB_UPDATE_WIDNOW call, issue an OMAPFB_SYNC_GFX
ioctl before you start writing to memory again. This is basically the
limiter for us at this stage.

> When tearsync comes into action, everything gets a bit more complicated. I'm
> still investigating its impact on video playback performance.

'Not good'. :)

Thanks again for your work.

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


siarhei.siamashka at gmail

May 2, 2007, 10:15 PM


Views: 9932
Re: N800 & Video playback

On Wednesday 02 May 2007 12:54, Daniel Stone wrote:
> > The 'framebuffer' is just the ordinary system memory, converting color
> > format and copying data to framebuffer will be done with the same
> > performance as simulated in this test. RFBI performance is only critical
> > for asynchronous DMA data transfer to LCD controller which does not
> > introduce any overhead and is performed at the same time as ARM core is
> > doing some other work (decoding the next frame). RFBI performance matters
> > only if data transfer to LCD is still not complete at the time when the
> > next frame is already decoded and is ready to be displayed. When playing
> > video, ARM core and LCD controller are almost always working at the same
> > time performing different tasks in parallel. I think I had already
> > explained these details in [1]
>
> Right. My point is that the numbers you're showing -- while very good,
> don't get me wrong -- won't necessarily have a huge direct impact on
> video playback. Particularly if you want to avoid tearing.

I have no idea what other proof would be enough for you. You already got all
the numbers, and even benchmarks with patched xserver. They all confirm
video output performance improvement.

> > So now the results of the tests are consistent - when doing video output,
> > most of ARM core cycles are spent in this 'omapCopyPlanarDataYUV420'
> > function.
>
> Well, either that, or just waiting for RFBI transfers to complete.

You need to wait a bit before displaying the next frame anyway, and
the period between frames for 30 fps video usually eclipses transfer
completion time. If you want some numbers, now 640x480 YUV420 (12bpp)
screen update takes now 25ms without tearsync flag enabled
(OMAPFB_FORMAT_FLAG_TEARSYNC for OMAPFB_UPDATE_WINDOW
ioctl) and 25-42ms with tearsync. For 30 fps video, period between
performing screen updates is normally 33ms. For playing video, we
initiate RFBI transfer, wait till it completes, perform VY12->YUV420 color
format conversion (which should take less than 4ms for 640x480
considering benmchmark results), wait till it is time to display the next
frame and start RFBI transfer again. For 30 fps video 25ms+4ms is less
than 33ms, so without tearsync enabled, any 640x480 video should play
fine (considering video output performance). With tearsync enabled, we
should add the time needed for performing vertical sync in LCD controller
which breaks our nice numbers. Worst case (17ms wait for retrace + 25ms
for actual data transfer) takes more time than 33ms between frames.
We can be saved if LCD controller internal refresh rate is really 60Hz,
it this case video playback will automagically synchronize to LCD refresh
rate and each frame processing will be done exactly within 2 LCD refresh
cycles (by the time we want to display a video frame, the next vertical will
be near and we will not lose much time waiting for it). If decoding time for
each frame will never exceed 28-29ms (which is a tough limitation, cpu
usage is not uniform), video playback without dropping any frames will be
possible even with tearsync enabled. That's what I'm investigating now.
In any case, getting ideal 24 fps playback will be a bit easier.

I hope all these explanations are clear now. And this is not just a theory,
but already confirmed by some experiments and practical tests.

> I'm still using Scratchbox 0.9.8.5 for day-to-day stuff ...

Thanks, that is what I would consider 'additional tips and tricks' :)

It is good to know that maemo 3.x development can be also done with
older scratchbox (I have 0.9.8.8 installed now), I'll try it without upgrading
scratchbox then.

> > Well, anyway, everything worked perfectly and I could play 640x480 video
> > on N800 with the following statistics:
> >
> > VIDEO: [DIVX] 640x480 12bpp 23.976 fps 886.7 kbps (108.2 kbyte/s)
> > ...
> > BENCHMARKs: VC: 87,757s VO: 8,712s A: 1,314s Sys: 3,835s =
> > 101,618s BENCHMARK%: VC: 86,3592% VO: 8,5736% A: 1,2932% Sys: 3,7740%
> > = 100,0000% BENCHMARKn: disp: 2044 (20,11 fps) drop: 355 (14%) total:
> > 2399 (23,61 fps)
> >
> > As you see, mplayer took 8.712 seconds to display 2044 VGA resolution
> > frames. If we do the necessary calculations, that's 72 millions pixels
> > per second, quite close to 'yv12_to_yuv420_line_armv6' capabilities
> > limit, so this function is the only major contributor to video output
> > time. Video output took much less time than decoding, so it proves that
> > video output overhead can be reduced to minimum (in this test tearsync
> > was not used though).
>
> I'd be curious to see the results from this with tearsync _enabled_?
> i.e., after your OMAPFB_UPDATE_WIDNOW call, issue an OMAPFB_SYNC_GFX
> ioctl before you start writing to memory again. This is basically the
> limiter for us at this stage.

That's exactly how MPlayer works. It always waits on OMAPFB_SYNC_GFX
before filling framebuffer with the data for the next frame. Not issuing
OMAPFB_SYNC_GFX would introduce *artificial* tearing not related to sync
with LCD refresh. Actually for this 24 fps video, OMAPFB_SYNC_GFX is not a
problem. The detailed explanation with some numbers was posted above.

When I'm talking about tearsync, I'm talking exclusively about
OMAPFB_FORMAT_FLAG_TEARSYNC for screen updates ioctls.

> > When tearsync comes into action, everything gets a bit more complicated.
> > I'm still investigating its impact on video playback performance.
>
> 'Not good'. :)

Video quality is still quite good even without tearsync (in my definition),
but not perfect. With you definition, tearsync is always enabled in MPlayer
anyway, on Nokia 770 too :)
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 2, 2007, 10:26 PM


Views: 9908
Re: N800 & Video playback

On Wednesday 02 May 2007 12:47, Daniel Stone wrote:
> > > X11 error: BadValue (integer parameter out of range for operation)
> > > MPlayer interrupted by signal 6 in module: flip_page while gstreamer
> > > did play them just fine. Also the Nokia_N800.avi and NokiaN93.avi died
> > > in the same way.
> >
> > This X11 error on video playback start and also sometimes on switching
> > fullscreen/windowed mode is a known problem [1] reported in this mailing
> > list.
> >
> > If MPlayer dies on start, usually trying to start it again succeeds. So
> > these 320x240 and 352x288 videos could be played as well if you were a
> > bit more persistent :)
>
> Resizing is a bit tricky. Most video hardware lets you use the hardware
> to clip, so if you move it beyond the edge of the screen, it just
> happily ignores anything beyond the hardware's bounds. Unfortunately
> for us, attempting to move a video surface off-screen (even by just a
> few pixels) triggers a hardware lockup.
>
> Given that we can't display the frame at all, we send BadValue (there
> are a couple of other conditions where this is possible, but this is the
> main one). I don't see the point in returning Success when no video is
> drawn at all. So, I guess you could hack mplayer's error handler to
> just ignore BadValues from Xv(Shm)PutImage, unless you get more than
> five or ten in a row, say.

Thanks for the hint, I'll try it.

> Bear in mind that, as you've hinted at, the only part of the Xv code
> which is custom is the _output_ code. We're using the standard X server
> implementation (as used by tens of millions of people) for the protocol
> decode and standard semantics, the standard KDrive layer for extended
> stuff (as used by god-knows-how-many embedded and consumer devices), and
> then the only part we have to play is taking frames and putting them on
> the screen.
>
> Due to some restrictions (as above), we have to deliberately error out
> on some operations. But errors like that tend to say 'you've hit a
> hardware restriction, I can't do this', rather than 'you hit one of the
> many random return BadValues we put in this weird code just to confuse
> people'.

That's the interesting information, thanks.

> Also, bear in mind that a lot of the initial instability was due to the
> DSP. The video was actually rather stable when you played without
> sound, although now the situation is somewhat reversed with the DSP
> being pretty steady now, and the new YUV420 code having complicated
> semsnatics.

Well, I was planning to raise this issuer later (after xserver/Xv things are
clear), but looks like DSP still has some problems on N800. In MPlayer
it can be triggered by a number of very fast sequential gstreamer pipeline
start/stop operations which usually happen on seeking. Audio playback just
hangs. Right now MPlayer artificially introduces 100ms pause to workaround
this problem. I tried to reproduce the same issue on a small test program,
but did not succeed yet.

> > I have also submitted this patch to maemo bugzilla, hopefully it (or its
> > modification) can get included into the next version of N800 firmware:
> > https://maemo.org/bugzilla/show_bug.cgi?id=1278
>
> I'll merge it with some changes.

Thanks a lot.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 2, 2007, 10:48 PM


Views: 9915
Re: N800 & Video playback

On Wednesday 02 May 2007 12:39, Daniel Stone wrote:
> On Wed, May 02, 2007 at 09:16:01AM +0300, ext Siarhei Siamashka wrote:
> > On Tuesday 01 May 2007 20:49, Siarhei Siamashka wrote:
> > > Results with unpatched xserver and some more explanations can be found
> > > in [3].
> > > Yes, now N800 is faster than Nokia 770 for video output performance at
> > > last :)
> >
> > Well, still not everything is so good until the following bug gets fixed:
> > https://maemo.org/bugzilla/show_bug.cgi?id=1281
> >
> > The patch for optimized Xv performance will not help to watch widescreen
> > video which triggers this tearing bug. If you see tearing on the screen,
> > you should know that the YUV420 color format conversion optimization
> > patch does not get used at all and xserver most likely uses a slow
> > nonoptimized YUV422 fallback code with software scaling.
>
> Indeed. And the reason the code is there is because Hailstorm can only
> downscale at fixed ratios (half and one-quarter), and even then, it
> locked up when we tried. Similarly, the display controller's
> downscaling didn't work, either. So we can optimise the fallback path,
> but you'll still be screwed by sending 16bpp (instead of 12bpp) through
> RFBI.

The only thing which is unclear here is that Hailstorm does not need to
downscale video in this situation. The bug can be reproduced with 512x288
video which just needs upscaling to 800x450. Also even standard
Nokia_N800.avi video with proper aspect ratio causes a huge
performance regression and tearing.

Please give this #1281 issue another look. It looks like a bug in xserver,
but not a hardware limitation. I can probably try to workaround it by
requesting not 512x288 buffer from Xv, but something like 512x308, use
only 512x288 part of it and artificially add black bands above and below.
After that, Xv can be asked to expand it to 800x480 to get expected result
But if it is a bug in xserver, it would be better to get it fixed, preferably
before the next firmware update :)

> > Fixing this bug is critical for video playback performance. I hope it
> > will be solved in the next version of N800 firmware too. But it we get
> > some patch to solve this problem for testing earlir, that would be nice
> > too.
>
> The only patch is optimising that function, really. Even if we did work
> out a way to make Hailstorm happy, you can still only scale at those
> exact multiples, which doesn't make it a viable general solution.

I will do optimized software YV12->YUV420 JIT scaler a bit later (on next
weekend?). It will be only a minor modification of YV12->YUV422 scaler
which already exists and works fine. If it can be useful for xserver, it might
be added there at any time.
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


kalle.vahlman at gmail

May 2, 2007, 10:57 PM


Views: 9920
Re: N800 & Video playback

2007/5/1, Siarhei Siamashka <siarhei.siamashka [at] gmail>:
> On Tuesday 01 May 2007 17:49, Kalle Vahlman wrote:
> > > OK, here is this untested a patch for xserver to add ARMv6 optimized
> > > YUV420 color format conversion. Theoretically it should compile
> > > (I did not try to build xserver myself though) and work. If it refuses to
> > > compile, fixing the patch should be not too difficult.
> >
> > Applied and build without problems for me.
>
> Thanks a lot for building the package and putting it for download, everything
> seems to be fine, but more details will follow below.
>
> > For testing, I fabricated some video with gstreamer:
> >
> > which resulted in 640x480 [at] 30fp and 800x480 [at] 30fp videos. For some
> > reason 320x240 and 352x288 refused to play with:
> >
> > X11 error: BadValue (integer parameter out of range for operation)
> > MPlayer interrupted by signal 6 in module: flip_page while gstreamer did
> > play them just fine. Also the Nokia_N800.avi and NokiaN93.avi died in the
> > same way.
>
> This X11 error on video playback start and also sometimes on switching
> fullscreen/windowed mode is a known problem [1] reported in this mailing list.
>
> If MPlayer dies on start, usually trying to start it again succeeds. So these
> 320x240 and 352x288 videos could be played as well if you were a bit more
> persistent :)

No, it's actually 100% reproducable in this situation (yes, I tried a
number of . You see, I didn't have the window manager running. It
breaks with the N800 video too.
Running with the window manager does make it runnable, but it also
changes the window size which I wanted to avoid.

> > My mplayer is compiled from the svn
> > trunk of the garage project, with some additional cflags I use (so
> > maybe those were the problem...).
>
> Do you have a set of cflags settings which work better than the default set?
> Can you share this information?

If by "default set" you mean what the default options in the toolchain
is, then yes (as there are none AFAIK ;). If you mean the default
options for mplayer, I don't know if they add any value. I like to use
my hardware well ;) so I tend to compile everything with VFP enabled
and optimized for the processor:

CFLAGS='-mcpu=arm1136j-s -mfpu=vfp -mfloat-abi=softfp -O2'

Now, wheter it works better than thumb code is debatable, as
optimizing code size might be more beneficial than having fast floats.
But at least I was happy with the results we got from our testing,
detailed in

http://syslog.movial.fi/archives/46-N800-VFP-or-not-to-VFP.html

I doubt they will do much good for mplayer, as I assume most critical
operations will be highly optimized already by hand and not left
entirely for the compiler...

> If you want a guaranteed video playback with divx/xvid/mpeg4 codecs, you
> should restrict to 512x384 resolution or lower and keep bitrate reasonable.
>
> The results for these 'insane' videos you have posted are somewhat weird, a
> complete statistics would require also a number of frames dropped, otherwise
> we don't know how much work was done by the player. Probably missing audio
> track resulted in MPlayer not being able to provide a proper report.

Yeah, I guess the fabricated videos weren't that good. Have to do some
more testing with real videos...

> Yes, now N800 is faster than Nokia 770 for video output performance at last :)

This is _very_ cool indeed :)

--
Kalle Vahlman, zuh [at] iki
Powered by http://movial.fi
Interesting stuff at http://syslog.movial.fi
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


dufkaf at seznam

May 3, 2007, 12:21 AM


Views: 9950
Re: N800 & Video playback

Siarhei Siamashka wrote:
> If decoding time for
> each frame will never exceed 28-29ms (which is a tough limitation, cpu
> usage is not uniform), video playback without dropping any frames will be
> possible even with tearsync enabled.

Would a double or multiple buffering help with this? Does mplayer use
different threads for displaying and decoding and decode frames in advance?
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 3, 2007, 1:10 PM


Views: 9909
Re: N800 & Video playback

On Thursday 03 May 2007 08:48, Siarhei Siamashka wrote:
> The only thing which is unclear here is that Hailstorm does not need to
> downscale video in this situation. The bug can be reproduced with 512x288
> video which just needs upscaling to 800x450. Also even standard
> Nokia_N800.avi video with proper aspect ratio causes a huge
> performance regression and tearing.
>
> Please give this #1281 issue another look. It looks like a bug in xserver,
> but not a hardware limitation. I can probably try to workaround it by
> requesting not 512x288 buffer from Xv, but something like 512x308, use
> only 512x288 part of it and artificially add black bands above and below.
> After that, Xv can be asked to expand it to 800x480 to get expected result
> But if it is a bug in xserver, it would be better to get it fixed,
> preferably before the next firmware update :)

Well, found what's the matter and added explanation at bugzilla:
https://maemo.org/bugzilla/show_bug.cgi?id=1281

The workaround can be easily added to MPlayer, so that it will
never call XvShmPutImage with top left image corner at an odd line.
I'm going to release an updated MPlayer package (maybe even
a bit later today), it is really fast on N800 with the optimized xserver :)
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


daniel.stone at nokia

May 4, 2007, 12:49 AM


Views: 9909
Re: N800 & Video playback

On Thu, May 03, 2007 at 11:10:32PM +0300, ext Siarhei Siamashka wrote:
> On Thursday 03 May 2007 08:48, Siarhei Siamashka wrote:
> > The only thing which is unclear here is that Hailstorm does not need to
> > downscale video in this situation. The bug can be reproduced with 512x288
> > video which just needs upscaling to 800x450. Also even standard
> > Nokia_N800.avi video with proper aspect ratio causes a huge
> > performance regression and tearing.
> >
> > Please give this #1281 issue another look. It looks like a bug in xserver,
> > but not a hardware limitation. I can probably try to workaround it by
> > requesting not 512x288 buffer from Xv, but something like 512x308, use
> > only 512x288 part of it and artificially add black bands above and below.
> > After that, Xv can be asked to expand it to 800x480 to get expected result
> > But if it is a bug in xserver, it would be better to get it fixed,
> > preferably before the next firmware update :)
>
> Well, found what's the matter and added explanation at bugzilla:
> https://maemo.org/bugzilla/show_bug.cgi?id=1281
>
> The workaround can be easily added to MPlayer, so that it will
> never call XvShmPutImage with top left image corner at an odd line.
> I'm going to release an updated MPlayer package (maybe even
> a bit later today), it is really fast on N800 with the optimized xserver :)

Aha, that will indeed cause a fallback (x, y, width and height should
all be aligned to 4px).

Cheers,
Daniel
Attachments: signature.asc (0.18 KB)


siarhei.siamashka at gmail

May 6, 2007, 1:25 AM


Views: 9873
Re: N800 & Video playback

On Thursday 03 May 2007 10:21, Frantisek Dufka wrote:

> Siarhei Siamashka wrote:
> > If decoding time for
> > each frame will never exceed 28-29ms (which is a tough limitation, cpu
> > usage is not uniform), video playback without dropping any frames will be
> > possible even with tearsync enabled.
>
> Would a double or multiple buffering help with this?

Yes, most likely it will. N800 has 800x480 virtual size for framebuffer and a
new enhanced screen update ioctl. Now it should be possible (did not try yet,
but will have some results very soon) to specify output position and size for
the rectangle as it gets displayed on the screen.

struct omapfb_update_window {
__u32 x, y;
__u32 width, height;
__u32 format;
__u32 out_x, out_y;
__u32 out_width, out_height;
__u32 reserved[8];
};

This theoretically allows us to use some kind of double buffering, we can
split framebuffer into two 400x480 parts and while one part is being
displayed, another one can be freely filled with the data for the next frame.
This will effectively remove the need for OMAPFB_SYNC_GFX, improving
peak framerate.

But this solution will require support for arbitrary downscaling in YUV420
format for each video frame to fit 400x480 box. The quality will be also
reduced a bit, but on the other hand, graphics bus should have no
performance problems with sending 400x480 through it.

If virtual framebuffer size could be extended to 800x960, this would allow us
to use doublebuffering without sacrificing resolution. Anyway, I'll try to
fix MPlayer framebuffer output module to properly work with the latest
version of N800 firmware and implement this form of doublebuffering. It
should provide the fastest video output performance that is possible.

Regarding Nokia 770, now it uses 800x600 framebuffer virtual size (some
extra waste of RAM?). Anyway, if hwa742 kernel driver could be extended to
support this improved screen update API and respect 'out_x' and 'out_y'
arguments, we could have four video pages in framebufer memory for
400x240 pixel doubled video output. It could allow to implement a very
efficient double buffering for accelerated Nokia 770 SDL project if it ever
takes off the ground :)

> Does mplayer use different threads for displaying and decoding and decode
> frames in advance?

No, it doesn't have any extra threads now. But video playback on Nokia 770
is already parallel, splitting tasks between the following pieces of hardware
each working simultaneously:
1. ARM core (demuxing and decoding video into framebuffer)
2. DMA + graphics controller (screen update transferring data from framebuffer
into videomemory and performaing YUV->RGB conversion on the fly)
3. C55x DSP core (mp3 audio decoding and playback)

There is not much point in creating many threads on ARM, as we only have a
single ARM core and splitting work into several threads will not accelerate
overall performance. Threads could be useful for doing something extra while
waiting for other hardware components to finish their work (waiting for screen
update for example), but decoding ahead will also require storing the decoded
data somewhere. This place for storing decoded ahead frames could be only
some extra space in framebuffer memory, otherwise we would lose some
performance on moving this data to framebuffer later (and increasing battery
power consumption). As framebuffer space is limited, we would not be able to
store many frames ahead, and decoding cpu usage most likely varies not
between frames but more like between different scenes (complicated action
scene will make us run out of decode ahead buffer pretty fast). Anyway,
probably this may be worth trying later, there even exists some threads based
MPlayer fork: http://mplayerxp.sourceforge.net/
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


siarhei.siamashka at gmail

May 6, 2007, 4:06 AM


Views: 9832
Re: N800 & Video playback

On Friday 04 May 2007 10:49, Daniel Stone wrote:
> On Thu, May 03, 2007 at 11:10:32PM +0300, ext Siarhei Siamashka wrote:
> > Well, found what's the matter and added explanation at bugzilla:
> > https://maemo.org/bugzilla/show_bug.cgi?id=1281
> >
> > The workaround can be easily added to MPlayer, so that it will
> > never call XvShmPutImage with top left image corner at an odd line.
> > I'm going to release an updated MPlayer package (maybe even
> > a bit later today), it is really fast on N800 with the optimized xserver
> > :)
>
> Aha, that will indeed cause a fallback (x, y, width and height should
> all be aligned to 4px).

Could you clarify this information? The code from kernel framebuffer
driver (blizzard.c) suggests that only width should be 4px aligned:

switch (color_mode) {
case OMAPFB_COLOR_YUV420:
/* Embedded window with different color mode */
bpp = 12;
/* X, Y, height must be aligned at 2, width at 4 pixels */
x &= ~1;
y &= ~1;
height = yspan = height & ~1;
width = width & ~3;
break;

Does xserver introduce additional limitations?
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://maemo.org/mailman/listinfo/maemo-developers


abos at hanno

Oct 18, 2007, 7:30 PM


Views: 9286
Re: N800 & Video playback

Hi,

>> The memory bandwidth to the N800 LCD framebuffer is 3 times slower that
>> the bandwidth in the N770? Is it really _that_ big?
>
> Siarhei's calculations were correct, so, yes.
>
>> What is limiting the bandwidth: The OMAP interface, the LCD controller
>> itself or was it a design issue.
>
> a) and c). It's just not stable at higher frequencies.

Just curious - is there any word out about the N810 regarding this
particular issue?

(As previously mentioned, my personal killer app for Maemo is full
screen 800x480 video @ 30 fps. Will it be possible?)

Thanks!

Hanno
_______________________________________________
maemo-developers mailing list
maemo-developers [at] maemo
https://lists.maemo.org/mailman/listinfo/maemo-developers