ndt-dev - [ndt-dev] Fwd: Re: Interesting Flash benchmark results

Subject: NDT-DEV email list created

List archive

[ndt-dev] Fwd: Re: Interesting Flash benchmark results

From: Will Hawkins <>
To: Sebastian Kostuch <>, "<>" <>
Subject: [ndt-dev] Fwd: Re: Interesting Flash benchmark results
Date: Fri, 21 Mar 2014 14:50:39 -0400

Per my previous message, here is a compilation of the information we've
found regarding network I/O performance for the Flash runtime.

-------- Original Message --------
Subject: Re: Interesting Flash benchmark results
Date: Tue, 18 Mar 2014 19:47:59 -0400
From: Will Hawkins
<>
To: Matt Mathis
<>
CC: Jordan McCarthy
<>

On 03/18/2014 07:36 PM, Will Hawkins wrote:
> I've experimentally determined that, sadly, there appears to be no way
> to get faster than that 10ms resolution. I ran strace on the runtime to
> see what I could see. Here is the important bit:
>
> 0.000028 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}], 2, 10)
>
> where poll() is prototyped as
>
> int poll(struct pollfd *fds, nfds_t nfds, int timeout);

Some further information:

lsof returns some interesting information about what is in that fd set:

adl 7822 hawkinsw 3u 0000 0,9 0 6978 anon_inode
adl 7822 hawkinsw 6u unix 0x00000000 0t0 88865 socket
adl 7822 hawkinsw 9u IPv4 89107 0t0 TCP *:5001
(LISTEN)
adl 7822 hawkinsw 10u IPv4 85153 0t0 TCP
localhost:5001->localhost:59446 (ESTABLISHED)

In other words, the runtime does not poll for data on active sockets. It
is only polling on those anon_inode and unix socket file descriptors.
This is all the more reason to hope that, perhaps, this is runtime
dependent and there me be more granular resolution on more "popular"
platforms.

I will do some comparison on Mac/Windows tomorrow and let you know what
I find.

Will

>
> So, we can see that the 10ms is built in to the runtime. The linux
> version of the runtime uses GTK and those poll()s are called within the
> GTK main runloop. So, it's possible that a) we can tweak those default
> settings for testing purposes and b) the intervals may vary based on
> platform. Neither of which seem likely, but I will explore those
> possibilities.
>
> I thought that a workaround might be building a high resolution timer
> and checking for data at that resolution. However, the timer mechanism
> in the runtime is limited to about 16ms (according to the documentation).
>
> I am also going to experiment with the runtime threading support to see
> if that is an option. I will keep you posted.
>
> Will
>
> On 03/17/2014 06:03 PM, Matt Mathis wrote:
>> So it seems that the main thread has an 10mS timer in it. (1564 events
>> per 16 seconds), and each tick nominally reads 64*1024 bytes. Note that
>> it only used about 0.572 seconds of CPU time in 16 seconds of wall time,
>> so it is spending nearly all of its time in some sort of wait (probably
>> a select). It is using only about 3% of your cpu.
>>
>> If you want to reverse engineer the runtime look at the ps "WCHAN"
>> column for clues. This is probably pointless, although it might be
>> worthwhile to try different runtimes.
>>
>> Does this size match any declared buffers? If not, it is probably the
>> built in (a'la stdio) copy buffer.
>>
>> Clearly the tool maxes out at 50 Mb/s. I would be inclined to be
>> suspicious of any rate above half that, because it is completely normal
>> to see peak data rates that are twice the average. If the peak rate is
>> being limited and the average is above 25 Mb/s, you don't know about any
>> secondary effects of limiting the rate. (They may not be a problem, but
>> you do have to consider the possibility).
>>
>> The short version: "Measured rates above 25 Mb/s may be subject to
>> calibration errors due to Flash client limitations".
>>
>> Now does this apply to all clients, or are some different? BTW It is
>> real important to expose the "receiver limited" message.
>>
>> Thanks,
>> --MM--
>> The best way to predict the future is to create it. - Alan Kay
>>
>> Privacy matters! We know from recent events that people are using our
>> services to speak in defiance of unjust governments. We treat privacy
>> and security as matters of life and death, because for some users, they
>> are.
>>
>>
>> On Fri, Mar 14, 2014 at 5:18 PM, Will Hawkins
>> <
>>
>> <mailto:>>
>> wrote:
>>
>> Matt,
>>
>> Here are some early benchmark results.
>>
>> I first ran a test locally (localhost) using netcat as the sender and
>> the receiver to transmit 100MB of data:
>>
>> Netcat Null Receiver:
>>
>> hawkinsw@worldwide:flash-benchmark$
>> time dd bs=1024 count=100000
>> if=/dev/zero | nc 127.0.0.1 5001
>> 100000+0 records in
>> 100000+0 records out
>> 102400000 bytes (102 MB) copied, 0.220695 s, 464 MB/s
>>
>> real 0m0.225s
>> user 0m0.048s
>> sys 0m0.268s
>>
>>
>> hawkinsw@worldwide:flash-benchmark$
>> nc -vl 0.0.0.0 5001 | wc -c
>> Connection from 127.0.0.1 port 5001 [tcp/*] accepted
>> 102400000
>>
>> Pretty quick! Then, I stripped out the networking code from the flash
>> client and copied it into a standalone flash application that can be
>> run
>> under a standalone runtime. In other words, we can try to isolate the
>> Flash's runtime I/O performance from the browser. The program simply
>> reads the data and discards it in exactly the same way as the flash
>> client.
>>
>> Slow (with existing read algorithm -- one byte at a time)
>>
>> hawkinsw@worldwide:bin$
>> time dd bs=1024 count=100000 if=/dev/zero | nc
>> 127.0.0.1 5001
>> 100000+0 records in
>> 100000+0 records out
>> 102400000 bytes (102 MB) copied, 43.4306 s, 2.4 MB/s
>>
>> real 0m43.672s
>> user 0m0.012s
>> sys 0m0.264s
>>
>>
>> hawkinsw@worldwide:flash-benchmark$
>> ./run.sh 2>&1 | tee output.txt
>> Closing read (connection 102400000 bytes in 1564 events).
>>
>> That's really, really slow! One of the first things that jumped out at
>> me is that the code is using its network I/O very inefficiently. It is
>> attempting to read one byte at a time.
>>
>> So, I changed the benchmark program to be more efficient. I made it
>> read
>> bytes a chunk at a time rather than one byte at a time:
>>
>> Better (with 10240 chunks)
>> Source:
>>
>> hawkinsw@worldwide:bin$
>> time dd bs=1024 count=100000 if=/dev/zero | nc
>> 127.0.0.1 5001
>> 100000+0 records in
>> 100000+0 records out
>> 102400000 bytes (102 MB) copied, 16.0099 s, 6.4 MB/s
>>
>> real 0m16.104s
>> user 0m0.044s
>> sys 0m0.528s
>> Sync:
>>
>> hawkinsw@worldwide:flash-benchmark$
>> ./run.sh 2>&1 | tee output.txt
>> Closing connection (read 102400000 bytes in 1564 events).
>>
>> I experimented with different chunk sizes but never really got any
>> better than that. It's still a significant improvement.
>>
>> Clearly this is a contrived scenario but I think it shows that there is
>> plenty of work to do on the flash client so that the receiver is not
>> negatively affecting results. This is exactly as your surmised!
>>
>> I applied this change to the flash client itself and compared the
>> performance with the original version. There is a noticeable
>> improvement
>> on the receiver limited metric, but not as much as I expected after
>> seeing the results above.
>>
>> In the tests above I tracked the number of "events". Those are the
>> number of times that the runtime alerts (by invoking a callback
>> function) the application that data is available on the socket. It is
>> good to see that the number of events is the same no matter whether we
>> read data one byte at a time or one chunk at a time. However, I am
>> interested in seeing exactly how expensive those event dispatches are
>> and whether they are negatively impacting network I/O performance.
>>
>> I am very eager to hear your thoughts on this! I hope you have a great
>> weekend!
>>
>> Will
>>
>>
>>

[ndt-dev] Fwd: Re: Interesting Flash benchmark results, Will Hawkins, 03/21/2014

List archive

[ndt-dev] Fwd: Re: Interesting Flash benchmark results