Skip to Content.
Sympa Menu

perfsonar-dev - Re: [pS-dev] perfSONAR protocol modification for high volume data transfers

Subject: perfsonar development work

List archive

Re: [pS-dev] perfSONAR protocol modification for high volume data transfers


Chronological Thread 
  • From: Verena Venus <>
  • To: Jeff Boote <>
  • Cc: Roman Lapacz <>, , , Sibylle Schweizer-Jäckle <>
  • Subject: Re: [pS-dev] perfSONAR protocol modification for high volume data transfers
  • Date: Wed, 12 Mar 2008 11:33:58 +0100

Hi,

Am Dienstag, 11. März 2008 17:02:53 schrieb Jeff Boote:
> Hi All,
>
> Roman is absolutely correct. This is something we have discussed as a
> possibility nearly from the beginning. This type of solution is
> complicated by the fact that you need to communicate a much larger
> amount of information than just a URL for the data connection. (AA,
> binary data format, byte ordering, etc...)
>
> However, I do wonder what specific data services you are seeing a
> problem with at this time? Can you make your problem more concrete?
>

The CNM tool asks every 5 minutes for data from all Hades measurements
collected in these 5 minutes. Obviously, this reqeust has to be processed
within 5 minutes. As we have severeal thousand measruements running, the
request file itself already contains several megabytes of XML, which has to
be parsed. Then, these several thousand measurements have to be loaded in
memory to transform them into an XML document. These XML operations take some
time - we put quite some work into it to improve the performance. Now we
reached the point where we have to question the mechanism itself, as we do
see only insignificant performance gains from further internal optimizations,
which will not solve the problem in the long-term.

Of course, we could leave it at that and live with the fact that the CNM is
not able to do the 5 minutes requests. But I think there is quite a potential
to solve this issue. That's why Andreas and me brought this point into
discussion again.

I know that this idea is not new, and was not considered further so far as
there was no need for it. But right now we have a case where we should at
least consider the possibilities.

> There are several ways to mitigate the issues - and I'm not wondering
> if the DataHandle solution* (as I believe we called it IIRC - it
> could be push or pull) is really needed yet. For example, one
> possible solution to larger data flows we have considered in the pS-
> PS code is base64 encoding the entire data array for high-volume
> data. That can be done within the context of the single control-
> communication that currently exists and would hopefully save on the
> xml parsing. (It would require some additional metadata/parameters,
> but could be done within the same interaction model.)
>
> * The DataHandle was in effect a URL for where the data could be
> retrieved (or could be sent depending on the context (push vs pull) -
> and included additional information on the binary representation and
> AA needs. I actually wrote up a description of how this would work,
> but alas - it doesn't seem to be on the wiki anymore. Evidently lost
> during one of the previous re-organizations of the wiki.
>

I think that's more or less what we had in mind.

> [more comments in-line]
>
> On Mar 11, 2008, at 8:42 AM, Roman Lapacz wrote:
> > This issue takes us again to the push interface to transfer high
> > volume data. Once we had idea to create two communication channels:
> > control one (nmwg format) and data one (no xml structure, just
> > data). The control channel would dynamically set up data connection
> > between a sender and a receiver.
> >
> > Andreas Hanemann wrote:
> >> The difficulties are more or less tied to the flexibility that is
> >> foreseen in the NMWG format as briefly summarized in the
> >> following. There is the concept that data and metadata contain
> >> pointers so that a message typically looks like
> >>
> >> “Metadata2, metadata1 (containing reference to metadata2), data
> >> (containing reference to metadata1)”,
> >>
> >> but can also be organized in a different arbitrary manner. A
> >> further possibility is the potential multiple use of references
> >> (for filter metadata) which may look as follows.
> >>
> >> “Filter metadata, metadata1 (containing reference to filter
> >> metadata), metadata2 (containing reference to filter metadata),
> >> data1 (containing reference to metadata1), data2 (containing
> >> reference to metadata2)”
> >>
> >> The processing would therefore require an interleaved referencing
> >> to previous metadata so that a serial processing of data is not
> >> possible and requires that a complicated XML parsing tree is built
> >> in memory. Even though the latter case has currently not been used
> >> yet (to our knowledge), we have to address the amount of data to
> >> be kept in memory during the XML parsing.
>
> Your concern is keeping this metadata in memory during parsing? The
> metadata should not be very large... Can you show an example where
> this has really been a concern?
>

See above...
To give you figures, it would take us some time to prepare proper performance
tests. I'm not sure if we can afford this time in the near future. However,
this should be also a debate on principles, as we think there are probably
more cases where performance of large data transfers is an issue.

> >> What should be done is to try to allow for a serialized
> >> processing. There could be a metadata flag that says that the data
> >> in the message allow for serialized processing. This means that
> >> possibilities for arbitrary ordering are not used in this case and
> >> also that some metadata are repeated (e.g. the filtering metadata
> >> mentioned above). The idea of the flag is that services which have
> >> no need for high-volume data exchange may ignore the flag and
> >> process data as before so that there is no need for changes. Other
> >> services would be required to have a new library for parsing. We
> >> have to check for potential problems if there are messages
> >> suitable for stream processing and others are not stream
> >> processing-enabled (e.g. one RRD MA instance sends stream
> >> processing-enabled messages and the others not).
>
> So, this method would only save you from having to cache metadata,
> right? Or am I missing something?
>

It would also save us from building large XML files.

> >> A further alternative would be a RESTful architecture which could
> >> make use of a modified MetaDataKeyRequest. A client asks for a key
> >> for a certain parameter set. The key in the answer is then a URL
> >> which is the location where the client can fetch the data together
> >> with a metadata description. The further data transfer is then
> >> done with HTTP which allows for stream processing. The data would
> >> not be wrapped in XML anymore which would minimize the overhead.
> >> However, this solution would require larger modifications to
> >> perfSONAR.
>
> This is much more similar to the DataHandle idea that we discussed
> very early on and have put on the back-burner since it has not been
> needed yet.
>

Maybe the right time has come to dig it out again.

> Since this represents a very large change, I would want to see more
> information on why you think it is needed. If for no other reason,
> then to better be able to evaluate any potential solutions.
>

I hope I could shed some light on the situation.

Kind regards,
Verena

> Thanks,
> jeff
> --
> Jeff Boote
>





Archive powered by MHonArc 2.6.16.

Top of Page