Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] "pscheduler result" raises exception {"result-merged": null}

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] "pscheduler result" raises exception {"result-merged": null}


Chronological Thread 
  • From: Brian Candler <>
  • To: "" <>
  • Subject: Re: [perfsonar-user] "pscheduler result" raises exception {"result-merged": null}
  • Date: Fri, 6 Sep 2019 15:43:03 +0100
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=subject:from:to :references:message-id:date:mime-version:in-reply-to :content-type; q=dns; s=sasl; b=aVYUNdYO1EOM5zDyqX51GcfJfCWMJzJg bLENgLXrXXPdNa5X4KrW+G+L8utdGgtnef2WJJaAbhmZwO2NJmma3fVAXTA4a5NG XUg/C/dUWC3cN93ngkeYUFWqoimo3OOXQzcTh5uuXtK1Di3z/hhktE9tD1xV0RIv 6JTlnDt53OA=

Argh...  I've found it, but it was a damn hard struggle.  In the process I found several places where pscheduler and esmond could have detected the problem, but blithely continued anyway.

TL;DR: The problem was this.  On the remote host ns1.BBBB.com, there was an entry in /etc/hosts for its IPv4 address, but not its IPv6 address.

B.B.B.B  ns1.BBBB.com ns1

This meant that the function pscheduler.ip_normalize_version() was returning null for the source end point address on the test (which was using IPv6):

python
>>> import pscheduler
>>> pscheduler.ip_normalize_version("ns1.BBBB.com", "perf1.home.AAAA.net", ip_version=6)
(None, 'A:A:A:A::45')

This function is called from /usr/lib/pscheduler/classes/archiver/esmond/esmond_util.py

Just to be clear: the host ns1.BBBB.com has valid forward and reverse, for both IPv4 and IPv6, in the DNS.  But putting IPv4 only in /etc/hosts was preventing ip_normalize_version from resolving the name to an IPv6 address.

This in turn meant that the archiver was posting metadata for the test with 'source': None, 'measurement-agent': None

esmond was quite happily accepting this and returning a 201 result, but was creating an archive with no members (event-type: [] - i.e. empty list)

esmond was also quite happily accepting POSTs of test results to this broken archive, also with 201 result, and throwing them away.  So according to all pscheduler logs, all the archiving was working fine, but actually nothing was being stored.

SOLUTION: added the host's own IPv6 address to /etc/hosts on the remote host.

B.B.B.B  ns1.BBBB.com ns1
B:B:B:B:feff:ff:fe00:f16  ns1.BBBB.com ns1

PROPOSED IMPROVEMENTS:

I would like to make the following suggestions to make the stack more robust against this sort of problem.

1. esmond should barf (return 4xx result) if you try to create an archive with missing mandatory attributes, such as "source": null.  I think it should also barf if you try to post test results to such a broken archive.

This would have shown as an archiver error on the lead node.

2. all the archivers, especially iperf3, should barf internally if after resolving the source and destination IP addresses, it finds either of them is None.  This would have shown as an exception on the lead node.

(Interestingly, the owamp archiver didn't seem to have this problem: I think that's probably because it was running on the local side.  I didn't have any /etc/hosts entries for the remote node on the local node)

3. if the test result already includes a resolved source and destination address with the correct address family, then that should be used instead of attempting to re-resolve the names of the endpoints.

For the iperf3 test, it looks like start.connected.local_host and start.connected.remote_host contain those addresses, and since that's the source-of-truth about what addresses were used, I think that "source" and "destination" should prefer to use those values (whereas "input-source" and "input-destination" can of course have the hostnames).  It might work better for multi-homed hosts too.

QUESTION:

Why did the IPv4 /etc/hosts entry mess up ip_normalize_version(), but not the tests themselves?  Maybe one is going via nsswitch but the other is doing DNS lookups?

If anyone wants to investigate further, this is an easy problem to reproduce (even on 4.2).  Add to /etc/hosts:

198.128.153.23 es.net

then type:

# python
Python 2.7.15+ (default, Nov 27 2018, 23:36:35)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pscheduler
>>> pscheduler.ip_normalize_version("es.net", "es.net", ip_version=6)
(None, None)
>>>

Remote the entry from /etc/hosts, and you get

>>> pscheduler.ip_normalize_version("es.net", "es.net", ip_version=6)
('2001:400:210:153::17', '2001:400:210:153::17')
>>>

But it's not important, if the other changes are made.

In case anyone is interested, I'll add the debugging trail to the end of this mail.

Regards,

Brian Candler.


-=-=-=-=-=-=-=-=-=-

On 06/09/2019 11:40, Brian Candler wrote:
But the specified archive on the local node appears to be empty: event-types is empty list.

root@perf1:~# curl -sk https://localhost/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/
{"url":"https://localhost/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/","metadata-key":"d7bd3b9b8f28474b97a730a49364ccb4","subject-type":"point-to-point","event-types":[],"uri":"/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/"}

I found the log (on the remote host) when this metadata was written to esmond on the local host:

Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    No metadata key, so posting to esmond
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Posting metadata to https://[A:A:A:A::45]/esmond/perfsonar/archive/: {'time-duration': 29.0, u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', 'pscheduler-test-type': 'throughput', 'source': None, 'measurement-agent': None, 'input-destination': u'A:A:A:A::45', 'destination': 'A:A:A:A::45', 'event-types': [{'event-type': 'failures'}, {'summaries': [{'summary-type': 'average', 'summary-window': 86400, 'event-type': 'throughput'}], 'event-type': 'throughput'}, {'event-type': 'throughput-subintervals'}, {'event-type': 'packet-retransmits'}, {'event-type': 'packet-retransmits-subintervals'}, {'event-type': 'pscheduler-run-href'}], 'subject-type': 'point-to-point', 'ip-transport-protocol': 'tcp', 'input-source': u'ns1.BBBB.com', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name': u'pscheduler/iperf3'}
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Posting metadata to https://[A:A:A:A::45]/esmond/perfsonar/archive/: {'time-duration': 29.0, u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', 'pscheduler-test-type': 'throughput', 'source': None, 'measurement-agent': None, 'input-destination': u'A:A:A:A::45', 'destination': 'A:A:A:A::45', 'event-types': [{'event-type': 'failures'}, {'summaries': [{'summary-type': 'average', 'summary-window': 86400, 'event-type': 'throughput'}], 'event-type': 'throughput'}, {'event-type': 'throughput-subintervals'}, {'event-type': 'packet-retransmits'}, {'event-type': 'packet-retransmits-subintervals'}, {'event-type': 'pscheduler-run-href'}], 'subject-type': 'point-to-point', 'ip-transport-protocol': 'tcp', 'input-source': u'ns1.BBBB.com', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name': u'pscheduler/iperf3'}
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Esmond returned HTTP 201: {u'url': u'https://[A:A:A:A::45]/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/', u'metadata-key': u'd7bd3b9b8f28474b97a730a49364ccb4', u'uri': u'/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/', u'subject-type': u'point-to-point', u'event-types': []}
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Esmond returned HTTP 201: {u'url': u'https://[A:A:A:A::45]/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/', u'metadata-key': u'd7bd3b9b8f28474b97a730a49364ccb4', u'uri': u'/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/', u'subject-type': u'point-to-point', u'event-types': []}
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Added metadata key d7bd3b9b8f28474b97a730a49364ccb4 for task https://localhost/pscheduler/tasks/00f5197e-6a0f-4b67-b7ef-8b58af3f0465@https://[A:A:A:A::45]/esmond/perfsonar/archive/ to memcache
Sep  5 21:22:57 ns1 archiver-esmond/archive DEBUG    Added metadata key d7bd3b9b8f28474b97a730a49364ccb4 for task https://localhost/pscheduler/tasks/00f5197e-6a0f-4b67-b7ef-8b58af3f0465@https://[A:A:A:A::45]/esmond/perfsonar/archive/ to memcache

This is weird.  In the post data, event-types was a non-empty list.  But in the 201 response, event-types was empty.

In the post data, I notice 'source': None, 'measurement-agent': None, which looks suspicious.

For comparison, here is a similar example on the local host, creating metadata for an outbound test:

Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    No metadata key, so posting to esmond
Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    No metadata key, so posting to esmond
Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    Posting metadata to https://[A:A:A:A::45]/esmond/perfsonar/archive/: {'time-duration': 29.0, u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', 'pscheduler-test-type': 'throughput', 'source': 'A:A:A:A::45', 'measurement-agent': 'A:A:A:A::45', 'input-destination': u'ns1.BBBB.com', 'destination': 'B:B:B:B:feff:ff:fe00:f16', 'event-types': [{'event-type': 'failures'}, {'summaries': [{'summary-type': 'average', 'summary-window': 86400, 'event-type': 'throughput'}], 'event-type': 'throughput'}, {'event-type': 'throughput-subintervals'}, {'event-type': 'packet-retransmits'}, {'event-type': 'packet-retransmits-subintervals'}, {'event-type': 'pscheduler-run-href'}], 'subject-type': 'point-to-point', 'ip-transport-protocol': 'tcp', 'input-source': u'A:A:A:A::45', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name': u'pscheduler/iperf3'}
Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    Posting metadata to https://[A:A:A:A::45]/esmond/perfsonar/archive/: {'time-duration': 29.0, u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', 'pscheduler-test-type': 'throughput', 'source': 'A:A:A:A::45', 'measurement-agent': 'A:A:A:A::45', 'input-destination': u'ns1.BBBB.com', 'destination': 'B:B:B:B:feff:ff:fe00:f16', 'event-types': [{'event-type': 'failures'}, {'summaries': [{'summary-type': 'average', 'summary-window': 86400, 'event-type': 'throughput'}], 'event-type': 'throughput'}, {'event-type': 'throughput-subintervals'}, {'event-type': 'packet-retransmits'}, {'event-type': 'packet-retransmits-subintervals'}, {'event-type': 'pscheduler-run-href'}], 'subject-type': 'point-to-point', 'ip-transport-protocol': 'tcp', 'input-source': u'A:A:A:A::45', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name': u'pscheduler/iperf3'}
Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    Esmond returned HTTP 201: {u'metadata-key': u'06e4abe9da4a445789aa7c51bfbec5d9', u'tool-name': u'pscheduler/iperf3', u'pscheduler-test-type': u'throughput', u'time-duration': u'29.0', u'url': u'https://[A:A:A:A::45]/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/', u'measurement-agent': u'A:A:A:A::45', u'input-destination': u'ns1.BBBB.com', u'destination': u'B:B:B:B:feff:ff:fe00:f16', u'uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/', u'event-types': [{u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits/base', u'event-type': u'packet-retransmits'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/pscheduler-run-href/base', u'event-type': u'pscheduler-run-href'}, {u'time-updated': 1567765553, u'summaries': [{u'summary-type': u'average', u'summary-window': u'86400', u'uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/averages/86400', u'time-updated': 1567765553}], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/base', u'event-type': u'throughput'}, {u'time-updated': None, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/failures/base', u'event-type': u'failures'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput-subintervals/base', u'event-type': u'throughput-subintervals'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits-subintervals/base', u'event-type': u'packet-retransmits-subintervals'}], u'subject-type': u'point-to-point', u'source': u'A:A:A:A::45', u'input-source': u'A:A:A:A::45', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', u'ip-transport-protocol': u'tcp'}
Sep  6 10:43:51 perf1 archiver-esmond/archive DEBUG    Esmond returned HTTP 201: {u'metadata-key': u'06e4abe9da4a445789aa7c51bfbec5d9', u'tool-name': u'pscheduler/iperf3', u'pscheduler-test-type': u'throughput', u'time-duration': u'29.0', u'url': u'https://[A:A:A:A::45]/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/', u'measurement-agent': u'A:A:A:A::45', u'input-destination': u'ns1.BBBB.com', u'destination': u'B:B:B:B:feff:ff:fe00:f16', u'uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/', u'event-types': [{u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits/base', u'event-type': u'packet-retransmits'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/pscheduler-run-href/base', u'event-type': u'pscheduler-run-href'}, {u'time-updated': 1567765553, u'summaries': [{u'summary-type': u'average', u'summary-window': u'86400', u'uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/averages/86400', u'time-updated': 1567765553}], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/base', u'event-type': u'throughput'}, {u'time-updated': None, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/failures/base', u'event-type': u'failures'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput-subintervals/base', u'event-type': u'throughput-subintervals'}, {u'time-updated': 1567765553, u'summaries': [], u'base-uri': u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits-subintervals/base', u'event-type': u'packet-retransmits-subintervals'}], u'subject-type': u'point-to-point', u'source': u'A:A:A:A::45', u'input-source': u'A:A:A:A::45', u'pscheduler-reference-psconfig-created-by-uuid': u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', u'pscheduler-reference-psconfig-created-by-user-agent': u'psconfig-pscheduler-agent', u'ip-transport-protocol': u'tcp'}

This time, source and measurement-agent in the posted data are not null, and the esmond response contains a whole bunch of other URLs for related archives.

This suggests at least two problems:

1. esmond (in 4.2) accepts junk metadata, but happily gives a 201 response anyway as if everything was OK.

2. archiver (in 4.1.6 anyway) is generating junk metadata with null source and measurement-agent.

I thought at first this could be related to the missing reverse DNS for the local host; but actually in this case, the source and measurement agent should be the remote host (ns1.BBBB.com, or B:B:B:B:feff:ff:fe00:f16) - and that *does* have correct reverse DNS.

That leaves one thing: where are the null source and measurement-agent values coming from?  Back to source code. Note: this has changed a bit in 4.2, I am looking at 4.1.6.

=> /usr/lib/pscheduler/classes/archiver/esmond/archive

    #get explicit measurement-agent if se
    measurement_agent = None
    if "measurement-agent" in json['data']:
        measurement_agent = json['data']['measurement-agent']

=> /usr/lib/pscheduler/classes/archiver/esmond/esmond_util.py

            #determine source since its optional
            input_source = lead_participant
            if src_field and src_field in test_spec:
                input_source = test_spec[src_field]

            #determine if we are forcing an ip-version
            ip_version = None
            if ipv_field in test_spec:
                ip_version = test_spec[ipv_field]

            #get dest if this is point-to-point
            src_ip = None
            dest_ip = None
            input_dest = None
            if dst_field:
                self.metadata['subject-type'] = 'point-to-point'
                input_dest = test_spec[dst_field]
                src_ip, dest_ip = pscheduler.ip_normalize_version(input_source, input_dest, ip_version=ip_version)
            else:
                self.metadata['subject-type'] = 'network-element'
                src_ip, tmp_ip = pscheduler.ip_normalize_version(input_source, input_source, ip_version=ip_version)

            #set fields
            self.metadata['source'] = src_ip
            if dest_ip:
                self.metadata['destination'] = dest_ip
...
            #Make measurement-agent the created_by_address if we have it, otherwise the lead participant, with same ip type as source
            if measurement_agent:
                src_ip, self.metadata['measurement-agent'] = pscheduler.ip_normalize_version(src_ip, measurement_agent)
            else:
                src_ip, self.metadata['measurement-agent'] = pscheduler.ip_normalize_version(src_ip, lead_participant)

And aha: ip_normalize_version is broken.

>>> import pscheduler
>>> pscheduler.ip_normalize_version("ns1.BBBB.com", "perf1.home.AAAA.net", ip_version=4)
('B.B.B.B', None)
>>> pscheduler.ip_normalize_version("ns1.BBBB.com", "perf1.home.AAAA.net", ip_version=6)
(None, 'A:A:A:A::45')

It works fine on the local node (4.2).  This is where I discovered that the problem was not the difference in code, but the IPv4 entry in /etc/hosts preventing IPv6 lookup.






Archive powered by MHonArc 2.6.19.

Top of Page