Argh... I've found it, but it was a
damn hard struggle. In the process I found several places where
pscheduler and esmond could have detected the problem, but
blithely continued anyway.
TL;DR: The problem was this. On the
remote host ns1.BBBB.com, there was an entry in /etc/hosts for its
IPv4 address, but not its IPv6 address.
B.B.B.B ns1.BBBB.com ns1
This meant that the function pscheduler.ip_normalize_version() was
returning null for the source end point address on the test (which
was using IPv6):
python
>>> import pscheduler
>>>
pscheduler.ip_normalize_version("ns1.BBBB.com",
"perf1.home.AAAA.net", ip_version=6)
(None, 'A:A:A:A::45')
This function is called from /usr/lib/pscheduler/classes/archiver/esmond/esmond_util.py
Just to be clear: the host ns1.BBBB.com
has valid forward and reverse, for both IPv4 and IPv6, in the
DNS. But putting IPv4 only in /etc/hosts was preventing
ip_normalize_version from resolving the name to an IPv6 address.
This in turn meant that the archiver was posting metadata for the
test with 'source': None, 'measurement-agent': None
esmond was quite happily accepting this and returning a 201
result, but was creating an archive with no members (event-type:
[] - i.e. empty list)
esmond was also quite happily accepting POSTs of test results to
this broken archive, also with 201 result, and throwing them
away. So according to all pscheduler logs, all the archiving was
working fine, but actually nothing was being stored.
SOLUTION: added the host's own IPv6 address to /etc/hosts on the
remote host.
B.B.B.B ns1.BBBB.com ns1
B:B:B:B:feff:ff:fe00:f16 ns1.BBBB.com ns1
PROPOSED IMPROVEMENTS:
I would like to make the following
suggestions to make the stack more robust against this sort of
problem.
1. esmond should barf (return 4xx
result) if you try to create an archive with missing mandatory
attributes, such as "source": null. I think it should also barf
if you try to post test results to such a broken archive.
This would have shown as an archiver
error on the lead node.
2. all the archivers, especially
iperf3, should barf internally if after resolving the source and
destination IP addresses, it finds either of them is None. This
would have shown as an exception on the lead node.
(Interestingly, the owamp archiver
didn't seem to have this problem: I think that's probably because
it was running on the local side. I didn't have any /etc/hosts
entries for the remote node on the local node)
3. if the test result already includes
a resolved source and destination address with the correct address
family, then that should be used instead of attempting to
re-resolve the names of the endpoints.
For the iperf3 test, it looks like start.connected.local_host
and start.connected.remote_host contain those addresses,
and since that's the source-of-truth about what addresses were
used, I think that "source" and "destination" should prefer to use
those values (whereas "input-source" and "input-destination" can
of course have the hostnames). It might work better for
multi-homed hosts too.
QUESTION:
Why did the IPv4 /etc/hosts entry
mess up ip_normalize_version(), but not the tests themselves?
Maybe one is going via nsswitch but the other is doing DNS
lookups?
If anyone wants to investigate
further, this is an easy problem to reproduce (even on 4.2).
Add to /etc/hosts:
198.128.153.23 es.net
then type:
# python
Python 2.7.15+ (default, Nov 27 2018, 23:36:35)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> import pscheduler
>>> pscheduler.ip_normalize_version("es.net",
"es.net", ip_version=6)
(None, None)
>>>
Remote the entry from /etc/hosts, and
you get
>>>
pscheduler.ip_normalize_version("es.net", "es.net",
ip_version=6)
('2001:400:210:153::17', '2001:400:210:153::17')
>>>
But it's not important, if the other
changes are made.
In case anyone is interested, I'll add
the debugging trail to the end of this mail.
Regards,
Brian Candler.
-=-=-=-=-=-=-=-=-=-
On 06/09/2019 11:40, Brian Candler
wrote:
But the
specified archive on the local node appears to be empty:
event-types is empty list.
root@perf1:~# curl -sk https://localhost/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/
{"url":"https://localhost/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/","metadata-key":"d7bd3b9b8f28474b97a730a49364ccb4","subject-type":"point-to-point","event-types":[],"uri":"/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/"}
I found the log (on the remote host) when this metadata was
written to esmond on the local host:
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG No
metadata key, so posting to esmond
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Posting metadata to
https://[A:A:A:A::45]/esmond/perfsonar/archive/:
{'time-duration': 29.0,
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', 'pscheduler-test-type':
'throughput', 'source': None, 'measurement-agent': None,
'input-destination': u'A:A:A:A::45', 'destination':
'A:A:A:A::45', 'event-types': [{'event-type': 'failures'},
{'summaries': [{'summary-type': 'average', 'summary-window':
86400, 'event-type': 'throughput'}], 'event-type':
'throughput'}, {'event-type': 'throughput-subintervals'},
{'event-type': 'packet-retransmits'}, {'event-type':
'packet-retransmits-subintervals'}, {'event-type':
'pscheduler-run-href'}], 'subject-type': 'point-to-point',
'ip-transport-protocol': 'tcp', 'input-source': u'ns1.BBBB.com',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name':
u'pscheduler/iperf3'}
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Posting metadata to
https://[A:A:A:A::45]/esmond/perfsonar/archive/:
{'time-duration': 29.0,
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', 'pscheduler-test-type':
'throughput', 'source': None, 'measurement-agent': None,
'input-destination': u'A:A:A:A::45', 'destination':
'A:A:A:A::45', 'event-types': [{'event-type': 'failures'},
{'summaries': [{'summary-type': 'average', 'summary-window':
86400, 'event-type': 'throughput'}], 'event-type':
'throughput'}, {'event-type': 'throughput-subintervals'},
{'event-type': 'packet-retransmits'}, {'event-type':
'packet-retransmits-subintervals'}, {'event-type':
'pscheduler-run-href'}], 'subject-type': 'point-to-point',
'ip-transport-protocol': 'tcp', 'input-source': u'ns1.BBBB.com',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name':
u'pscheduler/iperf3'}
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Esmond returned HTTP 201: {u'url':
u'https://[A:A:A:A::45]/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/',
u'metadata-key': u'd7bd3b9b8f28474b97a730a49364ccb4', u'uri':
u'/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/',
u'subject-type': u'point-to-point', u'event-types': []}
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Esmond returned HTTP 201: {u'url':
u'https://[A:A:A:A::45]/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/',
u'metadata-key': u'd7bd3b9b8f28474b97a730a49364ccb4', u'uri':
u'/esmond/perfsonar/archive/d7bd3b9b8f28474b97a730a49364ccb4/',
u'subject-type': u'point-to-point', u'event-types': []}
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Added metadata key d7bd3b9b8f28474b97a730a49364ccb4 for task
https://localhost/pscheduler/tasks/00f5197e-6a0f-4b67-b7ef-8b58af3f0465@https://[A:A:A:A::45]/esmond/perfsonar/archive/
to memcache
Sep 5 21:22:57 ns1 archiver-esmond/archive DEBUG
Added metadata key d7bd3b9b8f28474b97a730a49364ccb4 for task
https://localhost/pscheduler/tasks/00f5197e-6a0f-4b67-b7ef-8b58af3f0465@https://[A:A:A:A::45]/esmond/perfsonar/archive/
to memcache
This is weird. In the post data, event-types was a non-empty
list. But in the 201 response, event-types was empty.
In the post data, I notice 'source': None, 'measurement-agent':
None, which looks suspicious.
For comparison, here is a similar example on the local host,
creating metadata for an outbound test:
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG No
metadata key, so posting to esmond
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG No
metadata key, so posting to esmond
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG
Posting metadata to
https://[A:A:A:A::45]/esmond/perfsonar/archive/:
{'time-duration': 29.0,
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', 'pscheduler-test-type':
'throughput', 'source': 'A:A:A:A::45', 'measurement-agent':
'A:A:A:A::45', 'input-destination': u'ns1.BBBB.com',
'destination': 'B:B:B:B:feff:ff:fe00:f16', 'event-types':
[{'event-type': 'failures'}, {'summaries': [{'summary-type':
'average', 'summary-window': 86400, 'event-type':
'throughput'}], 'event-type': 'throughput'}, {'event-type':
'throughput-subintervals'}, {'event-type':
'packet-retransmits'}, {'event-type':
'packet-retransmits-subintervals'}, {'event-type':
'pscheduler-run-href'}], 'subject-type': 'point-to-point',
'ip-transport-protocol': 'tcp', 'input-source': u'A:A:A:A::45',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name':
u'pscheduler/iperf3'}
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG
Posting metadata to
https://[A:A:A:A::45]/esmond/perfsonar/archive/:
{'time-duration': 29.0,
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', 'pscheduler-test-type':
'throughput', 'source': 'A:A:A:A::45', 'measurement-agent':
'A:A:A:A::45', 'input-destination': u'ns1.BBBB.com',
'destination': 'B:B:B:B:feff:ff:fe00:f16', 'event-types':
[{'event-type': 'failures'}, {'summaries': [{'summary-type':
'average', 'summary-window': 86400, 'event-type':
'throughput'}], 'event-type': 'throughput'}, {'event-type':
'throughput-subintervals'}, {'event-type':
'packet-retransmits'}, {'event-type':
'packet-retransmits-subintervals'}, {'event-type':
'pscheduler-run-href'}], 'subject-type': 'point-to-point',
'ip-transport-protocol': 'tcp', 'input-source': u'A:A:A:A::45',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6', 'tool-name':
u'pscheduler/iperf3'}
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG
Esmond returned HTTP 201: {u'metadata-key':
u'06e4abe9da4a445789aa7c51bfbec5d9', u'tool-name':
u'pscheduler/iperf3', u'pscheduler-test-type': u'throughput',
u'time-duration': u'29.0', u'url':
u'https://[A:A:A:A::45]/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/',
u'measurement-agent': u'A:A:A:A::45', u'input-destination':
u'ns1.BBBB.com', u'destination': u'B:B:B:B:feff:ff:fe00:f16',
u'uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/',
u'event-types': [{u'time-updated': 1567765553, u'summaries': [],
u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits/base',
u'event-type': u'packet-retransmits'}, {u'time-updated':
1567765553, u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/pscheduler-run-href/base',
u'event-type': u'pscheduler-run-href'}, {u'time-updated':
1567765553, u'summaries': [{u'summary-type': u'average',
u'summary-window': u'86400', u'uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/averages/86400',
u'time-updated': 1567765553}], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/base',
u'event-type': u'throughput'}, {u'time-updated': None,
u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/failures/base',
u'event-type': u'failures'}, {u'time-updated': 1567765553,
u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput-subintervals/base',
u'event-type': u'throughput-subintervals'}, {u'time-updated':
1567765553, u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits-subintervals/base',
u'event-type': u'packet-retransmits-subintervals'}],
u'subject-type': u'point-to-point', u'source': u'A:A:A:A::45',
u'input-source': u'A:A:A:A::45',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6',
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', u'ip-transport-protocol': u'tcp'}
Sep 6 10:43:51 perf1 archiver-esmond/archive DEBUG
Esmond returned HTTP 201: {u'metadata-key':
u'06e4abe9da4a445789aa7c51bfbec5d9', u'tool-name':
u'pscheduler/iperf3', u'pscheduler-test-type': u'throughput',
u'time-duration': u'29.0', u'url':
u'https://[A:A:A:A::45]/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/',
u'measurement-agent': u'A:A:A:A::45', u'input-destination':
u'ns1.BBBB.com', u'destination': u'B:B:B:B:feff:ff:fe00:f16',
u'uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/',
u'event-types': [{u'time-updated': 1567765553, u'summaries': [],
u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits/base',
u'event-type': u'packet-retransmits'}, {u'time-updated':
1567765553, u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/pscheduler-run-href/base',
u'event-type': u'pscheduler-run-href'}, {u'time-updated':
1567765553, u'summaries': [{u'summary-type': u'average',
u'summary-window': u'86400', u'uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/averages/86400',
u'time-updated': 1567765553}], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput/base',
u'event-type': u'throughput'}, {u'time-updated': None,
u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/failures/base',
u'event-type': u'failures'}, {u'time-updated': 1567765553,
u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/throughput-subintervals/base',
u'event-type': u'throughput-subintervals'}, {u'time-updated':
1567765553, u'summaries': [], u'base-uri':
u'/esmond/perfsonar/archive/06e4abe9da4a445789aa7c51bfbec5d9/packet-retransmits-subintervals/base',
u'event-type': u'packet-retransmits-subintervals'}],
u'subject-type': u'point-to-point', u'source': u'A:A:A:A::45',
u'input-source': u'A:A:A:A::45',
u'pscheduler-reference-psconfig-created-by-uuid':
u'b042dea5-40fd-4c63-9b0b-1e0f63bf59a6',
u'pscheduler-reference-psconfig-created-by-user-agent':
u'psconfig-pscheduler-agent', u'ip-transport-protocol': u'tcp'}
This time, source and measurement-agent in the posted data are
not null, and the esmond response contains a whole bunch of other
URLs for related archives.
This suggests at least two problems:
1. esmond (in 4.2) accepts junk metadata, but happily gives a 201
response anyway as if everything was OK.
2. archiver (in 4.1.6 anyway) is generating junk metadata with
null source and measurement-agent.
I thought at first this could be related to the missing reverse
DNS for the local host; but actually in this case, the source and
measurement agent should be the remote host (ns1.BBBB.com, or B:B:B:B:feff:ff:fe00:f16)
- and that *does* have correct reverse DNS.
That leaves one thing: where are the null source and
measurement-agent values coming from? Back to source code. Note:
this has changed a bit in 4.2, I am looking at 4.1.6.
=> /usr/lib/pscheduler/classes/archiver/esmond/archive
#get explicit measurement-agent if se
measurement_agent = None
if "measurement-agent" in json['data']:
measurement_agent =
json['data']['measurement-agent']
=>
/usr/lib/pscheduler/classes/archiver/esmond/esmond_util.py
#determine source since its optional
input_source = lead_participant
if src_field and src_field in test_spec:
input_source = test_spec[src_field]
#determine if we are forcing an ip-version
ip_version = None
if ipv_field in test_spec:
ip_version = test_spec[ipv_field]
#get dest if this is point-to-point
src_ip = None
dest_ip = None
input_dest = None
if dst_field:
self.metadata['subject-type'] =
'point-to-point'
input_dest = test_spec[dst_field]
src_ip, dest_ip =
pscheduler.ip_normalize_version(input_source, input_dest,
ip_version=ip_version)
else:
self.metadata['subject-type'] =
'network-element'
src_ip, tmp_ip =
pscheduler.ip_normalize_version(input_source,
input_source, ip_version=ip_version)
#set fields
self.metadata['source'] = src_ip
if dest_ip:
self.metadata['destination'] = dest_ip
...
#Make measurement-agent the
created_by_address if we have it, otherwise the lead
participant, with same ip type as source
if measurement_agent:
src_ip,
self.metadata['measurement-agent'] =
pscheduler.ip_normalize_version(src_ip, measurement_agent)
else:
src_ip,
self.metadata['measurement-agent'] =
pscheduler.ip_normalize_version(src_ip, lead_participant)
And aha: ip_normalize_version is broken.
>>> import pscheduler
>>>
pscheduler.ip_normalize_version("ns1.BBBB.com",
"perf1.home.AAAA.net", ip_version=4)
('B.B.B.B', None)
>>>
pscheduler.ip_normalize_version("ns1.BBBB.com",
"perf1.home.AAAA.net", ip_version=6)
(None, 'A:A:A:A::45')
It works fine on the local node (4.2). This is where I
discovered that the problem was not the difference in code, but
the IPv4 entry in /etc/hosts preventing IPv6 lookup.