Excellent news. Thank you.
Doug
From: Andrew Lake <andy@es.net>
Date: Thursday, December 14, 2017 at 3:15 PM
To: Doug Wussler <doug.wussler@fsu.edu>, Shkelzen Rugovac <shkelzen.rugovac@gmail.com>
Cc: "perfsonar-user@internet2.edu" <perfsonar-user@internet2.edu>
Subject: Re: [perfsonar-user] Django error after a fresh install
An update: I think I have an idea on the Django database setup issues after looking around Szymon’s broken host today. It appears a number of the django related setup procedures
fail if cassandra is not finished starting when they run. This is because of where the cassandra connection is created, and it causes issues even for stuff not touching cassandra :( It’s basically a race condition between when cassandra finishes its startup
and when the scripts run, which is why it works just fine in some cases. It’s also why I could recreate when my host only had an IPv6 address, because cassandra did not boot at all. It’s not strictly related to IPv6 though, it just has a similar effect.
Not sure why it’s rearing its head now, as that all has been pretty much the same for a few years, but was very clearly the problem on Szymon’s host and I can recreate locally.
Its made worse by the fact that the /usr/lib/perfsonar/scripts/system_environment/configure_esmond does a cassandra restart right before calling the scripts affected by this. This is what causes the "ProgrammingError: relation "ps_metadata" does not exist”
and similar SQL errors as well as the lack of /usr/share/pscheduler/psc-archiver-esmond.json.
I will work on fixing this and getting a new RPM and ISO. Likely I will have something you could test tomorrow (early next week at the latest) if you are willing. A short-term fix is to edit /usr/lib/perfsonar/scripts/system_environment/configure_esmond
around line 21 by commenting out "/sbin/service cassandra restart” then run:
/usr/lib/perfsonar/scripts/system_environment/configure_esmond —force
That should setup your database and give you a /usr/share/pscheduler/psc-archiver-esmond.json. I'm not sure this is the only issue but its certainly part of it.
On December 14, 2017 at 9:56:20 AM, Doug Wussler (doug.wussler@fsu.edu) wrote:
Will follow your advice and wait to hear from you guys after you sort through it. Just FYI – I have performed the full install on two distinct systems and both of them exhibited
these problems.
You may also be interested to know of two other problems I encountered that are system dependent. If others see these issues, here is what I did:
-
rngd service fails on a system without TPM. I was able to solve this on my system with this adjustment:
cp /usr/lib/systemd/system/rngd.service /etc/systemd/system
and then modify the ExecStart like this:
ExecStart=/sbin/rngd -f -r /dev/urandom -o /dev/random
Then “systemctl daemon-reload;systemctl restart rngd”
-
/etc/init.d/perfsonar-configure_nic_parameters fails because my Broadcom Limited NetXtreme BCM5720 Gigabit PCIe card does not support “rx-usecs = 0”
Here is the diff to my edits so the script would work with my card:
52c52,56
< IC_OFF $interface
---
> if echo $interface | grep -q '^em[12]$'; then
> IC_OFF_em $interface
> else
> IC_OFF $interface
> fi
101a106,124
>
> ###########################################################################
> # The Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe
> # card does not support a value of 0 for the rx-usecs parameter,
> # which says to use the value of rx-frames. So we set it to 1.
> IC_OFF_em() {
> IC_RET_em=0
> for i in "$1"; do
> if ! /usr/sbin/ethtool -c $i | grep -q '^rx-frames: 1$'; then
> /usr/sbin/ethtool -C $i rx-frames 1 || IC_RET_em=$?
> fi
>
> if ! /usr/sbin/ethtool -c $i | grep -q '^rx-usecs: 1$'; then
> /usr/sbin/ethtool -C $i rx-usecs 1 || IC_RET_em=$?
> fi
> done
> return $IC_RET_em
> }
> ###########################################################################
Doug Wussler
Florida State University
Its not a forgotten dependency, its installed as part of a python virtualenv in the esmond RPM. The esmond python/django
setup has always been a little weird to get around the fact that CentOS 6 does not have python 2.7, and some of that weirdness persists in CentOS 7 even though its does have python 2.7. You are correct that you and the others having this issue all have the
same problem that django for whatever reason is not setup correctly. I’m slightly worried you may just inherit a new set of problem by installing the django RPM. I have done a bunch of FullInstalls in the last couple days trying to recreate and have not had
that issue, so I suspect some type of race condition somewhere with package install order or similar. I am working on getting access to a broken node from Szymon (in another thread) to see if I can learn more about what went wrong. I do NOT advise anyone else
run “yum install django” just yet if they hit a similar problem.
On December 13, 2017 at 6:18:53 PM, Shkelzen Rugovac (shkelzen.rugovac@gmail.com) wrote:
I went a bit further I think. Looking carefully at what /usr/lib/perfsonar/scripts/system_environment/configure_esmond
is supposed to do I found out that the $KEY remains empty when running it (by adding a line "echo $KEY").
Running separately the command "python esmond/manage.py add_api_key_user perfsonar", I got a complain about a missing module django.core.management.
Then I installed 2 python-django rpms with "yum install django" and rebooted.
After the boot, "/usr/lib/perfsonar/scripts/system_environment/configure_esmond --force" worked
and created /usr/share/pscheduler/psc-archiver-esmond.json.
Now, all looks fine. We can see non-empty lists in the archive page.
Is python-django a forgotten dependency of the perfsonar toolkit? Or part of the django that is contained in esmond is missing? eg : django.core.management?
2017-12-13 20:19 GMT+01:00 Doug Wussler <doug.wussler@fsu.edu>:
Making progress…. I also rebooted. /var/log/messages looks clean now. But the “pscheduler task” command still returns this error:
Unable to read archive file: [Errno 2] No such file or directory: '/usr/share/pscheduler/psc-archiver-esmond.json'
And when I log in as root I see this (which was there earlier too, this is not new):
ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1513186024
id dbfb005bb360b44b7061efdc82ae2c09130ba258
reason: pidfile.py:48:__exit__:OSError: [Errno 13] Permission denied: '/var/run/pscheduler-scheduler.pid'
time: Wed 13 Dec 2017 12:18:57 PM EST
cmdline: /usr/bin/python /usr/libexec/pscheduler/daemons/scheduler --daemon --pid-file /var/run/pscheduler-scheduler.pid --dsn @/etc/pscheduler/database/database-dsn
package: pscheduler-server-1.0.2-1.el7.centos
uid: 1000 (pscheduler)
count: 6
Directory: /var/spool/abrt/Python-2017-12-13-12:18:57-1926
From:
Andrew Lake <andy@es.net>
Date: Wednesday, December 13, 2017 at 1:38 PM
That file that’s missing is really just for convenience, but the fact that it is missing is probably an artifact
of the failed setup. Try running "/usr/lib/perfsonar/scripts/system_environment/configure_esmond —force”. It should create the file. Not sure how your setup got that broken, must have had some issue getting to postgres as well at one point.
You can ignore those powstream errors about the diretcory missing, those are harmless.
On December 13, 2017 at 1:06:22 PM, Doug Wussler (doug.wussler@fsu.edu) wrote:
Andy –
I executed the restarts but when I then executed the “pscheduler task” command I got this error:
Unable to read archive file: [Errno 2] No such file or directory: '/usr/share/pscheduler/psc-archiver-esmond.json'
Just to repeat, we are IPv4 only, we don’t use IPv6.
Don’t see any errors in the /var/log/Esmond logs anymore but my /var/log/messages contains this:
journal: tool-powstream/run ERROR powstream exited with error, will attempt restart: Nothing on stderr, may have been killed by external command (e.g. something ran 'kill')
journal: tool-powstream/run ERROR powstream exited with error, will attempt restart: Nothing on stderr, may have been killed by external command (e.g. something ran 'kill')
journal: tool-powstream/run ERROR powstream failed to complete execution: <type 'exceptions.IOError'>
journal: tool-powstream/run ERROR powstream failed to complete execution: <type 'exceptions.IOError'>
journal: safe_run/runner ERROR Restarting: ['/usr/libexec/pscheduler/daemons/runner', '--daemon', '--pid-file', '/var/run/pscheduler-runner.pid', '--dsn', '@/etc/pscheduler/database/database-dsn']
journal: runner INFO Started
journal: safe_run/runner ERROR Program threw an exception after 0:00:00.521575
journal: safe_run/runner ERROR Exception: DatabaseError: server closed the connection unexpectedly#012#011This probably means the server terminated abnormally#012#011before or
while processing the request.#012#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/pscheduler/saferun.py", line 72, in safe_run#012 function()#012 File "/usr/libexec/pscheduler/daemons/runner", line 932, in <lambda>#012
pscheduler.safe_run(lambda: main_program())#012 File "/usr/libexec/pscheduler/daemons/runner", line 842, in main_program#012 """, [refresh]);#012DatabaseError: server closed the connection unexpectedly#012#011This probably means the server terminated
abnormally#012#011before or while processing the request.
Dec 13 12:53:46 psonar-ca journal: safe_run/runner ERROR Waiting 22.5 seconds before restarting
And the command “systemctl status –l pscheduler-runner” reports:
run[13589]:
tool-powstream/run WARNING Unable to remove powstream temporary directory /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175432202983 due to error reported by OS: 2: No such file or directory (filename: /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175432202983)
python[1924]: detected unhandled Python exception in '/usr/libexec/pscheduler/daemons/runner'
systemd[1]:
pscheduler-runner.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Stopped pScheduler server - runner.
systemd[1]:
Unit pscheduler-runner.service entered failed state.
systemd[1]:
pscheduler-runner.service failed.
run[13762]:
tool-powstream/run WARNING Unable to remove powstream temporary directory /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175456196516 due to error reported by OS: 2: No such file or directory (filename: /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175456196516)
run[13901]:
tool-powstream/run WARNING Unable to remove powstream temporary directory /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175520196554 due to error reported by OS: 2: No such file or directory (filename: /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175520196554)
run[13762]:
tool-powstream/run WARNING Unable to remove powstream temporary directory /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175456196516 due to error reported by OS: 2: No such file or directory (filename: /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175456196516)
run[13901]:
tool-powstream/run WARNING Unable to remove powstream temporary directory /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175520196554 due to error reported by OS: 2: No such file or directory (filename: /var/lib/pscheduler/tool/powstream/e664a834-4e5c-468b-901d-f8e552502dd9-2017Dec13T175520196554)
Ok, I’m actually able to recreate these exact exceptions if my host boots with only an IPv6 address. Even after
I assign it an IPv4 address things don’t work until I restart some of the daemons. Do your hosts have just an IPv6 address? I see the hostname Shkelzen shared is dual-stack so in theory it should have both and be fine, but maybe there is some race condition
with when addresses get assigned vs when the daemons come up? Try the following if you host has an IPv4 address as well, since the commands below seemed to fix things for me after assigning an IPv4 address:
systemctl restart cassandra
systemctl restart pscheduler-scheduler
systemctl restart pscheduler-archiver
systemctl restart pscheduler-ticker
systemctl restart pscheduler-runner
If you want a quick command to add something to your archive run:
pscheduler task --archive @/usr/share/pscheduler/psc-archiver-esmond.json idle —duration PT5S
The second command should return a non-empty list if things are back to working.
On December 12, 2017 at 3:36:50 AM, Shkelzen Rugovac (shkelzen.rugovac@gmail.com) wrote:
Hi all,
Indeed, after applying "yum reinstall esmond" the "internal Server Error" disappeared but not the error in the django.log.
Not knowing what possible service to restart, I just rebooted the whole server and now django logs the following error:
2017-12-12 09:16:43,508 [ERROR] /usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/exception.py: Internal Server Error: /esmond/perfsonar/archive
Traceback (most recent call last):
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/exception.py", line 42, in inner
response = get_response(request)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/base.py", line 244, in _legacy_get_response
response = middleware_method(request)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/middleware/common.py", line 62, in process_request
if self.should_redirect_with_slash(request):
File "/usr/lib/esmond/lib/python2.7/site-packages/django/middleware/common.py", line 80, in should_redirect_with_slash
not is_valid_path(request.path_info, urlconf) and
File "/usr/lib/esmond/lib/python2.7/site-packages/django/urls/base.py", line 157, in is_valid_path
File "/usr/lib/esmond/lib/python2.7/site-packages/django/urls/base.py", line 27, in resolve
return get_resolver(urlconf).resolve(path)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/urls/resolvers.py", line 270, in resolve
for pattern in self.url_patterns:
File "/usr/lib/esmond/lib/python2.7/site-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name]
= self.func(instance)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/urls/resolvers.py", line 313, in url_patterns
patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name]
= self.func(instance)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/urls/resolvers.py", line 306, in urlconf_module
return import_module(self.urlconf_name)
File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
File "/usr/lib/esmond/esmond/urls.py", line 10, in <module>
from esmond.api.api_v2 import (
File "/usr/lib/esmond/esmond/api/api_v2.py", line 52, in <module>
raise ConnectionException(str(e))
ConnectionException: '"System Manager can\'t connect to Cassandra at localhost:9160 - Could not connect to localhost:9160"'
2017-12-11 17:45 GMT+01:00 Doug Wussler <doug.wussler@fsu.edu>:
Andrew –
That was me. The “yum reinstall esmond” cleared up the “Internal Server Error” I was getting in the “Test Results” section when I visited the toolkit page. But it did not clear
up the issue Shkelzen is reporting, I have the same error even after reinstalling Esmond and rebooting. We both did a full install of the 4.02 ISO and our /var/log/esmond/django.log is showing the same failure. The reinstall has not fixed this problem.
Doug Wussler
Florida State University
Someone else reported a similar issue last week and they were able to correct it by doing a “yum reinstall esmond”.
The dev team is investigating closer as it doesn't appear to happen in our test builds, so must be some condition or inconsistent package ordering that is triggering it.
On December 11, 2017 at 3:27:21 AM, Shkelzen Rugovac (shkelzen.rugovac@gmail.com) wrote:
Hi,
After a fresh installation via the FULL 4.0.2 ISO + yum update [1],[2] and [3], I have the following error in /var/log/esmond/django.log :
2017-12-11 09:18:31,761 [ERROR] /usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/exception.py: Internal Server Error: /esmond/perfsonar/archive/
Traceback (most recent call last):
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/exception.py", line 42, in inner
response = get_response(request)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/base.py", line 249, in _legacy_get_response
response = self._get_response(request)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 58, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/viewsets.py", line 90, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/views.py", line 489, in dispatch
response = self.handle_exception(exc)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/views.py", line 449, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/views.py", line 486, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/lib/esmond/esmond/api/perfsonar/api_v2.py", line 838, in list
return super(ArchiveViewset, self).list(request)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/mixins.py", line 42, in list
page = self.paginate_queryset(queryset)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/generics.py", line 173, in paginate_queryset
return self.paginator.paginate_queryset(queryset, self.request, view=self)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/pagination.py", line 335, in paginate_queryset
self.count = _get_count(queryset)
File "/usr/lib/esmond/lib/python2.7/site-packages/rest_framework/pagination.py", line 53, in _get_count
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/models/query.py", line 369, in count
return self.query.get_count(using=self.db)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/models/sql/query.py", line 476, in get_count
number = obj.get_aggregation(using, ['__count'])['__count']
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/models/sql/query.py", line 457, in get_aggregation
result = compiler.execute_sql(SINGLE)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 835, in execute_sql
cursor.execute(sql, params)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/usr/lib/esmond/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
ProgrammingError: relation "ps_metadata" does not exist
LINE 1: ...e" AS Col3, "ps_metadata"."checksum" AS Col4 FROM "ps_metada...
Why ps_metadata not (yet) present?
--
Shkelzen RUGOVAC
srugovac@ulb.ac.be
Université Libre de Bruxelles
Bd du Triomphe, CP230
1050 Bruxelles - Belgium
Office at VUB:
Building G - Level 0 - Room 147
Tel: (+32) 2 629 33 26
--
Shkelzen RUGOVAC
srugovac@ulb.ac.be
Université Libre de Bruxelles
Bd du Triomphe, CP230
1050 Bruxelles - Belgium
Office at VUB:
Building G - Level 0 - Room 147
Tel: (+32) 2 629 33 26
--
Shkelzen RUGOVAC
srugovac@ulb.ac.be
Université Libre de Bruxelles
Bd du Triomphe, CP230
1050 Bruxelles - Belgium
Office at VUB:
Building G - Level 0 - Room 147
Tel: (+32) 2 629 33 26
|