wg-multicast - Re: Notes from BOF

Subject: All things related to multicast

List archive

Re: Notes from BOF

From: John Kristoff <>
To:
Subject: Re: Notes from BOF
Date: Tue, 7 Feb 2006 10:03:53 -0600

On Mon, Feb 06, 2006 at 04:29:17PM -0500, Alan Crosswell wrote:
> (John Kristoff be sending his notes separately.)

As I mentioned to Alan a couple days ago I didn't get a chance to
present anything formally and ended up scribbling a few notes at
the start of the BoF. I'll include a bunch of details here and a
couple others I didn't mention.

I referred to a paper and a tool. The paper is "Failure to thrive:
QoS and the culture of operational networking", which you can find
from the ACM RIPQoS workshop. The reason I like referring to this
paper is because of the very familiar feeling of pain described in
that paper regarding stable multicast operations. I spoke with
the author after the BoF and he indicated that multicast in their
environment is much more stable now than it was described in that
paper. However, for me, the situation persists. The reason we
believe is due large part to frequency of code upgrades and the
implementation of "new" knobs that relate to multicast protocols.
My most recent environment was frequently going through code
upgrades and the use of new knobs, particularly the "hardening"
knobs to help mitigate unnecessary multicast state and flooding.
As I understand it, these types of change at LBNL are far and few
between in recent memory.

The tool I referred to is a very crude Perl script that tries to
summarize some rudimentary multicast state and counters on a
router. It spits out per interface counts for IGMP joins, IGMP
leaves, in and out multicast octets as well as if MSDP is enabled
and how many SA cache entries there is so. The idea was to be
able to just get a quick snapshot of some key numbers to help
quickly spot obvious anomalous multicast load/state. mcastsum
can be found here:

<http://aharp.ittns.northwestern.edu/software/>

The following is a list of issues we've experienced over the past
year or so, some with varying degree of end user pain. Generally
all took up a non-trivial amount of support effort and time. And
except for cases involving our NUTV service, in my estimation, our
local multicast user population is in the single digits.

Note, let me be clear this is not an attempt to pick on a vendor.
In all cases involving bugs, support people I worked with were all
very good. Bugs happen.

JUNOS bug
PIM logic bug causing sources not directly attached to flap.
It was unclear when this started happening, but it surfaced
about a month or two after the last JUNOS upgrade and we believe
we didn't have the problem for that long of a period. We never
figured out why it started happening and it took awhile to find
this one. Took troubleshooting from the Juniper as well as
the router vendor where the source was attached (Cisco). This
one took some time to figure out. Had to bring up additional
MSDP peers in front of the Juniper to work around this problem.

JUNOS bug
'show multicast usage' crashed router, done by JTAC while in the
process of troubleshooting previous bug.

JUNOS bug
mtrace command crashed route, done by JTAC while in the process
of troubleshooting previous bug.

JUNOS bug
Source specific SA limiter was rejecting SAs from sources not
actually exceeding the configured limit.

IOS bug
filter-sa-request doesn't work.

IOS bug
Not specifically multicast, but related. If you use certain
modules, in our case a wireless lan module, multicast packets
to it get processed using the port mirroring feature. These
modules use span sessions starting at #1 and counting up. We
had #1 configured and when we removed the commands, the router
completely locked up, as well, oddly enough, did some of its
neighbors.

IOS oddity and bug
Send a TCP ACK to a multicast address the router is listening
to and you'll get a RST back, with the source address filled
in with the group address you sent to.

High (90+%) cpu on 6509s with sup2's when a multicast app is
sending with a TTL=1.

MREN not accepting routes from Abilene, typo in a route-map in
BGP peering config.

I had some control plane configs wrong so that an RP and some
PIM interfaces were rejecting valid registers. Found another
incorrect multicast-related filter that was broken in the process.

ip sap listen on some interfaces and an totally borked control
plane policer config caused OSPF adjacencies to bounce, because
SA floods were starving OSPF traffic in the control plane policer.

generic udp multicast rate limit for an ingress on subnets cause
some file distribution ghost-like apps to completely fail.

When there is a layer 2 topology change, our layer 2 devices flush
their group/port state cache and cause brief multicast outages and
flooding during these periods.

And finally, one last non-operational problem... Multicast Beacon
code upgrades released on a Friday that require us to upgrade by
Monday. :-)

John

Notes from BOF, Alan Crosswell, 02/06/2006
- Re: Notes from BOF, John Kristoff, 02/07/2006
  - Re: Notes from BOF, Alan Crosswell, 02/07/2006
    - Re: Notes from BOF, John Kristoff, 02/07/2006
      - Re: Notes from BOF, Marshall Eubanks, 02/07/2006
  - Re: Notes from BOF, Jonathan S. Thyer JSTHYER, 02/07/2006
    - Re: Notes from BOF, John Kristoff, 02/07/2006
      - Re: Notes from BOF, Charles Spurgeon, 02/08/2006
- Re: Notes from BOF, Stig Venaas, 02/08/2006
  - Re: Notes from BOF, Greg Wickham, 02/14/2006
- <Possible follow-up(s)>
- RE: Notes from BOF, Richard Mavrogeanes, 02/07/2006
- RE: Notes from BOF, Roberts, Michael J. (IATS), 02/08/2006

List archive

Re: Notes from BOF