The netlink route message encode function accessed and used
zebra vrfs to produce debug output. That's not thread-safe, and
that encode code needs to run in (at least) the dplane pthread
and the FPM plugin pthread.
Signed-off-by: Mark Stapp <mjs@cisco.com>
Separate core netlink route message parsing into a new api that
uses a dplane ctx to hold the parsed attribute data. Use the new
api in two paths: the normal netlink update message parsing path,
and in the FPM plugin, which also uses netlink encoding. The FPM
route-notificatin code runs in its own pthread, and only needs
a subset of the route info that zebra ordinarily develops. This
change stops that pthread from accessing zebra's internal
data, such as vrfs and ifps, that are not thread-safe.
Signed-off-by: Mark Stapp <mjs@cisco.com>
Separate zebra's ZAPI server socket handling into two phases:
an early phase that opens the socket, and a later phase that
starts listening for client connections.
Signed-off-by: Mark Stapp <mjs@cisco.com>
During zebra shutdown, the main pthread and the FPM pthread can
deadlock if the FPM pthread is in fpm_reconnect(). Each pthread
tries to use event_cancel_async() to cancel tasks that may be
scheduled for the other pthread - this leads to a deadlock as
neither thread can progress.
This adds an atomic boolean that's managed as each pthread
enters and leaves the cleanup code in question, preventing the
two threads from running into the deadlock.
Signed-off-by: Mark Stapp <mjs@cisco.com>
The functions:
if_get_flags
if_flags_update
if_flags_mangle
are never invoked from a linux netlink build. Put a #ifdef
around those functions so that they are not included on the
linux build as that they are not needed there.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
For interface config:
shutdown
mpls
multicast
These states were never being shown in output, let's show it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When VLAN-VNI mapping is updated, do not set the L2VNI up event
if the associated VXLAN device is not up.
This may result in bgp synced remote routes to skip installing
in Zebra and onwards (Kernel).
Ticket: #4139506
Signed-off-by: Chirag Shah <chirag@nvidia.com>
In zebra_mpls.c it has a usage of MTYPE_NH_LABEL which is
defined in both lib/nexthop.c and zebra/zebra_mpls.c. The
usage in zebra_mpls.c is a realloc. This leads to a crash:
(gdb) bt
0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=126487246404032) at ./nptl/pthread_kill.c:44
1 __pthread_kill_internal (signo=6, threadid=126487246404032) at ./nptl/pthread_kill.c:78
2 __GI___pthread_kill (threadid=126487246404032, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
3 0x0000730a1b442476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
4 0x0000730a1b94fb18 in core_handler (signo=6, siginfo=0x7ffeed1e07b0, context=0x7ffeed1e0680) at lib/sigevent.c:268
5 <signal handler called>
6 __pthread_kill_implementation (no_tid=0, signo=6, threadid=126487246404032) at ./nptl/pthread_kill.c:44
7 __pthread_kill_internal (signo=6, threadid=126487246404032) at ./nptl/pthread_kill.c:78
8 __GI___pthread_kill (threadid=126487246404032, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
9 0x0000730a1b442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
10 0x0000730a1b4287f3 in __GI_abort () at ./stdlib/abort.c:79
11 0x0000730a1b9984f5 in _zlog_assert_failed (xref=0x730a1ba59480 <_xref.16>, extra=0x0) at lib/zlog.c:789
12 0x0000730a1b8f8908 in mt_count_free (mt=0x576e0edda520 <MTYPE_NH_LABEL>, ptr=0x576e36617b80) at lib/memory.c:74
13 0x0000730a1b8f8a59 in qrealloc (mt=0x576e0edda520 <MTYPE_NH_LABEL>, ptr=0x576e36617b80, size=16) at lib/memory.c:112
14 0x0000576e0ec85e2e in nhlfe_out_label_update (nhlfe=0x576e368895f0, nh_label=0x576e3660e9b0) at zebra/zebra_mpls.c:1462
15 0x0000576e0ec833ff in lsp_install (zvrf=0x576e3655fb50, label=17, rn=0x576e366197c0, re=0x576e3660a590) at zebra/zebra_mpls.c:224
16 0x0000576e0ec87c34 in zebra_mpls_lsp_install (zvrf=0x576e3655fb50, rn=0x576e366197c0, re=0x576e3660a590) at zebra/zebra_mpls.c:2215
17 0x0000576e0ecbb427 in rib_process_update_fib (zvrf=0x576e3655fb50, rn=0x576e366197c0, old=0x576e36619660, new=0x576e3660a590) at zebra/zebra_rib.c:1084
18 0x0000576e0ecbc230 in rib_process (rn=0x576e366197c0) at zebra/zebra_rib.c:1480
19 0x0000576e0ecbee04 in process_subq_route (lnode=0x576e368e0270, qindex=8 '\b') at zebra/zebra_rib.c:2661
20 0x0000576e0ecc0711 in process_subq (subq=0x576e3653fc80, qindex=META_QUEUE_BGP) at zebra/zebra_rib.c:3226
21 0x0000576e0ecc07f9 in meta_queue_process (dummy=0x576e3653fae0, data=0x576e3653fb80) at zebra/zebra_rib.c:3265
22 0x0000730a1b97d2a9 in work_queue_run (thread=0x7ffeed1e3f30) at lib/workqueue.c:282
23 0x0000730a1b96b039 in event_call (thread=0x7ffeed1e3f30) at lib/event.c:1996
24 0x0000730a1b8e4d2d in frr_run (master=0x576e36277e10) at lib/libfrr.c:1232
25 0x0000576e0ec35ca9 in main (argc=7, argv=0x7ffeed1e4208) at zebra/main.c:536
Clearly replacing a label stack is an operation that should be owned by
lib/nexthop.c. So lets move this function into there and have
zebra_mpls.c just call the function to replace the label stack.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently when doing a `show ip route table XXXX`, zebra is displaying
the current default vrf as the vrf we are in. We are displaying a
table not a vrf. This is only true if you are not using namespace
based vrf's, so modify the output to display accordingly.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
`ZEBRA_DEBUG_SRV6` is not the correct macro to evaluate if SRv6 debug is enabled or not.
The correct macro is `IS_ZEBRA_DEBUG_SRV6`.
Fix this by replacing `ZEBRA_DEBUG_SRV6` with `IS_ZEBRA_DEBUG_SRV6`.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Currently zebra starts the graceful restart timer as well as
allows connections from clients before all data is read in
from the kernel as well as the possiblity of allowing client
connections before this happens as well.
Let's move the graceful restart timer start till after this is
done as well as not allowing client connections till then as well.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The `rib_update_handle_kernel_route_down_possibility()` didn't consider
the kernel routes ( blackhole ) without interface. When some other
interfaces are down, these kernel routes will be wrongly removed.
Signed-off-by: anlan_cs <anlan_cs@126.com>
Expanded the cli command to include an mrib flag for importing to
the main table MRIB instead of the main table URIB.
Piped through specifying the safi through the import table functions
rather than hardcoding to SAFI_UNICAST.
Import still only import routes from the URIB subtable, only added the
ability to import into the main table MRIB.
Signed-off-by: Nathan Bahr <nbahr@atcorp.com>
Currently the mroute code was not allowing the mroute
to be sent to the dataplane. This leaves us with a
situation where the routes being installed where never
being set as installed and additionally nht against
the mrib would not work if the route came into existence
after the nexthop tracking was asked for.
Turns out all the pieces where there to let this work.
Modify the code to pass it to the dplane and to send
it back up as having worked.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add new per-NS interface typerb tree, using external linkage, to
replace the use of the if_table table.
Add apis to iterate the per-NS collection - which is not public.
Signed-off-by: Mark Stapp <mjs@cisco.com>
Add a specific debug command to handle srv6 troubleshooting.
Move the srv6 traces that initially were under 'debug zebra packet'
debug.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:44
1 __pthread_kill_internal (signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:78
2 __GI___pthread_kill (threadid=130719886083648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
3 0x000076e399e42476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
4 0x000076e39a34f950 in core_handler (signo=6, siginfo=0x76e3985fca30, context=0x76e3985fc900) at lib/sigevent.c:258
5 <signal handler called>
6 __pthread_kill_implementation (no_tid=0, signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:44
7 __pthread_kill_internal (signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:78
8 __GI___pthread_kill (threadid=130719886083648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
9 0x000076e399e42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
10 0x000076e399e287f3 in __GI_abort () at ./stdlib/abort.c:79
11 0x000076e39a39874b in _zlog_assert_failed (xref=0x76e39a46cca0 <_xref.27>, extra=0x0) at lib/zlog.c:789
12 0x000076e39a369dde in cancel_event_helper (m=0x5eda32df5e40, arg=0x5eda33afeed0, flags=1) at lib/event.c:1428
13 0x000076e39a369ef6 in event_cancel_event_ready (m=0x5eda32df5e40, arg=0x5eda33afeed0) at lib/event.c:1470
14 0x00005eda0a94a5b3 in bgp_stop (connection=0x5eda33afeed0) at bgpd/bgp_fsm.c:1355
15 0x00005eda0a94b4ae in bgp_stop_with_notify (connection=0x5eda33afeed0, code=8 '\b', sub_code=0 '\000') at bgpd/bgp_fsm.c:1610
16 0x00005eda0a979498 in bgp_packet_add (connection=0x5eda33afeed0, peer=0x5eda33b11800, s=0x76e3880daf90) at bgpd/bgp_packet.c:152
17 0x00005eda0a97a80f in bgp_keepalive_send (peer=0x5eda33b11800) at bgpd/bgp_packet.c:639
18 0x00005eda0a9511fd in peer_process (hb=0x5eda33c9ab80, arg=0x76e3985ffaf0) at bgpd/bgp_keepalives.c:111
19 0x000076e39a2cd8e6 in hash_iterate (hash=0x76e388000be0, func=0x5eda0a95105e <peer_process>, arg=0x76e3985ffaf0) at lib/hash.c:252
20 0x00005eda0a951679 in bgp_keepalives_start (arg=0x5eda3306af80) at bgpd/bgp_keepalives.c:214
21 0x000076e39a2c9932 in frr_pthread_inner (arg=0x5eda3306af80) at lib/frr_pthread.c:180
22 0x000076e399e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
23 0x000076e399f26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) f 12
12 0x000076e39a369dde in cancel_event_helper (m=0x5eda32df5e40, arg=0x5eda33afeed0, flags=1) at lib/event.c:1428
1428 assert(m->owner == pthread_self());
In this decode the attempt to cancel the connection's events from
the wrong thread is causing the crash. Modify the code to create an
event on the bm->master to cancel the events for the connection.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
NEWLINK is only registered by the dplane thread, the main thread
doesn't care about it. So remove the real process of `netlink_link_change()`
for NEWLINK event in main thread.
And move NEWLINK/DELLINK event to the block where the dplane messages
are kept together.
Signed-off-by: anlan_cs <anlan_cs@126.com>
For some reasons the Linux kernel associates the ipv6 blackhole of non
default table the lo interface.
> root@r1# ip -6 route show table 100
> root@r1# ip -6 route add unreachable default metric 4278198272 table 100
> root@r1# ip -6 route show table 100
> unreachable default dev lo metric 4278198272 pref medium
As a consequence, the VRF default that owns the lo interface is shown as
the nexthop VRF:
> r1# show ipv6 route table 20
> Table 20:
> K>* ::/0 [255/8192] unreachable (ICMP unreachable) (vrf default), 00:18:12
Do not display the nexthop VRF of a blackhole. It does not make sense
for a blackhole and it was not displayed in the past.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
`dirfd()` can theoretically return an error. Call it once and check the
result.
clang-SA: technically correct™. Ain't that the best kind of correct?
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
`errno` has its own semantics. Sometimes it is correct to write to it.
This is not one of those cases - just use a separate `nl_errno`.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
`FILE *` objects are theoretically in an invalid state if you try to use
them past their reporting EOF. Adjust the code to make it correct.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
In these cases the value assigned by the switch block is used directly
rather than returned. Mark the initial/default value as used so
clang-SA doesn't complain about it.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
clang-19's SA complains about unused initializers for this kind of
"switch (enum) { return string }" kind of code. Use direct string
return values to avoid the issue.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
I got asked today what was going on in the rnh code. I
had to take time off of what I was doing and rewrap my
head around this code, since it's been a long time.
As that this question may come up again in the future
I am trying to document this better so that someone
coming behind us will be able to read this and get
a better idea of what the algorithm is attempting
to do.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently FRR needs to send a uint16_t value for the number
of nexthops as well it needs the ability to properly decode
all of this. Find and handle all the places that this happens.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There exists a series of events where a kernel route is learned
first( that happens to be exactly what a connected route should be )
and FRR ends up with both a kernel route and a connected route,
leaving us in a very strange spot. This code change just mirrors
the existing code of if there is a connected route drop the kernel
route. Here we just do the reverse, if we have a kernel route
already and a connected should be created, remove the kernel and
keep the connected.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When the fpm_process_queue has run out of space
but has written to the fpm output buffer, schedule
it to wake up immediately, as that the write will go out
pretty much immediately, since it was scheduled first.
If the fpm_process_queue has not written to the output
buffer then delay the processing by 10 milliseconds to
allow a possibly backed up write processing to have a
chance to complete it's work.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The fpm_nl_process function was getting the count
of the total number of ctx's processed. This leads
to after having processed 1 context to always signal
the dataplane that there is work to do. Change the
code to only notify the dplane worker when a context
was actually added to the outgoing context queue.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The following ASAN issue has been observed:
> ERROR: AddressSanitizer: heap-use-after-free on address 0x6160000acba4 at pc 0x55910c5694d0 bp 0x7ffe3a8ac850 sp 0x7ffe3a8ac840
> READ of size 4 at 0x6160000acba4 thread T0
> #0 0x55910c5694cf in ctx_info_from_zns zebra/zebra_dplane.c:3315
> #1 0x55910c569696 in dplane_ctx_ns_init zebra/zebra_dplane.c:3331
> #2 0x55910c56bf61 in dplane_ctx_nexthop_init zebra/zebra_dplane.c:3680
> #3 0x55910c5711ca in dplane_nexthop_update_internal zebra/zebra_dplane.c:4490
> #4 0x55910c571c5c in dplane_nexthop_delete zebra/zebra_dplane.c:4717
> #5 0x55910c61e90e in zebra_nhg_uninstall_kernel zebra/zebra_nhg.c:3413
> #6 0x55910c615d8a in zebra_nhg_decrement_ref zebra/zebra_nhg.c:1919
> #7 0x55910c6404db in route_entry_update_nhe zebra/zebra_rib.c:454
> #8 0x55910c64c904 in rib_re_nhg_free zebra/zebra_rib.c:2822
> #9 0x55910c655be2 in rib_unlink zebra/zebra_rib.c:4212
> #10 0x55910c6430f9 in zebra_rtable_node_cleanup zebra/zebra_rib.c:968
> #11 0x7f26f275b8a9 in route_node_free lib/table.c:75
> #12 0x7f26f275bae4 in route_table_free lib/table.c:111
> #13 0x7f26f275b749 in route_table_finish lib/table.c:46
> #14 0x55910c65db17 in zebra_router_free_table zebra/zebra_router.c:191
> #15 0x55910c65dfb5 in zebra_router_terminate zebra/zebra_router.c:244
> #16 0x55910c4f40db in zebra_finalize zebra/main.c:249
> #17 0x7f26f2777108 in event_call lib/event.c:2011
> #18 0x7f26f264180e in frr_run lib/libfrr.c:1212
> #19 0x55910c4f49cb in main zebra/main.c:531
> #20 0x7f26f2029d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
> #21 0x7f26f2029e3f in __libc_start_main_impl ../csu/libc-start.c:392
> #22 0x55910c4b0114 in _start (/usr/lib/frr/zebra+0x1ae114)
It happens with FRR using the kernel. During shutdown, the
namespace identifier is attempted to be obtained by zebra, in an
attempt to prepare zebra dataplane nexthop messages.
Fix this by accessing the ns structure.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Currently FRR is limiting the nexthop count to a uint8_t not a
uint16_t. This leads to issues when the nexthop count is 256
which results in the count to overflow to 0 causing problems
in the code.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the zebra pw code has setup a retry to install the
pw after 30 seconds when it is decided that reachability to
the pw is gone. This causes a failure mode where the
pw code just goes and re-installs the pw after 30 seconds
in the non-reachability case. Instead it should just be
reinstalling after reachability is restored.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the pw code sets the status of the pw for install
and uninstall immediately when notifying the dplane. This
is incorrect in that we do not actually know the status at
this point in time. When we get the result is when to set
the status.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
fpm_enqueue_rmac_table expects an fpm_rmac_arg* as its argument.
The issue can be reproduced by dropping the TCP session using:
ss -K dst 127.0.0.1 dport = 2620
I used Fedora 40 and frr 9.1.2 and I got the gdb backtrace:
(gdb) bt
0 0x00007fdd7d6997ea in fpm_enqueue_rmac_table (bucket=0x2134dd0, arg=0x2132b60) at zebra/dplane_fpm_nl.c:1217
1 0x00007fdd7dd1560d in hash_iterate (hash=0x21335f0, func=0x7fdd7d6997a0 <fpm_enqueue_rmac_table>, arg=0x2132b60) at lib/hash.c:252
2 0x00007fdd7dd1560d in hash_iterate (hash=0x1e5bf10, func=func@entry=0x7fdd7d698900 <fpm_enqueue_l3vni_table>,
arg=arg@entry=0x7ffed983bef0) at lib/hash.c:252
3 0x00007fdd7d698b5c in fpm_rmac_send (t=<optimized out>) at zebra/dplane_fpm_nl.c:1262
4 0x00007fdd7dd6ce22 in event_call (thread=thread@entry=0x7ffed983c010) at lib/event.c:1970
5 0x00007fdd7dd20758 in frr_run (master=0x1d27f10) at lib/libfrr.c:1213
6 0x0000000000425588 in main (argc=10, argv=0x7ffed983c2e8) at zebra/main.c:492
Signed-off-by: Igor Zhukov <fsb4000@yandex.ru>
Trigger: Zebra core seen when we convert l2vni to l3vni and back
BackTrace:
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_zlog_assert_failed+0xe9) [0x7f4af96989d9]
/usr/lib/frr/zebra(zebra_vxlan_if_vni_up+0x250) [0x5561022ae030]
/usr/lib/frr/zebra(netlink_vlan_change+0x2f4) [0x5561021fd354]
/usr/lib/frr/zebra(netlink_parse_info+0xff) [0x55610220d37f]
/usr/lib/frr/zebra(+0xc264a) [0x55610220d64a]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x7d) [0x7f4af967e96d]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xe8) [0x7f4af9637588]
/usr/lib/frr/zebra(main+0x402) [0x5561021f4d32]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f4af932624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f4af9326305]
/usr/lib/frr/zebra(_start+0x21) [0x5561021f72f1]
Root Cause:
In working case,
- We get a RTM_NEWLINK whose ctx is enqueued by zebra dplane and
dequeued by zebra main and processed i.e.
(102000 is deleted from vxlan99) before we handle RTM_NEWVLAN.
- So in handling of NEWVLAN (vxlan99) we bail out since find with
vlan id 703 does not exist.
root@leaf2:mgmt:/var/log/frr# cat ~/raja_logs/working/nocras.log | grep "RTM_NEWLINK\|QUEUED\|vxlan99\|in thread"
2024/07/18 23:09:33.741105 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=616, seq=0, pid=0
2024/07/18 23:09:33.744061 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7f2244000cf0, op INTF_INSTALL, ifindex (65), result QUEUED
2024/07/18 23:09:33.767240 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=508, seq=0, pid=0
2024/07/18 23:09:33.767380 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7f2244000cf0, op INTF_INSTALL, ifindex (73), result QUEUED
2024/07/18 23:09:33.767389 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(73)
2024/07/18 23:09:33.767404 ZEBRA: [TQR2A-H2RFY] Vlan-Vni(1186:1186-6000002:6000002) update for VxLAN IF vxlan99(73)
2024/07/18 23:09:33.767422 ZEBRA: [TP4VP-XZ627] Del L2-VNI 102000 intf vxlan99(73)
2024/07/18 23:09:33.767858 ZEBRA: [QYXB9-6RNNK] RTM_NEWVLAN bridge IF vxlan99 NS 0
2024/07/18 23:09:33.767866 ZEBRA: [KKZGZ-8PCDW] Cannot find VNI for VID (703) IF vxlan99 for vlan state update >>>>BAIL OUT
In failure case,
- The NEWVLAN is received first even before processing RTM_NEWLINK.
- Since the vxlan id 102000 is not removed from the vxlan99,
the find with vlan id 703 returns the 102000 one and we invoke
zebra_vxlan_if_vni_up where the interfaces don't match and assert.
root@leaf2:mgmt:/var/log/frr# cat ~/raja_logs/noworking/crash.log | grep "RTM_NEWLINK\|QUEUED\|vxlan99\|in thread"
2024/07/18 22:26:43.829370 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=616, seq=0, pid=0
2024/07/18 22:26:43.829646 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7fe07c026d30, op INTF_INSTALL, ifindex (65), result QUEUED
2024/07/18 22:26:43.853930 ZEBRA: [QYXB9-6RNNK] RTM_NEWVLAN bridge IF vxlan99 NS 0
2024/07/18 22:26:43.853949 ZEBRA: [K61WJ-XQQ3X] Intf vxlan99(73) L2-VNI 102000 is UP >>> VLAN PROCESSED BEFORE INTF EVENT
2024/07/18 22:26:43.853951 ZEBRA: [SPV50-BX2RP] RAJA zevpn_vxlanif vxlan48 and ifp vxlan99
2024/07/18 22:26:43.854005 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=508, seq=0, pid=0
2024/07/18 22:26:43.854241 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=516, seq=0, pid=0
2024/07/18 22:26:43.854251 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-dp-in (NS 0) type RTM_NEWLINK(16), len=544, seq=0, pid=0
ZEBRA: in thread kernel_read scheduled from zebra/kernel_netlink.c:505 kernel_read()
Fix:
Similar to #13396, where link change
handling was offloaded to dplane, same is being done for vlan events.
Note: Prior to this change, zebra main thread was interested in the
RTNLGRP_BRVLAN. So all the kernel events pertaining to vlan was
handled by zebra main.
With this change change as well the handling of vlan events is still
with Zebra main. However we offload it via Dplane thread.
Ticket :#3878175
Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Relocating functions used by vlan in if_netlink into zebra vxlan
Note: Static variable to the functions will be added back in the next
commit.
Ticket :#3878175
Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Report the routes metric in IPFORWARDMETRIC1 and return
-1 for the other metrics as required by the IP-FORWARD-MIB.
inetCidrRouteMetric2 OBJECT-TYPE
SYNTAX Integer32
MAX-ACCESS read-create
STATUS current
DESCRIPTION
"An alternate routing metric for this route. The
semantics of this metric are determined by the routing-
protocol specified in the route's inetCidrRouteProto
value. If this metric is not used, its value should be
set to -1."
DEFVAL { -1 }
::= { inetCidrRouteEntry 13 }
I've included metric2 but it's the same for all of them.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The snmp walk of the zebra rib was skipping entries
because in_addr_cmp was replaced with a prefix_cmp
which worked slightly differently causing parts
of the zebra rib tree to be skipped.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
zebra_nhg_install_kernel takes a route type. We don't
know it at that particular spot but we should not be passing
in `true`. Let's use ZEBRA_ROUTE_MAX to indicate we do not
know, so that the correct thing is done.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Adding comments that tell what a variable is doing in
the middle of a function call makes it extremely hard
to read the formatting. Remove.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The previous commit modified zebra to reinstall the singleton
nexthops for a nexthop group when a interface event comes up.
Now let's modify zebra to attempt to reuse the nexthop group
when this happens and the upper level protocol resends the
route down with that. Only match if the protocol is the same
as well as the instance and the nexthop groups would match.
Here is the new behavior:
eva(config)# do show ip route 9.9.9.9/32
Routing entry for 9.9.9.9/32
Known via "static", distance 1, metric 0, best
Last update 00:00:08 ago
* 192.168.99.33, via dummy1, weight 1
* 192.168.100.33, via dummy2, weight 1
* 192.168.101.33, via dummy3, weight 1
* 192.168.102.33, via dummy4, weight 1
eva(config)# do show ip route nexthop-group 9.9.9.9/32
% Unknown command: do show ip route nexthop-group 9.9.9.9/32
eva(config)# do show ip route 9.9.9.9/32 nexthop-group
Routing entry for 9.9.9.9/32
Known via "static", distance 1, metric 0, best
Last update 00:00:54 ago
Nexthop Group ID: 57
* 192.168.99.33, via dummy1, weight 1
* 192.168.100.33, via dummy2, weight 1
* 192.168.101.33, via dummy3, weight 1
* 192.168.102.33, via dummy4, weight 1
eva(config)# exit
eva# conf
eva(config)# int dummy3
eva(config-if)# shut
eva(config-if)# no shut
eva(config-if)# do show ip route 9.9.9.9/32 nexthop-group
Routing entry for 9.9.9.9/32
Known via "static", distance 1, metric 0, best
Last update 00:00:08 ago
Nexthop Group ID: 57
* 192.168.99.33, via dummy1, weight 1
* 192.168.100.33, via dummy2, weight 1
* 192.168.101.33, via dummy3, weight 1
* 192.168.102.33, via dummy4, weight 1
eva(config-if)# exit
eva(config)# exit
eva# exit
sharpd@eva ~/frr1 (master) [255]> ip nexthop show id 57
id 57 group 37/43/50/58 proto zebra
sharpd@eva ~/frr1 (master)> ip route show 9.9.9.9/32
9.9.9.9 nhid 57 proto 196 metric 20
nexthop via 192.168.99.33 dev dummy1 weight 1
nexthop via 192.168.100.33 dev dummy2 weight 1
nexthop via 192.168.101.33 dev dummy3 weight 1
nexthop via 192.168.102.33 dev dummy4 weight 1
sharpd@eva ~/frr1 (master)>
Notice that we now no longer are creating a bunch of new
nexthop groups.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If a interface down event caused a nexthop group to remove
one of the entries in the kernel, have it be reinstalled
when the interface comes back up. Mark the nexthop as
usable.
new behavior:
eva# show nexthop-group rib 181818168
ID: 181818168 (sharp)
RefCnt: 1
Uptime: 00:00:23
VRF: default(bad-value)
Valid, Installed
Depends: (35) (38) (44) (51)
via 192.168.99.33, dummy1 (vrf default), weight 1
via 192.168.100.33, dummy2 (vrf default), weight 1
via 192.168.101.33, dummy3 (vrf default), weight 1
via 192.168.102.33, dummy4 (vrf default), weight 1
eva# conf
eva(config)# int dummy3
eva(config-if)# shut
eva(config-if)# do show nexthop-group rib 181818168
ID: 181818168 (sharp)
RefCnt: 1
Uptime: 00:00:44
VRF: default(bad-value)
Depends: (35) (38) (44) (51)
via 192.168.99.33, dummy1 (vrf default), weight 1
via 192.168.100.33, dummy2 (vrf default), weight 1
via 192.168.101.33, dummy3 (vrf default) inactive, weight 1
via 192.168.102.33, dummy4 (vrf default), weight 1
eva(config-if)# no shut
eva(config-if)# do show nexthop-group rib 181818168
ID: 181818168 (sharp)
RefCnt: 1
Uptime: 00:00:53
VRF: default(bad-value)
Valid, Installed
Depends: (35) (38) (44) (51)
via 192.168.99.33, dummy1 (vrf default), weight 1
via 192.168.100.33, dummy2 (vrf default), weight 1
via 192.168.101.33, dummy3 (vrf default), weight 1
via 192.168.102.33, dummy4 (vrf default), weight 1
eva(config-if)# exit
eva(config)# exit
eva# exit
sharpd@eva ~/frr1 (master) [255]> ip nexthop show id 181818168
id 181818168 group 35/38/44/51 proto 194
sharpd@eva ~/frr1 (master)>
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current code when a link is set down is to just mark the
nexthop group as not properly setup. Leaving situations
where when an interface goes down and show output is
entered we see incorrect state. This is true for anything
that would be checking those flags at that point in time.
Modify the interface down nexthop group code to notice the
nexthops appropriately ( and I mean set the appropriate flags )
and to allow a `show ip route` command to actually display
what is going on with the nexthops.
eva# show ip route 1.0.0.0
Routing entry for 1.0.0.0/32
Known via "sharp", distance 150, metric 0, best
Last update 00:00:06 ago
* 192.168.44.33, via dummy1, weight 1
* 192.168.45.33, via dummy2, weight 1
sharpd@eva:~/frr1$ sudo ip link set dummy2 down
eva# show ip route 1.0.0.0
Routing entry for 1.0.0.0/32
Known via "sharp", distance 150, metric 0, best
Last update 00:00:12 ago
* 192.168.44.33, via dummy1, weight 1
192.168.45.33, via dummy2 inactive, weight 1
Notice now that the 1.0.0.0/32 route now correctly
displays the route for the nexthop group entry.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Trying to debug some cross vrf stuff in zebra and frankly
it's hard to grep the file for the routes you are interested
in. Let's clean this up some and get a bit better
information for us developers
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The show zebra dplane provider command was ommitting
the input and output queues to the dplane itself.
It would be nice to have this insight as well.
New output:
r1# show zebra dplane providers
dataplane Incoming Queue from Zebra: 100
Zebra dataplane providers:
Kernel (1): in: 6, q: 0, q_max: 3, out: 6, q: 14, q_max: 3
dplane_fpm_nl (2): in: 6, q: 10, q_max: 3, out: 6, q: 0, q_max: 3
dataplane Outgoing Queue to Zebra: 43
r1#
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane providers have a concept of input queues
and output queues. These queues are chained together
during normal operation. The code in zebra also has
a feedback mechanism where the MetaQ will not run when
the first input queue is backed up. Having the dplane_fpm_nl
code grab all contexts when it is backed up prevents
this system from behaving appropriately.
Modify the code to not add to the dplane_fpm_nl's internal
queue when it is already full. This will allow the backpressure
to work appropriately in zebra proper.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently when the dplane_thread_loop is run, it moves contexts
from the dg_update_list and puts the contexts on the input queue
of the first provider. This provider is given a chance to run
and then the items on the output queue are pulled off and placed
on the input queue of the next provider. Rinse/Repeat down through
the entire list of providers. Now imagine that we have a list
of multiple providers and the last provider is getting backed up.
Contexts will end up sticking in the input Queue of the `slow`
provider. This can grow without bounds. This is a real problem
when you have a situation where an interface is flapping and an
upper level protocol is sending a continous stream of route
updates to reflect the change in ecmp. You can end up with
a very very large backlog of contexts. This is bad because
zebra can easily grow to a very very large memory size and on
restricted systems you can run out of memory. Fortunately
for us, the MetaQ already participates with this process
by not doing more route processing until the dg_update_list
goes below the working limit of dg_updates_per_cycle. Thus
if FRR modifies the behavior of this loop to not move more
contexts onto the input queue if either the input queue
or output queue of the next provider has reached this limit.
FRR will naturaly start auto handling backpressure for the dplane
context system and memory will not go out of control.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The ctx queue data structures already have a counter
associated with them. Let's just use them instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When trying to track down a MTYPE_TMP memory leak
it's harder to search for it when you happen to
have some usage of ttable_dump. Let's just give
it it's own memory type so that we can avoid
confusion in the future.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the FRR code will receive both kernel and
connected routes that do not actually have an underlying
nexthop group at all. Zebra turns around and creates
a `matching` nexthop hash entry and installs it.
For connected routes, this will create 2 singleton
nexthops in the dplane per interface (v4 and v6).
For kernel routes it would just create 1 singleton
nexthop that might be used or not.
This is bad because the dplane has a limited amount
of space available for nexthop entries and if you
happen to have a large number of interfaces then
all of a sudden you have 2x(# of interfaces) singleton
nexthops.
Let's modify the code to delay creation of these singleton
nexthops until they have been used by something else in the
system.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There is a code path that could theoretically get you
to a point where the ng->nexthop is a NULL value.
Let's just make sure the SA system believes that
cannot happen anymore.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
A blackhole nexthop, according to the linux kernel,
can be v4 or v6. A v4 blackhole nexthop cannot be
used on a v6 route, but a v6 blackhole nexthop can
be used with a v4 route. Convert all blackhole
singleton nexthops to v6 and just use that.
Possibly reducing the number of active nexthops by 1.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Let's display the afi of the nexthop hash entry. Right
now it is impossible to tell the difference between v4 or
v6 nexthops, especially since it is important for the kernel.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Move the prefix lookup/comparison to outside the re loop
and into the rn loop, since that is where the code should
actually be.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There exists a path in rib_add_multipath where if a decision
is made to not use the passed in re, we just drop the memory
instead of freeing it. Let's free it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current code intentionally ignores kernel routes. Modify
zebra to allow these routes to be read in on linux. Also
modify zebra to look to see if a route should be treated
as a connected and mark it as such.
Additionally this should properly handle some of the issues
being seen with NOPREFIXROUTE.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The prefix'es p and src_p are not const. Let's make
them so. Useful to signal that we will not change this
data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently FRR when it has two nexthop groups:
A
nexthop 1 weight 5
nexthop 2 weight 6
nexthop 3 weight 7
B
nexthop 1 weight 3
nexthop 2 weight 4
nexthop 3 weight 5
We end up with 5 singleton nexthops and two groups:
ID: 181818168 (sharp)
RefCnt: 1
Uptime: 00:04:52
VRF: default
Valid, Installed
Depends: (69) (70) (71)
via 192.168.119.1, enp13s0 (vrf default), weight 182
via 192.168.119.2, enp13s0 (vrf default), weight 218
via 192.168.119.3, enp13s0 (vrf default), weight 255
ID: 181818169 (sharp)
RefCnt: 1
Uptime: 00:02:08
VRF: default
Valid, Installed
Depends: (71) (127) (128)
via 192.168.119.1, enp13s0 (vrf default), weight 127
via 192.168.119.2, enp13s0 (vrf default), weight 170
via 192.168.119.3, enp13s0 (vrf default), weight 255
id 69 via 192.168.119.1 dev enp13s0 scope link proto 194
id 70 via 192.168.119.2 dev enp13s0 scope link proto 194
id 71 via 192.168.119.3 dev enp13s0 scope link proto 194
id 127 via 192.168.119.1 dev enp13s0 scope link proto 194
id 128 via 192.168.119.2 dev enp13s0 scope link proto 194
id 181818168 group 69,182/70,218/71,255 proto 194
id 181818169 group 71,255/127,127/128,170 proto 194
This is not a desirable state to be in. If you have a
link flapping in the network and weights are changing
rapidly you end up with a large number of singleton
nexthops that are being used by the nexthop groups.
This fills up asic space and clutters the table.
Additionally singleton nexthops cannot have any weight
and the fact that you attempt to create a singleton
nexthop with different weights means nothing to the
linux kernel( or any asic dplane ). Let's modify
the code to always create the singleton nexthops
without a weight and then just creating the
NHG's that use the singletons with the appropriate
weight.
ID: 181818168 (sharp)
RefCnt: 1
Uptime: 00:00:32
VRF: default
Valid, Installed
Depends: (22) (24) (28)
via 192.168.119.1, enp13s0 (vrf default), weight 182
via 192.168.119.2, enp13s0 (vrf default), weight 218
via 192.168.119.3, enp13s0 (vrf default), weight 255
ID: 181818169 (sharp)
RefCnt: 1
Uptime: 00:00:14
VRF: default
Valid, Installed
Depends: (22) (24) (28)
via 192.168.119.1, enp13s0 (vrf default), weight 153
via 192.168.119.2, enp13s0 (vrf default), weight 204
via 192.168.119.3, enp13s0 (vrf default), weight 255
id 22 via 192.168.119.1 dev enp13s0 scope link proto 194
id 24 via 192.168.119.2 dev enp13s0 scope link proto 194
id 28 via 192.168.119.3 dev enp13s0 scope link proto 194
id 181818168 group 22,182/24,218/28,255 proto 194
id 181818169 group 22,153/24,204/28,255 proto 194
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
All routes received by zebra from upper level protocols have a weight
of 1. Let's just make everything extremely consistent in our code.
Lot's of tests needed to be fixed up to make this work.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The linux kernel adds 1 upon receipt of a weight, if you
send a 255 it gets unhappy. Let's Limit range to 254 as
that kernel does not like sending of 255.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Commit 605df8d44 zebra: Use zebra dplane for RTM link and addr broke loading of
kernel routes at startup for configurations without netlink.
It rearranged the startup sequence in zebra_ns_enable() to pass via
zebra_ns_startup_continue(), triggered through zebra_dplane_startup_stage()
calls. However, it neglected to make these calls in the non-netlink code path.
As a result zebra failed to load kernel routes at startup on platforms such
as FreeBSD.
Insert these calls so we run through all of the expected startup stages.
Signed-off-by: Kristof Provost <kprovost@netgate.com>
The function zebra_nhg_hash_equal is only used
as a hash function for storage of NHG's and retrieval.
If you have say two nhg's:
31 (25/26)
32 (25/26)
This function would return them as being equal. Which
of course leads to the problem when you attempt to
hash_release 32 but release 31 from the hash. Then later
when you attempt to do hash comparisons 32 has actually
been freed leaving to use after free situations and shit
goes down hill fast.
This hash is only used as part of the hash comparison
function for nexthop group storage. Since this is so
let's always return the 31/32 nhg's are not equal at all.
We possibly have a different problem where we are creating
31 and 32 ( when 31 should have just been used instead of 32 )
but we need to prevent any type of hash release problem at all.
This supercedes any other issue( that should be tracked down
on it's own ). Since you can have use after free situation
that leads to a crash -vs- some possible nexthop group duplication
which is very minor in comparison.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When the prefix-list is not found, show which AFI is the real one we are
looking for.
E.g.: looking at this output is not clear:
```
[RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List p1 specified does not exist defaulting to NO_MATCH
```
route_match_address_prefix_list() is called by route_match_ipv6_address_prefix_list(),
and route_match_ip_address_prefix_list().
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
It is possible that right before an upper level protocol dies
or is killed routes would be installed into zebra. These routes
could be on the Meta-Q for early route-processing. Leaving us with
a situation where the client is removed, and all it's routes that are
in the rib at that time, and then after that the MetaQ is run and the
routes are reprocessed leaving routes from an upper level daemon
post daemon going away from zebra's perspective. These routes will
be abandoned.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
During internal testing, when the following sequence is followed, two
non default vrfs end up pointing to the same table-id
- Initially vrf201 has table id 1002
- ip link add dev vrf202 type vrf table 1002
- ip link set dev vrf202 up
- ip link set dev <intrerface> master vrf202
This will ideally lead to zebra exit since this is a misconfiguration as
expected.
However if we perform a restart frr.service at this point, we end up
having two vrfs pointing to same table-id and bad things can happen.
This is because in the interface_vrf_change, we incorrectly check for
vrf_lookup_by_id() to evaluate if there is a misconfig. This works well
for a non restart case but not for the startup case.
root@mlx-3700-20:mgmt:/var/log/frr# sudo vtysh -c "sh vrf"
vrf mgmt id 37 table 1001
vrf vrf201 id 46 table 1002
vrf vrf202 id 59 table 1002 >>>>
Fix: in all cases of misconfiguration, exit zebra as expected.
Ticket :#3970414
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
The bgp_duplicate_nexthop test installs routes with nexthop's
flags set to both DUPLICATE and FIB: this should not happen.
The DUPLICATE flag of a nexthop indicates this nexthop is already
used in the same nexthop-group, and there is no need to install it
twice in the system; having the FIB flag set indicates that the
nexthop is installed in the system. This is why both flags should
not be set on the same nexthop.
This case happens at installation time, but can also happen
at update time.
- Fix this by not setting the FIB flag value when the DUPLICATE
flag is present.
- Modify the bgp_duplicate_test to check that the FIB flag is not
present on duplicated nexthops.
- Modify the bgp_peer_type_multipath_relax test.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
In L3 BGP-EVPN, if there are both IPv4 and IPv6 routes in the VPN, zebra
maintains two instances of `struct zebra_neigh` object: one with IPv4
address of the nexthop, and another with IPv6 address that is an IPv4
mapped to IPv6, but only one intance of `struct zebra_mac` object, that
contains a list of nexthop addresses that use this mac.
The code in `zebra_vxlan` module uses the fact that the list is empty as
the indication that the `zebra_mac` object is unused, and needs to be
dropped. However, preexisting code used nexthop address converted to
IPv4 notation for the element of this list. As a result, when two
`zebra_neigh` objects, one IPv4 and one IPv6-mapped-IPv4 were linked to
the `zebra_mac` object, only one element was added to the list.
Consequently, when one of the two `zebra_neigh` objects was dropped, the
only element in the list was removed, making it empty, and `zebra_mac`
object was dropped, and neigbrour cache elements uninstalled from the
kernel.
As a result, after the last route in _one_ family was removed from a
remote vtep, all remaining routes in the _other_ family became
unreachable, because RMAC of the vtep was removed.
This commit makes `zebra_mac` use uncoerced IP address of the `zebra_neigh`
object for the entries in the `nh_list`. This way, `zebra_mac` object no
longer loses track of `zebra_neigh` objects that need it.
Bug-URL: https://github.com/FRRouting/frr/issues/16340
Signed-off-by: Eugene Crosser <crosser@average.org>
Current code when a link is set down is to just mark the
nexthop group as not properly setup. Leaving situations
where when an interface goes down and show output is
entered we see incorrect state. This is true for anything
that would be checking those flags at that point in time.
Modify the interface down nexthop group code to notice the
nexthops appropriately ( and I mean set the appropriate flags )
and to allow a `show ip route` command to actually display
what is going on with the nexthops.
eva# show ip route 1.0.0.0
Routing entry for 1.0.0.0/32
Known via "sharp", distance 150, metric 0, best
Last update 00:00:06 ago
* 192.168.44.33, via dummy1, weight 1
* 192.168.45.33, via dummy2, weight 1
sharpd@eva:~/frr1$ sudo ip link set dummy2 down
eva# show ip route 1.0.0.0
Routing entry for 1.0.0.0/32
Known via "sharp", distance 150, metric 0, best
Last update 00:00:12 ago
* 192.168.44.33, via dummy1, weight 1
192.168.45.33, via dummy2 inactive, weight 1
Notice now that the 1.0.0.0/32 route now correctly
displays the route for the nexthop group entry.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a new start option "-K" to libfrr to denote a graceful start,
and use it in zebra and bgpd.
zebra will use this option to denote a planned FRR graceful restart
(supporting only bgpd currently) to wait for a route sync completion
from bgpd before cleaning up old stale routes from the FIB. An optional
timer provides an upper-bounds for this cleanup.
bgpd will use this option to denote either a planned FRR graceful
restart or a bgpd-only graceful restart, and this will drive the BGP
GR restarting router procedures.
Signed-off-by: Vivek Venkatraman <vivek@nvidia.com>
The `locator` pointer is dereferenced before ensuring it is not NULL.
Fix the issue by checking that the pointer is not NULL before
dereferencing it.
Fixes 1594013
** CID 1594013: Null pointer dereferences (REVERSE_INULL)
/zebra/zebra_srv6.c: 961 in zebra_srv6_sid_compose()
________________________________________________________________________________________________________
*** CID 1594013: Null pointer dereferences (REVERSE_INULL)
/zebra/zebra_srv6.c: 961 in zebra_srv6_sid_compose()
955 struct srv6_locator *locator,
956 uint32_t sid_func)
957 {
958 uint8_t offset, func_len;
959 struct srv6_sid_format *format = locator->sid_format;
960
CID 1594013: Null pointer dereferences (REVERSE_INULL)
Null-checking "locator" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
961 if (!sid_value || !locator)
962 return false;
963
964 if (format) {
965 offset = format->block_len + format->node_len;
966 func_len = format->function_len;
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
The `for` loop starting at line 1848 searches the `func_allocated` array
for a pointer that points to a specific `sid_wide_func` element.
The loop should iterate over all the elements of the `func_allocated`
array and dereference each element to see if it is the one we are
looking for.
Currently, the loop is using the wrong variable to iterate over the
array.
Let's fix this issue by using the correct variable in the loop.
Fixes CID 1594014
Fixes CID 1594016
** CID 1594014: Null pointer dereferences (FORWARD_NULL)
/zebra/zebra_srv6.c: 1860 in release_srv6_sid_func_explicit()
________________________________________________________________________________________________________
*** CID 1594014: Null pointer dereferences (FORWARD_NULL)
/zebra/zebra_srv6.c: 1860 in release_srv6_sid_func_explicit()
1854
1855 /* Lookup SID function in the functions allocated list of EWLIB range */
1856 for (ALL_LIST_ELEMENTS_RO(block->u.usid
1857 .wide_lib[sid_func]
1858 .func_allocated,
1859 node, sid_func_ptr))
CID 1594014: Null pointer dereferences (FORWARD_NULL)
Dereferencing null pointer "sid_wide_func_ptr".
1860 if (*sid_wide_func_ptr == sid_wide_func)
1861 break;
1862
1863 /* Ensure that the SID function is allocated */
1864 if (!sid_wide_func_ptr) {
1865 zlog_warn("%s: failed to release wide SID function %u, function is not allocated",
** CID 1594016: Possible Control flow issues (DEADCODE)
/zebra/zebra_srv6.c: 1871 in release_srv6_sid_func_explicit()
________________________________________________________________________________________________________
*** CID 1594016: Possible Control flow issues (DEADCODE)
/zebra/zebra_srv6.c: 1871 in release_srv6_sid_func_explicit()
1865 zlog_warn("%s: failed to release wide SID function %u, function is not allocated",
1866 __func__, sid_wide_func);
1867 return -1;
1868 }
1869
1870 /* Release the SID function from the EWLIB range */
CID 1594016: Possible Control flow issues (DEADCODE)
Execution cannot reach this statement: "listnode_delete(block->u.us...".
1871 listnode_delete(block->u.usid.wide_lib[sid_func]
1872 .func_allocated,
1873 sid_wide_func_ptr);
1874 zebra_srv6_sid_func_free(sid_wide_func_ptr);
1875 } else {
1876 zlog_warn("%s: function %u is outside ELIB [%u/%u] and EWLIB alloc ranges [%u/%u]",
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
At line 1736, `alloc_mode` is set to `SRV6_SID_ALLOC_MODE_EXPLICIT` or
`SRV6_SID_ALLOC_MODE_DYNAMIC` depending on the `sid_value` variable.
There will never be a case where alloc_mode will be `SRV6_SID_ALLOC_MODE_MAX`
or `SRV6_SID_ALLOC_MODE_UNSPEC`.
Let's replace the `switch(alloc_mode) {...}` with an if-else.
Fixes CID 1594015.
** CID 1594015: (DEADCODE)
/zebra/zebra_srv6.c: 1782 in get_srv6_sid()
/zebra/zebra_srv6.c: 1781 in get_srv6_sid()
________________________________________________________________________________________________________
*** CID 1594015: (DEADCODE)
/zebra/zebra_srv6.c: 1782 in get_srv6_sid()
1776 }
1777
1778 ret = get_srv6_sid_dynamic(sid, ctx, locator);
1779
1780 break;
1781 case SRV6_SID_ALLOC_MODE_MAX:
CID 1594015: (DEADCODE)
Execution cannot reach this statement: "case SRV6_SID_ALLOC_MODE_UN...".
1782 case SRV6_SID_ALLOC_MODE_UNSPEC:
1783 default:
1784 flog_err(EC_ZEBRA_SM_CANNOT_ASSIGN_SID,
1785 "%s: SRv6 Manager: Unrecognized alloc mode %u",
1786 __func__, alloc_mode);
1787 /* We should never arrive here */
/zebra/zebra_srv6.c: 1781 in get_srv6_sid()
1775 return -1;
1776 }
1777
1778 ret = get_srv6_sid_dynamic(sid, ctx, locator);
1779
1780 break;
CID 1594015: (DEADCODE)
Execution cannot reach this statement: "case SRV6_SID_ALLOC_MODE_MAX:".
1781 case SRV6_SID_ALLOC_MODE_MAX:
1782 case SRV6_SID_ALLOC_MODE_UNSPEC:
1783 default:
1784 flog_err(EC_ZEBRA_SM_CANNOT_ASSIGN_SID,
1785 "%s: SRv6 Manager: Unrecognized alloc mode %u",
1786 __func__, alloc_mode);
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
In case of EVPN MH bond, a member port going in
protodown state due to external reason (one case being linkflap),
frr updates the state correctly but upon manually
clearing external reason trigger FRR to reinstate
protodown without any reason code.
Fix is to ensure if the protodown reason was external
and new state is to have protodown 'off' then do no reinstate
protodown.
Ticket: #3947432
Testing:
switch:#ip link show swp1
4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9216 qdisc
pfifo_fast master bond1 state DOWN mode DEFAULT group default qlen
1000
link/ether 1c:34:da:2c:aa:68 brd ff:ff:ff:ff:ff:ff protodown on
protodown_reason <linkflap>
switch:#ip link set swp1 protodown off protodown_reason linkflap off
switch:#ip link show swp1
4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9216 qdisc
pfifo_fast master bond1 state DOWN mode DEFAULT group default qlen
1000
link/ether 1c:34:da:2c:aa:68 brd ff:ff:ff:ff:ff:ff
Signed-off-by: Chirag Shah <chirag@nvidia.com>
If using weighted ECMP, the weight for non-recursive next-hop should be
inherited from recursive next-hop.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
In the near future, some daemons may only register SIDs. This may be
the case for the pathd daemon when creating SRv6 binding SIDs.
When a locator is getting deleted at ZEBRA level, the daemon may have
an easy way to find out the SIds to unregister to.
This commit proposes to add the locator name to the SID_SRV6_NOTIFY
message whenever possible. Only case when an allocation failure happens,
the locator will not be present. In all other places, the notify API
at procol levels has the locator name extra-parameter.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
When removing a large number of routes, the linux kernel can take the
cpu for an extended amount of time, leaving a situation where FRR
detects a starvation event.
r1# sharp install routes 10.0.0.0 nexthop 192.168.44.33 1000000 repeat 10
2024-06-14 12:55:49.365 [NTFY] sharpd: [M7Q4P-46WDR] vty[5]@# sharp install routes 10.0.0.0 nexthop 192.168.44.33 1000000 repeat 10
2024-06-14 12:55:49.365 [DEBG] sharpd: [YP4TQ-01TYK] Inserting 1000000 routes
2024-06-14 12:55:57.256 [DEBG] sharpd: [TPHKD-3NYSB] Installed All Items 7.890085
2024-06-14 12:55:57.256 [DEBG] sharpd: [YJ486-NX5R1] Removing 1000000 routes
2024-06-14 12:56:07.802 [WARN] zebra: [QH9AB-Y4XMZ][EC 100663314] STARVATION: task dplane_thread_loop (634377bc8f9e) ran for 7078ms (cpu time 220ms)
2024-06-14 12:56:25.039 [DEBG] sharpd: [WTN53-GK9Y5] Removed all Items 27.783668
2024-06-14 12:56:25.039 [DEBG] sharpd: [YP4TQ-01TYK] Inserting 1000000 routes
2024-06-14 12:56:32.783 [DEBG] sharpd: [TPHKD-3NYSB] Installed All Items 7.743524
2024-06-14 12:56:32.783 [DEBG] sharpd: [YJ486-NX5R1] Removing 1000000 routes
2024-06-14 12:56:41.447 [WARN] zebra: [QH9AB-Y4XMZ][EC 100663314] STARVATION: task dplane_thread_loop (634377bc8f9e) ran for 5175ms (cpu time 179ms)
Let's modify the loop in dplane_thread_loop such that after a provider
has been run, check to see if the event should yield, if so, stop
and reschedule this for the future.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Instead of keeping a counter that is independent
of the queue's data structure. Just use the queue's
built-in counter. Ensure that it's pthread safe by
keeping it wrapped inside the mutex for adding/deleting
to the queue.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently, when a locator is deleted in zebra, zebra notifies only the
zclient that owns the locator.
With the introduction of SID Manager, the locator is no longer owned by
any client. Instead, the locator is owned by Zebra, and clients can
allocate and release SIDs from the locator using the ZAPI
ZEBRA_SRV6_MANAGER_GET_SID and ZEBRA_SRV6_MANAGER_RELEASE_SID.
Therefore, when a locator is removed in Zebra, we need to notify all
daemons so that they can release/uninstall the SIDs allocated by that
locator.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Send asynchronous notifications to zclients when an SRv6 SID is
allocated/released and when a SID alloc/release operation fails.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Previous commits introduced two new ZAPI operations,
`ZEBRA_SRV6_MANAGER_GET_SRV6_SID` and
`ZEBRA_SRV6_MANAGER_RELEASE_SRV6_SID`. These operations allow a daemon
to interact with the SRv6 SID Manager to get and release an SRv6 SID,
respectively.
This commit extends the SID Manager by adding logic to process the
requests `ZEBRA_SRV6_MANAGER_GET_SRV6_SID` and
`ZEBRA_SRV6_MANAGER_RELEASE_SRV6_SID`, and allocate/release SIDs to
requesting daemons.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Add functions to allocate/release SRv6 SIDs. SIDs can be allocated
either explicitly (allocate a specific SID) or dynamically (allocate any
available SID).
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
The previous commits introduced a new operation,
`ZEBRA_SRV6_MANAGER_GET_LOCATOR`, allowing a daemon to request
information about a specific SRv6 locator from the SRv6 SID Manager.
This commit extends the SID Manager to respond to a
`ZEBRA_SRV6_MANAGER_GET_LOCATOR` request and provide the requested
locator information.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Add a data structure to represent an SRv6 SID context and the related
management functions (allocate/free).
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Add the CLI to choose the SID format of a locator. When the SID format
of a locator is changed, the SIDs allocated from that locator might no
longer be valid (for example, because the new format might involve a
different SID allocation schema). In such a case, it is necessary to
notify all the zclients so that they can withdraw/uninstall the old SIDs
that use the previous format and allocate/install/advertise the new SIDs
based on the new format.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
An SRv6 block is an IPv6 prefix from which SIDs are allocated. This
commit adds support for SRv6 SID blocks. Specifically, it adds a data
structure to store information about an SRv6 block (e.g., its occupancy
status, which SIDs have been allocated and which are available, which
SID format is used for that block, etc.). It also adds some functions to
manage the block (allocate / free / lookup).
These functions will be used in the next commits to support the
allocation of SIDs from a block in the SID Manager.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>