Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mlx5 vf as PF1 proxy #614

Merged
merged 11 commits into from
Oct 17, 2024
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ ARG DPSERVICE_FEATURES=""
RUN meson setup release_build $DPSERVICE_FEATURES --buildtype=release && ninja -C release_build
RUN CC=clang CXX=clang++ meson setup clang_build $DPSERVICE_FEATURES && ninja -C clang_build
RUN meson setup xtratest_build $DPSERVICE_FEATURES -Denable_tests=true && ninja -C xtratest_build
RUN meson setup pf1_proxy_build $DPSERVICE_FEATURES -Denable_pf1_proxy=true && ninja -C pf1_proxy_build


# Test-image to run pytest
Expand Down
1 change: 1 addition & 0 deletions docs/deployment/help_dpservice-bin.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
| -v, --version | None | display version and exit | |
| --pf0 | IFNAME | first physical interface (e.g. eth0) | |
| --pf1 | IFNAME | second physical interface (e.g. eth1) | |
| --pf1-proxy | IFNAME | VF representor to use as a proxy for pf1 packets | |
| --ipv6 | ADDR6 | IPv6 underlay address | |
| --vf-pattern | PATTERN | virtual interface name pattern (e.g. 'eth1vf') | |
| --dhcp-mtu | SIZE | set the mtu field in DHCP responses (68 - 1500) | |
Expand Down
5 changes: 5 additions & 0 deletions docs/deployment/mellanox.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ Set the number of VFs to the needed value (max 126 at the moment) and enable bot
Restart the machine for the changes to take effect.
> These changes are done in the NIC itself, it does not matter if the host is an ephemeral image or if another host OS will boot later.

### Multiport-eswitch
For this mode to be functional, an additional firmware setting `LAG_RESOURCE_ALLOCATION=1` is needed.

In some cases (looks like a nic/switch combination) performance is severly affected when VM traffic is happening. This has been observed to be fixed by setting `ROCE_CONTROL=1` (this means "disabled", the default is `2` meaning "enabled"). The actual cause of this is yet to be discovered.


## Dp-service setup
Either `prepare.sh` script or `preparedp.service` systemd unit needs to be run before dp-service can work properly. This should already be done automatically if using the Docker image provided. Make sure this does not produce any errors.
22 changes: 22 additions & 0 deletions docs/sys_design/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Graph Framework
This is the graph topology for packets handled by dpservice. Offloaded packets never enter dpservice (and thus the graph) itself.

![dpservice graph schema](dpservice_dataplane.drawio.png "dpservice graph schema")

Note that every graph node actually has one other edge to it that leads to a **"Drop"** node, but for clarity this is omitted. As the name suggests, that node has no other edge and is simply dropping the packets without sending them anywhere.

## PF1-proxy
When using a (conditionally compiled-in) pf1-proxy feature, all traffic for the host (i.e. not underlay traffic for dpservice) needs to be forwarded to a special VF on PF1 called "pf1-proxy" and back.

### Traffic from proxy to PF1
Since **all packets** without exception need to be forwarded directly to PF1, an rte-rule is installed to do just that, so all packets are offloaded and never enter the graph.

### Traffic from PF1 to proxy
Only non-underlay IPv6 packets, i.e. IPv6 packets with destination IP matching the host's IP (`--ipv6` command-line argument) are directly forwarded by offloading via an rte-rule. The remaining packets enter dpservice normally and if they are classified as "unusable" (i.e. should be dropped by "Classify" node), they are instead forwarded to pf1-proxy. See the dashed graph edge above.

## Virtual services
If virtual services are compiled-in, there is another path for packets to take. Packets going from a virtual IPv4 and TCP/UDP port to a specific web-service (i.e. specific IPv6 and TCP/UDP port) undergo an IP header replacement (from IPv4 to IPv6 and back) to enable VMs to contact IPv6 web-services without the use of NAT. This is useful for services that are heavily used by many connections, like DNS, k8s api-servers, etc.

For this to work some changes to the graph topology are needed. For simplicity, this schema is separate and should be imagined as an "overlay" over the standard schema above.

![dpservice virtual services schema](dpservice_virtsvc.drawio.png "virtual services graph schema")
Binary file modified docs/sys_design/dpservice_dataplane.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sys_design/dpservice_virtsvc.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions hack/dp_conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@
"type": "char",
"array_size": "IF_NAMESIZE"
},
{
"lgopt": "pf1-proxy",
"arg": "IFNAME",
"help": "VF representor to use as a proxy for pf1 packets",
"var": "pf1_proxy",
"type": "char",
"array_size": "IF_NAMESIZE",
"ifdef": "ENABLE_PF1_PROXY"
},
{
"lgopt": "ipv6",
"arg": "ADDR6",
Expand Down
79 changes: 56 additions & 23 deletions hack/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -140,32 +140,31 @@ process_multiport_eswitch_mode() {
}

function create_vf() {
local pf="${devs[0]}"
local pf0="${devs[0]}"
local pf1="${devs[1]}"

if [[ "$IS_ARM_WITH_BLUEFIELD" == "true" ]]; then
actualvfs=$NUMVFS
log "Skipping VF creation for BlueField card on ARM"
# enable switchdev mode, this operation takes most time
process_switchdev_mode "$pf"
process_switchdev_mode "$pf0"
return
fi

if [[ "$CONFIG_ONLY" == "true" ]]; then
actualvfs=$(cat /sys/bus/pci/devices/$pf/sriov_numvfs)
actualvfs=$(cat /sys/bus/pci/devices/$pf0/sriov_numvfs)
log "Skipping VF creation as requested"
return
fi

# we disable automatic binding so that VFs don't get created, saves a lot of time
# plus we don't need to unbind them before enabling switchdev mode
log "disabling automatic binding of VFs on pf: $pf"
echo 0 > /sys/bus/pci/devices/$pf/sriov_drivers_autoprobe

# calculating amount of VFs to create, 126 if more are available, or maximum available
totalvfs=$(cat /sys/bus/pci/devices/$pf/sriov_totalvfs)
actualvfs=$((NUMVFS<totalvfs ? NUMVFS : totalvfs))
log "creating $actualvfs virtual functions"
echo $actualvfs > /sys/bus/pci/devices/$pf/sriov_numvfs
log "disabling automatic binding of VFs on pf0 '$pf0'"
echo 0 > /sys/bus/pci/devices/$pf0/sriov_drivers_autoprobe
if [[ "$OPT_PF1_PROXY" == "true" ]]; then
log "enabling automatic binding of VFs on pf1 '$pf1'"
echo 1 > /sys/bus/pci/devices/$pf1/sriov_drivers_autoprobe
fi

if [[ "$IS_X86_WITH_MLX" == "true" ]]; then
# enable switchdev mode, this operation takes most time
Expand All @@ -174,7 +173,7 @@ function create_vf() {
process_switchdev_mode "$pf"
done
else
process_switchdev_mode "$pf"
process_switchdev_mode "$pf0"
fi
fi

Expand All @@ -183,19 +182,60 @@ function create_vf() {
process_multiport_eswitch_mode "$pf"
done
fi

# calculating amount of VFs to create, 126 if more are available, or maximum available
totalvfs=$(cat /sys/bus/pci/devices/$pf0/sriov_totalvfs)
actualvfs=$((NUMVFS<totalvfs ? NUMVFS : totalvfs))
log "creating $actualvfs virtual functions"
echo $actualvfs > /sys/bus/pci/devices/$pf0/sriov_numvfs
if [[ "$OPT_PF1_PROXY" == "true" ]]; then
log "creating pf1-proxy virtual function"
echo 1 > /sys/bus/pci/devices/$pf1/sriov_numvfs
log "configuring pf1-proxy"
local pf1proxy=$(get_pf1_proxy $pf1)
ip link set $pf1proxy mtu 9100
ip link set $pf1proxy up
local pf1_name=$(get_ifname 1)
local pf1_mac=$(cat /sys/class/net/$pf1_name/address)
local pf1proxy_vf=$(get_pf1_proxy_vf)
ip link set $pf1proxy_vf mtu 9100
ip link set $pf1proxy_vf address $pf1_mac
ip link set $pf1proxy_vf up
fi
}

function get_pattern() {
local dev=$1
pattern=$(devlink port | grep pci/$dev/ | grep "virtual\|pcivf" | awk '{print $5}' | sed -rn 's/(.*[a-z_])[0-9]{1,3}$/\1/p' | uniq)
if [ -z "$pattern" ]; then
err "can't determine the pattern for $dev"
err "can't determine the vf pattern for $dev"
elif [ $(wc -l <<< "$pattern") -ne 1 ]; then
err "multiple patterns found for $dev"
err "multiple vf patterns found for $dev"
fi
echo "$pattern"
}

function get_pf1_proxy() {
local dev=$1
proxy=$(devlink port | grep pci/$dev/ | grep "virtual\|pcivf" | awk '{print $5}' | uniq)
if [ -z "$proxy" ]; then
err "can't determine the pf1-proxy vf for $dev"
elif [ $(wc -l <<< "$proxy") -ne 1 ]; then
err "multiple pf1-proxy devices found for $dev"
fi
echo "$proxy"
}

function get_pf1_proxy_vf() {
vf=$(devlink port | grep auxiliary/mlx5_core.eth.2/ | grep virtual | awk '{print $5}' | uniq)
if [ -z "$vf" ]; then
err "can't determine the pf1-proxy vf"
elif [ $(wc -l <<< "$vf") -ne 1 ]; then
err "multiple pf1-proxy vfs found"
fi
echo "$vf"
}

function get_ifname() {
local port=$1
devlink port | grep "physical port $port" | awk '{ print $5}'
Expand All @@ -211,13 +251,6 @@ function get_ipv6() {
done < <(ip -6 -o addr show lo | awk '{print $4}')
}


function get_pf_mac() {
local pci_dev=${devs[$1]}
local pf=$(get_ifname $1)
cat /sys/bus/pci/devices/$pci_dev/net/$pf/address
}

function make_config() {
if [[ "$IS_X86_WITH_BLUEFIELD" == "true" ]]; then
log "Skipping config file creation on AMD/Intel 64-bit host with Bluefield"
Expand All @@ -233,7 +266,7 @@ function make_config() {
if [[ "$OPT_MULTIPORT" == "true" ]]; then
echo "a-pf0 ${devs[0]},class=rxq_cqe_comp_en=0,rx_vec_en=1,dv_flow_en=2,dv_esw_en=1,fdb_def_rule_en=1,representor=pf[0-1]vf[0-$[$actualvfs-1]]"
if [[ "$OPT_PF1_PROXY" == "true" ]]; then
echo "pf1-proxy $(get_pf_mac 1)"
echo "pf1-proxy $(get_pf1_proxy ${devs[1]})"
fi
echo "multiport-eswitch"
else
Expand All @@ -244,7 +277,7 @@ function make_config() {
if [[ "$OPT_MULTIPORT" == "true" ]]; then
log "dpservice configured in multiport-eswitch mode"
if [[ "$OPT_PF1_PROXY" == "true" ]]; then
log "dpservice will create a TAP device to proxy PF1"
log "dpservice will create a PF1-proxy"
fi
else
log "dpservice configured in normal mode"
Expand Down
3 changes: 0 additions & 3 deletions include/dp_conf.h
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,6 @@ const struct dp_conf_dhcp_dns *dp_conf_get_dhcp_dns(void);
const struct dp_conf_dhcp_dns *dp_conf_get_dhcpv6_dns(void);

#ifdef ENABLE_PF1_PROXY
const char *dp_get_eal_pf1_proxy_mac_addr(void);
const char *dp_get_eal_pf1_proxy_dev_name(void);
const char *dp_generate_eal_pf1_proxy_params(void);
bool dp_conf_is_pf1_proxy_enabled(void);
#endif

Expand Down
3 changes: 3 additions & 0 deletions include/dp_conf_opts.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ enum dp_conf_log_format {

const char *dp_conf_get_pf0_name(void);
const char *dp_conf_get_pf1_name(void);
#ifdef ENABLE_PF1_PROXY
const char *dp_conf_get_pf1_proxy(void);
#endif
const char *dp_conf_get_vf_pattern(void);
int dp_conf_get_dhcp_mtu(void);
int dp_conf_get_wcmp_perc(void);
Expand Down
24 changes: 13 additions & 11 deletions include/dp_port.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ struct dp_port_async_template {

enum dp_port_async_template_type {
DP_PORT_ASYNC_TEMPLATE_PF_ISOLATION,
#ifdef ENABLE_PF1_PROXY
DP_PORT_ASYNC_TEMPLATE_PF1_FROM_PROXY,
PlagueCZ marked this conversation as resolved.
Show resolved Hide resolved
DP_PORT_ASYNC_TEMPLATE_PF1_TO_PROXY,
#endif
#ifdef ENABLE_VIRTSVC
DP_PORT_ASYNC_TEMPLATE_VIRTSVC_TCP_ISOLATION,
DP_PORT_ASYNC_TEMPLATE_VIRTSVC_UDP_ISOLATION,
Expand All @@ -67,6 +71,10 @@ enum dp_port_async_template_type {
enum dp_port_async_flow_type {
DP_PORT_ASYNC_FLOW_ISOLATE_IPIP,
DP_PORT_ASYNC_FLOW_ISOLATE_IPV6,
#ifdef ENABLE_PF1_PROXY
DP_PORT_ASYNC_FLOW_PF1_FROM_PROXY,
DP_PORT_ASYNC_FLOW_PF1_TO_PROXY,
#endif
DP_PORT_ASYNC_FLOW_COUNT,
};

Expand Down Expand Up @@ -108,11 +116,10 @@ struct dp_ports {
// hidden structures for inline functions to access
extern struct dp_port *_dp_port_table[DP_MAX_PORTS];
extern struct dp_port *_dp_pf_ports[DP_MAX_PF_PORTS];
extern struct dp_ports _dp_ports;

#ifdef ENABLE_PF1_PROXY
extern struct dp_port _dp_pf_proxy_tap_port;
extern struct dp_port _dp_pf1_proxy_port;
#endif
extern struct dp_ports _dp_ports;


struct dp_port *dp_get_port_by_name(const char *pci_name);
Expand All @@ -123,7 +130,7 @@ void dp_ports_free(void);

int dp_start_port(struct dp_port *port);
#ifdef ENABLE_PF1_PROXY
int dp_start_pf_proxy_tap_port(void);
int dp_start_pf1_proxy_port(void);
#endif
int dp_stop_port(struct dp_port *port);

Expand Down Expand Up @@ -158,11 +165,6 @@ struct dp_port *dp_get_out_port(struct dp_flow *df)
static __rte_always_inline
struct dp_port *dp_get_port_by_id(uint16_t port_id)
{
#ifdef ENABLE_PF1_PROXY
if (unlikely(dp_conf_is_pf1_proxy_enabled() && port_id == _dp_pf_proxy_tap_port.port_id))
return &_dp_pf_proxy_tap_port;
#endif

if (unlikely(port_id >= RTE_DIM(_dp_port_table))) {
DPS_LOG_ERR("Port not registered in dpservice", DP_LOG_PORTID(port_id));
return NULL;
Expand Down Expand Up @@ -201,9 +203,9 @@ struct dp_port *dp_get_port_by_pf_index(uint16_t index)

#ifdef ENABLE_PF1_PROXY
static __rte_always_inline
const struct dp_port *dp_get_pf_proxy_tap_port(void)
const struct dp_port *dp_get_pf1_proxy(void)
{
return &_dp_pf_proxy_tap_port;
return &_dp_pf1_proxy_port;
}
#endif

Expand Down
1 change: 1 addition & 0 deletions include/dp_virtsvc.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ struct dp_virtsvc {
rte_be16_t service_port;
uint8_t proto;
uint16_t last_assigned_port;
union dp_ipv6 ul_addr;
struct rte_hash *open_ports;
struct dp_virtsvc_conn connections[DP_VIRTSVC_PORTCOUNT];
struct rte_flow *isolation_rules[DP_MAX_PF_PORTS];
Expand Down
20 changes: 20 additions & 0 deletions include/nodes/cls_node.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
// SPDX-FileCopyrightText: 2023 SAP SE or an SAP affiliate company and IronCore contributors
// SPDX-License-Identifier: Apache-2.0

#ifndef __INCLUDE_CLS_NODE_H__
#define __INCLUDE_CLS_NODE_H__

#include <stdint.h>

#ifdef __cplusplus
extern "C" {
#endif

#ifdef ENABLE_PF1_PROXY
int cls_node_append_tx(uint16_t port_id, const char *tx_node_name);
#endif

#ifdef __cplusplus
}
#endif
#endif
3 changes: 2 additions & 1 deletion include/rte_flow/dp_rte_async_flow_isolation.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ int dp_create_virtsvc_async_isolation_templates(struct dp_port *port, uint8_t pr

struct rte_flow *dp_create_virtsvc_async_isolation_rule(uint16_t port_id, uint8_t proto_id,
const union dp_ipv6 *svc_ipv6, rte_be16_t svc_port,
struct rte_flow_template_table *template_table);
struct rte_flow_template_table *template_table,
const union dp_ipv6 *ul_addr);
#endif

#ifdef __cplusplus
Expand Down
24 changes: 24 additions & 0 deletions include/rte_flow/dp_rte_async_flow_pf1_proxy.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// SPDX-FileCopyrightText: 2023 SAP SE or an SAP affiliate company and IronCore contributors
// SPDX-License-Identifier: Apache-2.0

#ifndef __INCLUDE_DP_RTE_FLOW_ASYNC_FLOW_PF1_PROXY_H__
#define __INCLUDE_DP_RTE_FLOW_ASYNC_FLOW_PF1_PROXY_H__

#define DP_PF1_PROXY_RULE_COUNT 2

#ifdef __cplusplus
extern "C" {
#endif

#include "dp_port.h"

int dp_create_pf_async_from_proxy_templates(struct dp_port *port);
int dp_create_pf_async_to_proxy_templates(struct dp_port *port);

uint16_t dp_create_pf1_proxy_async_isolation_rules(struct dp_port *port);

#ifdef __cplusplus
}
#endif

#endif
18 changes: 18 additions & 0 deletions include/rte_flow/dp_rte_flow_helpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ union dp_flow_item_l4 {
struct rte_flow_item_icmp6 icmp6;
};

#ifdef ENABLE_PF1_PROXY
static const struct rte_flow_item_ethdev dp_flow_item_ethdev_mask = {
.port_id = 0xffff,
};
#endif

static const struct rte_flow_item_eth dp_flow_item_eth_mask = {
.hdr.ether_type = 0xffff,
};
Expand All @@ -62,6 +68,18 @@ static const struct rte_flow_item_ipv6 dp_flow_item_ipv6_dst_mask = {
.hdr.dst_addr = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
.hdr.proto = 0xff,
};
#ifdef ENABLE_VIRTSVC
static const struct rte_flow_item_ipv6 dp_flow_item_ipv6_src_dst_mask = {
PlagueCZ marked this conversation as resolved.
Show resolved Hide resolved
.hdr.src_addr = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
.hdr.dst_addr = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
.hdr.proto = 0xff,
};
#endif
#ifdef ENABLE_PF1_PROXY
static const struct rte_flow_item_ipv6 dp_flow_item_ipv6_dst_only_mask = {
.hdr.dst_addr = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
};
#endif

static const struct rte_flow_item_ipv4 dp_flow_item_ipv4_dst_mask = {
.hdr.dst_addr = 0xffffffff,
Expand Down
Loading
Loading