DevOps.dev

Devops.dev is a community of DevOps enthusiasts sharing insight, stories, and the latest…

Follow publication

Monitor Nvidia Jetson GPU

--

We have many mobile edge computer(mec)s, which have nvidia jetson gpu, we want to monitor the gpu status with prometheus and grafana.

What we have done as the following steps:

  1. build custom jetson gpu exporter to collect gpu status
  2. prometheus scrape the metrics from the gpu exporter, store it in tsdb
  3. create custom grafana dashboard to show the metrics in prometheus

Build custom jetson GPU Exporter

After search I found svcavallar/jetson-stats-grafana-dashboard on github. But it doesn’t match my requirements, since

  1. Now jetston-stats is 4.2.3, which has lot change and diff from old version. We should change the filed get from the package.
  2. We want to get the processes info that using gpu currently.
  3. We want to add basic auth to our /metrics route to improve our security.
  4. Prefer docker image in cloud native way than systemd service on host.

So we fork the repository to taozhi8833998/jetson-stats-grafana-dashboard, and add the above feature to the exporter.

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import atexit
import os
from jtop import jtop, JtopException
from prometheus_client.core import InfoMetricFamily, GaugeMetricFamily, REGISTRY, CounterMetricFamily
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server
from base64 import b64encode

string_to_encode = os.environ.get("USERNAME")+':'+os.environ.get("PASSWORD")
bytes_to_encode = string_to_encode.encode('utf-8')
base64_encoded = b64encode(bytes_to_encode)
base64_encoded_string = base64_encoded.decode('utf-8')
base64_encoded_string = base64_encoded_string.strip()
HTTP_AUTH = 'Basic ' + base64_encoded_string

class CustomCollector(object):
def __init__(self):
atexit.register(self.cleanup)
self._jetson = jtop()
self._jetson.start()

def cleanup(self):
print("Closing jetson-stats connection...")
self._jetson.close()

def collect(self):
if self._jetson.ok():
#
# Board info
#
i = InfoMetricFamily('gpu_info_board', 'Board sys info', labels=['board_info'])
i.add_metric(['info'], {
'machine': self._jetson.board['info']['machine'] if 'machine' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Module'],
'jetpack': self._jetson.board['info']['jetpack'] if 'jetpack' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Jetpack'],
'l4t': self._jetson.board['info']['L4T'] if 'L4T' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['L4T']
})
yield i

i = InfoMetricFamily('gpu_info_hardware', 'Board hardware info', labels=['board_hw'])
i.add_metric(['hardware'], {
'codename': self._jetson.board['hardware'].get('Codename', self._jetson.board['hardware'].get('CODENAME', 'unknown')),
'soc': self._jetson.board['hardware'].get('SoC', self._jetson.board['hardware'].get('SOC', 'unknown')),
'module': self._jetson.board['hardware'].get('P-Number', self._jetson.board['hardware'].get('MODULE', 'unknown')),
'board': self._jetson.board['hardware'].get('699-level Part Number', self._jetson.board['hardware'].get('BOARD', 'unknown')),
'cuda_arch_bin': self._jetson.board['hardware'].get('CUDA Arch BIN', self._jetson.board['hardware'].get('CUDA_ARCH_BIN', 'unknown')),
'serial_number': self._jetson.board['hardware'].get('Serial Number', self._jetson.board['hardware'].get('SERIAL_NUMBER', 'unknown')),
})
yield i

#
# NV power mode
#
i = InfoMetricFamily('gpu_nvpmode', 'NV power mode', labels=['nvpmode'])
i.add_metric(['mode'], {'mode': self._jetson.nvpmodel.name})
yield i

#
# System uptime
#
g = GaugeMetricFamily('gpu_uptime', 'System uptime', labels=['uptime'])
days = self._jetson.uptime.days
seconds = self._jetson.uptime.seconds
hours = seconds//3600
minutes = (seconds//60) % 60
g.add_metric(['days'], days)
g.add_metric(['hours'], hours)
g.add_metric(['minutes'], minutes)
yield g

#
# CPU usage
#
g = GaugeMetricFamily('gpu_usage_cpu', 'CPU % schedutil', labels=['cpu'])
g.add_metric(['cpu_1'], self._jetson.stats['CPU1'] if 'CPU1' in self._jetson.stats else 0)
g.add_metric(['cpu_2'], self._jetson.stats['CPU2'] if 'CPU2' in self._jetson.stats else 0)
g.add_metric(['cpu_3'], self._jetson.stats['CPU3'] if 'CPU3' in self._jetson.stats else 0)
g.add_metric(['cpu_4'], self._jetson.stats['CPU4'] if 'CPU4' in self._jetson.stats else 0)
g.add_metric(['cpu_5'], self._jetson.stats['CPU5'] if 'CPU5' in self._jetson.stats else 0)
g.add_metric(['cpu_6'], self._jetson.stats['CPU6'] if 'CPU6' in self._jetson.stats else 0)
g.add_metric(['cpu_7'], self._jetson.stats['CPU7'] if 'CPU7' in self._jetson.stats else 0)
g.add_metric(['cpu_8'], self._jetson.stats['CPU8'] if 'CPU8' in self._jetson.stats else 0)
yield g

#
# GPU usage
#
g = GaugeMetricFamily('gpu_usage_gpu', 'GPU % schedutil', labels=['gpu'])
g.add_metric(['val'], self._jetson.stats['GPU'])
yield g

#
# Fan usage
#
g = GaugeMetricFamily('gpu_usage_fan', 'Fan usage', labels=['fan'])
g.add_metric(['speed'], self._jetson.fan.get('speed', self._jetson.fan.get('pwmfan', {'speed': [0] })['speed'][0]))
yield g

#
# Sensor temperatures
#
g = GaugeMetricFamily('gpu_temperatures', 'Sensor temperatures', labels=['temperature'])
keys = ['AO', 'GPU', 'Tdiode', 'AUX', 'CPU', 'thermal', 'Tboard']
for key in keys:
if key in self._jetson.temperature:
g.add_metric([key.lower()], self._jetson.temperature[key]['temp'] if isinstance(self._jetson.temperature[key], dict) else self._jetson.temperature.get(key, 0))
yield g
#
# Power
#
g = GaugeMetricFamily('gpu_usage_power', 'Power usage', labels=['power'])
if isinstance(self._jetson.power, dict):
g.add_metric(['cv'], self._jetson.power['rail']['VDD_CPU_CV']['avg'] if 'VDD_CPU_CV' in self._jetson.power['rail'] else self._jetson.power['rail'].get('CV', { 'avg': 0 }).get('avg'))
g.add_metric(['gpu'], self._jetson.power['rail']['VDD_GPU_SOC']['avg'] if 'VDD_GPU_SOC' in self._jetson.power['rail'] else self._jetson.power['rail'].get('GPU', { 'avg': 0 }).get('avg'))
g.add_metric(['sys5v'], self._jetson.power['rail']['VIN_SYS_5V0']['avg'] if 'VIN_SYS_5V0' in self._jetson.power['rail'] else self._jetson.power['rail'].get('SYS5V', { 'avg': 0 }).get('avg'))
if isinstance(self._jetson.power, tuple):
g.add_metric(['cv'], self._jetson.power[1]['CV']['cur'] if 'CV' in self._jetson.power[1] else 0)
g.add_metric(['gpu'], self._jetson.power[1]['GPU']['cur'] if 'GPU' in self._jetson.power[1] else 0)
g.add_metric(['sys5v'], self._jetson.power[1]['SYS5V']['cur'] if 'SYS5V' in self._jetson.power[1] else 0)
yield g

#
# Processes
#
try:
processes = self._jetson.processes
# key exists in dict
i = InfoMetricFamily('gpu_processes', 'Process usage', labels=['process'])
for index in range(len(processes)):
i.add_metric(['info'], {
'pid': str(processes[index][0]),
'user': processes[index][1],
'gpu': processes[index][2],
'type': processes[index][3],
'priority': str(processes[index][4]),
'state': processes[index][5],
'cpu': str(processes[index][6]),
'memory': str(processes[index][7]),
'gpu_memory': str(processes[index][8]),
'name': processes[index][9],
})
yield i
except AttributeError:
# key doesn't exist in dict
i = 0

class Auth():
def __init__(self, app):
self._app = app

def __call__(self, environ, start_response):
if self._authenticated(environ.get('HTTP_AUTHORIZATION')):
return self._app(environ, start_response)
return self._login(environ, start_response)

def _authenticated(self, header):
if not header:
return False
return header == HTTP_AUTH

def _login(self, environ, start_response):
start_response('401 Authentication Required',
[('Content-Type', 'text/html'),
('WWW-Authenticate', 'Basic realm="Login"')])
return [b'Login']

if __name__ == '__main__':
port = os.environ.get('PORT', 9998)
REGISTRY.register(CustomCollector())
app = make_wsgi_app()
httpd = make_server('', int(port), Auth(app))
print('Serving on port: ', port)
try:
httpd.serve_forever()
except KeyboardInterrupt:
print('Goodbye!')

We add basic auth to our /metrics route based username and password from env.

And get metrics:

Add Dockerfile to build the jetson gpu exporter image.

FROM python:3-buster
RUN pip install --upgrade pip && pip install -U jetson-stats prometheus-client
RUN mkdir -p /root
COPY jetson_stats_prometheus_collector.py /root/jetson_stats_prometheus_collector.py
WORKDIR /root
USER root
ENTRYPOINT ["python3", "/root/jetson_stats_prometheus_collector.py"]

Prometheus scrape the metrics

We want the prometheus have service discovery, we request kubernetes endpoint api to add endpoint, and the prometheus can scrape metrics from the service.

In the prometheus.yml, we have the following job config:


- job_name: 'edge-node-exporter'
basic_auth:
username: username
password: password
static_configs:
- targets: ['asteria-edge-node-exporter']
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- citybrain
selectors:
- role: endpoints
label: "name=asteria-edge-node-exporter"
relabel_configs:
- source_labels: [__param_target]
target_label: instance
- source_labels: [__address__]
regex: '(.+):(.*)'
replacement: '${1}'
target_label: ip
action: replace
action: replace

In kubernetes, we have asteria-edge-node-exporter svc that do not have selector field.


apiVersion: v1
kind: Service
metadata:
name: asteria-edge-node-exporter
namespace: citybrain
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
type: ClusterIP

Beside, we have endpoints:

apiVersion: v1
kind: Endpoints
metadata:
labels:
name: asteria-edge-node-exporter
name: asteria-edge-node-exporter
namespace: citybrain
subsets:
- addresses:
- ip: 30.232.136.90
- ip: 30.232.140.65
ports:
- name: gpu-exporter
port: 9998
protocol: TCP

After the setting, the prometheus will request http://30.232.136.90:9998/metrics and http://30.232.140.65:9998/metrics automatically.

If we have new mec, we can add ip to the endpoints, and the prometheus will scrape it automatically.

Show in Grafana

Now our gpu metrics in prometheus, we can config it to grafana to show it, and the final dashboard looks like in the first paragraph.

Conclusions

I am not familiar with python, so it took me lots of time to write the gpu exporter. By the cloud native service discovery, we can easily add endpoint to promethues automatically. Finally, the grafana dashboard shows perfect.

--

--

Published in DevOps.dev

Devops.dev is a community of DevOps enthusiasts sharing insight, stories, and the latest development in the field.

Responses (2)

Write a response