Due to the goal of having a hands-off system, besides hard crashes, something going wrong might go undetected for quite some time. Therefore the Raspberry Pi (4B) will be monitored and have alerts set for certain events.
For this purpose
Telegraf (gather data),
InfluxDB (store data) and
Grafana (visualize data & alerts) will be used. They were chosen because they seem popular, rather mature, feature rich and stable. For each of the components there are alternatives available. All of them are free and can be self hosted. For installation see my
previous post.
Configure InfluxDB
To interact with influxdb (2) i strongly advice to use its excellent web UI (http://[IP RPi]:8086).
Create a new bucket with retention policy 1d (or as desired). I have called it “telegraf”.
Create a at least one read/write access token, or two if you want to split telegraf and grafana.
Comment the output plugin [[outputs.influxdb]] (enabled by default) and enable [[outputs.influxdb_v2]]. Fill in the urls (http://127.0.0.1:8086), token (see influxdb web UI), organization and bucket (e.g. “telegraf”).
Telegraf permisisons
This section only applies if you want to run custom scripts/programs to gather data.
Telegraf runs as a dedicated user (telegraf) which does not have sudo rights. Neither does it have rights to common folders (e.g. home folders of other users), if you want to execute custom scripts i advice you to mark the script as executable (chmod +x script.sh) and move it to /var/lib/telegraf; telegraf’s user homefolder. That way Telegraf will always have access. If you need to debug script use: sudo -u telegraf /full/path/to/script.sh to manually run the script.
To let telegraf execute a script which requires elevated privileges (sudo) add the script to /etc/sudoers e.g. telegraf ALL=(ALL) NOPASSWD : /path/to/script.sh (it must go behind all other rules, last one takes precedence). Note: this is not secure, if one can change the script they effectively have root privileges. For our purposes elevated privileges will not be required.
GPU
In order for the telegraf user to be able to gather information about the GPU one first has to add telegraf into the video group: sudo usermod -G video telegraf. To test if it works use sudo -u telegraf /opt/vc/bin/vcgencmd measure_temp.
Check the permission of vchiq which might be incorrect.
sudo chgrp video /dev/vchiq
sudo chmod 0660 /dev/vchiq
ls -al /dev/vchiq
crw-rw---- 1 root video 245, 0 Jul 25 17:10 /dev/vchiq
vcgencmd
The Raspberry Pi comes with a tool,
/opt/vc/bin/vcgencmd, to query various system parameters; intended for development. For example voltage, temperature, frequency and throttled state. Some parsing/formatting is required, this can be done through a shell script and fed to telegraf. Fortunately
fivesixzero and
robcowart had the same thought and have already created such script.
Fortunately on Arch Linux ARM one does not need sudo to run /opt/vc/bin/vcgencmd as root, so i made slight modifications and combined both scripts. The only thing one has to do is mark it as executable (chmod +x vcgencmd.sh) and move it to /var/lib/telegraf; that’s it.
#!/bin/bash
# ==========================================================================================# Read out all usefull information from vcgencmd# This could be converted into an proper telegraf plug-in## Uses influx output format. Comment/uncomment what you want to measure.## Combination of# - https://github.com/robcowart/raspberry_pi_stats# - https://github.com/fivesixzero/telegraf-pi-bash## See https://www.raspberrypi.org/documentation/raspbian/applications/vcgencmd.md# ==========================================================================================host=$(cat /proc/sys/kernel/hostname)data="vcgencmd,host=${host} "# ---------------------------------------------------------------## telegraf-pi-get-throttled.sh## Get and parse RPi throttled state to produce single Grok-ready string.### Data Acquisition# Poll the VideoCore mailbox using vcgencmd to get the throttle state then use sed to grab just the hex digitsthrottle_state_hex=$(/opt/vc/bin/vcgencmd get_throttled | sed -e 's/.*=0x\([0-9a-fA-F]*\)/\1/')### Example vars for testing## Debug: uv=1 uvb=1 afc=0 afcb=0 tr=1 trb=0 str=0 (null) strb=0 (null)# throttle_state_hex=50005## Debug: uv=0 uvb=1 afc=0 afcb=0 tr=0 trb=0 str=0 (null) strb=0 (null)# throttle_state_hex=50000## Debug: uv=1 uvb=1 afc=1 afcb=1 tr=1 trb=1 str=1 strb=1# throttle_state_hex=F000F## Debug: uv=1 uvb=1 afc=0 afcb=1 tr=0 trb=1 str=0 strb=1# throttle_state_hex=F0008## Debug: uv=0 uvb=1 afc=0 afcb=1 tr=0 trb=1 str=1 strb=1# throttle_state_hex=F0001### Command Examples## dc: Convert hex to dec or binary (useful for get_throttled output)# Example: dc -e 16i2o50005p# Syntax: dc -e <base_in>i<base_out>o<input>p# i = push base_in (previous var) to stack# o = push base_out (previous var) to stack# p = print conversion output with newline to stdoutget_base_conversion (){ args=${1}i${2}o${3}p
binary=$(dc -e $args) printf %020d $binary
}throttle_state_bin=$(get_base_conversion 162 $throttle_state_hex)binpattern="[0-1]{20}"if[[ $throttle_state_bin =~ $binpattern ]]; then## Reference: get_throttled bit flags# https://github.com/raspberrypi/firmware/commit/404dfef3b364b4533f70659eafdcefa3b68cd7ae#commitcomment-31620480## NOTE: These ref numbers are reversed compared to vcencmd output.## Since Boot Now# | | # 0101 0000 0000 0000 0101# |||| ||||_ [19] throttled# |||| |||_ [18] arm frequency capped# |||| ||_ [17] under-voltage# |||| |_ [16] soft temperature capped# ||||_ [3] throttling has occurred since last reboot# |||_ [2] arm frequency capped since last reboot# ||_ [1] under-voltage has occurred since last reboot# |_ [0] soft temperature reached since last reboot strb=${throttle_state_bin:0:1} uvb=${throttle_state_bin:1:1} afcb=${throttle_state_bin:2:1} trb=${throttle_state_bin:3:1} str=${throttle_state_bin:16:1} uv=${throttle_state_bin:17:1} afc=${throttle_state_bin:18:1} tr=${throttle_state_bin:19:1}# Note: first field, do not start with a , data+="under_volted=${uv:-0}i,under_volted_boot=${uvb:-0}i,arm_freq_capped=${afc:-0}i,arm_freq_capped_boot=${afcb:-0}i,throttled=${tr:-0}i,throttled_boot=${trb:-0}i,soft_temp_limit=${str:-0}i,soft_temp_limit_boot=${strb:-0}i"fi# ---------------------------------------------------------------# SOC tempsoc_temp=$(/opt/vc/bin/vcgencmd measure_temp | sed -e "s/temp=//" -e "s/'C//")data+=",soc_temp=${soc_temp}"# Clock speedsarm_f=$(/opt/vc/bin/vcgencmd measure_clock arm | sed -e "s/^.*=//")data+=",arm_freq=${arm_f}i"core_f=$(/opt/vc/bin/vcgencmd measure_clock core | sed -e "s/^.*=//")data+=",core_freq=${core_f}i"#h264_f=$(/opt/vc/bin/vcgencmd measure_clock h264 | sed -e "s/^.*=//")#data+=",h264_freq=${h264_f}i"#isp_f=$(/opt/vc/bin/vcgencmd measure_clock isp | sed -e "s/^.*=//")#data+=",isp_freq=${isp_f}i"#v3d_f=$(/opt/vc/bin/vcgencmd measure_clock v3d | sed -e "s/^.*=//")#data+=",v3d_freq=${v3d_f}i"uart_f=$(/opt/vc/bin/vcgencmd measure_clock uart | sed -e "s/^.*=//")data+=",uart_freq=${uart_f}i"#pwm_f=$(/opt/vc/bin/vcgencmd measure_clock pwm | sed -e "s/^.*=//")#data+=",pwm_freq=${pwm_f}i"#emmc_f=$(/opt/vc/bin/vcgencmd measure_clock emmc | sed -e "s/^.*=//")#data+=",emmc_freq=${emmc_f}i"#pixel_f=$(/opt/vc/bin/vcgencmd measure_clock pixel | sed -e "s/^.*=//")#data+=",pixel_freq=${pixel_f}i"#vec_f=$(/opt/vc/bin/vcgencmd measure_clock vec | sed -e "s/^.*=//")#data+=",vec_freq=${vec_f}i"#hdmi_f=$(/opt/vc/bin/vcgencmd measure_clock hdmi | sed -e "s/^.*=//")#data+=",hdmi_freq=${hdmi_f}i"#dpi_f=$(/opt/vc/bin/vcgencmd measure_clock dpi | sed -e "s/^.*=//")#data+=",dpi_freq=${dpi_f}i"# Voltagescore_v=$(/opt/vc/bin/vcgencmd measure_volts core | sed -e "s/volt=//" -e "s/0*V//")data+=",core_volt=${core_v}"#sdram_c_v=$(/opt/vc/bin/vcgencmd measure_volts sdram_c | sed -e "s/volt=//" -e "s/0*V//")#data+=",sdram_c_volt=${sdram_c_v}"#sdram_i_v=$(/opt/vc/bin/vcgencmd measure_volts sdram_i | sed -e "s/volt=//" -e "s/0*V//")#data+=",sdram_i_volt=${sdram_i_v}"#sdram_p_v=$(/opt/vc/bin/vcgencmd measure_volts sdram_p | sed -e "s/volt=//" -e "s/0*V//")#data+=",sdram_p_volt=${sdram_p_v}"# Memory# Note: do not use get_mem arm on RPi 4 (> 1GB)#arm_m=$(($(/opt/vc/bin/vcgencmd get_mem arm | sed -e "s/arm=//" -e "s/M//")*1000000))#data+=",arm_mem=${arm_m}i"#gpu_m=$(($(/opt/vc/bin/vcgencmd get_mem gpu | sed -e "s/gpu=//" -e "s/M//")*1000000))#data+=",gpu_mem=${gpu_m}i"#malloc_total_m=$(($(/opt/vc/bin/vcgencmd get_mem malloc_total | sed -e "s/malloc_total=//" -e "s/M//")*1000000))#data+=",malloc_total_mem=${malloc_total_m}i"#malloc_m=$(($(/opt/vc/bin/vcgencmd get_mem malloc | sed -e "s/malloc=//" -e "s/M//")*1000000))#data+=",malloc_mem=${malloc_m}i"#reloc_total_m=$(($(/opt/vc/bin/vcgencmd get_mem reloc_total | sed -e "s/reloc_total=//" -e "s/M//")*1000000))#data+=",reloc_total_mem=${reloc_total_m}i"#reloc_m=$(($(/opt/vc/bin/vcgencmd get_mem reloc | sed -e "s/reloc=//" -e "s/M//")*1000000))#data+=",reloc_men=${reloc_m}i"# Config# Note: this data is static; there are more options available#config_arm_f=$(($(/opt/vc/bin/vcgencmd get_config arm_freq | sed -e "s/arm_freq=//")*1000000))#data+=",config_arm_freq=${config_arm_f}i"#config_core_f=$(($(/opt/vc/bin/vcgencmd get_config core_freq | sed -e "s/core_freq=//")*1000000))#data+=",config_core_freq=${config_core_f}i"#config_gpu_f=$(($(/opt/vc/bin/vcgencmd get_config gpu_freq | sed -e "s/gpu_freq=//")*1000000))#data+=",config_gpu_freq=${config_gpu_f}i"# Out Of Memory events in VC4 memory space# Note: lifetime oom requried is skipped#oom_c=$(/opt/vc/bin/vcgencmd mem_oom | grep "oom events" | sed -e "s/^.*: //")#data+=",oom_count=${oom_c}i"#oom_t=$(/opt/vc/bin/vcgencmd mem_oom | grep "total time" | sed -e "s/^.*: //" -e "s/ ms//")#data+=",oom_total_time=${oom_t}i"#oom_max_t=$(/opt/vc/bin/vcgencmd mem_oom | grep "max time" | sed -e "s/^.*: //" -e "s/ ms//")#data+=",oom_max_time=${oom_max_t}i"# Relocatable memory allocator on the VC4#mem_reloc_alloc_fail_c=$(/opt/vc/bin/vcgencmd mem_reloc_stats | grep "alloc failures" | sed -e "s/^.*:[^0-9]*//")#data+=",mem_reloc_alloc_fail_c=${mem_reloc_alloc_fail_c}i"#mem_reloc_compact_c=$(/opt/vc/bin/vcgencmd mem_reloc_stats | grep "compactions" | sed -e "s/^.*:[^0-9]*//")#data+=",mem_reloc_compact_c=${mem_reloc_compact_c}i"#mem_reloc_leg_blk_fail_c=$(/opt/vc/bin/vcgencmd mem_reloc_stats | grep "legacy block fails" | sed -e "s/^.*:[^0-9]*//")#data+=",mem_reloc_leg_blk_fail_c=${mem_reloc_leg_blk_fail_c}i"# Ring oscillator# Indicates how slow/fast the silicone the this particalar RPi is.# https://www.raspberrypi.org/forums/viewtopic.php?p=582078#osc_output=$(/opt/vc/bin/vcgencmd read_ring_osc)time_stamp=$(date +%s%N)data+=" ${time_stamp}"echo "${data}"
As the original author already noted, one could write a proper plug-in for Telegraf instead of using a shellscript.
Telegraf input configuration
Telegraf defines data sources as “inputs”, it has quite a lot of default
input plugins available, for example: cpu, memory, network, …. info all work out-of-the-box. To enable them uncomment/add them in Telegraf’s config file. The config file is quite large by itself therefor it is advised to create a separate conf file (e.g. inputs.conf) and add it to /etc/telegraf/telegraf.d/. This way you will retain a better overview of your telegraf config, some people even prefer a separate config per input plugin. Technically it does not matter whether or not, or how, you split them up. They are effectively all appended.
Some input plugins might require to install other software, for example [[inputs.hddtemp]] requires to have the hddtemp daemon running (install the hddtemp package). Note: not all drives have a temperature sensor, YMMV.
Configuring plug-ins is very straight forward, only for the “topk” processes default plugin there is some additional configuration done to properly format the values (copied from
akavel). Other than that everything should be self explanatory (if not have a look at the
official documentation).
# --------------------------------------------------- Defaults ---------------------------------------------# Read CPU metrics[[inputs.cpu]]
## Whether to report per-cpu stats or notpercpu = true## Whether to report total system cpu stats or nottotalcpu = true## If true, collect raw CPU time metrics.collect_cpu_time = false## If true, compute and report the sum of all non-idle CPU states.report_active = false# Gather metrics from disks[[inputs.disk]]
interval = "5m"## Ignore mount points by filesystem type.ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
# Read metrics about disk IO by device[[inputs.diskio]]
# Read HDD temp metric from the hddtemp daemon#[[inputs.hddtemp]]# Read metrics about system load & uptime[[inputs.system]]
## Uncomment to remove deprecated metrics.fielddrop = ["uptime_format"]
# Get kernel statistics from /proc/stat[[inputs.kernel]]
# no configuration# Read metrics about memory usage[[inputs.mem]]
# no configuration# Get the number of processes and group them by status[[inputs.processes]]
# no configuration# Read metrics about swap memory usage#[[inputs.swap]]# no configuration# Read metrics about system load & uptime[[inputs.system]]
# no configuratiion# Read metrics about the IRQ[[inputs.interrupts]]
cpu_as_tag = false#[[inputs.linux_sysctl_fs]]# Gather metrics about network interfaces[[inputs.net]]
ignore_protocol_stats = false# Collect TCP connections state and UDP socket counts[[inputs.netstat]]
# no configuration# Gather metrics from /proc/net/* files#[[inputs.nstat]]# ----------------------------------------------------- Top processes (default) ----------------------------------[[inputs.procstat]]
pattern = "."pid_tag = falsepid_finder = "native"fieldpass = [
"num_threads",
"cpu_usage",
"memory_rss",
]
[[processors.topk]]
# see: https://github.com/influxdata/telegraf/blob/release-1.17/plugins/processors/topk/README.mdnamepass = ["*procstat*"]
aggregation = "max"k = 10fields = [
"num_threads",
"cpu_usage",
"memory_rss",
]
[[processors.regex]]
namepass = ["*procstat*"]
[[processors.regex.tags]]
key = "process_name"pattern = "^(.{60}).*"replacement = "${1}..."# ------------------------------------------------------ Raspberry Pi ---------------------------------------------# Read RPi CPU temperature[[inputs.exec]]
commands = [ '''sed -e 's/^\([0-9]\{2\}\)\(.*\)$/\1.\2/' /sys/class/thermal/thermal_zone0/temp''' ]
name_override = "sys"data_format = "grok"grok_patterns = ["%{NUMBER:thermal_zone0:float}"]
# Vcgencmd input[[inputs.exec]]
commands = ["/var/lib/telegraf/vcgencmd.sh"]
timeout = "7s"data_format = "influx"
Remember to reload the telegraf service when the conf file has been changed.
Configure Grafana
Its configuration is located at: /etc/grafana/grafana.ini however creating and adjusting dashboards is done through its web interface: http://[Raspberry Pi IP]:4000/
To set-up Grafana with InfluxDB as data source: in the webinterface go to “Data Sources”, select “Add new Data Source” and find InfluxDB under “Timeseries Databases”. Set the query language to Flux, fill in the url (http://127.0.0.1:8086) and fill in the “InfluxDB Details” (you can find the token in the influxdb web UI).
Create your own dashboard or use one of the many
community dashboards available, including mine (see bottom of this page). Dashboard can be imported as JSON or grafana id. Often when you load a dashboard for the first time some variables, like data source, have to be configured. For more info see the
official documentation.
To debug queries it is strongly advised to try them out in the InfluxDB explorer (see web UI).
If the dashboard is not shown 24/7 something might need your attention without you noticing, grafana has this
feature build in. Alters. There are many “notification channels” (e.g. Microsoft Teams, slack, discord, PagerDuty, …) which can be configured in the config file; i will use SMTP (email).
Of course Grafana runs on the Raspberry Pi itself, so if the RPi goes down grafana goes down as well. Classic case of “who is watching the watcher”.
Unfortunately there are some, none obvious, limitations in grafana’s altering system: they can only be applied to graphs and the queries cannot use any variables. This issue has been
raise since 2016 but so far no dice. They
recently stated working on it though. Let’s hope i can rewrite this section soon.
Using Grafana: create dedicated graph panels for every metric you want to set an alert on and do not use any variables in your queries. You can put them all in the same “row” and collapse it. While not ideal it does work, has not much side effects and does not require further attention once set-up. If you get tsdb.HandleRequest() error time: invalid duration $intervalmake sure there is no variable used in the query options.
Using InfluxDB dashboard: InfluxDB web UI also allows one to create dashboards, it is very similar to grafana. One can copy the queries verbatim. It also supports alters however, SMTP (email) as notification channel is not (yet?) supported.
Therefore i simply did not configure any alerts for the time being.
Gotcha’s
Few gotcha’s i have encountered while exploring Grafana:
One cannot adjust the refresh rate of individual panels. This would be useful to fine-tune performance and allow faster refresh rates for certain applicable metrics.
If you want 2 y-axes with different units do NOT set the “Unit” under “Standard options” since it will overwrite the individual axis settings. Unfortunately once set you can no longer clear it through the UI, you have to open the JSON file of the panel and set the standard unit to "". You can now set a unit per y-axis again.
Using Flux disjoint graph series (show nothing if there is no data) seem broken. It should be enough to just set Null values (under “Display”) to null. However it does not seem to work. This affects the Processes graphs.
Result
Without further or do, here is the result. Of course you can always further customize to your liking. To try for yourself import the attached JSON below or the Grafana id
13982.
(the old version of this board, for InfluxDB 1.X, is available under Grafana id
13044 but no longer maintained)
License: The text and content is licensed under CC BY-NC-SA 4.0.
All source code I wrote on this page is licensed under The Unlicense; do as you please, I'm not liable nor provide warranty.
Noticed an error in this post? Corrections are
spam.appreciated.