![]() ![]() The “panels” ID values must match what you have in Grafana, otherwise the direct link won’t work. Can you share yours? This is what OSC uses for Grafana configs in /etc/ood/config/clusters.d/pitzer.yml custom: There are a few possible reasons the job viewer panels aren’t showing up, one could be incorrect cluster config inside OnDemand. The actual URL field jobid= in URL query parameters still works. It’s really just historical how we handle users looking at data from other user’s jobs. I did this so there isn’t a jobid lookup as well as hiding the field, mostly at OSC so our users don’t get a form to type in other people’s job IDs. groups:Įxpr: count by(host,cluster,role) (node_cpu_info)Įxpr: avg by (host,cluster,role)(irate(node_cpu_seconds_total, jobid) "/etc/prometheus/file_sd_config.d/gpfs_*.yaml"įor the CPU load we use record rules to speed up the loading since our nodes have anywhere from 28 to 96 cores and so per-core metrics take a long time to load when doing whole clusters. Regex: gpfs_exporter_(collect_error|collector_duration_seconds) (mmhealth|mount|verbs) compute Regex: gpfs_(mount|health|verbs)_status compute All our scrapes are 1 minute except GPFS which is 3 minutes and we drop more things with GPFS to only focus on exactly the metrics we care about: - job_name: gpfs Most of our exporters follow this pattern for generating the scrape configs. If you put host in the scrape config, you don’t need the metric relabeling logic to generate host label. # this file is managed by puppet changes will be overwritten We use Puppet to generate the actual scrape target configs, here is an example: # cat /etc/prometheus/file_sd_config.d/cgroup_cgroup-p0001.yaml "/etc/prometheus/file_sd_config.d/cgroup_*.yaml" Regex: "^(go|process|promhttp)_.* compute" The SLURM exporter we use is a fork with major modifications, but for the OnDemand Clusters dashboard, the upstream repo may work but can’t guarantee that since we rely on the heavily modified fork.Įxample config for cgroup exporter that filters out process and Go metrics since we run this on ~1400 compute nodes and don’t really care about those metrics. I just pushed a new version of OnDemand Clusters dashboard with SLURM dashboards instead of Moab and using NVIDIA’s DCGM exporter. I know that was a lot, but thanks for any help you can offer. If I click on that link, I get the grafana page with the data. If I expand a job I get a blank screen with the job info and a Detailed Metrics link. Lastly, the Active Job Dashboard in OOD (2.0.8) does not show the integrated graphs. I’m sure I can make it work for slurm, but what’s the prometheus config in use, or collection information that should be set to get valid data? Is this an incompatibility with the version of node exporter? (I have version 1.1.2)Īlso, for the moab graphs, I’m using slurm, but there’s no info in the documentation anywhere on what this is looking for. Now, the CPU Load and Memory Usage graphs are loading stuff, but CPU usage has no data, it’s looking for node_cpu_load_system which isn’t an item being served up by prometheus, it has node_cpu_seconds_total of various types, but not that particular metric. Also, in the documentation, the relabel_configs has in quotes and prometheus (2.27.1) didn’t like that, but taking the ’ ’ out made that work. (this wasn’t documented that I needed to do this, I figured that by looking at the variables in the grafana dashboard). I configured the prometheus.yml file to have this for each node: I have a prometheus configuration set up, and it looks like that piece is set up correctly, the nodes are exporting data, and prometheus has that data being stored. I see that it requires the use of the OnDemand Clusters dashboard, so I have that installed and have been working to get that functional. I’m trying to get the grafana integration working and I’m getting a little stuck.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |