Celestia Full Node Analysis

My performance in the Blockspace Race

May 15, 2023

I performed my analysis by running a low spec server with Prometheus and Grafana and scraping my Celestia Full Node server with Node Exporter installed and running.

This is how I set my analysis stack up. Please feel free to skip to the actual analysis below.

Spec:

Celestia Full Node (ran at home)

Distro: Ubuntu 22.04

Processors: 8

RAM: 17GB

Storage: 1T SSD

Bandwidth: 250MB/s

VPS Analysis server:
Distro: Ubuntu 22.04

Processors: 1

RAM: 2GB

Storage: 40GB SSD

Bandwidth: 250MB/s

Steps to set up:

First, it’s worthwhile opening the relevant ports on each machine:

Analysis machine:

sudo ufw allow ssh 

#Port for prometheus 
sudo ufw allow 9090

#Port for grafana
sudo ufw allow 3000

sudo ufw enable
sudo ufw status

Full node machine:

# You don't want to be locked out of this server so best open ssh straight away for remote access. 

sudo ufw allow ssh 

# Port for node_exporter
sudo ufw allow 9100

# Port for exporting celestia metrics (covering all bases) 
sudo ufw allow 4318
sudo ufw enable
sudo ufw status

Running Prometheus and Grafana on a dedicated analysis server.

Download Prometheus and Extract (Promtol and Prometheus)

wget https://github.com/prometheus/prometheus/releases/download/v2.33.0/prometheus-2.33.0.linux-amd64.tar.gz
 
tar xvf prometheus-2.33.0.linux-amd64.tar.gz

Move the Prometheus binaries to /usr/local/bin

sudo cp prometheus-2.33.0.linux-amd64/prometheus /usr/local/bin/ 
sudo cp prometheus-2.33.0.linux-amd64/promtool /usr/local/bin/

Grant permissions (assumes user is ubuntu in this case)

sudo chown ubuntu:ubuntu /usr/local/bin/prometheus 
sudo chown ubuntu:ubuntu /usr/local/bin/promtool

Check version installed:

prometheus --version

promtool --version

Expected output:

Create Prometheus configuration file

sudo mkdir /etc/prometheus/ sudo nano /etc/prometheus/prometheus.yml

Note it’s important to add the IP for the machine you want to scrape, in this case the Celestia Full Node.

global:
  scrape_interval: 10s
  scrape_timeout: 3s
  evaluation_interval: 5s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node_exporter
    static_configs:
      - targets: ['<your_celestia_node_ip>:9100']

Create a system service for Prometheus

sudo tee /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=root
Type=simple
ExecStart=/usr/local/bin/prometheus \\
--config.file /etc/prometheus/prometheus.yml \\
--storage.tsdb.path /var/lib/prometheus/ \\
--web.console.templates=/etc/prometheus/consoles \\
--web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF

Enable and Start Services

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Check prometheus is running:

sudo systemctl status prometheus

Expected output:

Grafana

Note: you can also use grafana cloud to run Grafana. I ran this locally in my case.

Add the Grafana GPG key to the system:

curl <https://packages.grafana.com/gpg.key> | sudo apt-key add -

Add the Grafana repository to the system:

sudo add-apt-repository "deb <https://packages.grafana.com/oss/deb> stable main"

Update the package index and install Grafana:

sudo apt update
sudo apt install grafana

Start the Grafana service:

sudo systemctl start grafana-server

Enable the Grafana service to start at boot:

sudo systemctl enable grafana-server

Check the status of the Grafana service:

sudo systemctl status grafana-server

Expected output:

On the Celestia Full Node machine:

Download and Extract Node exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz

tar xvf node_exporter-1.3.1.linux-amd64.tar.gz

Move and Grant Permissions

sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/ sudo chown ubuntu:ubuntu /usr/local/bin/node_exporter

Check version installed:

node_exporter --version

Expected output:

Create a system service for Node Exporter

sudo tee /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

Enable and Start Services

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Check node_exporter is running:

sudo systemctl status node_exporter

Expected output:

Check Prometheus:

http://<your_analysis_server_ip>:9090/targets

Expected output:

At this stage everything should be connected and we can now move to detailed analysis on Grafana.

Grafana can be accessed through a web browser. The default port for Grafana is 3000, so you can access it by typing in a browser:

http://<your_analysis_server_ip>:3000

Once you have accessed the Grafana login page, you can log in with the default username and password, which are both "admin". You will be prompted to change the password upon logging in for the first time.

After logging in, you can create a data source to connect to your analysis server and then create visualisations and dashboards to display your data.

I used a generic public dashboard for my analysis with a few minor modifications.

It can be imported using UI: 10180

Expected output, something similar to this:

Note: The formula for the “Memory Use” gauge:

1 - node_memory_MemAvailable_bytes{instance=~"$host"} / node_memory_MemTotal_bytes{instance=~"$host"}

I had changed formula from node_memory_MemFree_bytes to MemAvailable_bytes as I felt it gave a more accurate picture of how things were going.

Analysis

My analysis is of my Celestia Full Node in a timeline like fashion including some screen and logs screenshots. My dashboard was online from May the 4th (be with us).

My full node started with a hiss and a roar, functioning very well and maintaining near perfect uptime.

However, by 11th April my data folder was filling up fast:

My node wasn’t pruning and thus kept filling up with data. I didn’t have this monitoring stack running while this happened and got caught out at a weekend when I was less active. I had to delete the data files, restart and re-sync to get things going again. This meant my node was offline for a significant length of time, and had an adverse effect on my uptime.

The v0.9.0 update that came the day after my re-sync corrected this.

Lack of storage hasn’t been a problem since:

Since then my node was performing admirably, or so I thought.

Let’s talk about SHREX-BABE-Y 🎶

I started noticing a couple of things, one, my server was working super hard, I could hear the fans working overtime and feel the heat coming off the machine. And second, I could see my logs were overwhelmed with SHREX errors. By this time the team was well aware of the problem and were working on a fix.

The recommended solution was to restart your node, however with each restart my node couldn’t find any bootstrappers and went into a restart loop. I needed to delete and resync to get things going again. This was frustrating as I didn’t want to keep resyncing and potentially losing uptime. I wasn’t alone in this and more parties were noticing the same issue.

Eventually an update corrected this and I began restarting my node multiple times in order to get it back in sync.

By this stage I got my analysis stack up and running and could see in more detail how things were going.

Running Celestia version:

4th-5th May: As can be seen in the below screen shot one of the restarts was around 14.10, things eased off briefly but the CPU Load increased very quickly soon after to near capacity with SHREX errors dominating my logs.

Later that day I began to notice that my node was struggling to get in sync, my uptime was gradually lowering and I could see that the head_of_sampled_chain was static on each check. I was also consuming a lot of data, outputting almost 3x as much as consuming. These figures became much more balanced after updates and re-sync in later snapshots.

Static `head_of_sampled_chain. catch_up_done=false`

Effectively the SHREX errors were overloading my machine and it couldn’t catch up with the network.

I stated restarting more frequently and managed to get some longer periods of peace:

5th-6th May

But I was eventually getting closer to the chain head.

After another restart I finally got fully synched after many restarts and my CPU plummeted. My machine could finally recover. The cooling fans were slowing and things were looking much better.

Fully synced

6th - 8th May: But the SHREX errors kept returning and my machine hit capacity again, and started lagging.

With the latest update v0.9.3 things just clicked.

My node was sampling headers and had synced in no time. The SHREX errors settled down as the other nodes updated and my CPU could finally get to a consistent lower work rate. Network traffic started reducing.

9th - 12th May

My machine has been very chilled averaging a CPU of 7.4%, Network traffic normalised and quite balanced. I am consistently fully synched.

12th May

This update started well but caused a lot of memory use that eventually led to a shutdown of my machine!

RAM wasn’t a significant factor until this point. But after the update, as can be seen below; the available memory was degrading quickly, the Celestia Node was requiring a lot more RAM to function and picking up steam too! This wasn’t showing in the logs, everything seemed fine and the shutdown really caught me off guard.

After the update, there was less “write” data, but more violent “Read” data, that subsequently lead to the shutdown.

To make sure the shutdown didn’t happen again I decided to restart intermittently to get the RAM back to the lower level.

Summary

Data storage - Filled super quickly pre v0.9.0. No issues since the update.

CPU - Super high when SHREX errors persisted. Normalised after v0.9.3 update.

Memory - Consistent slow rise throughout averaging 29% until v0.9.4 where an unknown problem has been consuming more RAM than previous versions. This led to a shutdown and the only fix thus far has been regular restarts.

Network - Lots of traffic in and out during SHREX error time period, in part due to all the restarts and re-syncs. Normalised thereafter.

Full capture from start of analysis installation

This concludes the analysis of my Celestia Full Node. I learned a lot from this exercise. Thank you for the opportunity Celestia team!

The Signal

Celestia Full Node Analysis

My performance in the Blockspace Race

Steps to set up:

Grafana

On the Celestia Full Node machine:

Analysis

Let’s talk about SHREX-BABE-Y 🎶

Summary