Polkassembly - Treasury Proposal - Funding for development of Database Analytics and RPC Endpoint monitoring #2546

Treasury Proposal - Funding for development of Database Analytics and RPC Endpoint monitoring

3 years ago

Dear community,

I am presenting to you a proposal which seeks to fund a database analytic and RPC monitoring project for a period of 12 months. The estimated schedule spans the period of 01-June to 14-September-2023 and would be executed by the three team members of the ParaNodes team.

The proposal seeks to address three gaps of public knowledge:

Database sizing & growth rates
Database restoration times
RPC (WSS) Endpoint monitoring

To achieve these objectives additional hardware is required and I have presented two options for its procurement. The first suggests rental of servers for the duration of the project, whereas the second suggests purchase of hardware with a lifespan of 5 years. Future costs are lower selecting option 2 and any 'discarded' hardware would be donated to a charitable cause. At present rates, the all in cost for option 1 is 2,725.45 KSM whereas the all in cost for option 2 is 3,005.15 KSM, both costs contain fees for slippage and are measured against an EMA7 of $32.817/KSM.

Full details can be found in the proposal document.

Summary of Objectives 1 & 2

At the core of any Polkadot/Kusama node is a database, the database has multiple types and configurations. The project seeks to expand on all useful combination of database types and configurations and monitor its current size and growth rates. The knowledge is useful to validators, collators, RPC providers, developers, researchers or anyone who intends to roll-out or maintain a Polkadot node. The information can be useful to prevent unexpected disk consumption or guide new builders with adequate sizing of disks. The collected data would be presented on the ParaNodes.io website as well as on the Polkadot Wiki. This project was initiated ~16 months ago, at launch it received welcoming feedback with suggestions to chart data and present growth rates.

Summary of Objective 3

Secondly, I wish to establish a system to monitor the availability, performance and consistency of public RPC endpoints on Kusama and Polkadot. This should provide benefit to end-users and RPC providers themselves. At present, the treasury is the primary financier of public RPC endpoints for which funds are provided based on self-evaluations, this system intends to provide a third-party evaluation of said services to further guide funding.

Regards,

Will | Paradox

Comments (5)

3 years ago

Thank you Will for this interesting post, here are a few comments that may help the proposal

database sizing: should also take into account that a database synced since a long time is using a much higher storage space than a database freshly sync (factor 2-3x over 1 year). This in an important information for node operators for sizing their hardware.
database sizing and database restore time can be merged on the same nodes as these metrics don't require a real time visualization (few days old metric is acceptable for these metrics). Having a (weekly?) scheduled job that wipes these nodes db and restore should help.
db restore time will highly depend of machine specs and connectivity/location. You should have the same machine specs for all benchmarks made on this metric.
RPC testing: low specs hardware is enough for the probing, as long as machines have good connectivity (if you are to select a cloud provider, from experience Azure has one of the best connectivity).
RPC uptime: it would be good to shorten the probing part (response time) to every minute or even shorter for instant connectivity reporting. A tool like blackbox exporter does that very well and very lightly and is easy to integrate to your own database. This would be a good metric to have permanently rather than for just a limited time.
AMD CPUs are less common than Intel ones in general with node runners. I would suggest using an Intel CPU.
HDD IO will be saturated if you run many nodes on these, I recommend a SSD

I would suggest renting the hardware as the execution time of the proposal is limited.

Could you please make a table of all nodes / servers that would be running?
With abouve suggestions, I guess you can make sustainable economy on the hardware/maintenance part

3 years ago

Hey bLD,

Thank you for the comments.

database sizing: should also take into account that a database synced since a long time is using a much higher storage space than a database freshly sync (factor 2-3x over 1 year). This in an important information for node operators for sizing their hardware.

I refer to this as database bloat and it was something that I didn't consider within the proposal, thanks for raising it. Usually bloat arises from logs files or some unnecessary temp files. Perhaps I can extract bloat and measure it's growth in comparison to the primary database footprint. I'll consult and get more info with regards to this.

database sizing and database restore time can be merged on the same nodes as these metrics don't require a real time visualization (few days old metric is acceptable for these metrics). Having a (weekly?) scheduled job that wipes these nodes db and restore should help.

I don't agree with this point.

Restoration times would be negatively affected due to contention for shared resources. Given the number of nodes shared on a single machine this impact would be significant and skew results presented to users.

Wiping the-core databases and having them re-sync would affect the objective of monitoring natural growth rates. These databases are not synchronized using warp-sync, they would each take a very long time to re-sync.

db restore time will highly depend of machine specs and connectivity/location. You should have the same machine specs for all benchmarks made on this metric.

Agreed, I am provisioning a single rented bare-metal server from OVH within the reference specifications for restoration times. This would be posted on the website so other may gauge expectations comparing to their own hardware.

RPC testing: low specs hardware is enough for the probing, as long as machines have good connectivity (if you are to select a cloud provider, from experience Azure has one of the best connectivity).

RPC uptime: it would be good to shorten the probing part (response time) to every minute or even shorter for instant connectivity reporting. A tool like blackbox exporter does that very well and very lightly and is easy to integrate to your own database. This would be a good metric to have permanently rather than for just a limited time.

The scope of RPC testing is beyond port probing (connectivity), it also requires computational and integrity tests which I intend to utilize Polkadot.JS/API. With this I would be able to test many metrics such as metadata, block depth and execution time for common queries related to staking, governance and account rendering. I also want to ensure that none of these providers are unreasonably throttling connections/queries.

Can these be achieved using blackbox?

AMD CPUs are less common than Intel ones in general with node runners. I would suggest using an Intel CPU.

The AMD CPUs were selected for the database size monitoring aspect of the proposal, I don't anticipate CPU type having an effect on database sizing. I appreciate that CPU type can affect restoration times but really by how much given that they're both above reference? Regardless, I can rent a dedicated server with an Intel CPU. Something to consider, both OVH and Hetzner provide many server selections with AMD CPUs.

HDD IO will be saturated if you run many nodes on these, I recommend a SSD

Agreed, I intend to utilize SSDs though I must digress that the price isn't far off from lower spec'd U.2 NVMe drives.

I would suggest renting the hardware as the execution time of the proposal is limited.

I appreciate the suggestion but subject to approval, I would like to continue the project into the future. Procurement would really see benefits after year 1.

Could you please make a table of all nodes / servers that would be running? With abouve suggestions, I guess you can make sustainable economy on the hardware/maintenance part

Added to the appendices (see the last page).

3 years ago

Hey Paradox,

Thank you for posting your proposal.

Please note the [Audit report] is reflecting only the quality of information presented in the proposal and not the Quality of the project/idea itself. Please refer to the feedback in the areas with possible improvements. I will create another report when the proposal goes onchain.

The following proposal Audit is created as a part of Proposal#67. More information about the Treasury proposal template and the Audit process can be found on the link above. All templates are free for everyone to use. For any questions or feedback regarding the Audit templates use the discussion link above.

Please note that the views and opinions presented in the Audit report are my personal and they do not represent general community opinion. If you can't access files on the Crust network, the original report can be accessed here.

PleaseLogin to comment