VeriHash testnet node operator proposal [January 2022]

bgraham-vh · January 6, 2022, 10:52pm

Our goal is to propose for mainnet after a trial and proof period on the testnet. We would love the opportunity to participate and are eager to answer any questions.

We bring with us a total of 25+ years of experience running various infrastructure deployments from startups, to Fortune 500, to architecting and integrating infrastructure and processing capabilities for $800MM US Fed Gov contracts; We know we can bring the same level of rigor to helping secure and diversify the stakewise platform.

He have deep experience running different compute orchestration platforms such as Kubernetes, and pride ourselves on running efficient, secure, and high available systems.

We have launched a new entity as of today, verihash.io, to focus on the specific needs of the crypto ecosystem in regards to the infrastructure that makes it all possible.

Our testnet proposal, and potential mainnet proposal, will be hosted in AWS as we carry deep experience within the AWS platform both in regards to certifications but also working experience having managed everything from greenfield to large datacenter migrations to AWS from on-prem deployments.

We again look forward to questions in this thread, and/or please feel free to reach out to info@, on the Stakewise discord (beegmon/dnlglsn) as we are excited and happy to chat further.

Entity: verihash.io — This is a new domain so content will be up here shortly but we didn’t want to wait for that to get a test net proposal up.

Infrastructure Setup:

Location - Oregon, USA
3 DCs (AWS 3 AZ deployment)
Networking - Redundant ISP links per DC (100+ GBPS per link)
- 10GBPS per cluster compute workload node
Kubernetes Management/Compute nodes distributed across DCs for availability
Isolated Resources for Kubernetes Management/API Plane and Compute/Pod workloads
Workload compute nodes specialized minimal OS to reduce footprint
At rest encryption of PV volumes

Specification

DAO calls addOperator function of PoolValidators contract with the following parameters:
- operator: 0x5b0DF4Ab7905F7e5098865900819188fA153dD0D
- depositDataMerkleRoot: 0x92197c5aeab4aed6f9e7e6231fae98c86725427c8946124245bb0ab52e1d1563
- depositDataMerkleProofs: /ipfs/QmcePpF3jBBTvoRSrCvBnWHMefp3Vn3fYvmF1VPB3sDtYx
DAO calls setOperator function of Roles contract with the following parameters:
- account: 0x5b0DF4Ab7905F7e5098865900819188fA153dD0D
- revenueShare: 5000
If the proposal will be approved, the operator must perform the following steps:
- Call operator-cli sync-vault with the same mnemonic as used for generating the proposal
- Create or update validators and make sure the new keys are added
- Call commitOperator from the 0x5b0DF4Ab7905F7e5098865900819188fA153dD0D address

tsudmi · January 10, 2022, 12:39pm

Hi @bgraham-vh ,

Great to hear that you guys are interested in joining testnet. I’ve added your operator to the set on testnet. You can now proceed to syncing the vault.

bgraham-vh · January 10, 2022, 3:05pm

Awesome!

We have been doing failure testing on the cluster last week and over the weekend to gain familiarity with this particular setup/construction.

Once we complete that activity, most likely in the next 24hrs or so, we will begin operations on the testnet

Thanks again!

brianchilders · January 10, 2022, 6:27pm

Looking forward to seeing your testnet work!

brianchilders · January 11, 2022, 12:32am

Hi @bgraham-vh - can you comment on the distribution of clients that you plan on running? E.g. Prysm vs Lighthouse vs. Teku?

Thank you,

-brian

bgraham-vh · January 11, 2022, 5:34am

For the moment we are running the defaults from the operator helm chart. These defaults are 3 prysm and 1 lighthouse for beacon clients, prysm validators, and geth for eth1.

As we get a better handle on different trade offs we intend on exploring client diversity further.

We are also very interested in exploring hardware diversity and have an eye toward a longer term goal of exploring ARM hardware for some components of the infra.

Our goal is to be as diverse as operationally (simplicity is still king when it comes to reliability and repeatability) and economically as possible as it brings with it benefits not only to the ecosystem as a whole but to the infra we run from an availability and security standpoint.

bgraham-vh · January 13, 2022, 4:09pm

https://www.verihash.io is now live with with some minimal content.

bgraham-vh · January 13, 2022, 10:37pm

Goerli Transaction Hash (Txhash) Details | Etherscan – commitOperator complete

We are including the following graffiti in the blocks we propose – VeriHash - TestNet - Ljz2cF6EqmyUP7vZFrLIEx9IGkfLVzrlVhUz2nPw==

brianchilders · January 14, 2022, 4:51pm

thanks for your response

brianchilders · January 14, 2022, 4:51pm

Nice! Going to check it out!

bgraham-vh · February 17, 2022, 2:48pm

It’s been a while…and we have been very busy!

In prep for a mainnet proposal we have been busy hardening our stack, improving efficiency, testing failure scenarios for DR, and getting all the ducks in a row from a legal entity and business prospective.

That last “businessy” part is just as important as the tech part in our opinions.

Validators need highly available resources, highly available resources take money, and money getting involved usually means taking the necessary steps to ensure solvency, continuity, and the forming the legal structures to keep the entire operation (and those involved) safe and humming along smoothly.

Thankfully all “businessy” stuff is in pretty much in place now so we can get back the tech which is really where the fun starts.

Items Completed So Far:

Migration to most infra to ARM64 based hardware for all but the validators. Enables increased efficiency/reduced cost of OPs per clock and improved security via hardware enabled DRAM encryption.

Migration of Vault deployment to auto unseal with non-internal HA backend. Enables auto recovery with lower operation overhead and provides facilities for backup and replication to secondary DR site for validator key data.

Migration of Geth to a lower maintenance deployment. Enables auto pruning and backup of Geth DB and chain data at regular intervals improving time to recovery for failed Geth nodes, bringing increased efficiency in regards to storage use, and facilitating DR/secondary site recovery.

Migration to Lighthouse beacon chain client as the primary beacon client with a lower maintenance deployment. Enables improved storage efficiency, and auto backup of DB/chain data at regular intervals lower time to recovery and DR.

Improved network architecture enhancing availability, efficiency, and security through removal of unnecessary complexity.

Deployment of a hardened host OS purpose built for hosting containers which includes the removal of all unnecessary packages/libraries, the removal of SSH, and deployment of AppArmor/SELinux rulesets. OS Image is built at a regular cadence to include updates. CVEs, depending on criticality, may be addressed sooner than standard update cadence

Hosts are replaced as new OS versions become available in a rolling-update fashion

All data now encrypted at rest.

All traffic to components external to the validator cluster, those being backup data streams, vault HA backend, and admin side channel are encrypted in transit with at least TLS 1.2

Deployment of automated builds for all deployed containers including sync jobs for “official” external containers like the eth-sidecar deployed by stakewise helm charts (for example)

Privately hosted container and artifact repos

Deployment of enhanced host and container monitoring with in the environment. Enables blending of Prometheus data with finer grained metrics from outside observers of the cluster.

Deployment new validator storage backend providing HA for slashing DBs while ensuring key isolation between validator sets. This ensures higher validator availability as it isolates the storage from the compute and allows for backups to be taken incase of corruption/for DR purposes.

Items left to be completed (ideally before a mainnet proposal):

Migrate validators to Lighthouse. We feel strongly about client diversity along with some advantages we see lighthouse bringing to the table for us. We will also move validators to ARM64 hardware in this step as well.

Encryption in transit everywhere.

Signed containers.

Streamlined Prometheus monitoring stack deployed outside cluster.

Full end to end test of recovery from backup/redeployment via infra as code for each component of the infra/cluster.

Post (hopefully) mainnet acceptance:
Establish continuous failure testing of components (excluding validators initially) to continue to harden against edge cases within the system

Begin exploring second client stack (likely Nimbus)

Test full DR failover to secondary site (in dev)

Move to readonly hosts for containers and replace some hosts with serverless options

brianchilders · February 17, 2022, 4:01pm

Hey @bgraham-vh - I really appreciate the transparency and updates that you’re giving to the DAO. Much appreciated.

I also appreciate that you have client diversity as a core value of your operations. Thank you for considering this as part of your proposal.

What tool are you using to build the hardened OS images? Packer (packer.io)? Or?

Thank you,

-brian

bgraham-vh · February 17, 2022, 4:56pm

Given we are AWS native – at present AWS provided hardened AMIs based on AWS Linux 2 with CIS level 1 compliance. We then enable the additional SELinux and Apparmor rules for docker and containers on the host, and disable SSH completely (it already isn’t provided a key). This is done on initial host bring up along with needed disk formatting, and mounting. Encryption of data at rest on disk is already provided in hardware via XTS-AES-256 block cipher.

The CIS image is already as minimal as you can really get at least in terms of AMZ linux 2 while still able to deploy EKS on them.

AWS Linux 2 CIS level 1 is patched once monthly usually for updates and addresses CVEs depending criticality either in the monthly builds or in an ad-hoc build done by AWS.

If we need to spin our own, AWS EC2 image builder (a native service) allows us to revision/craft AMIs the same way we craft our containers…via code and through a pipeline.

We don’t have a need for custom AMIs at the moment given we have a nicely hardened image that passes an audit now with the AWS linux 2 CIS level 1. At the end of the day if we don’t have to run it and our provider has a tool for it, that meets the security requirements we are internally aiming for (PCI, HIPPA, CIS, etc), then we wont as it is just more operational overhead and security surface we have to manage.

Ideally, once Bottlerocket is mature enough, we move to that since it even more hardened and purpose built for containers. However there are some teething problems especially around ephemeral disks, their formatting and management, along with mounting shared filesystems.

Once those are hurdles are cleared by the community at some point soon here hopefully we can move to bottlerocket which has some really nice features in terms of host immutability and workload isolation. Thus the reason why we slate this after mainnet golive.

brianchilders · February 18, 2022, 3:51pm

@bgraham-vh - you’re doing a great job with this! I work with cloud services in my job - so great to see that you’re taking the extra steps here as a staking provider. When you’re mentioning mounting shared file systems are you talking about the EFS?

Agree that Bottlerocket is not quite mature yet - but glad to see that you’re keeping on top of what AWS is providing.

Thank you again for being candid and transparent about what you’re doing.

bgraham-vh · February 18, 2022, 4:10pm

Well when the team is made up of people who have been in AWS since S3 was the only service…the resulting solutions trend toward things that work well with your long standing cloud provider of choice.

We, at present are investigating EFS, FSx (OpenZFS) and FSx (luster) as options for a multi-az HA storage layer for very specific workloads that don’t need extreme amounts of IOPS, and where state is not easily recoverable/not recoverable at all via a fresh sync. Validators are perfect targets for this, and at present deployed in a way that binds the compute to a single AZ due to the nature of EBS.

Separating the compute and storage layers and allowing each of those layer to scale independently while maintaining their failure boundaries allows us to handle single AZ failures in a given region, and opens up opportunities for easier backup schemes and replication to secondary regions for DR.

Bottlerocket is interesting numerous reasons to us beyond just the smaller OS/security scope, immutability, and reduction of instance resource utilization. It Just needs to move faster!

bgraham-vh · February 19, 2022, 10:49pm

Quick weekend update:

Prometheus redeployment completed for Lighthouse Beacon client, validators, and Geth. This will make it easier to alert and take automated actions within our infrastructure for specific events like backup and host replacements.

We have also decided to commit to running slashers as well on all our lighthouse beacon nodes for the good of the eco-system.

bgraham-vh · February 22, 2022, 1:57pm

First deployment of ARM64 based validators completed. We have a few kinks to work out but expect this week to see the migration of presently running prater validators to this new deployment.

We will also be holding off on container signing for a bit. At present our private registry doesn’t support signatures very well. There is a solution but it is still in the draft phase. For now we will stick with simple SHA verification of our images they are pulled to run and will re-take a look at this once we have a mainnet deploy.

bgraham-vh · February 22, 2022, 6:40pm

Starting at 12pm PST 02/22/2022 we will be taking extended validator downtime to migrate them over to the new ARM64 deployment and to lighthouse at the same time.

We expect no more than 24hrs of downtime for this one time migration effort (likely much less time but we are taking it slow and testing a few other things as well during this effort)

bgraham-vh · February 23, 2022, 4:11am

Migration completed.

We have a few nagging issues to sort out but have finished the migration of validators to arm + lighthouse. This, along with a slew of other changes and improvements which should lower recovery times for failed components should have us nearly ready to go live.

We will be sorting out the issues for the rest the week and into next, but expect validators to remain fairly stable, as before, from here on out for this deployment.