jira.xml

<?xml version="1.0" encoding="utf-8" ?>
<!--
#    \\ SPIKE: Secure your secrets with SPIFFE.
#  \\\\\ Copyright 2024-present SPIKE contributors.
# \\\\\\\ SPDX-License-Identifier: Apache-2.0
-->
<stuff>
  <purpose>
    <target>Our goal is to have a minimally delightful product.</target>
    <target>Strive not to add features just for the sake of adding features.
    </target>
    <target>Half-assed features shall be completed before adding more
      features.
    </target>
  </purpose>
  <next>
    <issue>
      we don't need state.ReadAppState()
      and other state enums for keepers anymore
      keepers are just dummy stateless keepers.
    </issue>
    <issue>
      this is for policy creation:

      allowed := state.CheckAccess(
      spiffeid.String(), "*",
      []data.PolicyPermission{data.PermissionSuper},
      )

      instead of a wildcard, maybe have a predefined path
      for access check like "/spike/system/acl"

      also disallow people creating secrets etc under
      /spike/system
    </issue>
    <issue>
      Invert shard generation flow.
    </issue>
    <issue>
      control these with flags.
      i.e. the starter script can optionally NOT automatically
      start nexus or keepers.

      #echo ""
      #echo "Waiting before SPIKE Keeper 1..."
      #sleep 5
      #run_background "./hack/start-keeper-1.sh"
      #echo ""
      #echo "Waiting before SPIKE Keeper 2..."
      #sleep 5
      #run_background "./hack/start-keeper-2.sh"
      #echo ""
      #echo "Waiting before SPIKE Keeper 3..."
      #sleep 5
      #run_background "./hack/start-keeper-3.sh"

      #echo ""
      #echo "Waiting before SPIKE Nexus..."
      #sleep 5
      #run_background "./hack/start-nexus.sh"

    </issue>
    <issue>
      validations:

      along with the error code, also return some explanatory message

      instead of this for example

      err = validation.ValidateSpiffeIdPattern(spiffeIdPattern)
      if err != nil {
      responseBody := net.MarshalBody(reqres.PolicyCreateResponse{
      Err: data.ErrBadInput,
      }, w)
      net.Respond(http.StatusBadRequest, responseBody, w)
      return err
      }

      do this

      err = validation.ValidateSpiffeIdPattern(spiffeIdPattern)
      if err != nil {
      responseBody := net.MarshalBody(reqres.PolicyCreateResponse{
      Err: data.ErrBadInput,
      Reason: "Invalid spiffe id pattern. Matcher should be a regex that can match a spiffe id"
      }, w)
      net.Respond(http.StatusBadRequest, responseBody, w)
      return err
      }
    </issue>
    <issue>
      implement keeper crash recovery:
      i.e. ask shards from nexus
    </issue>
    <issue>
      implement nexus crash recovery
      i.e. ask shards from keepers
    </issue>
    <issue>
      keeper does not need to store multiple shards;
      each keeper should keep its own shard.

      // Store decoded shard in the map.
      state.Shards.Store(id, decodedShard)
      log.Log().Info(fName, "msg", "Shard stored", "id", id)
    </issue>
    <issue>
      implement doomsday recovery
      i.e. operator saves shards in a secure enclave.
    </issue>
    <issue>
      admin should not be able to create two policies with the same name.
    </issue>
    <issue>
      exponentially back off here

      log.Log().Info("tick", "msg", "Waiting for keepers to initialize")
      time.Sleep(5 * time.Second)
    </issue>
    <issue>
      Store root key in Nexus' memory (will be required for recovery later)
      We can also keep shards in nexus' memory for convenience too
      (reasoning: if we are keeping the root key, securely erasing shards
       do not increase the security posture that much)
    </issue>
    <issue>
      consider db backend as untrusted
      i.e. encrypt everything you store there; including policies.
      (that might already be the case actually) -- if so, document it
      in the website.
    </issue>
    <issue>
      // TODO: this check will change once we make #keepers configurable.
      if len(keepers) &lt; 3 {
      log.FatalLn("Tick: not enough keepers")
      }
    </issue>
    <issue>
      For in-memory store, bypass initialization, shard creation etc.
    </issue>
    <issue>
      SPIKE defaults to sqlite backing store
    </issue>
    <issue>
      nexus tracks its keeper-initialization state.
      (time shards created, number of shards etc)
    </issue>
    <issue>
      1. Nexus maintaining its initialization state
      (i.e. if it successfully initialized the keepers it should recompute the
      root key from the keepers instead of auto-generating itself.
      for that; it will set a tombstone indicating it has initialized the keepers
      the tombstone will be in SQLite; since shards do not make sense in in-memory
      backing store)

      2. Keepers advertising their status to Nexus regularly
      IF Nexus has initialized keepers already Nexus will recompute and
      provide the shard to the keeper.
      (Nexus will keep the root key in its memory. The threat model of SPIKE
      does not protect Nexus against memory-based attacks and it's up to the
      user to harden and ensure that nexus runs with non-root privileges
      (this threat model is the same for Vault and other secret stores too))

      3. Nexus crashes; it figures out it already initialized keepers; asks for
      shards and rekeys itself.

      4. Nexus crashes; but quorum of keepers that they know their shards cannot
      be reached.
      Nexus transitions to "locked" state and a manual unlock will be required
      (this will be a separate user story)
    </issue>
    <issue>
      // 3. spike policy list gives `null` for no policies instead of a message
      //    also the response is json rather than a more human readable output.
      //    also `createdBy` is emppy.
      //    we can create "good first issue"s for these.
    </issue>
    <issue>
      func computeShares(finalKey []byte) (group.Scalar, []secretsharing.Share) {
      // Initialize parameters
      g := group.P256
      // TODO: these will be configurable
      t := uint(1) // Need t+1 shares to reconstruct
      n := uint(3) // Total number of shares
    </issue>
    <issue>
      dr: keeper crash
      waiting-for: shard generation inversion.
    </issue>
    <issue>
      Check the entire codebase and implement the `TODO:` items.
    </issue>
    <issue>
      Policy creation feature does not work due to validation error.
    </issue>
    <issue>
      Deploy Keycloak and make sure you can initialize it.
    </issue>
    <issue>
      Create a video about this new shamir secret sharing workflow.
    </issue>
    <issue>
      DR: devise a DR scenario when a keeper crashes.
      (depends on the new inverted sharding workflow)
    </issue>

    <issue>
      <task>
        Outline a disaster recovery scenario when both nexus and keepers are all down.
      </task>
      <task>The system should work with three keepers by default.</task>
     </issue>
  </next>
  <low-hanging-fruits>
    <issue>
      something similar for SPIKE too:
      Dev mode
      The Helm chart may run a OpenBao server in development. This installs a
      single OpenBao server with a memory storage backend.

      For dev mode:
      - no keepers
      - no backing store (everything is in memory)
    </issue>
    <issue>
      Consider using google kms, azure keyvault, and other providers
      (including an external SPIKE deployment) for root key recovery.
      question to consider is whether it's really needed
      second question to consider is what to link kms to (keepers or nexus?)
        keepers would be better because we'll back up the shards only then.
      or google kms can be used as an alternative to keepers
      (i.e., store encrypted dek, with the encrypted root key on nexus;
      only kms can decrypt it -- but, to me, it does not provide any
      additional advantage since if you are on the machine, you can talk to
      google kms anyway)
    </issue>
    <issue>
      enable SQLlite by default
      and test it (ie. crash nexus and ensure both secrets and policies can be recovered)
    </issue>
    <issue>
      Multiple Keeper instances will be required for fan-in fan-out of the
      shards.

      configure the current system to work with multiple keepers.
      the demo setup should initialize 3 keepers by default.
      the demo setup should use sqlite as the backing store by default.
    </issue>
    <issue>
      Install Keycloak locally and experiment with it.
      This is required for "named admin" feature.
    </issue>
    <issue>
      One way token flow;
      keeper provides the rootkey to nexus;
      nexus init pushes root key to keeper.
      that's it.
    </issue>
    <issue>
      If SPIKE is not initialized; `spike` or `spike --help` should display
      a reminder to initialize SPIKE first and exit.
    </issue>
    <issue>
      The paths that we set in get put ...etc should look like a unix path.
      it will require sanitization!
      Check how other secrets stores manage those paths.
    </issue>
    <issue>
      read policies from a yaml or a json file and create them.
    </issue>
    <issue>
      have sqlite as the default backing store.
      (until we implement the S3 backing store)
    </issue>
  </low-hanging-fruits>
  <later>
    <issue>
      ability to lock nexus programmatically.
      when locked, nexus will deny almost all operations
      locking is done by executing nexus binary with a certain command line flag.
      (i.e. there is no API access, you'll need to physically exec the ./nexus
      binary -- regular svid verifications are still required)
      only a superadmin can lock or unlock nexus.
    </issue>
    <issue>
      consider using NATS for cross trust boundary (or nor) secret federation
    </issue>
    <issue>
      wrt: secure erasing shards and the root key >>
      It would be interesting to try and chat with some of the folks under the cncf
      (That's a good idea indeed; I'm noting it down.)
    </issue>
    <issue>
      over the break, I dusted off https://github.com/spiffe/helm-charts-hardened/pull/166 and started playing with the new k8s built in cel based mutation functionality.
      the k8s cel support is a little rough, but I was able to do a whole lot in it, and think I can probably get it to work for everything. once 1.33 hits, I think it will be even easier.
      I mention this, as I think spike may want similar functionality?
      csi driver, specify secrets to fetch to volume automatically, keep it up to date, and maybe poke the process once refreshed
    </issue>
    <issue>
      set sqlilite on by default and make sure everything works.
    </issue>
    <issue>
      volkan@spike:~/Desktop/WORKSPACE/spike$ spike secret get /db
      Error reading secret: post: Problem connecting to peer

      ^ I get an error instead of a "secret not found" message.
    </issue>
    <issue>
      this is from SecretReadResponse, so maybe its entity should be somewhere
      common too.
      return &amp;data.Secret{Data: res.Data}, nil
    </issue>
    <issue>
      these may come from the environment:

      DataDir: ".data",
      DatabaseFile: "spike.db",
      JournalMode: "WAL",
      BusyTimeoutMs: 5000,
      MaxOpenConns: 10,
      MaxIdleConns: 5,
      ConnMaxLifetime: time.Hour,
    </issue>
  </later>
  <reserved>
    <issue waitingFor="shamir-to-be-implemented-first">
      use case: shamir
      1. `spike init` verifies that there are 3 healthy keeper instances.
      it creates a shard of 3 shamir secrets (2 of which will be enough to
      reassemble the root key) send each share to each keeper.
      2. SPIKE nexus regularly polls all keepers and if it can assemble a secret
      all good.
      3. `spike init` will also save the 2 shards (out of 3) in
      `~/.spike/recovery/*`
      The admin will be "highly encouraged" do delete those from the machine and
      securely back up the keys and distribute them to separate people etc.
      [2 and 3 are configurable]
    </issue>
    <issue waitingFor="shamir-to-be-implemented">
      <workflow>
        1. `spike init` initializes keeper(s). From that point on, SPIKE Nexus
        pulls the root key whenever it needs it.
        2. nexus and keeper can use e2e encryption with one time key pairs
        to have forward secrecy and defend the transport in the VERY unlikely
        case of a SPIFFE mTLS breach.
        3. ability for nexus to talk to multiple keepers
        4. ability for a keeper to talk to nexus to recover its root key if it
        loses it.
        5. abiliy for nexus to talk to and initialize multiple keepers.
        (phase 1: all keepers share the same key)
        6. `spike init` saves its shards (2 out of 3 or smilar) to
        `~/.spike/recovery/*`
        The admin will be "highly encouraged" to delete those from the machine
        and
        securely back up the keys and distribute them to separate people etc
        `spike init` will also save the primary key used in the shamir's secret
        sharing
        to `~/.spike/recovery/*` (this is not as sensitive as the root key, but
        still
        should be kept safe)
        - it is important to note that, without the recovery material, your only
        opiton
        to restore the root key relies on the possibility that more than N
        keepers remain
        operational at all times. -- that's a good enough possibility anyway
        (say 5 keepers in 3 AZs, and you need only 2 to recover the root key;
        then it will
        be extremely unlikely for all of them to go down at the same time)
        so in an ideal scenario you save your recovery material in a secure
        encrypted enclave
        and never ever use it.
        7. `spike recover` will reset a keeper cluster by using the recovery
        material.
        `spike recover` will also recover the root key.
        to use `spike recover` you will need a special SVID (even a super admin
        could not use it
        without prior authorization)
        the SVID who can execute `spike recover` will not be able to execute
        anything else.
        8. At phase zero, `spike recover` will just save the root key to disk,
        also mentioning that it's not secure and the key will be stored safely
        and wiped from the disk.
        9. maybe double encrypt keeper-nexus communication with one-time key
        pairs because
        the root key is very sensitive and we would want to make sure it's
        secure even
        if the SPIFFE mTLS is compromised.
      </workflow>
      <details>
        say user sets up 5 keeper instances.
        in nexus, we have a config
        keepers:
        - nodes: [n1, n2, n3, n4, n5]
        nexus can reach out with its own spiffe id to each node in the list. it
        can
        call the assembly lib with whatever secrets it gets back, as it gets
        them back,
        and so long as it gets enough, "it just works"

        recovery could even be, users have a copy of some of the keeper's
        secrets.
        they rebuild a secret server and load that piece back in. nexus then can
        recover.
        that api could also allow for backup configurations
      </details>
      <docs>
        WAITINGFOR: shamir to be implemented

        To documentation (Disaster Recovery)

        Is it like
        Keepers have 3 shares.
        I get one share
        you get one share.
        We keep our shares secure.
        none of us alone can assemble a keeper cluster.
        But we two can join our forces and do an awesome DR at 3am in the
        morning if needed?

        or if your not that paranoid, you can keep both shares on one
        thumbdrive, or 2
        shares on two different thumbdrives in two different safes, and rebuild.

        it gives a lot of options on just how secure you want to try and make
        things vs
        how painful it is to recover
      </docs>
    </issue>
    <issue waitingFor="shamir-to-be-implemented">
      func RouteInit(
      w http.ResponseWriter, r *http.Request, audit *log.AuditEntry,
      ) error {
      // This flow will change after implementing Shamir Secrets Sharing
      // `init` will ensure there are enough keepers connected, and then
      // initialize the keeper instances.
      //
      // We will NOT need the encrypted root key; instead, an admin user will
      // fetch enough shards to back up. Admin will need to provide some sort
      // of key or password to get the data in encrypted form.
    </issue>
  </reserved>
  <immediate-backlog>
  </immediate-backlog>
  <runner-up>
    <issue>
      double-encryption of nexus-keeper comms (in case mTLS gets compromised, or
      SPIRE is configured to use an upstream authority that is compromised, this
      will provide end-to-end encryption and an additional layer of security
      over
      the existing PKI)
    </issue>
    <issue>
      Minimally Delightful Product Requirements:
      - A containerized SPIKE deployment
      - A Kubernetes SPIKE deployment
      - Minimal policy enforcement
      - Minimal integration tests
      - A demo workload that uses SPIKE to test things out as a consumer.
      - A golang SDK (we can start at github/zerotohero-dev/spike-sdk-go
      and them move it under spiffe once it matures)
    </issue>
    <issue>
      Kubernetification
    </issue>
    <issue>
      v.1.0.0 Requirements:
      - Having S3 as a backing store
    </issue>
    <issue>
      Consider a health check / heartbeat between Nexus and Keeper.
      This can be more frequent than the root key sync interval.
    </issue>
    <issue>
      Unit tests and coverage reports.
      Create a solid integration test before.
    </issue>
    <issue>
      Test automation.
    </issue>
    <issue>
      Assigning secrets to SPIFFE IDs or SPIFFE ID prefixes.
    </issue>
    <issue>
      SPIKE CSI Driver

      the CSI Secrets Store driver enables users to create
      `SecretProviderClass` objects. These objects define which secret provider
      to use and what secrets to retrieve. When pods requesting CSI volumes are
      made, the CSI Secrets Store driver sends the request to the OpenBao CSI
      provider if the provider is `vault`. The CSI provider then uses the
      specified `SecretProviderClass` and the pod’s service account to retrieve
      the secrets from OpenBao and mount them into the pod’s CSI volume. Note
      that the secret is retrieved from SPIKE Nexus and populated to the CSI
      secrets store volume during the `ContainerCreation` phase. Therefore, pods
      are blocked from starting until the secrets are read from SPIKE and
      written to the volume.
    </issue>
    <issue>
      shall we implement rate limiting; or should that be out of scope
      (i.e. to be implemented by the user.
    </issue>
    <issue>
      to docs: the backing store is considered untrusted and it stores
      encrypted information
      todo: if it's "really" untrusted then maybe it's better to encrypt everything
      (including metadata) -- check how other secrets managers does this.
    </issue>
    <issue>
      more fine grained policy management

      1. an explicit deny will override allows
      2. have allowed/disallowed/required parameters
      3. etc.

      # This section grants all access on "secret/*". further restrictions can be
      # applied to this broad policy, as shown below.
      path "secret/*" {
      capabilities = ["create", "read", "update", "patch", "delete", "list", "scan"]
      }

      # Even though we allowed secret/*, this line explicitly denies
      # secret/super-secret. this takes precedence.
      path "secret/super-secret" {
      capabilities = ["deny"]
      }

      # Policies can also specify allowed, disallowed, and required parameters. here
      # the key "secret/restricted" can only contain "foo" (any value) and "bar" (one
      # of "zip" or "zap").
      path "secret/restricted" {
      capabilities = ["create"]
      allowed_parameters = {
      "foo" = []
      "bar" = ["zip", "zap"]
      }

      but also, instead of going deep down into the policy rabbit hole, maybe
      it's better to rely on well-established policy engines like OPA.

      A rego-based evaluation will give allow/deny decisions, which SPIKE Nexus
      can  then honor.

      Think about pros/cons of each approach. -- SPIKE can have a good-enough
      default policy engine, and for more sophisticated functionality we can
      leverage OPA.
    </issue>
    <issue>
      If nexus has not started SPIKE Pilot should give a more informative
      error message (i.e. Nexus is not ready, or not initialized, or
      unreachable, please check yadda yadda yadda)
    </issue>
    <issue>
      key rotation

      NIST rotation guidance

      Periodic rotation of the encryption keys is recommended, even in the
      absence of compromise. Due to the nature of the AES-256-GCM encryption
      used, keys should be rotated before approximately 232
      encryptions have been performed, following the guidelines of NIST
      publication 800-38D.

      SPIKE will automatically rotate the backend encryption key prior to reaching
      232 encryption operations by default.

      also support manual key rotation
    </issue>
    <issue>
      Do an internal security analysis / threat model for spike.
    </issue>
    <issue>
      TODO in-memory "dev mode" for SPIKE #spike (i.e. in memory mode will not be default)
      nexus --dev or something similar (maybe an env var)
    </issue>
    <issue>
      Use SPIKE in lieu of encryption as a service (similar to transit secrets)
    </issue>
    <issue>
      dynamic secrets
    </issue>
    <issue>
      document how to do checksum verification to ensure that the binaries
      you download is authentic.
    </issue>
    <issue>
      docs:
      Since the storage backend resides outside the barrier, it’s considered
      untrusted so SPIKE will encrypt the data before it sends them to the
      storage backend. This mechanism ensures that if a malicious attacker
      attempts to gain access to the storage backend, the data cannot be
      compromised since it remains encrypted, until OpenBao decrypts the data.
      The storage backend provides a durable data persistent layer where data
      is secured and available across server restarts.
    </issue>
    <issue>
      use case:
      one time access to an extremely limited subset of secrets
      (maybe using a one time, or time-bound token)
      but also consider if SPIKE needs tokens at all; I think we can piggyback
      most of the authentication to SPIFFE and/or JWT -- having to convert
      various kinds of tokens into internal secrets store tokens is not that much needed.
    </issue>
    <issue>
      - TODO Telemetry
      - core system metrics
      - audit log metrics
      - authentication metrics
      - database metrics
      - policy metrics
      - secrets metrics
    </issue>
    <issue>
      "token" secret type
      - will be secure random
      - will have expiration
    </issue>
    <issue>
      spike dev mode
    </issue>
    <issue>
      document limits and maximums of SPIKE (such as key length, path lenght, policy size etc)
    </issue>
    <issue>
      double encryption when passing secrets around
      (can be optional for client-nexus interaction; and can be mandatory for
      tools that transcend trust boundaries (as in a relay / message queue that
      may be used for secrets federation)
    </issue>
    <issue>
      active/standby HA mode
    </issue>
    <issue>
      document the built-in spiife ids used by the system.
    </issue>
    <issue>
      pattern-based random secret generation
    </issue>
    <issue>
      admin ui
    </issue>
    <issue>
      guidelines about how to backup and restore
    </issue>
    <issue>
      - AWS KMS support for keepers
      - Azure keyvault support for keepers
      - GCP kms support for keepers
      - HSM support for keepers
      - OCI kms support for keepers
      - keepers storing their shards in a separate SPIKE deployment
        (i.e. SPIKE using another SPIKE to restore root token)
    </issue>
    <issue>
      postgresql backend
    </issue>
    <issue>
      audit targets:
        - file
        - syslog
        - socket
      (if audit targets are enabled then command will not execute unless an
      audit trail is started)
    </issue>
    <issue>
      OIDC authentication for named admins.
    </issue>
    <issue>
      SPIKE Dynamic secret sidecar injector
    </issue>
    <issue>
      To docs: Why not kubernetes?
      ps: also inspire a bit from VSecM docs too.

      - Kubernetes is not a secrets management solution. It does have native
      support for secrets, but that is quite different from a dedicated
      secrets management solution. Kubernetes secrets are scoped to the cluster
      only, and many applications will have some services running outside
      Kubernetes or in other Kubernetes clusters. Having these applications
      use Kubernetes secrets from outside a Kubernetes environment will be
      cumbersome and introduce authentication and authorization challenges.
      Therefore, considering the secret scope as part of the design process
      is critical.
      - Kubernetes secrets are static in nature. You can define secrets by using
      kubectl or the Kubernetes API, but once they are defined, they are stored
      in etcd and presented to pods only during pod creation. Defining secrets
      in this manner may create scenarios where secrets get stale, outdated,
      or expired, requiring additional workflows to update and rotate the
      secrets, and then re-deploy the application to use the new version, which
      can add complexity and become quite time-consuming. Ensure consideration
      is given to all requirements for secret freshness, updates, and rotation
      as part of your design process.
      - The secret access management security model is tied to the Kubernetes
      RBAC model. This model can be challenging for users who are not familiar
      with Kubernetes. Adopting a platform-agnostic security governance model
      can enable you to adapt workflows for applications regardless of how and
      where they are running.
    </issue>
  </runner-up>
  <backlog>
    <issue kind="v1.0-requirement">
      - Run SPIKE in Kubernetes too.
    </issue>

    <issue kind="v1.0-requirement">
      - Postgres support as a backing store.
    </issue>

    <issue kind="v1.0-requirement">
      - Ability to channel audit logs to a log aggregator.
    </issue>

    <issue kind="v1.0-requirement">
      - OIDC integration: Ability to connect to an identity provider.
    </issue>

    <issue kind="v1.0-requirement">
      - ESO (External Secrets Operator) integration
    </issue>

    <issue kind="v1.0-requirement">
      - An ADMIN UI (linked to OIDC probably)
    </issue>

    <issue kind="v1.0-requirement">
      - Ability to use the RESTful API without needing an SDK.
      That could be hard though since we rely on SPIFFE authentication and
      SPIFFE workload API to gather certs: We can use a tool to automate that
      part. But it's not that hard either if I know where my certs are:
      `curl --cert /path/to/svid_cert.pem --key /path/to/svid_key.pem
      https://mtls.example.com/resource`
    </issue>

    <issue kind="v1.0-requirement">
      > 80% unit test coverage
    </issue>

    <issue kind="v1.0-requirement">
      Fuzzing for the user-facing API
    </issue>

    <isssue kind="v1.0-requirement">
      100% Integration test (all features will have automated integration tests
      in all possible environments)
    </isssue>
    <issue>
      By design, we regard memory as the source of truth.
      This means that backing store might miss some secrets.
      Find ways to reduce the likelihood of this happening.
      1. Implement exponential retries.
      2. Implement a health check to ensure backing store is up.
      3. Create background jobs to sync the backing store.
    </issue>
    <issue>
      Test the db backing store.
    </issue>
    <issue>
      Ability to add custom metadata to secrets.
    </issue>
    <issue>
      We need use cases in the website
      - Policy-based access control for workloads
      - Secret CRUD operations
      - etc
    </issue>
    <issue>
      Fleet management:
      - There is a management plane cluster
      - There is a control plane cluster
      - There are workload clusters connected to the control plane
      - All of those are their own trust domains.
      - There is MP-CP connectivity
      - There is CP-WL connectivity
      - MP has a central secrets store
      - WL and CP need secrets
      - Securely dispatch them without "ever" using Kubernetes secrets.
      - Have an alternative that uses ESO and a restricted secrets namespace
      that no one other than SPIKE components can see into.
    </issue>
    <issue>
      To docs:
      how do we manage the root key.
      i.e., it never leaves the memory and we keep it alive via replication.
    </issue>
    <issue>
      API for SPIKE nexus to save its shard encrypted with a passphrase for
      emergency backup
      This will be optional; and admin will be advised to save it securely
      outside the machine.
      (requires the shamir secret sharing to be implemented)
    </issue>
    <issue>
      Postgresql support for backing store.
    </issue>
    <issue>
      maybe a default auditor SPIFFEID that can only read stuff (for Pilot;
      not for named admins; named admins will use the policy system instead)
    </issue>
    <issue>
      Optionally not create tables and other ddl during backing store creation
    </issue>
    <issue>
      What if a keeper instance crashes and goes back up?
      if there is an "initialized" Nexus; it can hint nexus to send its share
      again.
    </issue>

    <issue>
      Think about DR scenarios.
    </issue>
    <issue>
      SPIKE Pilot to ingest a policy YAML file(s) to create policies.
      (similar to kubectl)
    </issue>
    <issue>
      - SPIKE Keep Sanity Tests
      - Ensure that the root key is stored in SPIKE Keep's memory.
      - Ensure that SPIKE Keep can return the root key back to SPIKE Nexus.
    </issue>
    <issue>
      Demo: root key recovery.
    </issue>

    <issue>
      If there is a backing store, load all secrets from the backing store
      upon crash, which will also populate the key list.
      after recovery, all secrets will be there and the system will be
      operational.
      after recovery admin will lose its session and will need to re-login.
    </issue>
    <issue>
      Test edge cases:
      * call api method w/o token.
      * call api method w/ invalid token.
      * call api method w/o initializing the nexus.
      * call init twice.
      * call login with bad password.
      ^ all these cases should return meaningful errors and
      the user should be informed of what went wrong.
    </issue>
    <issue>
      Try SPIKE on a Mac.
    </issue>
    <issue>
      Try SPIKE on an x-86 Linux.
    </issue>

    <issue>
      based on the following, maybe move SQLite "create table" ddls to a
      separate file.
      Still a "tool" or a "job" can do that out-of-band.

      update: for SQLite it does not matter as SQLite does not have a concept
      of RBAC; creating a db is equivalent to creating a file.
      For other databases, it can be considered, so maybe write an ADR for that.

      ADR:

      It's generally considered better security practice to create the schema
      out-of-band (separate from the application) for several reasons:

      Principle of Least Privilege:

      The application should only have the permissions it needs for runtime
      (INSERT, UPDATE, SELECT, etc.)
      Schema modification rights (CREATE TABLE, ALTER TABLE, etc.) are not
      needed during normal operation
      This limits potential damage if the application is compromised


      Change Management:

      Database schema changes can be managed through proper migration tools
      Changes can be reviewed, versioned, and rolled back if needed
      Prevents accidental schema modifications during application restarts


      Environment Consistency:

      Ensures all environments (dev, staging, prod) have identical schemas
      Reduces risk of schema drift between environments
      Makes it easier to track schema changes in version control
    </issue>


    <qa>
      <issue>
        - SPIKE Nexus Sanity Tests
        - Ensure SPIKE Nexus caches the root key in memory.
        - Ensure SPIKE Nexus reads from SPIKE keep if it does not have the root
        key.
        - Ensure SPIKE Nexus saves the encrypted root key to the database.
        - Ensure SPIKE Nexus caches the user's session key.
        - Ensure SPIKE Nexus removes outdated session keys.
        - Ensure SPIKE Nexus does not re-init (without manual intervention)
        after
        being initialized.
        - Ensure SPIKE Nexus adheres to the bootstrapping sequence diagram.
        - Ensure SPIKE Nexus backs up the admin token by encrypting it with the
        root
        key and storing in the database.
        - Ensure SPIKE Nexus stores the initialization tombstone in the
        database.
      </issue>
      <issue>
        - SPIKE Pilot Sanity Tests
        - Ensure SPIKE Pilot denies any operation if SPIKE Nexus is not
        initialized.
        - Ensure SPIKE Pilot can warn if SPIKE Nexus is unreachable
        - Ensure SPIKE Pilot does not indefinitely hang up if SPIRE is not
        there.
        - Ensure SPIKE Pilot can get and set a secret.
        - Ensure SPIKE Pilot can do a force reset.
        - Ensure SPIKE Pilot can recover the root password.
        - Ensure that after `spike init` you have a password-encrypted root key
        in the db.
        - Ensure that you can recover the password-encrypted root key.
      </issue>
    </qa>


  </backlog>
  <future>
    <issue>
      multiple keeper clusters:

      keepers:
      - nodes: [n1, n2, n3, n4, n5]
      - nodes: [dr1, dr2]

      if it cant assemble back from the first pool, it could try the next
      pool, which could be stood up only during disaster recovery.
    </issue>
    <issue>
      a tool to read from one cluster of keepers to hydrate a different
      cluster of keepers.
    </issue>

    <issue>
      since OPA knows REST, can we expose a policy evaluation endpoint to
      help OPA augment/extend SPIKE policy decisions?
    </issue>
    <issue>
      maybe create an interface for kv, so we can have thread-safe variants too.
    </issue>

    <issue>
      maybe create a password manager tool as an example use case
    </issue>

    <issue>
      A `stats` endpoint to show the overall
      system utilization
      (how many secrets; how much memory, etc)
    </issue>
    <issue>
      maybe inspire admin UI from keybase
      https://keybase.io/v0lk4n/devices
      for that, we need an admin ui first :)
      for that we need keycloak to experiment with first.
    </issue>

    <issue>
      the current docs are good and all but they are not good for seo; we might
      want to convert to something like zola later down the line
    </issue>

    <issues>
      wrt ADR-0014:
      Maybe we should use something S3-compatible as primary storage
      instead of sqlite.
      But that can wait until we implement other features.

      Besides, Postgres support will be something that some of the stakeholders
      want to see too.
    </issues>


    <issue>
      SPIKE Dev Mode:

      * Single binary
      * `keeper` functionality runs in memory
      * `nexus` uses an in-memory store, and its functionality is in the single
      binary too.
      * only networking is between the binary and SPIRE Agent.
      * For development only.

      The design should be maintainable with code reuse and should not turn into
      maintaining two separate projects.
    </issue>
    <issue>
      rate limiting to api endpoints.
    </issue>
    <issue>
      * super admin can create regular admins and other super admins.
      * super admin can assign backup admins.
      (see drafts.txt for more details)
    </issue>
    <issue>
      Each keeper is backed by a TPM.
    </issue>
    <issue>
      Do some static analysis.
    </issue>
    <to-plan>
      <issue>
        S3 (or compatible) backing store
      </issue>
      <issue>
        File-based backing store
      </issue>
      <issue>
        In memory backing store
      </issue>
      <issue>
        Kubernetes Deployment
      </issue>
    </to-plan>
    <issue>
      Initial super admin can create other admins.
      So that, if an admin leaves, the super admin can delete them.
      or if the password of an admin is compromised, the super admin can
      reset it.
    </issue>
    <issue>
      - Security Measures (SPIKE Nexus)
      - Encrypting the root key with admin password is good
      Consider adding salt to the password encryption
      - Maybe add a key rotation mechanism for the future
    </issue>
    <issue>
      - Error Handling
      - Good use of exponential retries
      - Consider adding specific error types/codes for different failure
      scenarios
      - Might want to add cleanup steps for partial initialization failures
    </issue>
    <issue>
      Ability to stream logs and audit trails outside of std out.
    </issue>
    <issue>
      Audit logs should write to a separate location.
    </issue>
    <issue>
      Create a dedicated OIDC resource server (that acts like Pilot but exposes
      a
      restful API for things like CI/CD integration.
    </issue>
    <issue>
      HSM integration (i.e. root key is managed/provided by an HSM, and the key
      ever leaves the trust boundary of the HSM.
    </issue>
    <issue>
      Ability to rotate the root key (automatic via Nexus).
    </issue>
    <issue>
      Ability to rotate the admin token (manual).
    </issue>
    <issue>
      Encourage to create users instead of relying on the system user.
    </issue>
  </future>
</stuff>