Creating A Usage-Based Billing Package A Comprehensive Guide

by StackCamp Team 61 views

Hey guys! Today, we're diving deep into the creation of a usage-based billing package, a crucial step in our journey to revamp our billing system. This is part of phase 1 of the Usage-based Billing RFC, and it's going to be super exciting. We'll be covering everything from the initial setup to the nitty-gritty details of event tracking and publishing. So, grab your favorite beverage, and let's get started!

Introducing the usage Package

The core of our new system is the usage package. This package is where all the magic happens when it comes to storage and publishing of usage events. Think of it as the central hub for all things related to tracking how our users are utilizing the system. This is a significant shift, and it's going to give us a much clearer picture of our resource consumption.

The usage package will encapsulate all the logic required to record, process, and publish usage data. This includes defining event types, creating data structures for events, and setting up the mechanisms for collecting and transmitting this data. By centralizing these functionalities within a dedicated package, we ensure a more organized and maintainable codebase. This approach also makes it easier to extend and modify the usage tracking system in the future, accommodating new event types and usage metrics as needed.

One of the key benefits of introducing the usage package is the improved clarity and modularity it brings to our system. Instead of scattering usage tracking logic across different parts of the codebase, we now have a single, well-defined module responsible for all usage-related tasks. This makes it easier to understand how usage is being tracked and to make changes or improvements to the system. Moreover, the usage package promotes code reuse and reduces the risk of inconsistencies or errors in usage tracking.

Enterprise-Licensed Code and the Collector Interface

For our enterprise clients, we're using an enterprise-licensed code that incorporates a collector interface. This is where things get a bit technical, but stick with me! We're defining a set of interfaces in AGPL (Affero General Public License) along with a no-op (no operation) stub. This ensures that our code is compliant and flexible.

Let's break down the code snippets:

type EventType string // enum

type UsageEvent interface {
 usageEvent() // to enforce they must come from this package
 EventType() EventType
}

type UsageCollector interface {
 RecordUsage(ctx context.Context, event Event) error
}

Here's what each part means:

  • EventType: This is essentially an enumeration (enum) that defines the different types of usage events we'll be tracking. Think of it as a way to categorize events, such as workspace creations, task executions, or data storage usage. Each event type will have a unique identifier, allowing us to differentiate and analyze them effectively.

  • UsageEvent: This interface is a contract that all usage event structs must adhere to. It includes a usageEvent() method to ensure that all events originate from this package, and an EventType() method to return the event type. This ensures consistency and makes it easier to work with different types of usage events in a uniform way. The UsageEvent interface acts as a common ground for all usage events, enabling us to handle them generically and avoid type-specific logic in many cases.

  • UsageCollector: This interface defines the contract for collecting usage events. It has a RecordUsage method that takes a context and an event as input. This allows us to record events asynchronously and handle errors gracefully. The UsageCollector interface is crucial for decoupling the event recording process from the rest of the system. It allows us to easily swap out different event collection mechanisms, such as writing to a database, sending to a message queue, or using a third-party analytics service.

The collector interface is crucial for decoupling our system. It allows us to switch out the underlying implementation without affecting the rest of the codebase. For example, we can easily swap out the database connection or use a different event publishing mechanism. This flexibility is essential for maintaining a robust and scalable system. By defining a clear interface, we ensure that any changes to the event collection process are isolated and do not introduce unintended side effects elsewhere in the system.

Adding the usage_events Table

As per our RFC, we're adding a new table called usage_events. This table will be the central repository for all our usage data. It's going to be structured to efficiently store and query event information. This is a critical component as it forms the backbone of our usage-based billing system.

The usage_events table will likely include columns such as:

  • event_id: A unique identifier for each event.

  • event_type: The type of the event (e.g., workspace creation, task execution).

  • event_data: A JSON blob containing event-specific data.

  • timestamp: The time the event occurred.

  • user_id: The ID of the user associated with the event.

  • workspace_id: The ID of the workspace associated with the event.

This structure allows us to efficiently query and analyze usage data across different dimensions, such as event type, time range, user, and workspace. We can use this data to generate reports, identify usage patterns, and, of course, calculate billing amounts. The design of the usage_events table is crucial for performance and scalability. We need to ensure that the table is properly indexed and optimized for the types of queries we will be running. This may involve choosing appropriate data types, creating indexes on frequently queried columns, and partitioning the table based on time or other criteria.

The creation of the usage_events table is a significant step towards a more granular and accurate billing system. By storing detailed usage data, we can move away from fixed-price plans and offer our users more flexible and cost-effective options. This also opens up opportunities for usage-based pricing, where users are charged based on their actual consumption of resources. This approach aligns our pricing with the value our users are receiving, making our services more attractive and competitive. Furthermore, the data stored in the usage_events table can be used for a variety of purposes beyond billing, such as monitoring system usage, identifying performance bottlenecks, and understanding user behavior. This makes the table a valuable asset for our entire organization.

Versioning Event Structs

To maintain consistency and prevent breaking changes, all event structs must be versioned. This means that each time we change the structure of an event, we'll create a new version of the struct. This is super important because it allows us to evolve our event schema without disrupting existing data or processes. Think of it like version control for our events!

Versioning event structs ensures backward compatibility. If we need to add a new field to an event, we can create a new version of the struct without affecting the processing of older events. This is crucial for maintaining the integrity of our data and preventing errors. Without versioning, changes to event structures could lead to data corruption, processing failures, and inaccurate billing calculations.

Versioning also provides a clear audit trail of event schema changes. We can easily track the evolution of our event structures over time, understanding what fields were added, removed, or modified. This is invaluable for debugging, troubleshooting, and understanding the history of our system. A clear audit trail also makes it easier to comply with regulatory requirements and industry best practices.

There are several strategies we can use for versioning event structs:

  • Suffixing the struct name with a version number: For example, WorkspaceCreatedV1, WorkspaceCreatedV2. This is a simple and straightforward approach.

  • Using a dedicated version field in the struct: This allows us to store the version number explicitly within the event data.

  • Using a schema registry: A schema registry is a centralized repository for storing and managing event schemas. This provides a more robust and scalable solution for versioning events.

The choice of versioning strategy depends on the complexity of our system and our specific requirements. Regardless of the strategy we choose, it's essential to establish clear guidelines and processes for versioning events. This will ensure consistency and prevent errors. By implementing versioning, we future-proof our system and make it easier to adapt to changing requirements.

CODEOWNERS and Event Struct Protection

We're adding @deansheather to CODEOWNERS to prevent unintentional changes to the structs. This is a safeguard to ensure that only authorized personnel can modify the event structures. It's like having a gatekeeper for our critical data structures!

The CODEOWNERS file in a repository specifies individuals or teams that are responsible for specific parts of the codebase. When a pull request is created that modifies a file covered by a CODEOWNERS entry, the specified owners are automatically requested to review the changes. This ensures that changes are reviewed by those with the most expertise and context. By including @deansheather in the CODEOWNERS file for the usage package, we ensure that any changes to event structs are reviewed and approved by Dean. This helps prevent accidental or incorrect modifications that could have serious consequences.

Protecting event structs is crucial for several reasons:

  • Data Integrity: Event structs define the structure of our usage data. Changes to these structs can lead to data corruption or inconsistencies, which can impact billing accuracy and reporting.

  • Backward Compatibility: As mentioned earlier, versioning event structs is essential for maintaining backward compatibility. Unauthorized changes to structs can break existing processes and applications that rely on the previous structure.

  • System Stability: Incorrectly modified event structs can cause runtime errors and system instability. By protecting these structs, we reduce the risk of such issues.

In addition to CODEOWNERS, we can use other mechanisms to protect event structs, such as:

  • Code Reviews: Requiring thorough code reviews for any changes to the usage package.

  • Automated Testing: Implementing comprehensive unit and integration tests to ensure that changes to event structs do not introduce errors.

  • Access Control: Restricting access to the usage package to a limited number of developers.

By combining these measures, we can create a robust system for protecting our event structs and ensuring the integrity of our usage data. This is essential for maintaining the reliability and accuracy of our usage-based billing system.

Publishing to Tallyman

We'll be publishing usage events to Tallyman, our billing system, using a goroutine. This is an asynchronous process, meaning it won't block the main thread. We'll only do this if the deployment has a license with publish_usage_data=true. This ensures that we're only publishing data for deployments that are authorized to do so.

Publishing to Tallyman involves several steps:

  1. Collecting Usage Events: As events occur in our system, they are recorded and stored in the usage_events table.
  2. Transforming Events: The events may need to be transformed into a format that Tallyman can understand. This may involve mapping fields, aggregating data, or applying business logic.
  3. Publishing Events: The transformed events are then sent to Tallyman using a goroutine. This ensures that the publishing process does not block the main thread and allows our system to continue processing requests.
  4. Handling Errors: If an error occurs during the publishing process, we need to handle it gracefully. This may involve logging the error, retrying the publication, or notifying the appropriate personnel.

Using a goroutine for publishing to Tallyman provides several benefits:

  • Asynchronous Processing: As mentioned earlier, it prevents the publishing process from blocking the main thread.

  • Scalability: Goroutines are lightweight and efficient, allowing us to handle a large volume of events concurrently.

  • Resilience: If one publishing operation fails, it does not affect other operations.

The publish_usage_data flag in the license is a crucial safeguard. It ensures that we are only publishing usage data for deployments that have explicitly authorized this. This is important for privacy, security, and compliance reasons. We need to ensure that this flag is properly set and enforced. We also need to have mechanisms in place to audit and monitor the publishing process. This will help us identify and address any issues that may arise.

Generating the Initial dc_ai_workspaces_v1_genesis Event

We're generating an initial dc_ai_workspaces_v1_genesis event with existing data ONCE. This is a one-time event that will bootstrap our usage data. It's like setting the baseline for our new system.

This event will likely contain information about the existing workspaces in our system, such as their creation date, owner, and resource usage. It's essential to generate this event accurately and only once to avoid duplicating data or skewing our usage metrics. The data for this event will be sourced from our existing data stores, such as the workspace_builds table. We need to ensure that the data is consistent and accurate before generating the event. This may involve data cleaning, transformation, and validation. We also need to have a clear process for generating the event and verifying its accuracy. This may involve manual checks, automated scripts, and data quality reports. The dc_ai_workspaces_v1_genesis event serves as a snapshot of our system at a specific point in time. It provides a historical baseline for tracking usage and billing. This baseline is essential for comparing current usage with past usage and identifying trends. It also allows us to reconcile our new usage-based billing system with our existing billing system.

Recording Events for New has_ai_task Workspaces

We're adding code to the provisionerd server to record an event when new has_ai_task workspaces are built. This is crucial for tracking the usage of AI-powered workspaces. It allows us to accurately measure the resources consumed by these workspaces and bill users accordingly. The provisionerd server is responsible for provisioning new workspaces. By adding code to this server, we can ensure that usage events are recorded automatically whenever a new has_ai_task workspace is created. The event will likely include information such as the workspace ID, creation time, owner, and resource allocation. This information is essential for tracking the cost and performance of these workspaces. Recording events for new has_ai_task workspaces is a key step towards a more granular and accurate billing system. It allows us to charge users based on their actual consumption of AI resources. This approach aligns our pricing with the value users are receiving and makes our services more competitive. It also provides valuable data for understanding the usage patterns of AI-powered workspaces. This data can be used to optimize resource allocation, identify performance bottlenecks, and improve the overall user experience. We need to ensure that the event recording code is efficient and reliable. It should not introduce any performance overhead or disrupt the provisioning process. We also need to have mechanisms in place to monitor and audit the event recording process. This will help us identify and address any issues that may arise.

Counting Managed Agent Events

Finally, we're changing the managed agent count function to count events from the new table rather than from the workspace_builds table. This ensures that we're using the most accurate and up-to-date data for our calculations. By switching to the usage_events table, we're leveraging the detailed usage data we've been collecting. This provides a more precise count of managed agent events compared to relying on the workspace_builds table. The managed agent count is a key metric for billing and resource allocation. It represents the number of active agents in our system. By accurately counting these agents, we can ensure that we're charging users fairly and allocating resources efficiently. The workspace_builds table may not always provide an accurate count of managed agents. For example, it may not capture agents that are created outside of the workspace building process. By switching to the usage_events table, we address this issue and ensure that our count is more reliable. This change also simplifies our codebase. We no longer need to rely on multiple data sources for counting managed agents. The usage_events table becomes the single source of truth for this metric. This makes our system easier to understand, maintain, and troubleshoot. We need to ensure that the new managed agent count function is thoroughly tested and validated. This will help us identify and address any issues before they impact our billing or resource allocation. We also need to monitor the new function closely to ensure that it continues to provide accurate results.

Conclusion

So, there you have it! We've walked through the creation of our usage-based billing package, covering everything from the usage package and enterprise-licensed code to publishing to Tallyman and counting managed agent events. This is a huge step forward in our journey to a more flexible and accurate billing system. By implementing these changes, we're not only improving our billing process but also gaining valuable insights into how our users are interacting with our platform. This will allow us to make more informed decisions about resource allocation, pricing, and feature development. Remember, this is just phase 1, and there's more to come. But with this foundation in place, we're well-positioned to build a world-class usage-based billing system. Thanks for joining me on this journey, and stay tuned for more updates!