Feature Infrastructure Engineering: A Comprehensive Guide
How raw events become real-time signals in Machine Leaning models
While much of the ML literature focuses almost exclusively on model architecture, features are often taken for granted. Yet in many real-world applications, it’s improvements in features — not models — that can drive the majority of performance gains. In my own experience, I’ve seen countless new model proposals fall flat, while even modest feature improvements consistently deliver impact.
Feature infrastructure engineering is the discipline of turning raw events into high-signal inputs that can be consumed in near real time by models, while also being stored offline for experimentation and training. In this article, we’ll explore how this remarkable engineering feat works: from event streaming with Kafka, to feature computation with Spark and Flink, to online serving with Redis and Cassandra. We’ll also unpack common feature types in the wild — including counter and cross features, user history features, and embeddings — and discuss the design trade-offs that come with each.
Let’s get started.
Engineering components
Event streaming
The origin of any feature is an event, such as a user clicking on an ad, liking a post, or swiping to the next video in their video feed. Note that events themselves have no duration — they simply represent a single point in time at which something happened. For example, a user watching a video for 30 seconds would technically be a feature, not an event.
In order to turn these raw events into features, we need an event streaming platform — a service that allows the information about the event itself to be broadcasted in one place so that it can be consumed in many other places. The most widely adopted open-source event streaming platform used in the industry today is Apache Kafka, used by “more than 80% of all Fortune 100 companies” (Kafka website, as of 2025). In Big Tech, you’ll often see in-house implementations instead, such as Scribe in Meta, Kinesis in Amazon, or PubSub in Google.
Kafka introduces the concepts of producers, consumers, and topics, where:
A producer is a job running inside the web or app backend that sends event metadata into a Kafka topic.
A topic is a distributed, append-only commit log of events — essentially an event stream that multiple consumers can read from. Each topic can be partitioned, allowing Kafka to scale horizontally and support multiple parallel consumers.
A consumer is a job that subscribes to one or more topics and computes derived values from the streamed event data.
For example, in a Newsfeed application, we might have a user_item_clicks topic that streams events containing timestamp, user_id, and item_id for every user click. A Kafka consumer could subscribe to this topic and compute a feature like the rolling count of clicks per user over the past 24 hours. In practice, a single event stream like this can power hundreds of downstream features, each consumed by different ML models or analytics services. This kind of flexible, fan-out architecture would be difficult to achieve with traditional point-to-point messaging systems like RabbitMQ, where messages are typically routed from one producer to one consumer only.
While many useful events originate from direct user interactions, not all do. Amazon, for example, can stream user actions like clicks or purchases directly from its own backend. But it could also collect and ingest third-party signals — such as payment response codes (from payment processors), chargebacks (from issuing banks), or package scan events (from UPS, FedEx, etc.). Once ingested (via APIs, file drops, or other integrations), these third-party events can then be streamed internally, making them available to any team that needs them in near real-time.
Feature computation
A feature job is a Kafka consumer that computes a value and writes that value into a feature store. Note that technically there is no difference at this stage between features and labels — in the Kafka world both are simply consumers, and the distinction only comes into play at the modeling stage. Two commonly used tools to launch these consumer jobs are Apache Flink and Apache Spark:
Apache Flink enables streaming processing in near real time. This means feature values are updated almost immediately after an event is logged — often within a few hundred milliseconds.
Apache Spark, on the other hand, uses micro-batching, which increases throughput but introduces slightly higher latency depending on the batch size. That said, Spark jobs can still achieve sub-second latency for many use cases.
The trade-off comes down to latency vs throughput. For low-latency applications like real-time recommendations or fraud detection, Flink is often the better choice. For batch-oriented or less latency-sensitive tasks — think product classification on Amazon or artist recommendation on Spotify — Spark may be sufficient.
That said, latency also depends heavily on the complexity of the feature itself. Simple aggregations can be computed in milliseconds regardless of framework, but features involving joins, large state, or heavy computation (e.g. NLP, embedding projections, or more complex statistics) can dominate overall latency. The framework helps — but not if the feature logic itself is the bottleneck.
Online feature stores
A feature store, as the name implies, stores features so that they can be consumed by downstream ML models and analytics applications. Two commonly used data stores used as feature stores are Apache Cassandra and Redis:
Apache Cassandra is a wide-column, distributed NoSQL database optimized for write-heavy workloads. The data is primarily stored on disk, but Cassandra uses in-memory caching to speed up access. Typical latencies range from a few to a few tens of ms.
Redis is an in-memory key-value store optimized for low-latency reads and writes. Because all data is stored inside in-memory hash maps, reads can be extremely fast, often in the sub-ms latency range.
The trade-off is serving cost: while Redis can reduce feature lookup latency by an order of magnitude, it is also significantly more expensive, as RAM storage is typically 10-100x more expensive than disk storage. For that reason, many production systems adopt a hybrid approach, using Redis for “hot keys” (e.g. user IDs with the most frequent requests), and falling back to Cassandra or another disk-backed store for “cold” or infrequent keys.
Offline feature stores
In addition to online feature stores, we typically persist features in longer-term storage systems such as data lakes (e.g., S3, HDFS, Google Cloud Storage) or data warehouses (e.g., Redshift, Snowflake, BigQuery) to enable offline experimentation on historical data. For example, while Redis might store features for minutes to hours (”hot storage”) and Cassandra for hours to days (”warm storage”), offline stores often retain data for weeks to months (”cold storage”). The data is usually stored in columnar binary formats such as Parquet or ORC to speed up reads and writes, and is typically queryable using SQL-based engines. Typical query latencies range from seconds to minutes, depending on the underlying storage and the complexity of the workload.
A critical requirement for any feature platform is offline/online consistency, that is, features in hot storage (i.e., at serving time) need to be identical to features in cold storage (i.e., at training time). If they aren’t, then the model will not be as good at serving time as we thought based on offline experimentation, which can be very difficult to debug.
When new features are onboarded, it is therefore a good practice to log the online features for a certain period of time (a few hours to a few days) and then manually confirm consistency with the offline feature store. A more holistic (although technically difficult) way to confirm offline/online parity is to evaluate the ML model using both offline and online features on the same eval data, and then explicitly check for parity in the model predictions or in the predictive performance metrics such as AUC.
Finally, managed feature stores provide a high-level API that abstracts away the complexity of storage, feature lookup, and offline/online consistency. The most well-known open-source solution is Feast, used by companies such as Robinhood, NVIDIA, Discord, Walmart, Shopify, and many others. Cloud providers also offer commercial managed options, including AWS SageMaker Feature Store, Databricks Feature Store, and Azure Feature Store. In larger organizations, it’s common to build custom in-house solutions — for example, Palette at Uber, Luigi at Spotify, and Zipline at Netflix.
Feature types
It’s useful to distinguish between offline and online features:
Offline features are typically static or slowly changing attributes, such as a user’s country, age, spoken languages, or a product’s category and artist genre.
Online features are rapidly changing behavioral signals, such as the rolling number of user clicks, merchant purchases, or artist plays over a recent time window.
With that distinction in mind, let’s take a closer look at some of the most common feature types you’ll encounter in production systems: counter and cross features, user history features, and embedding features — along with their respective design trade-offs.
Counter features
Counter features do exactly what the name implies: they count specific types of events for specific keys over a given lookback window. A simple example we’ve seen earlier is the total number of user clicks in the past 24 hours.
Cross features, which count events over combinations of entities (e.g. user × brand), can be especially valuable. For example, when predicting whether a user will click on an ad, it’s far more useful to know how many times that user has clicked on ads from the same brand in the past 30 days, rather than just total clicks overall. Cross features are so impactful that Google’s landmark Wide & Deep model (Cheng et al., 2016) directly feeds them into the neural network’s output logit alongside dense learned representations.
A key design choice for counter features is the window size. If the window is too short, the feature may become too noisy (e.g. spiky user behavior); too long, and the signal becomes stale. In practice, it’s common to compute multiple rolling windows — such as past 1 hour, 24 hours, 7 days, and 30 days — and feed all of them into the model. This allows the model to benefit from both short-term freshness and long-term stability.
User history features
User history features — such as the list of clicked posts, purchased items, or watched videos — are arguably the most powerful signals in modern recommender systems. In general, the richer and longer the user history, the better we can predict what type of content a user is likely to engage with next.
As with counter features, a key design consideration is the window size. Longer histories tend to provide more signal, but they also increase storage and serving cost — especially when served from in-memory stores like Redis. Whether to increase history length ultimately becomes a return-on-investment question: does the gain in model performance justify the infra cost? Unfortunately, this can turn into a chicken-and-egg problem — how do we estimate the value of a longer history window before we’ve logged and trained on it?
Another design choice is the event granularity. In a Newsfeed, for instance, we might log each user engagement type (like, share, comment, etc.) as separate features, or we might bundle them into a single sequence with event-type metadata. The bundled approach is typically more efficient to store and serve, but it requires more complex downstream parsing if we want to isolate specific types of engagement.
Embedding features
In addition to learned embedding tables (which map hashed ids into dense vectors), a model may also consume precomputed embeddings. For example, a Spark job might periodically pool a user’s engagement history into a single user embedding, which is then stored in the feature store for use during inference. In practice, a model can take multiple embeddings as input — some learned end-to-end, others precomputed — and treat all of them as features.
Retrieval models are a particularly interesting case here because embeddings serve a dual role. During offline training, we might learn user and item embedding tables using a two-tower architecture trained on interaction data (e.g., clicks or purchases). Once trained, we precompute item embeddings and store them in a vector database (e.g., FAISS) to support fast approximate nearest neighbor search at inference time. When a user request comes in, we compute the corresponding user embedding with the user tower, and then retrieve the top-k most similar items from the vector index using a similarity metric such as dot product or cosine similarity.
Outlook
Feature infrastructure engineering is one of the highest-leverage investments a modern ML team can make. Increasing the number, quality, lookback window, and serving speed of production features is often the fastest path to more accurate and more responsive models.
But feature infra is far from a solved problem. One of the hardest challenges is point-in-time correctness — ensuring that features (and labels) only access events that occurred before the prediction time, not after. This sounds simple, but in practice is hard to guarantee because Kafka consumers aren’t always in sync. For instance, if a feature job updates a click counter before the label job logs that same click as a label, the feature “leaks the future,” inflating model accuracy. Point-in-time–correct join libraries are essential to avoid this.
Another common pitfall is housekeeping. Creating new features is easy once event streams are in place — but keeping track of what features exist, what they compute, and who uses them is much harder. In large orgs, this metadata often lives only in code or tribal knowledge. Every feature should be documented with a description, creation date, and change log — a task that could be potentially automated with LLMs.
Housekeeping also includes deprecation. Removing unused features is surprisingly hard, since usage extends beyond the production model to A/B tests, dashboards, and backtests that may run for months. Without careful dependency tracking, unused features linger.
Let’s return to the premise of this article: if feature infrastructure is so impactful, why is modeling work so popular? I’d argue it’s because modeling is easy — or at least easier. Anyone familiar with Python and modern ML frameworks can tweak a neural architecture in a few lines of code. Feature infrastructure engineering, on the other hand, demands a much deeper understanding not just of modeling, but of the full stack, from event generation to stream processing to feature serving. It’s harder, messier, and not as invisible — but also where much of the real leverage lives.
Models are only as good as the features they train on. Go forth and engineer them wisely.
Special thanks to Warren Lemmon for a thoughtful review of this article.



Love this!