by Da Dalt, Francesco, and Perrig, Adrian
Abstract:
Measuring the frequency of items in data streams is a relevant and wide-spread problem in stream analysis and Internet traffic monitoring. This paper studies the problem of sketch-based frequency estimation from a Bayesian statistics point of view which captures uncertainties regarding the frequencies of items in a more flexible and quantitative way compared to the state of the art. We design and implement, based on Markov chain Monte Carlo, a Bayesian frequency estimation sketch that provides both state of the art accuracy, as well as greater functionality compared to other sketches such as confidence bounds for arbitrary levels, and error-function aware frequency estimates. In our theoretical work we derive information-theory related equations such as the expected information gain of a sketch, as well as the optimal least-squares Bayesian frequency estimator. In benchmarks comparing the state of the art, the proposed method achieves the lowest absolute error across all real world data streams, as well as outperforming all sketches on 4 out of 5 metrics on synthetic data. We also show that our method can provide, for multiple confidence levels simultaneously, good confidence levels on both synthetic as well as real data.
Reference:
BFES: Towards Optimal Bayesian Frequency Estimation Sketches in Data-Streams . Da Dalt, Francesco, and Perrig, Adrian. In 2025 IEEE 41st International Conference on Data Engineering (ICDE) 2025.
Bibtex Entry:
@INPROCEEDINGS {dadalt2025bfes,
author = { Da Dalt, Francesco and Perrig, Adrian },
booktitle = { 2025 IEEE 41st International Conference on Data Engineering (ICDE) },
title = {{ BFES: Towards Optimal Bayesian Frequency Estimation Sketches in Data-Streams }},
year = {2025},
volume = {},
ISSN = {2375-026X},
pages = {2337-2350},
abstract = { Measuring the frequency of items in data streams is a relevant and wide-spread problem in stream analysis and Internet traffic monitoring. This paper studies the problem of sketch-based frequency estimation from a Bayesian statistics point of view which captures uncertainties regarding the frequencies of items in a more flexible and quantitative way compared to the state of the art. We design and implement, based on Markov chain Monte Carlo, a Bayesian frequency estimation sketch that provides both state of the art accuracy, as well as greater functionality compared to other sketches such as confidence bounds for arbitrary levels, and error-function aware frequency estimates. In our theoretical work we derive information-theory related equations such as the expected information gain of a sketch, as well as the optimal least-squares Bayesian frequency estimator. In benchmarks comparing the state of the art, the proposed method achieves the lowest absolute error across all real world data streams, as well as outperforming all sketches on 4 out of 5 metrics on synthetic data. We also show that our method can provide, for multiple confidence levels simultaneously, good confidence levels on both synthetic as well as real data. },
keywords = {data stream;sketch;bayesian inference},
doi = {10.1109/ICDE65448.2025.00177},
url = {/publications/papers/BFES_ICDE_CRC_CR_final.pdf},
slides = {/publications/slides/2025_ICDE_BFES.pdf},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}