Probabilistic database

<h2 id="terminology">Terminology</h2>
<p>In a probabilistic database, each tuple is associated with a probability between 0 and 1, with 0 representing that the data is certainly incorrect, and 1 representing that it is certainly correct.
</p>
<h3>Possible worlds</h3>
<p>A probabilistic database could exist in multiple states. For example, if there is uncertainty about the existence of a tuple in the database, then the database could be in two different states with respect to that tuple—the first state contains the tuple, while the second one does not. Similarly, if an attribute can take one of the values <i>x</i>, <i>y</i> or <i>z</i>, then the database can be in three different states with respect to that attribute.
</p><p>Each of these <i>states</i> is called a possible world.
</p><p>Consider the following database:
</p>
An Incomplete Database<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a1</td><td>b1</td></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>{b3, b3′, b3′′}</td></tr></tbody></table>
<p>(Here <i>{b3, b3′, b3′′}</i> denotes that the attribute can take any of the values <i>b3</i>, <i>b3′</i> or <i>b3′′</i>)
</p>
<ul><li>Assuming that there is uncertainty about the first tuple, certainty about the second tuple, and uncertainty about the value of attribute B in the third tuple.</li></ul>
<p>Then the actual state of the database may or may not contain the first tuple (depending on whether it is correct or not). Similarly, the value of the attribute B may be <i>b3</i>, <i>b3′</i> or <i>b3′′</i>.
</p><p>Consequently, the possible worlds corresponding to the database are as follows:
</p>
World 1<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a1</td><td>b1</td></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3</td></tr></tbody></table>
World 2<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a1</td><td>b1</td></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3′</td></tr></tbody></table>
World 3<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a1</td><td>b1</td></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3′′</td></tr></tbody></table>

World 4<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3</td></tr></tbody></table>
World 5<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3′</td></tr></tbody></table>
World 6<table><tbody><tr><th>A</th><th>B</th></tr><tr><td>a2</td><td>b2</td></tr><tr><td>a3</td><td>b3′′</td></tr></tbody></table>

<h3>Types of Uncertainties</h3>
<p>There are essentially two kinds of uncertainties that could exist in a probabilistic database, as described in the table below:
</p>
Types of Uncertainties<table><tbody><tr><th>Tuple-level uncertainty</th><th>Attribute-level uncertainty</th></tr><tr><td>Uncertainty if a tuple is correct or not, that is, whether it should exist in the database or not.</td><td>Uncertainty about the values that an attribute of a tuple can take, that is, it could take one of the several possible values.</td></tr><tr><td>Corresponding to each uncertain tuple, there are two possible worlds: one which includes the tuple while the other which does not.</td><td>Corresponding to each uncertain attribute which can take one of the values <i>a1,...,an</i>, there are <i>n</i> possible worlds.</td></tr><tr><td>Tuple-level uncertainty can be seen as a boolean random variable associated with each uncertain tuple.</td><td>Attribute-level uncertainty can be seen as a random variable associated with each uncertain attribute which can take values <i>a1,...,an</i>.</td></tr></tbody></table>
<p>By assigning values to random variables associated with the data items, different possible worlds can be represented.
</p>
<h2 id="history">History</h2>
<p>The first published use of the term "probabilistic database" was probably in the 1987 VLDB conference paper "The theory of probabilistic databases", by Cavallo and Pittarelli.<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a> The title (of the 11 page paper) was intended as a bit of a joke, since David Maier's 600 page monograph, The Theory of Relational Databases, would have been familiar at that time to many of the conference participants and readers of the conference proceedings.
</p>

<h2 id="external-links">External links</h2>
<ul><li>The MayBMS project at <a href="/facts/Cornell_University/FGyXuXeI">Cornell University</a> (<a href="https://maybms.sourceforge.net/">sourceforge.net project site</a>)</li>
<li>The <a href="https://homes.cs.washington.edu/~suciu/project-mystiq.html">MystiQ</a> project at the <a href="/facts/University_of_Washington/36PG6qkG">University of Washington</a></li>
<li>The <a href="https://orion.cs.purdue.edu/">Orion</a> project at <a href="/facts/Purdue_University/dRuTdltL">Purdue University</a></li>
<li>The <a href="http://infolab.stanford.edu/trio/">Trio</a> project at <a href="/facts/Stanford_University/JEQeF9z2">Stanford University</a></li>
<li>The <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/102060.html/">BayesStore</a> project at the <a href="/facts/University_of_California%2c_Berkeley/MXVEjklr">University of California, Berkeley</a></li>
<li>The <a href="http://www.cs.umd.edu/~amol/PrDB/">PrDB</a> project at the <a href="/facts/University_of_Maryland%2c_College_Park/Wmc5R1kM">University of Maryland, College Park</a></li>
<li>The <a href="https://odin.cse.buffalo.edu/research/mimir/">Mimir</a> project at the <a href="/facts/University_at_Buffalo/gbIhPu11">University at Buffalo</a></li>
<li>The <a href="https://github.com/PierreSenellart/provsql">ProvSQL</a> project at <a href="/facts/%25C3%2589cole_normale_sup%25C3%25A9rieure_(Paris)/KQtorjQB">École normale supérieure (Paris)</a> (Module for <a href="/facts/PostgreSQL/LHdXuHfl">PostgreSQL</a>)</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Vinod Muthusamy, Haifeng Liu, Hans-Arno Jacobsen: Predictive Publish/Subscribe Matching. University of Toronto. <a href="http://www.eecg.toronto.edu/~jacobsen/ptopss.pdf" target="_blank">http://www.eecg.toronto.edu/~jacobsen/ptopss.pdf</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Nilesh N. Dalvi, Dan Suciu: Efficient query evaluation on probabilistic databases. VLDB J. 16(4): 523–544 (2007) <a href="/w/index.php?title=Nilesh_N._Dalvi&action=edit&redlink=1" target="_blank">/w/index.php?title=Nilesh_N._Dalvi&action=edit&redlink=1</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Lyublena Antova, Christoph Koch, Dan Olteanu: 10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information. ICDE 2007: 606–615 <a href="/w/index.php?title=Lyublena_Antova&action=edit&redlink=1" target="_blank">/w/index.php?title=Lyublena_Antova&action=edit&redlink=1</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Roger Cavallo, Michael Pittarelli: The Theory of Probabilistic Databases. In VLDB'87, Proceedings of 13th International Conference on Very Large Data Bases, September 1–4, 1987, Brighton: 71–81 (1987) <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
</ol>

Probabilistic database open-in-new

Probabilistic database