Distributed Sampling on Dask Bags


Distributed Data Structure



import randomseq = [1, 2, 3, 4, 5, 6]s = random.choices(seq, k=2)print(s)>> [2, 6]
import dask.bag as db
from dask.bag import random
seq = db.from_sequence([1, 2, 3, 4, 5, 6], npartitions=3)s = random.choices(seq, k=2)print(list(s.compute()))>> [2, 6]

Two steps algorithm

Informal Proof

Computational Costs

Statistical Test

import dask.bag as db
numbers = range(6)
buckets = {i: 0 for i in numbers}
a = db.from_sequence(numbers, partition_size=4)
for i in range(150):
s = a.sample(k=1).compute()
for e in s:
buckets[e] += 1
obs = c(25,26,33,22,17,27)
n = length(obs)
expr = rep(1/n, n)
xmulti(obs, expr, statName = "LLR", histobins = T, histobounds = c(0, 0), showCurve = T)
P value  (LLR)  =  0.3335
P value (Prob) = 0.3339
P value (Chisq) = 0.3428





Data Architect — Senior Software Developer http://eracle.me

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Modernizing willhaben’s Data Analytics Infrastructure

5 steps to keep your Tableau skills fresh (for free)!

8 Ways you can grow your Business using Data Science

Introducing Grimoire: A Data Centric Blogging Platform

Iris dataset interactive bar graph

Data Mesh — Fad or Fab?

Time-filter for Great Whale’s migration routes

Google data analytics capstone 1

The Catalog and the Hydrator

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Antonio Ercole De Luca

Antonio Ercole De Luca

Data Architect — Senior Software Developer http://eracle.me

More from Medium

Peripéteies in Greece — Week 85

Growing seeds of food system transformation

Figure 1. Workshop group imagining the mature condition of a seed (Step 2).

Thinking In Images: Byte-sized Guide

4 | Light It Up the Spark of Fireworks in Your Eyes.