Sampling Methods - All You Need To Know!

Sampling Methods - All You Need To Know!

Introduction

Sampling is the process of creating a representative set of population. It’s almost impossible to analyze/survey a population to arrive at the results. Sampling allows for large-scale research at a realistic cost and time. There are various sampling techniques to choose from depending on the usecase, we will cover the most commonly used ones in this article.

Probability Sampling

Probability sampling is a technique in which every element in the population has an equal chance of being chosen. Since such sample would be unbiased and random, we can estimate the sampling error and the degree of confidence. This technique is best suitable for descriptive studies with large and diverse population.

Simple Random Sampling

This is a basic sampling method, where a subset is randomly selected from the population. It is popular for its simplicity and lack of bias.

1
2
3
4
5
6
7
8
9
10
11
#Simple Random Sampling
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
#load iris dataset
iris = load_iris()
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
        columns= iris['feature_names'] + ['target']).astype({'target': int}) \
       .assign(species=lambda x: x['target'].map(dict(enumerate(iris['target_names']))))
#sample 100 rows from iris dataset
simple_random = df.sample(n=50)

Systematic Sampling

In this method, members of the sample are selected from population in a systematic way - at a fixed sampling interval from a random starting point. Compared to simple random sampling, this method has higher degree of control and low chance of data contanimation. Size of population is needed beforehand for this method to work well.

1
2
3
4
5
6
7
8
9
# systematic sampling function
# 0 as starting point
def systematic_sampling(df, step):
    rows = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[rows]
    return systematic_sample

#sample every 3rd row
systematic_sample = systematic_sampling(df,3)

Stratified Sampling

This method involves division of population into subsets or strata. Proportionate sampling takes each strata in the sample proportionate to the population size while disproportionate sampling, certain strata will oversample or undersample based on research question.

1
2
3
4
#proprotionate stratified sampling
from sklearn.model_selection import train_test_split
#sample 30% of the data
stratified_sample,_ = train_test_split(df, test_size=0.7, stratify=df["species"])

Cluster Sampling

In this method, population is divided into clusters and then some of the clusters are randomly selected as the sample. There are variations to this method such as single-stage, double-stage and multi-stage each adding randomness with the stages. This method is used when the population is too spread out. This method can result in a high sampling error if the clusters aren't representative of the population.

1
2
3
4
5
6
7
8
9
#single stage cluster sampling
import random
def cluster_sampling(df,clusters,n):
    df['cluster'] = np.random.randint(0,clusters+1,size = len(df))
    rows = random.sample(range(0,clusters+1),n+1)
    cluster_sample = df[df.cluster.isin(rows)]
    return cluster_sample

cluster_sample = cluster_sampling(df,10,3)

Nonprobability Sampling

Nonprobability sampling is a technique in which elements have unequal chance of being chosen. This technique is best suited for exploratory studies on small and specific population. It is not intended to draw any statistical conclusions about the population.

Convenience Sampling

In this techqniue, sampling is done based on availability and relative ease of access. Researchers usually use this method to gather feedback on critical issues/products. Online surveys, market research on university campus, interviewing people at a mall are examples of this method.

Judgemental Sampling

In this method, sampling is done based on researcher's judgement or expertise. This is often used when certain participants/elements of the population are more relevant to the research objective. Expert panels, celebrity interviews, clinical trials are examples of this method.

Quota Sampling

This method is similar to stratified sampling, where researcher idenities subsets of population and then selects memebers non-randomly based on pre-specified quota. This is often used in when researcher wants to ensure that sample is representative of population based on certain attributes like geographic region, age, gender or income.

Snowball Sampling

This technique involves selecting a small group of people and relying on their referrals to identify additional participants for the sample. It can result in a biased sample, and researchers should be wary that it might not be representative of the population. This method is useful when population is difficult to reach or hidden.

Other Sampling Methods

Most ML algorithms are designed to work on balanced datasets for classification. When the data is imbalanced, the minority class is often ignored, and the minority class ends up with a high misclassification rate. This is because most ML algorithms rely on the class distribution to gauge the likelihood and make predictions.

Here’s where data sampling will help by transforming the imbalanced dataset to a balanced one. This allows for the algorithms to train on the modified dataset without additional data preparation. Random Oversampling, Random Undersampling, SMOTE, ADASYN, Tomek Links are some of the popular sampling methods for imbalanced datasets.

If you found my work useful, please cite it as:

{
  author        = {Tammineedi, Mahitha},
  title         = {Sampling Methods - All you need to know},
  howpublished  = {\url{https://mahi27.github.io/}},
  year          = {2023},
  note          = {Accessed: 2023-09-15},
  url           = {https://mahi27.github.io/}
}

M. Tammineedi, Sampling Methods - All you need to know , https://mahi27.github.io/, 2023, Accessed: Sep 15 2023.