Count repeated values in a dataframe

Lately I’ve been exploring large datasets by querying metadata files. In the metadata, each individual and has a unique identifier in the column sample_id. Some individuals are in the metadata multiple times, and I was curious about the distribution of those frequencies - are most people only in it once? What’s the max? How many people are in it a large number of times?

First, here’s some R code to simulate a dataframe that mimics the metadata file I’m working with:

# Set seed for reproducibility
set.seed(123)

# Create a vector of 20 unique random identifiers
sample_ids <- replicate(20, paste0(sample(LETTERS, 10, replace=TRUE), collapse=""))

# Sample 100 identifiers with replacement from the vector of 20 identifiers
sampled_ids <- sample(sample_ids, 100, replace=TRUE)

# Create a data frame with 100 rows and 4 columns
df <- data.frame(sample_id = sampled_ids,
                 val1 = runif(100),
                 val2 = runif(100),
                 val3 = runif(100))

# Print the first few rows of the data frame
head(df)

To see how many times an individual is in the dataset, in R, table(df$sample_id) will report each unique identifier in the column sample_id, along with the number of times it appears in the dataframe:

DAHTYPXVKP GULOJMGIIJ HZGJISDNQK IGBPMSXTOG JVLTNQNVCH KYWHNUMBKM LYNCNGCWVZ MRAYYFUOIO NFYHLZDMNU NSOXQWKGOW NVYZESYYIC OSNCJRVKET OUEHSJRJLB QVRQBDMEVS THCDTLVQJT 
         7          3          6          5          2          1          9          5          5          6          6          6          4          5          5 
WKDLNSYGVZ WUGUFYBEHL YTVYNYWCHP ZFNGJEFPXU ZPTFKHVVGP 
         1          5          8          3          8 

This is useful by eye, as I can see for example that the individual with sample_id DAHTYPXVKP appears in the dataframe 7 times.

But I also want to see how many individuals were in the dataset once, twice, 10 times, etc. To do that, I can do an additional table wrapper function: table(table(df$sample_id)). This returns the integer counts of appearances in the dataframe, along with the number of individuals making that number of appearances:

1 2 3 4 5 6 7 8 9 
2 1 2 1 6 4 1 2 1 

In this example, we have two individuals present in the dataset one time; one individual present in the dataset four times; and six individuals present in the dataset 5 times. We also see that the maximum number of appearnces in the dataset was 9, and that it was only one individual who made 9 appearances in the dataframe.

The most common number of appearances was 5: there are six individuals who appear in the dataframe five times. If we want to see those six individuals, we can execute which(table(df$sample_id) == 5), which gives 1) each sample_id that appeared in the dataframe 5 times and 2) the position of that sample_id in the initial table command. Running which(table(df$sample_id) == 5) yields:

IGBPMSXTOG MRAYYFUOIO NFYHLZDMNU QVRQBDMEVS THCDTLVQJT WUGUFYBEHL 
         4          8          9         14         15         17 

If we just view head(table(df$sample_id)) we’ll see that the identifier IGBPMSXTOG is the fourth one:

DAHTYPXVKP GULOJMGIIJ HZGJISDNQK IGBPMSXTOG JVLTNQNVCH KYWHNUMBKM 
         7          3          6          5          2          1

This nested table() command has been useful for seeing many times individuals were present in a dataframe, including the minimum and maximum number of appearances for any individual. It’s also great for downstream investigation - like seeing which individuals were present a given number of times. Let me know if you try it!