Getting a set of unique values from a column is a common operation in
data analysis. A commonly suggested technique is to use the
sort
and uniq
commands in combination. This
study compares the performance of this technique with some variations on
a real-world dataset and some contrived test examples.
The dataset is two files derived from a GBIF exported with 89,732,452 rows and 15 columns (GBIF.Org User 2024). The investigation focusses on two columns from this dataset:
publisher
with 30 unique values,license
with 175,716 unique values.These columns were isolated into separate files for each column to ensure the test is not influenced by column separation.
In order to compare with three more theoretical benchmarks three additional files (with the same number of lines as the real world example) were created:
same
: A file where every line reads “BioAcoustica:
Sound Library”unique
: A file with incrementing integersalternate
: Alternating lines of true and falseThe tests were performed on a VMWare virtualised instance of Ubuntu 22.04.05 LTS with 12GB of RAM and 4 Intel Xeon Platinum 8380 CPUs running at 2.30GHz.
The bash time
function was used to measure the time
taken to extract unique values from each column, with 10 repetitions for
each method. As the real time taken (as reported by time
)
may be affected by other processes running on the system, the sum of
user
and sys
time were used as the
total
measure of performance, although the
real
time is also reported as a measure of the time taken
to run the command and assess the impact of multi-thread operations.
Tests of single core performance were not conducted as it’s 2024.
The following methods were tested:
sort -u
uniq | sort -u
sort | uniq
uniq | uniq | sort -u
uniq | sort | uniq
The file being tested was read by the first sort
or
uniq
command, and the output was piped to
/dev/null
.
sort
The sort
command is used to sort the input data. The
-u
option is used to remove duplicate lines.
uniq
The uniq
command is used to remove duplicate lines from
a sorted file. If the file is not sorted uniq
will only
output the current line if it is different from the previous line.
In real-world datasets values in a column often occur in groups, so
the combinations uniq | sort | uniq
and
uniq | uniq | sort -u
may effectively reduce the number of
lines piped to sort
.
license
columnmethod | mean |
---|---|
sort -u | 24.6963 |
uniq | sort -u | 26.9240 |
uniq | sort | uniq | 32.5221 |
sort | uniq | 64.5948 |
method | mean |
---|---|
uniq | sort -u | 16.4382 |
sort -u | 17.2138 |
uniq | sort | uniq | 19.4527 |
sort | uniq | 38.7181 |
The two fastest methods for both user and total time are
uniq | sort -u
and sort -u
.
sort | uniq
is the slowest method, taking 2.36 times longer
than uniq | sort -u
for the user, and 2.62 times longer for
the total time.
publisher
columnmethod | mean |
---|---|
sort -u | 40.8204 |
sort | uniq | 41.6528 |
uniq | sort | uniq | 46.3127 |
uniq | sort -u | 47.5305 |
method | mean |
---|---|
sort -u | 40.8204 |
sort | uniq | 41.6528 |
uniq | sort | uniq | 46.3127 |
uniq | sort -u | 47.5305 |
The fastest times for both user and total time are
sort -u
and sort | uniq
. The calculation is
slower than for the license
file in all cases apart from
the compute time for sort | uniq
.
unique
columnmethod | mean |
---|---|
sort -u | 40.8204 |
sort | uniq | 41.6528 |
uniq | sort | uniq | 46.3127 |
uniq | sort -u | 47.5305 |
method | mean |
---|---|
sort -u | 40.8204 |
sort | uniq | 41.6528 |
uniq | sort | uniq | 46.3127 |
uniq | sort -u | 47.5305 |
The fastest times for both user and total time are
sort -u
and sort | uniq
, although all methods
are relatively closely matched.
alternating
columnmethod | mean |
---|---|
uniq | sort -u | 24.9655 |
sort -u | 30.4055 |
sort | uniq | 34.3696 |
uniq | sort | uniq | 37.9967 |
method | mean |
---|---|
uniq | sort -u | 24.9655 |
sort -u | 30.4055 |
sort | uniq | 34.3696 |
uniq | sort | uniq | 37.9967 |
The fastest times for both user and total time are
uniq | sort -u
and sort -u
.
same
columnmethod | mean |
---|---|
uniq | sort -u | 7.7747 |
uniq | sort | uniq | 7.8341 |
sort -u | 19.6351 |
sort | uniq | 43.1565 |
method | mean |
---|---|
uniq | sort -u | 7.7747 |
uniq | sort | uniq | 7.8341 |
sort -u | 19.6351 |
sort | uniq | 43.1565 |
The fastest times for both user and total time are
uniq | sort -u
and uniq | sort | uniq
, which
are similar in time taken and noticeably faster than the others.
uniq | sort -u
file | mean |
---|---|
publisher | 6.4099 |
same | 7.7747 |
alternate | 24.9655 |
license | 26.9240 |
unique | 47.5305 |
file | mean |
---|---|
publisher | 6.4099 |
same | 7.7747 |
alternate | 24.9655 |
license | 26.9240 |
unique | 47.5305 |
sort | uniq
file | mean |
---|---|
publisher | 21.7443 |
alternate | 34.3696 |
unique | 41.6528 |
same | 43.1565 |
license | 64.5948 |
file | mean |
---|---|
publisher | 21.7443 |
alternate | 34.3696 |
unique | 41.6528 |
same | 43.1565 |
license | 64.5948 |
uniq | sort | uniq
file | mean |
---|---|
same | 7.8341 |
publisher | 8.5796 |
license | 32.5221 |
alternate | 37.9967 |
unique | 46.3127 |
file | mean |
---|---|
same | 7.8341 |
publisher | 8.5796 |
license | 32.5221 |
alternate | 37.9967 |
unique | 46.3127 |
uniq | sort -u
file | mean |
---|---|
publisher | 6.4099 |
same | 7.7747 |
alternate | 24.9655 |
license | 26.9240 |
unique | 47.5305 |
file | mean |
---|---|
publisher | 6.4099 |
same | 7.7747 |
alternate | 24.9655 |
license | 26.9240 |
unique | 47.5305 |
For most real-world scenarios sort -u
is probably a time
efficient choice. For very large datasets which are predominantly single
valued uniq | sort -u
or uniq | sort | uniq
may be more efficient.
The virtual machine used for testing was provided by the Natural History Museum, London.