AWK-in Around - Verbum Data

Yet another edge case solution for the people involving multiple csv files and unique value counts.

The last few days have been a volatile love affair with awk. I have only lately encountered the wondrous monster. This post simply shares a small output which may be useful for someone out there.

What are we doing here?

We are processing csvs and counting unique values. The work I needed to accomplish involved reading hundreds of medium-sized (~100MB) csvs and tracking the evolution of unique identifier counts.

So how do we do it in awk? Two ways. One directly in the shell and one in a script.

Done in the shell:


# in the shell

awk 'BEGIN{ FPAT = "[^,]*|\"[^\"]+\"" } FNR==1{ next; } !seen[FILENAME,$2]++ { unq[FILENAME]++ }END {for (i=1;i<ARGC;i++) print ARGV[i], unq[ARGV[i]]}' your_input_file_1.csv your_input_file_2.csv

I preferred to run this in a script for sake of testing and ease. This process is only slightly more involved.

Create a file (for sake of simplicity, I’m assuming all your files & script are in the same directory) called script.awk
Ensure you’ve got everything you want saved in
(Optional): initialize via chmod +x ./script.awk … allows you to run it as an executable
Run script!

Enough preamble. Here’s the script version of the same code.


#!/bin/awk -f

BEGIN{ FPAT = "[^,]*|\"[^\"]+\"" } 
    
    FNR==1{ next; } # ignore header
    
    !seen[FILENAME,$2]++ { unq[FILENAME]++ }
    
END {for (i = 1;i < ARGC;i++) print ARGV[i], unq[ARGV[i]]}

A few notes here. Make sure you have properly written “#!/bin/awk -f” as the first line of your file. This signals to the shell that you are running a program and tells it where to look!

To actually run the file (assuming you haven’t made the file executable), execute the following (assume you have a bunch of csvs ending in “user-log”):


# not executable 

awk -f script.awk *user-log.csv

# executable

script.awk *user-log.csv

What does the script actually do?? Since awk reads line by line, looks for a pattern & action (in pattern {action} format), we can read roughly top to bottom. FPAT is a regex which tells awk what your fields are. If you know you don’t have any commas in your fields for your csvs, you can use FS=','.

With this done, we know that our csvs have headers so we tell awk to skip (next:) a row where FNR (file record number) equals 1 (i.e the header row).

Next up, we utilize awks counters to look in field #2 (happens to be consistent across my files) and increment an array by new (non-seen) values as it reads in the current file (denoted by FILENAME, as special term in awk which accesses the current input filename), line by line. The way to think about this is, the pattern says, “have I seen this value in this column?” If not, increment unq[FILENAME] by one.

We close it out by printing the results after awk has parsed the files. Here we utilized a loop to print each file name (accessed by ARGV[i]) and the final incremented value for the unq array as the result.

It was a blast figuring this bit of code out. I nearly cracked a correct solution, but was put into final action by this SO post for adjusted the approach I was taking.

In any case, messing around in awk has been a blast and I will insist on finding ways to feature it in ongoing workflows.