Yet another edge case solution for the people involving multiple csv files and unique value counts.
The last few days have been a volatile love affair with awk
. I have only lately encountered the wondrous monster. This post simply shares a small output which may be useful for someone out there.
What are we doing here?
We are processing csvs and counting unique values. The work I needed to accomplish involved reading hundreds of medium-sized (~100MB) csvs and tracking the evolution of unique identifier counts.
So how do we do it in awk
? Two ways. One directly in the shell and one in a script.
Done in the shell:
# in the shell
awk 'BEGIN{ FPAT = "[^,]*|\"[^\"]+\"" } FNR==1{ next; } !seen[FILENAME,$2]++ { unq[FILENAME]++ }END {for (i=1;i<ARGC;i++) print ARGV[i], unq[ARGV[i]]}' your_input_file_1.csv your_input_file_2.csv
I preferred to run this in a script for sake of testing and ease. This process is only slightly more involved.
- Create a file (for sake of simplicity, I’m assuming all your files & script are in the same directory) called script.awk
- Ensure you’ve got everything you want saved in
- (Optional): initialize via chmod +x ./script.awk … allows you to run it as an executable
- Run script!
Enough preamble. Here’s the script version of the same code.
#!/bin/awk -f
BEGIN{ FPAT = "[^,]*|\"[^\"]+\"" }
FNR==1{ next; } # ignore header
!seen[FILENAME,$2]++ { unq[FILENAME]++ }
END {for (i = 1;i < ARGC;i++) print ARGV[i], unq[ARGV[i]]}
A few notes here. Make sure you have properly written “#!/bin/awk -f” as the first line of your file. This signals to the shell that you are running a program and tells it where to look!
To actually run the file (assuming you haven’t made the file executable), execute the following (assume you have a bunch of csvs ending in “user-log”):
# not executable
awk -f script.awk *user-log.csv
# executable
script.awk *user-log.csv
What does the script actually do?? Since awk
reads line by line, looks for a pattern & action (in pattern {action} format), we can read roughly top to bottom. FPAT is a regex which tells awk
what your fields are. If you know you don’t have any commas in your fields for your csvs, you can use FS=','
.
With this done, we know that our csvs have headers so we tell awk
to skip (next:) a row where FNR (file record number) equals 1 (i.e the header row).
Next up, we utilize awks
counters to look in field #2 (happens to be consistent across my files) and increment an array by new (non-seen) values as it reads in the current file (denoted by FILENAME, as special term in awk
which accesses the current input filename), line by line. The way to think about this is, the pattern says, “have I seen this value in this column?” If not, increment unq[FILENAME]
by one.
We close it out by printing the results after awk
has parsed the files. Here we utilized a loop to print each file name (accessed by ARGV[i]) and the final incremented value for the unq
array as the result.
It was a blast figuring this bit of code out. I nearly cracked a correct solution, but was put into final action by this SO post for adjusted the approach I was taking.
In any case, messing around in awk
has been a blast and I will insist on finding ways to feature it in ongoing workflows.