I have not had a chance to post anything in a while so I thought I might discuss a fun issue I ran across.

I will not get in to all the boring details, but the crux of the problem was this: there is a server API process that communicates with a 3rd party that writes a csv file containing invoice information to a web accessible directory on our server. I can see the invoices that failed in the server logs but there is not an easy way to get a list of successful invoices. The information is in a database some where to be sure, but I do not have access to it. I needed a way to go through all the individual csv files and combine them.

Normally this would be where I would whip out Node or ColdFusion and write an app to scrape and parse this information. While that is certainly an option it is not always the best choice; as the old saying goes: "If all you have is a hammer, everything looks like a nail."

I put down my hammer and pulled out the 2lbs sledge: bash. It was a bit of trial and error but I was finally able to get a script together that (mostly) did what I needed it to do.

#!/bin/bash
script='csvinator.sh'

# There rundate and country should always be passed in
if [ $# -lt 2 ];
then
    echo Usage ${script} rundate country
    exit 1;
fi

rundate="$1"
rm -f invoices.txt output.csv
touch invoices.txt
touch output.csv

# The expected header
header="Batch ID,Vendor ID,Vendor Name,Vendor Class,Address 1,Address 2,City,State,Postal Code,Country,Phone,Payment Amount"
echo "$header" > output.csv

# Get the directory list from the web server, filter for files with our run date
lynx -dump -listonly http://site.com/folder/ | grep http | grep "$rundate" | awk '{print $2}' >> invoices.txt

while read url
do
    # save curl response to a variable
    curld=$(curl -s $url 2>&1)
    # strip out the header
    output="${curld//$header/}"
    # if the invoice country matches
    if grep -iq ",$2," <<< $output; then
        echo $url
        # write the csv data (minus the header) to the output file
        echo "$output" >> output.csv
    fi
done < invoices.txt

echo Done.

So there are a few things going on here so I will step through them one at a time.

  • If a rundate and a country are not specified the script exits with a non-zero exit code. These values are used later in the script.
  • Temporary files (invoices.txt and output.csv) are deleted and reinitialized. The header is written out to output.csv.
  • Lynx is used to get the directory listing and piped through grep to get http:// links which is then piped through grep again to filter files that contain the rundate parameter. Finally the output is piped through awk to append the entire http link to invoices.txt. An example line item might look like: http://site.com/some/folder/Invoice_1337_2019-01-17.csv
  • Next loop through each line of invoices.txt calling curl to read each .csv file save the contents to the variable curld. Next the csv header is stripped out using shell parameter expansion and saved to the variable output and grepped for the country. If the country is matched output is written to output.csv.
  • Profit

I am sure the same thing could have been accomplished with a query in a fraction of the time it took me to put this bash script together but it would not have been nearly as enjoyable. It is still a little bit quirky such as adding an extra line feed after every row which I have not been able to fix but it is easy enough just to do a column sort. Not too shabby for a Friday.

Have a great weekend!