Querying CSV files in SQL

I had a need to query a CSV file this morning. It was an export of users in Intercom and I wanted to find all the users with duplicate email addresses.

Numbers

You can kinda do this in Numbers by creating a new column with the formula

=COUNTIF($D, D2)

and applying it to all the cells. This will count the number of instances of the cell D2 in column D. You can then do some selective colouring to colour instances greater than one.

The downside is the this will freeze Numbers for several minutes on a 4000 row file. So I went looking for a better solution

Dave mentioned hack where you create a column in Excel which generates the SQL insert statement. Then export those and run against a database — but that still seemed like a lot of work.

Sqlite

https://twitter.com/pinkkis/status/818797075953500160

I like SQLite but I wanted something that created the schema for me as there’s 70 columns and I’m a very lazy bear.

As it happens, SQLite will auto-create the schema for you (mostly just the CSV column header and the TEXT type). Here’s the section from the .import documentation

when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table

So to import my_file.csv into a table named users, it’s as easy as

sqlite3
.mode csv
.import my_file.csv users
select * from (select count(*) as n, Email from users group by Email order by n DESC) as counts where n > 1;

q

https://twitter.com/peterc/status/818804467755847680

q is a handy tool which essentially uses sqlite behind the scenes and lets you directly query the csv file with sql from the command line.

q -H -d "," "select count(*) as n, Email from ./Coach_Users_259864_export_2017-01-09_17_31.csv group by Email order by n DESC"

The -H tells it to use the CSV header row to name the columns and -d tells it to use a comma as the delimiter (I wish that was the default instead of space)

I wasn’t able to make the subselect version from SQLite work in q but maybe I could rewrite it. q seems more appropriate for automation work than freeform data discovery