Data wrangling

Wrangle data of any size in a powerful, yet intuitive way

You've hit the limits of spreadsheets, but don't want to commit to a data analytics or warehousing platform.

Spreadsheets are powerful tools that are still the bread and butter of many professions. However, it's easy to hit their limits, whether it be in data size, performance or functionality. Your next step could be data analytics and warehousing platforms, but these can often be prohibitive in their pricing or complexity.

People looking at spreadsheet (from https://www.freepik.com)

tap is a middle ground between these tools:

  • Shape and analyse data of any size.
  • Easily share it with downstream teams.
  • Pricing model that you don't have to think twice about.

Case study: Transforming address data

In tap SQL is how you can manipulate the data. To demonstrate its power, let's look at a simple, relatively unclean dataset (an extract of Hospitals in Scotland):

PostcodeUtf8
AddressLine1Utf8
AddressLine2Utf8
"KA278LF""Lamlash ""Isle of Arran"
"KA128SS"" Kilwinning Road"" Irvine"
"KA280HF""College St ""Millport"
"KA2 0BE"" Kilmarnock Road""Kilmarnock"
"KA120DP"" Warrix Avenue""Irvine"
"KA6 6AB""Dalmellington Road ""Ayr"
"KA9 2HQ""Biggart Road ""Prestwick"
"KA6 6DX"" Dalmellington Road""Ayr"
"KA7 4DW""10 Doonfoot Road ""Ayr"
"KA215RF""Nelson Road""Saltcoats"

1. Clean it, standardising the whitespace both within and surrounding our address columns:

SELECT
regexp_replace("Postcode", '\s*(.3$)', ' \1', 'g') AS "Postcode",
trim("AddressLine1") AS "AddressLine1",
trim("AddressLine2") AS "AddressLine2"
FROM data.example
PostcodeUtf8
AddressLine1Utf8
AddressLine2Utf8
"KA27 8LF""Lamlash""Isle of Arran"
"KA12 8SS""Kilwinning Road""Irvine"
"KA28 0HF""College St""Millport"
"KA2 0BE""Kilmarnock Road""Kilmarnock"
"KA12 0DP""Warrix Avenue""Irvine"
"KA6 6AB""Dalmellington Road""Ayr"
"KA9 2HQ""Biggart Road""Prestwick"
"KA6 6DX""Dalmellington Road""Ayr"
"KA7 4DW""10 Doonfoot Road""Ayr"
"KA21 5RF""Nelson Road""Saltcoats"

2. Join with other datasets, in this case the ONS Postcode Lookup data to map postcodes to latitude and longitude:

SELECT model.example_cleaned.*,
data.uk_postcode_lookup.latitude AS "Latitude",
data.uk_postcode_lookup.longitude AS "Longitude"
FROM model.example_cleaned
JOIN data.uk_postcode_lookup ON "Postcode" = postcode
PostcodeUtf8
LatitudeFloat64
LongitudeFloat64
AddressLine1Utf8
AddressLine2Utf8
"KA27 8LF"55.54312-5.115521"Lamlash""Isle of Arran"
"KA12 8SS"55.635056-4.676106"Kilwinning Road""Irvine"
"KA28 0HF"55.761296-4.922755"College St""Millport"
"KA2 0BE"55.61394-4.539409"Kilmarnock Road""Kilmarnock"
"KA12 0DP"55.611863-4.660189"Warrix Avenue""Irvine"
"KA6 6AB"55.434174-4.593108"Dalmellington Road""Ayr"
"KA9 2HQ"55.492576-4.604862"Biggart Road""Prestwick"
"KA6 6DX"55.430332-4.595538"Dalmellington Road""Ayr"
"KA7 4DW"55.449149-4.639765"10 Doonfoot Road""Ayr"
"KA21 5RF"55.63957-4.772409"Nelson Road""Saltcoats"

3. Enrich with H3 indexing, to enable performant geospatial querying:

SELECT
model.example_joined.*,
h3_latlng_to_cell("Latitude", "Longitude", 15::INT) AS "H3Res15"
FROM model.example_joined
PostcodeUtf8
H3Res15Utf8
LatitudeFloat64
LongitudeFloat64
AddressLine1Utf8
AddressLine2Utf8
"KA27 8LF""8f19569a8c5c648"55.54312-5.115521"Lamlash""Isle of Arran"
"KA12 8SS""8f19092e6775410"55.635056-4.676106"Kilwinning Road""Irvine"
"KA28 0HF""8f19092b0d119ad"55.761296-4.922755"College St""Millport"
"KA2 0BE""8f190925084439b"55.61394-4.539409"Kilmarnock Road""Kilmarnock"
"KA12 0DP""8f19092e3d467a6"55.611863-4.660189"Warrix Avenue""Irvine"
"KA6 6AB""8f1954598d46a49"55.434174-4.593108"Dalmellington Road""Ayr"
"KA9 2HQ""8f1954598c58ac3"55.492576-4.604862"Biggart Road""Prestwick"
"KA6 6DX""8f1954598c58ac3"55.430332-4.595538"Dalmellington Road""Ayr"
"KA7 4DW""8f19545932d3a66"55.449149-4.639765"10 Doonfoot Road""Ayr"
"KA21 5RF""8f19092f28342c1"55.63957-4.772409"Nelson Road""Saltcoats"

You can then serve it via a Data API to allow your team to work with the tools of their choosing and development teams to start building data-centric applications right away! Or, just download the data in a CSV, Parquet or JSONL format for further analysis.

Close up of a CSV file by Mika Baumeister

Do you have data files you want to join or are too large?

tap is an easy to use data platform that gives you the power of SQL to work with files of any size.

Start wrangling immediately • No credit card required