Data wrangling
tap allows you to wrangle data of any size in a powerful, yet intuitive way
Spreadsheets are powerful tools that are still the bread and butter of many professions. However, it's easy to hit their limits, whether it be in data size, performance or functionality. Your next step could be data analytics and warehousing platforms, but these can often be prohibitive in their pricing or complexity.
tap is a middle ground between these tools, allowing you to shape and analyse data of any size and then easily share it with downstream teams with a pricing model that you don't have to think twice about
In tap SQL is how you can manipulate the data. To demonstrate its power, let's look at a simple, relatively unclean dataset:
PostcodeUtf8 | AddressLine1Utf8 | AddressLine2Utf8 |
---|---|---|
"KA278LF" | "Lamlash " | "Isle of Arran" |
"KA128SS" | " Kilwinning Road" | " Irvine" |
"KA280HF" | "College St " | "Millport" |
"KA2 0BE" | " Kilmarnock Road" | "Kilmarnock" |
"KA120DP" | " Warrix Avenue" | "Irvine" |
"KA6 6AB" | "Dalmellington Road " | "Ayr" |
"KA9 2HQ" | "Biggart Road " | "Prestwick" |
"KA6 6DX" | " Dalmellington Road" | "Ayr" |
"KA7 4DW" | "10 Doonfoot Road " | "Ayr" |
"KA215RF" | "Nelson Road" | "Saltcoats" |
We can clean it:
SELECT
regexp_replace("Postcode", '\s*(.3$)', ' \1', 'g') AS "Postcode",
trim("AddressLine1") AS "AddressLine1",
trim("AddressLine2") AS "AddressLine2"
FROM data.example
PostcodeUtf8 | AddressLine1Utf8 | AddressLine2Utf8 |
---|---|---|
"KA27 8LF" | "Lamlash" | "Isle of Arran" |
"KA12 8SS" | "Kilwinning Road" | "Irvine" |
"KA28 0HF" | "College St" | "Millport" |
"KA2 0BE" | "Kilmarnock Road" | "Kilmarnock" |
"KA12 0DP" | "Warrix Avenue" | "Irvine" |
"KA6 6AB" | "Dalmellington Road" | "Ayr" |
"KA9 2HQ" | "Biggart Road" | "Prestwick" |
"KA6 6DX" | "Dalmellington Road" | "Ayr" |
"KA7 4DW" | "10 Doonfoot Road" | "Ayr" |
"KA21 5RF" | "Nelson Road" | "Saltcoats" |
Join it with other datasets:
SELECT model.example_cleaned.*,
data.uk_postcode_lookup.latitude AS "Latitude",
data.uk_postcode_lookup.longitude AS "Longitude"
FROM model.example_cleaned
JOIN data.uk_postcode_lookup ON "Postcode" = postcode
PostcodeUtf8 | LatitudeFloat64 | LongitudeFloat64 | AddressLine1Utf8 | AddressLine2Utf8 |
---|---|---|---|---|
"KA27 8LF" | 55.54312 | -5.115521 | "Lamlash" | "Isle of Arran" |
"KA12 8SS" | 55.635056 | -4.676106 | "Kilwinning Road" | "Irvine" |
"KA28 0HF" | 55.761296 | -4.922755 | "College St" | "Millport" |
"KA2 0BE" | 55.61394 | -4.539409 | "Kilmarnock Road" | "Kilmarnock" |
"KA12 0DP" | 55.611863 | -4.660189 | "Warrix Avenue" | "Irvine" |
"KA6 6AB" | 55.434174 | -4.593108 | "Dalmellington Road" | "Ayr" |
"KA9 2HQ" | 55.492576 | -4.604862 | "Biggart Road" | "Prestwick" |
"KA6 6DX" | 55.430332 | -4.595538 | "Dalmellington Road" | "Ayr" |
"KA7 4DW" | 55.449149 | -4.639765 | "10 Doonfoot Road" | "Ayr" |
"KA21 5RF" | 55.63957 | -4.772409 | "Nelson Road" | "Saltcoats" |
We can even enrich with H3 indexing, if we wanted geospatial querying:
SELECT
model.example_joined.*,
h3_latlng_to_cell("Latitude", "Longitude", 15::INT) AS "H3Res15"
FROM model.example_joined
PostcodeUtf8 | H3Res15Utf8 | LatitudeFloat64 | LongitudeFloat64 | AddressLine1Utf8 | AddressLine2Utf8 |
---|---|---|---|---|---|
"KA27 8LF" | "8f19569a8c5c648" | 55.54312 | -5.115521 | "Lamlash" | "Isle of Arran" |
"KA12 8SS" | "8f19092e6775410" | 55.635056 | -4.676106 | "Kilwinning Road" | "Irvine" |
"KA28 0HF" | "8f19092b0d119ad" | 55.761296 | -4.922755 | "College St" | "Millport" |
"KA2 0BE" | "8f190925084439b" | 55.61394 | -4.539409 | "Kilmarnock Road" | "Kilmarnock" |
"KA12 0DP" | "8f19092e3d467a6" | 55.611863 | -4.660189 | "Warrix Avenue" | "Irvine" |
"KA6 6AB" | "8f1954598d46a49" | 55.434174 | -4.593108 | "Dalmellington Road" | "Ayr" |
"KA9 2HQ" | "8f1954598c58ac3" | 55.492576 | -4.604862 | "Biggart Road" | "Prestwick" |
"KA6 6DX" | "8f1954598c58ac3" | 55.430332 | -4.595538 | "Dalmellington Road" | "Ayr" |
"KA7 4DW" | "8f19545932d3a66" | 55.449149 | -4.639765 | "10 Doonfoot Road" | "Ayr" |
"KA21 5RF" | "8f19092f28342c1" | 55.63957 | -4.772409 | "Nelson Road" | "Saltcoats" |
You can then download the data in a CSV, Parquet or JSONL format for further analysis. Or, serve it via a Data API to allow downstream development teams to start building data-centric applications right away!