top of page

Dr. SQL-LOVE or How I Learned to Stop Worrying and Fall in Love with Subqueries

Updated: Jan 4, 2023




Way back in the summer of 2019, a simpler time, a time before covid, a time before masks, a time before hoopla and hysteria, I embarked on my first internship since high school. This one actually paid, thankfully.

The internship was for a Texas based grocery company called H-E-B. A company with a minor monopoly on grocery shopping in Texas. (They lack a strong presence in North and West Texas, but their presence is expanding in the former and the latter is mostly gorgeous desert and mountains.) I worked with one of their data analytics teams. A customer facing one. So, their work revolved around crunching consumer spending habits , mostly using straightforward number crunching and statistics, segmentation models, and some logistic regressions. Nothing too fancy, and why use fancy models when straight-forward ones work? I think the people in that office had a lot of good sense and were on the whole fantastically helpful and amiable.

There was one cheeky fellow who worked remotely, and whose name I don't remember, who I found rather disagreeable. He was very smug and condescending toward myself, other interns, and even his coworkers. The read I got on him was of someone very proud of himself for having the ability to work remotely while all these others had to be stuck in an office. Mind you, this was 2019, so remote work was still a hallowed perk for only a privileged few. My own thoughts on the character were, "if you're so brilliant and capable, why don't you work for yourself? and really be free of any corporate miasma you obviously have. Because after all, you still have to come into the office at least once a week." Well, that's neither here nor there, but I thought it was worth mentioning because I think people who work remotely should have a sacred sense of gratitude, and count their lucky stars because for billions of people work is a major encumbrance on their daily lives and routines. They have to go get their hands dirty in a myriad of tough and rough environments (believe me I know, I've seen it first hand in third world

countries), and don't have the luxury of sipping espressos from their couch while they crank out TPS reports and watch Netflix and wait for Uber food deliveries.

Practically speaking, I learned a lot during the internship. How to attend a lot of meetings that were time consumer. How to build data pipelines. How to perform a omni-channel customer segmentation. How to run machine learning algorithms like k-nearest neighbor at scale. How to make small talk with other introverted nerds (previously I worked as a bureaucrat, so the . How to use JIRA, Hadoop through Cloudera's interface, Databricks, H20.ai, and a whole bunch of other tools. Mostly, how to write SQL queries with big data well. I'd worked with SQL before, but never with huge amounts of data. Below I have included some of the many queries I wrote. At first I found SQL to be a grotesque and antiquated language. Though, over the course of that summer I came to love its simplicity and expressive power. Despite the advances in artificial intelligence, deep learning, and machine learning, the world of analytics would grind to a halt without SQL and all its flavors.

Below I have included one of the queries I wrote back in the summer of 2019, I think it's well-written, elegant, and an excellent demonstration of the power of sub-queries. It's creating a master table based on customer shopping data, which will then be used to run a k means clustering algorithm and from there create a customer segmentation to illicit insights into the use of different sales channels by distinct groups of customers. When an analyst is working with a lot of data, think millions of rows, spread across many tables, often times some tables with more than 30 columns, and the particular analytic question(s) that's being asked only requires a small subset of all those columns from all those tables, then in those cases a sub-query or a set of them as seen below, is often necessary. Of course, for the sake of performance, fine tuning is possible, large queries can be broken up into a series of steps, or new and better tables can be created to custom fit certain analytic tasks and projects.

On a positive note I would like to conclude, as Yoda might say, by saying the customer related data engineering team at H-E-B were the real all stars in my opinion. They didn't get all the glory when it came to recognition for data deliverables in the form of presentations and the like, but they were and are the real gifted problem solvers. They built the backbone system for all the data scientists to play in. They did the ETL work, yeoman's work. They would often step in and help the data scientists when their problems got too technical. Don't get me wrong, the data scientists were all brilliant at their own particular kind of work, a few I found to be intellectually quite exceptional. They helped me solve my problems, and were gracious enough to seek my help on problems of their own at times. And more than that, they were a funny group of guys always telling jokes and ribbing each other in a good natured way. The analytics people, especially the managers were a bunch of stiffs most of the time. Not the data engineering gentlemen. Their intellect and computational wizardry keeps me working on, building my skills, and sharpening my data intelligence so that it can cut through any complexity with ease and grace. C'est la vie, till next time!


 


--table to predict cluster assignment for all 10 million people 

SET MEM_LIMIT = 8g;

DROP TABLE IF EXISTS omni_seg_all_heb_cust;
CREATE TABLE omni_seg_all_heb_cust AS

WITH
  base_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips,
      sum(amt_net_sls_rev) AS sales,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross,
      sum(qty_net_pkg_itms) AS items,
      sum(qty_net_unts_sold) AS units
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                     32002,
                     32004))
      AND (flfmt_by = 'Store'
      OR flfmt_by = 'Curbside'
      OR flfmt_by = 'Delivery'
      OR flfmt_by = 'Instacart'
      OR flfmt_by = 'Shipt'
      OR flfmt_by = 'Ship To Home')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1

  ),

  store_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips_store,
      sum(amt_net_sls_rev) AS sales_store,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross_store,
      sum(qty_net_pkg_itms) AS items_store,
      sum(qty_net_unts_sold) AS units_store
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                  32002,
                  32004))
      AND (flfmt_by = 'Store')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1

  ),


  curbside_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips_curbside,
      sum(amt_net_sls_rev) AS sales_curbside,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross_curbside,
      sum(qty_net_pkg_itms) AS items_curbside,
      sum(qty_net_unts_sold) AS units_curbside
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                  32002,
                  32004))
      AND (flfmt_by = 'Curbside')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1

  ),

  delivery_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips_delivery,
      sum(amt_net_sls_rev) AS sales_delivery,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross_delivery,
      sum(qty_net_pkg_itms) AS items_delivery,
      sum(qty_net_unts_sold) AS units_delivery
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                  32002,
                  32004))
      AND (flfmt_by = 'Delivery')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1
  ),

  instacart_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips_instacart,
      sum(amt_net_sls_rev) AS sales_instacart,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross_instacart,
      sum(qty_net_pkg_itms) AS items_instacart,
      sum(qty_net_unts_sold) AS units_instacart
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                  32002,
                  32004))
      AND (flfmt_by = 'Instacart')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1

  ),

  shipt_query
  AS
  (

    SELECT instnc_id AS instance_id,
      COUNT (DISTINCT id_shopping_trans) AS trips_shipt,
      sum(amt_net_sls_rev) AS sales_shipt,
      (sum(amt_net_sls_rev) - sum(amt_net_cst)) AS gross_shipt,
      sum(qty_net_pkg_itms) AS items_shipt,
      sum(qty_net_unts_sold) AS units_shipt
    FROM ampdb.v_customerorders
    WHERE (id_date BETWEEN '2019-03-18' AND '2019-06-09')
      AND (id_lob IN (32001,
                  32002,
                  32004))
      AND (flfmt_by = 'Shipt')
      AND sw_ex = 'Y'
      AND instnc_id IS NOT NULL
    GROUP BY 1

  ),

  avg_days_query
  AS
  (

    SELECT days_between.livg_unt_id,
      avg(days_between.day_between) avg_days_between
    FROM
      (SELECT dates_before.livg_unt_id,
        dates_before.id_date,
        datediff(dates_before.day_between, dates_before.id_date) day_between
      FROM
        (SELECT hh_and_date.*,
          lag(hh_and_date.id_date,1) OVER (PARTITION BY hh_and_date.livg_unt_id
                                                 ORDER BY hh_and_date.id_date DESC) day_between
        FROM
          (SELECT ccrn.livg_unt_id,
            order_dates.id_date
          FROM
            (SELECT orders.instnc_id,
              orders.id_date
            FROM v_customerorders orders
              INNER JOIN v_dim_product_all product ON orders.id_upc = product.id_upc
            WHERE orders.id_lob IN (32001,32002,32004)
              AND orders.sw_ex = 'Y'
              AND (orders.flfmt_by = 'B&M'
              OR orders.flfmt_by = 'Curbside'
              OR orders.flfmt_by = 'Delivery'
              OR orders.flfmt_by = 'Instacart'
              OR orders.flfmt_by = 'Shipt'
              OR orders.flfmt_by = 'Ship To Home')
              AND orders.id_date IN
                   (SELECT DISTINCT b.id_date
              FROM ampdb.v_dim_fscl_week a
                INNER JOIN ampdb.v_dim_date b ON a.id_fscl_wk = b.id_fscl_wk
              WHERE a.nbr_wks_from_today BETWEEN -52 AND -1) )order_dates
            INNER JOIN v_consolidated_ccrn_hh ccrn ON order_dates.instnc_id = ccrn.instnc
          WHERE ccrn.ap_flag = 1
            AND ccrn.livg_unt_id >= 0
          GROUP BY 1,
                     2) hh_and_date) dates_before) days_between
    WHERE days_between.day_between IS NOT NULL
    GROUP BY 1

  )


(
  
SELECT
  ccrn.livg_unt_id,
  a.trips,
  a.sales,
  a.gross,
  a.items,
  a.units,
  b.trips_store,
  b.sales_store,
  b.items_store,
  c.trips_curbside,
  c.sales_curbside,
  c.items_curbside,
  d.trips_delivery,
  d.sales_delivery,
  d.items_delivery,
  e.trips_instacart,
  e.sales_instacart,
  e.items_instacart,
  f.trips_shipt,
  f.sales_shipt,
  f.items_shipt,
  g.avg_days_between

FROM ampdb.v_consolidated_ccrn_hh ccrn
  INNER JOIN base_query  a ON ccrn.instnc = a.instance_id
  INNER JOIN store_query b ON ccrn.instnc = b.instance_id
  INNER JOIN curbside_query c ON ccrn.instnc = c.instance_id
  INNER JOIN delivery_query d ON ccrn.instnc = d.instance_id
  INNER JOIN instacart_query e ON ccrn.instnc = e.instance_id
  INNER JOIN shipt_query f ON ccrn.instnc = f.instance_id
  INNER JOIN avg_days_query g ON ccrn.livg_unt_id = g.livg_unt_id
--GROUP BY 1
);

COMPUTE STATS f156350.omni_seg_all_heb_cust;


Commenti


bottom of page