Punchh’s Campaign Engine is at Scale

Punchh Technology Blog
4 min readMay 6, 2020

A busy Cinco de Mayo despite COVID-19

Authors: Arun Ginjala, Shubham Goyal, Amit Suroliya (May 06, 2020)

Mass Campaigns: Reliability and Performance at Scale

The number of business clients whom Punchh serves and the quantum of individual customers targeted using our platform has grown exponentially over the past two years. In 2019, using our campaign engine, our clients have targeted 100 million+ consumers and 5 billion+ times. On Cinco De Mayo 2020, 42 million+ consumers (13% of the US population) were targeted using our mass campaign engine. The Punchh platform has thus enabled our clients to engage with their customers and increase their sales on this day amidst COVID-19, emblematic of the triumphant spirit that Cinco de Mayo stands for.

One of the major objectives of the engineering team at Punchh is to scale our application and infrastructure, without slowing down campaign delivery during peak loads. Campaigns are used by Punchh’s clients to target their customers via emails, push notifications, or in-app messages and inform them about offers. The term ‘mass campaign’ refers to a large scale. In addition, Punchh’s ongoing campaigns are triggered only for customers on special occasions: birthday, anniversary, membership milestones, to name a few.

To give a sense of scale: in April 2020, the Punchh platform has delivered nearly 5.5K mass campaigns targeting 920+ million consumers, which amounts to 430 million emails and 110+ million push notifications. The Cinco de Mayo saw further increases in loads.

Cinco de Mayo day benchmarking with an Avg day in April

Performance at Scale

70% of the mass campaigns in April were completed under 10 minutes, 86% under 30 mins, and 94% under 1 hour. Even on Cinco De Mayo, with an elevated load of 43 million consumers (a 200% increase in guests targeted over the average daily load in 2019), 286 mass campaigns were successfully executed with 44% of them completed under 10 minutes, 78% under 30 mins, and 98% under 1 hour. In addition, what was even more challenging was that most of the mass campaigns had to be delivered at almost the same time, within several hours’ time span.

On our multi-tenant stack which serves our customers, we have handled massive peak throughputs of up to 6 million+ per hour reliably. We have achieved this through:

  • Augmented monitoring and reporting to ensure that during elevated volumes, API response times are stable with either no timeouts or smart handling of timeouts — the number of campaign workers and containers are dynamically scaled up when needed.
  • Setting up systems and processes with real-time alerts, anomaly detection, health monitoring systems, and on-call SRE teams, to ensure campaign delivery reliability.

Steps that are Taken to Fortify and Future-proof Our System

  • Instituted a process to initiate multiple tests (or drills) before large-scale events with simulated loads to stress test systems where necessary.
  • Complex segment (permutations and combinations of segments using set operations on arrays) evaluation had been modified to scale to evaluate larger segments using Redis — this had worked well but introduced new problems relating to Redis scalability. We had then moved to the next stage of this using Go, which will allow more segments to be evaluated in parallel and quickly, with less moving parts.
  • Segment architecture is being overhauled to scale to perform more dynamic segmentation and on larger sets of data. We continue to evolve the underlying storage layer to enable us to do this at scale that are alternatives to our Snowflake warehouse.
  • Campaign services are also being re-architected with database sharding, campaign batch processing, usage of NoSQL database, and regular archival of old data, through which we believe we could multiply our campaign throughput two-fold.

From a product perspective, we are driving deeper campaign personalization with AI products that help our customers create data-driven segments along with a prediction of the most appropriate omnichannel offers across all the engagement touchpoints. Furthermore, Punchh is building a data lake — a centralized data repository of all transactions, application logs, user mobile data, etc.

About the Authors

Arun Ginjala is the Chief of Staff of the engineering team at Punchh. He is assisting the CTO and the engineering leaders on tracking all programs and operations, and steering strategic initiatives.

Amit Suroliya leads the production support team at Punchh. His team ensures reliable campaign delivery of 24/7 and takes care of all production issues.

Shubham Goyal is the Senior Analyst of the Business Intelligence team. He interacts closely with internal stakeholders to understand the reporting needs and then to collect requirements, design, and build BI solutions.

The authors would like to thank Aditya Sanghi (Co-founder and CTO) and Xin Heng (VP, Data) for their guidance and support.

--

--

Punchh Technology Blog

Punchh is a marketing & data platform. In the blog site, we will share our learnings from data and technology.