Most computer systems have state and are likely to depend on a storage system. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though being critical dependencies, databases compromise a variety of capabilities to feasibly provide some others and then they don’t even talk about them. Databases can be so complex that our understanding of them is often limited to our production experiences and to our use cases. This talk is a collection of SRE experiences, and discusses what happens when databases lie about their capabilities and performance.
Everybody knows that we need a cache, but where exactly to place it? Inside your application or as a layer in front of it? In the container or outside the container? In the era of Cloud Native and Microservices these questions get even more complicated. In this session I’ll present different architectural patterns for distributed caching: Embedded, Client-Server, (Kubernetes) Sidecar, and Reverse HTTP Proxy Caching.
< br /> In this session you’ll learn: - What are the design options for including the caching layer - How to apply caching layer in Istio (and Service Mesh in general) - How to use distributed HTTP caching without updating your microservices - Common pitfalls when setting up caching for your system
Rafał Leszko , Hazelcast
Cloud software engineer at Hazelcast, author of the book “Continuous Delivery with Docker and Jenkins”, trainer, and conference speaker. He specializes in Java development, Cloud environments, and Continuous Delivery. Former employee in a number of companies and scientific organizations: Google, CERN, AGH University, and more.
As infrastructure as code has become the basic standard for DevOps / SRE teams. The need to validate the written code has become more important. This is also has a cost implication as actual resources will be created in a commercial cloud environment.
In this talk we will explore using the go library terratest together with a combination of other open source tools like kind and localstack to write useful tests for your infrastructure as code in your local environment.
Most SRE talks focus about what people should be doing, but there are benefits to considering what NOT to do. This talk covers five organizational behaviors that seem obvious but will actually hurt your system availability, and offer some advice on how to get them right instead.
We’ve all been there: it’s 3 AM, the system is down, everything is on fire and it’s up to us to make it better. We do some digging, deploy a fix and draft a post-mortem. We might even identify some things we could have done differently, or suggest a process to avoid such problems in the future. Everyone sits down for the ceremonial presentation of the post-mortem and nods sagely, going back to their work secure that valuable lessons had been learned… right up until the next time the system crashes and we go through the motions again.
In this session we’ll consider not what could be done differently, but what shouldn’t be done at all: common engineering antipatterns that, if we fail to avoid, will degrade our system and hurt its availability.
Gabel Tomer , Strigo
A programming junkie and computer history aficionado, Tomer’s been around the block a few times before settling in at Strigo. Over the years he’s built any number of (predominantly back-end) systems, cofounded two major Israeli user groups (Java.IL and Underscore), organized an annual Scala conference (Scalapeño) and is a recurring speaker at software conferences. He secretly still hopes to realize his childhood dream of becoming a lion tamer.
The Prometheus exposition format always had helpful metadata like a metric type and a help string. This data used to be ignored by the Prometheus server, but it is now becoming gradually more useful. The speaker will explain past, present, and future plans for metadata and then demo how it is already used in the newest Grafana release to display helpful tool tips.
But there is more: Exemplars are an addition in the upcoming OpenMetrics exchange format. With them, you can link a counter increment or an observation in a histogram bucket to something like a trace ID. And again, hot off the press, there will be a demo how the exemplars make it into Prometheus and then into Grafana, from where they can be used to link into your distributed tracing system.
Perhaps your team’s product is growing quickly, and your current shoot-from-the-hip method of shipping change simply isn’t scaling. Or an antiquated process is drowning you in overhead.
Either way, quality of releases is “generally bad.” How do you break “generally bad” down to actionable items?
Perhaps your team’s product or startup has started hitting heavy growth, and your current shoot-from-the-hip method of shipping change & managing production incidents simply isn’t scaling.
Or perhaps your product has been in existence for years, with large, long-tenured customers averse to change- but your engineers drowning in antiquated process in desperate need of an overhaul.
Either way, the process of deployment & quality of releases are “generally bad” for both your customers and engineering.
How do you break “generally bad” down to actionable items your company can address to measurably improve the deployment process?
By the end of this presentation, you will know how to make product deployment go from patchwork of occasional change to frequent deployment of change of pristine quality.
Katherine Cass , Salesforce
Katherine Cass has held several roles in Release Management, beginning as a Release Manager for the core Salesforce Platform. She is currently a member of the Einstein team, setting up the Build, Release & Deploy process for Salesforce’s recently released AI product, Einstein Platform.
Katherine has been setting up & optimizing development, build and release processes to optimize for quality for 3+ years. In this time, she has also consulted & presented on Release Management best practices to both customers and the broader community. This presentation is a distillation of repeat themes she’s encountered in doing so for 20+ different products and services of varying scale.
She is passionate about making technology more accessible, volunteering with Hour of Code and the Vis Valley Middle School maker’s club in her free time in addition to mentoring Hackbright and YearUp program graduates.
Testing distributed systems is different from testing a singleton application. In this talk, Minting will discuss one of the approaches used for testing updates in Aerospike, through which a general approach to test distributed applications is demonstrated.
Some of the points that will be covered:
Minting Xiao , Aerospike
Minting is a software engineer at Aerospike, as part of the quality engineering team, which works to ensure excellent quality of Aerospike product in a systematic way.
She has been working on test automation and infrastructure development for years.
Before joining Aerospike she has worked in various startups in San Francisco Bay Area, she has worked on tools, systems and testing.
Minting has a Master's degree in Industrial & Systems Engineering from the University of Oklahoma and a Bachelor's degree in Automation from University of Science & Technology of China.
There’s a lot of chatter in the software world about resilient systems. And, like most buzzwords, it all sounds great! But…what does it actually mean for a system to be resilient? And, crucially, how do we actually make our systems resilient - and keep them that way?
This talk will extract truth from the buzzword fog, and suggest practical steps you, and your organization, can take to put resiliency into practice.
Sam Boyer ,
Sam is a software engineer who’s obsessed with ecosystem-class problems: challenges without obvious answers, impacting vast groups of people, and where good solutions often require changing the way we look at the world. He is especially concerned with addressing such problems ethically, through the creation of humane systems. He has applied these ethically, through the creation of humane systems. He has applied these principles at companies large and small, and in the world of open source.
Testing in production: it's gotten a bad rap. Lots of people associate it with cowboy coding and lack of due diligence, often with good reason. Others devote obsessive energy into making staging an approximately perfect mirror of prod ... but that's impossible, and they've lost sight of a few super-key facts. Such as: you already test in prod, like it or not ... but you probably haven't invested in enough tooling, so you probably do a shit job of it. Let's talk about how to make better decisions when it comes to testing in production. What to test with CI, what to test in production, why mirroring distributed systems is literally impossible ... and how to burn off your deploy-related angst on profitable improvements to your prod-testing pipeline.
Charity Majors , honeycomb.io
Charity is the co-founder and CTO of honeycomb.io, which brings modern observability tooling to engineers in the era of distributed systems. Charity is the co-author of Database Reliability Engineering (O'Reilly), and is devoted to creating a world where every engineer is on call and nobody thinks on call sucks.
For more information please refer to our Workshops page.
For more information please refer to our Workshops page.