Observability Done Right: Best Practices and Anti-Patterns for Effective System Monitoring

Image
  WHAT Observability is a concept that refers to the ability to gain insights into the behavior and performance of complex systems. In the context of software engineering, observability involves the collection, analysis, and visualization of data from software applications, infrastructure, and other components of a system. In the animal kingdom, observability plays a critical role in survival, allowing animals to monitor their surroundings, detect threats, and find food. Dolphins use echolocation to observe their surroundings. They emit high-frequency sounds that bounce off objects, allowing them to create a 3D map of their environment. Thanks for reading Knowledge Cafe! Subscribe for free to receive new posts and support my work. Subscribed WHY In today's era, architectures are becoming increasingly large, complex, and fast-paced due to the faster development and deployment of software by distributed teams with the help of DevOps, continuous delivery, and agile development methodo...

Chaos Engineering | Type of Attacks

 


Today’s advance distributed software systems must be tested for potential weaknesses and faults. Chaos engineering is the process of testing a distributed computing system to ensure that it can tolerate unexpected disruptions. It relies on concepts underlying chaos theory, which focus on random and unpredictable behavior. If you are interested in knowing more about Chaos Engineering and History please refer this article from Gremlin 

In this article we will discuss about various categories of attack and some usecases. 



Resource Attack

Generate load across CPU, Memory and Storage devices
Help in preparation for sudden load change, validating auto scaling, test monitoring and alerting config. Its like preparing our system for Black Friday sale in advance. 

CPU Attack

CPU attack sends heavy traffic on system which can help to identify stability and performance undrer stress. We can also validate auto scaling and alerting mechanism. 

Memory Attack

Memory leak is top reason for "Out Of Memory" in production. Memory leaks happens when application consume more memory resources than release. This attack will help to validate hypothesis for memory intensive work load like in-memory cache, machine learning model. It will also help in cloud migration by simulating auto-scaling configuration. 

Disk Attack

Disk attacks are often used to simulate reading or writing a large data set, such as a restored backup, replicated database. It can also help in identifying loop holes in automatic disc cleanup process. 

I/O Attack

An IO attack can help you prepare for slower storage solutions by simulating their performance. This attack help to validate disk heavy work load (batch process which read/write from disk) and effectiveness of in-memory cache. 

State Attack

State attacks change the state of your environment by terminating processes, shutting down or restarting hosts, and changing the system clock. This lets you prepare your systems for unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes.

Process Killer Attack

Process killer attacks allow teams to terminate a specific process or set of processes. This will ensure watch-dog effectiveness for application/service restart and testing leader re-election in clustered work load.

Shutdown Attack 

This is similar to chaos monkey where entire host is shutdown which enable team to build highly resilient system. This will help to validate DR scenarios like automatic work load migration, replication and high availability of clustered workload.   

Time Travel Attack

Time travel attacks allow you to change the system clock. This lets you prepare for scenarios such as Daylight Savings Time (DST), clock drift between hosts, and expiring SSL/TLS certificates.

Network Attack

Network attacks let you simulate unhealthy network conditions including dropped connections, high latency, packet loss, and DNS outages. This lets you build applications that are resilient to unreliable network conditions.

Blackhole Attack

Blackhole attacks help you simulate outages by dropping network traffic between services. This lets you uncover hard dependencies, test fallback and failover mechanisms, and prepare your applications for unreliable networks. We can also validate monitoring and alerting mechanism for cluster. 

Latency Attack

Latency is the amount of time taken for a network request to travel from one network endpoint to another. The Latency attack injects a delay into outbound network traffic, letting you validate your system’s responsiveness under slow network conditions. This will also help in circuit breaker configuration for retry and timeout threshold. 

DNS Attack

Recently we have seen Akmai DNS failure caused many popular becoming un-reachable. More info here The DNS attack simulates a DNS outage by blocking network access to DNS servers. This lets you prepare for DNS outages, test your fallback DNS servers, and validate DNS resolver configurations. 

Packet Loss Attack

This attack is very helpful for streaming services, such as live video or multiplayer gaming which rely on a high throughput of data. When there is network congestion, many packets are queued and some packages may loss due to queue capacity threshold on your hardware. Packet Loss attacks let you replicate this condition and simulate the end user experience and configuration of replay meachanism for better user experience.


In next article we will discuss about other Chaos Engineering concepts. 


Popular posts from this blog

Chain of responsibility using Spring @Autowired List

Iterate Through a HashMap

Under the Hood: Understanding the Gossip Protocol in Apache Cassandra