Absolutely all companies who decided to build or have already built hosted solutions will get to a very important phase – deploy a stable, flexible, and fault-tolerant infrastructure to run their software and serve the business needs. This process requires receiving answers to some common questions like:
- Cloud or on-premise deployment?
- Which 3rd party services will be used (Databases, caches, queues, etc.)?
- What kind of Configuration and Deployment process?
- Approaches and tools for continuous Maintainance and Provisioning?
And all these questions should be answered as soon as possible to avoid issues and pitfalls when real users will start using the solution.
Which common issues you may get if you won’t think about each part of your infrastructure? Here is a small list of them:
The downtime due to the maintenance process
We should be honest, there is no ideal software in the world, and bugs may happen at any time. Have you ever thought about how you will redeploy your system to deliver fixes, improvements, or just new versions? What if, in some period of time, you will need a migration? Migration to another solution or database? Or probably to another cloud provider?
One of our clients, an airline company, told us that they lost about $500K during 30 minutes of the downtime while their support delivered a new UI version that contained small improvements. That was a huge challenge for us knowing this because our responsibilities were to migrate their solution from Monolith to Micro-services, and of course, without any downtime.
Performance reducing
Performance is a measurable thing, but how many factors do you know in your system that directly affect this value? For web applications, the most important performance values are about:
– How many people may use the platform at the same time?
– How long is each user waiting for a system response?
These are understandable. The best way to manage and improve it is through the development of different optimizations. But what if such issues happen times to time, during “hot peaks”. Your flexible infrastructure will definitely help a lot and be the best insurance.
Failover and poor availability
It is the real world, and in this real world, bad things may happen – Electricity outages, poor internet connection, server failovers, bugs, and bottlenecks. The question is how to deal with all of this. Which strategies do you develop in your DevOps and IT teams? How fast can you back everything to live, and how much will your clients lose during this time?
The complicated process of issue investigation
We already mentioned that issues may happen, but, actually, they will happen. How fast you are able to detect the problem, determine the reason, and introduce and implement the fix? Which data quality do you have in your logs and metrics? How do you manage sensitive and identifiable data? Any aggregation and data visualization?
Data losing
That is probably the most important point. Data, nowadays, is the most important part of any business. If your system has the option to lose data or even just a small part of it (concurrency modification or unavailable endpoint or broken communication between services) – it is a disaster.
All these issues are just the tip of the iceberg, and if at least one of them will happen – it might bring a lot of issues to your client’s business or even kill it. To avoid them we will describe a few common approaches which we usually use to prevent such issues and move solution in the right direction.
The downtime due to the maintenance process
It is all about your continuous development approach. You should think about this in a very detailed way because deployment, even into production, shouldn’t be a rare process. Bugs, improvements, custom client requests – all this is a continuous process, like your deployment.
At Jappware, we practice the Blue-Green deployment approach and highly recommend it to everyone. With this approach, you will avoid any downtimes and make your system availability as constant during any maintenance procedures.
Also, migration is a on aspect that might require downtime until you make a data copy and switch infrastructure in a new way. To proceed with this without any downtime you need much more than one solution, you need a strategy. One of the possible concepts might be this instruction:
- Build a proxy implementing the Canary Deployment process
- Fix database scope and make a copy
- Migrate database and switch traffic into two directions (supporting Old and New)
- Create a commit log to track mutation changes- Create a delta dump of the database to cover all missed parts which you might miss during the main database recreation
- Apply the commit log over the final database version to reply to changes
Performance reducing
The best way to handle such issues is to be “ready”, and to become “ready” – you should conduct load and stress testing through the whole system, starting from external API to internal services communication and 3rd party system integration (Databases, etc.). Collect metrics and build a data sheet with testing results. Using this information your DevOps team should build the infrastructure that is able to check the current system state (requests latency, memory, CPU consumption, etc.) and based on well know data make a decision when upscale and downscale the infrastructure behind the load balancer.
Failover and poor availability
This is a place where automated monitoring becomes your best friend. We know a lot of real work cases when different cluster nodes, services or workers just stop responding, yeah, there is an exact reason why it is happening (Memory issues, lack of space, CPU usage, etc.), but, it happens, and it might make damage on your client business and reflects on customer feedback. Automated monitoring is a tool that can manage such situations and trigger any actions to fix such issues asap.
The second tool is Infrastructure As a Code. Right, having an automated way to handle failover and issues with service availability – you need a tool to quickly restore the environment and bring things to work without wasting time. Think about it. IAAC (Infrastructure As a Code) is a very powerful mechanism that is able to restore the environment through seconds, even if it requires installation on 3rd party tools and different network and VPC manipulations.
The complicated process of issue investigation
Long aggregation and metrics analysis, as well as different performance notifications and limitations. Yeah, all sounds much easier than it is in the real world. Longs should be structured, if you are logging in all places, even with a correct priority level, it doesn’t mean that it follows any structure. The structure is when you can build a use-case picture reading the logs. Each log message should contain some sequence, detailed case, and, of course, it should avoid any sensitive information. Log messages should contain special use case details instead of pointing out who did what.
Metrics are even more important than logs. Using metrics in the right places, you can automate their monitoring and apply different rules to manage alerts and notifications. Your metrics should describe actions, parameters, or status of the things which are continuously happening in your system. Metrics should be short and smart. We use the term “Key Things”, and everyone should explain – what is a key thing in our system, what is critical and what makes sense, and what might be ignored while others might bring issues. Understanding the system brings all answers to all of these questions, ask your team, and check if you are comfortable with the answer (the answer should contain a “message” that contains a reference to a “key thing”)
Data losing
This is a very sensitive topic. It is about your understanding of the data distribution, partitioning, consistency, and availability. We hope that you know and always use the term “CAP theorem”. Right, you may use a little bit different term called “ACID”, however, this term is not about big data, which means, actually, a big data processing. ACID is all about RDBS, which is so great for simple solutions, but the rest is all about the NoSQL world. CAP theorem should explain the right mindset and ideology, but it can’t solve or even it doesn’t explain how to solve any problem.
Think about your data, do it continuously, this will answer all your questions:
- How to store and query it?
- Which DB should I use?
- How to optimize reading or writing?
- How to deal with updates and deleting?
- etc.
You are not able to lose even a single character or digit. Think about replication – make this factor minimum of 3 and continuously check the consistency of the cluster. Think about the distribution and partitioning key – this will make your cluster more robust and efficiently loaded. Your data nature order and clustering key – will help you top reach exact data in the preferable order. And the thing about the consensus algorithm, it means that your cluster will decide which node contains the right results, or even if the result is correct, your client can trust it.
Summary
The client’s business and interests should receive not only a product or solution – it should receive a continuous strategy that will move things forward in the correct way from day to day. Modern approaches require attention on each it steps, there is no, or shouldn’t be, any “single point of failure” – there should be a system, the mechanism, where even if some piece of it becomes broken – then the rest parts will keep serving business needs.
Nowadays, modern software requires an understanding of current business needs, future business needs, and possible business needs. Using these characteristics the software development process should include CI/CD, monitoring, provisioning, and maintenance strategies into this usual daily workflow to make clients' businesses work Non-Stop, save or earn money and be safe.