Ultimate checklist for a newly joining fullstack software engineer
Table of contents
I was working as a frontend engineer for a quite while, so when I was back to fullstack, it took some time to pick it up. One of the challenges was to gather all the information about the setup of the project I joined for, in order to be able to perform my duties at a professional level.
So here I've put together a checklist that every newly joining fullstack should go through, to feel confident developing, maintaining and observing a service they are responsible for.
The first and the most important thing is to know where to find Logs. You needed logs for every microservice, for both Staging and every Live environment in every region. Logs can be managed via Datadog, Loki, Loggly or similar. Just go there, filter out by the service name, log level set to "Error" and choose the backspan of one day. It won't hurt to create some bookmarks for quick access.
So any time you could check the logs and quickly see if there are any recent errors.
Metrics typically contain:
- Resource consumption (CPU, Memory, DB Size).
- Request per second.
- Request duration.
- Query duration.
- Error rate.
- Cost of running the cloud native application, per month, per quarter, per year.
- Any other custom metrics you might be interested in.
You need to keep an eye on those, especially after a new release is deployed.
All these may be rigged in Grafana, however, Datadog can be also utilised for this cause. Again, the dashboards should be available for every environment, and the quick links must be saved for your disposal.
If a project uses things like Open Telemetry, it may gather information about so-called spans. A span is typically a sub-routine in the code, and tracing allows to measure the duration of these. It's essentially like a profiler, but for the backend. To see the spans several tools may be used, such as Jaeger or Datadog or Grafana, so make sure you have it.
Essentially alerting helps SRE and the rest of the team to react on possible outages, so its essential to have all that properly rigged. The alerting can be set up via Datadog as well, with dumping all relevant information to special Slack channels, so you make sure you joined these, and set the notification policy to "all messages", not only when mentioned. Certainly, the channel is automatic, so you will never be mentioned.
If the application is containerised, it is most likely running inside of a K8s cluster. Typically, there is a Staging cluster, and several regional Live clusters. One way or another, in case of K8s a good option could be to use Spinnaker. Spinnaker allows deploying new versions of Docker images into the cluster swiftly and frictionless. Whenever you make a release that triggers an image build and, consequently, a Spinnaker pipeline, you might want to go to Spinnaker and see if there were any errors deploying that new image.
Again, the links must be saved for both Staging and Live environments.
So you should be able to:
- trigger a deployment to stg/canary/live or to a specific country (region),
- make sure the deployment was successful.
You should have access to both staging and live environments of GCP, AWS or any other cloud provider you host at. Same goes for other services, like Kafka SAAS or Mongodb. You should have a way of quickly take action if something is going on at production.
If the infrastructure is spinning on GCP, at least two things should be done:
- Get access to the GCP panel to see all the resources.
- Set the gcloud CLI tool up in order to perform useful operation in the console.
Sometimes it's necessary to interact with the clusters directly. Typically, two tasks are quite frequent:
- Restart a misbehaving container.
- Obtain the logs of a failing container.
Assuming that gcloud CLI tool was already installed and configured, you can get the credentials for a specific cluster and store them locally:
gcloud container clusters get-credentials <cluster_name> --zone <zone_name> --project <project_name>
I wrote a separate article on how to use K8s, and I try to keep it up to date there.
There is also an amazing CLI tool called k9s for managing K8s clusters. I mostly use it for two things:
- To read the logs of a staging container, to understand why it crashes.
- To SSH into a container, in order to do stuff like reading env variables with printenv.
You can also see the logs in the dashboard, if enabled.
You must always have read/write access to the staging database. You'll also gonna need access to all production databases in the readonly mode. The best way to do it would be to tunnel the connection to the local port.
With GPC it can be done via the cloud_sql_proxy tool. So you pick a port that you want to be allocated locally, and then run:
cloud_sql_proxy -instances=<project_name>:<region>:<clouddb_instance_name>=tcp:<local_port>
The best way to automate this would be to create a script, like this one, that allows connecting to a database for an arbitrary country or region, live or staging env.
So make sure you have one. It is company-specific, so you'll have to make that script by yourself.
You can use any client to access the database. It can be DataGrip, but projects like PgAdmin or PhpMyAdmin can also do fine if you use Postgres on Mysql respectively. One important note: when connecting to the live database, always connect to the read replica. As mentioned before, if you don't have a read replica (which you should), at least make the connection readonly. You don't want to mess up with the production data, do you?
Basically you should have a script that is capable of executing an arbitrary SQL across selected environments: live, stg, canary, and desired countries. Sometimes a company has BigQuery enabled, so all tables from all applications across all environments are kept together to enable analytics. In this case, SELECT queries can be done directly from the GCP console.
Sometimes you may want to run the same command on all, or on a subset of your clusters. An example of such can be killing a cronjob or restarting a pod. You should have a script for such a task.
If you have outgoing gcp pub/sub or kafka streams, sometimes you may be asked to re-publish the events. You should have a canned solution for this, in order to be able to do it promptly.
It's a good practice to have an option of running the whole cloud-native app on your local machine. I my opinion, it's wise to not rely solely on Unit testing and TDD, but also be able to actually test new features before pushing to Staging.
Docker Desktop or Colima to the rescue, if you are on Mac or Windows. There is also a variety of projects that offer mocking of the most popular cloud services: Localstack for AWS and a handful of projects like gcloud-pubsub-emulator or gcp-storage-emulator for GCP.
Ask your team members for the .env.local and .env.test files, so you don't have to spend time on figuring the right values on your own. It is a good shortcut.
One of the most frequent thing that may happen to you is your QA engineer reporting an issue on Staging. Then without any further ado, you can just dump the staging database locally and do the research in a local environment, which is of course extremely transparent and safe. Better than digging down the logs and trying to figure the issue. You can even use a debugger if a situation calls for it.
There is always a dumping tool available for your database out there. Use pg_dump for Postgres, mongodump for MongoDB, etc. Make a tunnel to the read replica, dump and then restore locally. You can even make a script to automate this, exactly as I did.
Not quite a technical action point, but it won't hurt to put together a short summary of services that consume data produced by the application you are in charge of now, and also what kind of data is consumed by your app from the other services. Also make sense to obtain emergency contacts of an EM and a Lead Engineer for every such a service.
In case if you have a lot of tickets in the works, and you are constantly asked to switch the context to urgently address some burning issue, you must be
a sorrow engineer in needs for a way to cut the corners when creating PRs. Here is an example of such a script that can help you out.
So yeh, as you see, observability is the key.
This article is a work in progress, so as soon as I find new relevant information, I am gonna expand the post.
Sergei Gannochenko
Golang, React, TypeScript, Docker, AWS, Jamstack.
20+ years in dev.