Migration Diary Part 2: Moving Logs from Grafana Cloud to Kubernetes
Migration Diary Part 2: Moving Logs from Grafana Cloud to Kubernetes
After moving our metrics to our own cluster (see Part 1), the next step was logs. We were still sending logs to Grafana Cloud, which felt weird now that everything else was local.
The setup was similar - deploy Loki for log storage, use Alloy to collect logs from our pods, view everything in Grafana. But logs are trickier than metrics. You generate way more of them, and storing them gets expensive fast.
This is Part 2 about logging. Part 1 covered our metrics migration.
What I needed to figure out
With metrics, you can throw them in Prometheus and call it a day. Logs are different. I had a few questions I needed to answer:
- How do you collect logs from dozens of pods without killing performance?
- Where do you store potentially gigabytes of logs per day?
- How do you make old logs searchable without keeping everything in memory?
- How do you keep costs reasonable when you're generating logs constantly?
I knew I wanted Loki because it's designed for Kubernetes logs and integrates with Grafana. But I didn't really understand how it worked internally. That took some reading.
The stack: Loki + Alloy
Loki handles storage and querying. Think of it like Prometheus but for logs. It doesn't index the full log content (which would be expensive). It just indexes labels. This makes it way cheaper to run than something like Elasticsearch.
Alloy is Grafana's log collector. It runs as a DaemonSet on every node, reads container logs, and ships them to Loki. It replaced the older Grafana Agent.
The flow: Container logs -> Alloy (running on each node) -> Loki -> Grafana for viewing
Understanding Loki components
Here's what took me a while to understand. Loki isn't just one thing. It's multiple components working together. When you deploy it, you're running several services under the hood.
Distributor
First stop for incoming logs. Alloy sends logs here. The distributor validates them, checks the formatting, and forwards them to ingesters. It's the front door.
Ingester
Takes logs from the distributor and writes them to storage. But not immediately. It batches logs in memory first, compresses them into "chunks", and then flushes those chunks to storage. This is way more efficient than writing every single log line one at a time.
Ingesters keep recent chunks in memory so queries for recent logs are fast. After a chunk is old enough and compressed, it gets written to long-term storage.
Querier
Handles log queries from Grafana. When you search for logs, the querier figures out where they are. Some might still be in ingesters' memory, others might be in long-term storage. It fetches from both and returns the results.
Query Frontend
Sits in front of queriers and makes them more efficient. It splits large queries into smaller time ranges, caches results, and parallelizes the work. Without this, searching through a day of logs would hammer a single querier.
Compactor
This one matters for keeping costs down. Over time you accumulate thousands of small index files from all the log chunks. The compactor runs periodically and merges these into larger, more efficient files. It also handles deleting old logs based on your retention policy.
Without it, your index grows forever and queries get slower as they scan through thousands of tiny files.
Ruler
Evaluates alert rules on your logs. Like Prometheus Alertmanager but for log-based alerts. We're not using this yet but it's there when we need it.
Index Gateway (optional)
In high-scale setups, this handles index queries separately to reduce load on storage. We don't need it at our scale.
Why all these components?
At first I was confused why Loki needed so many moving parts. But it makes sense when you think about scale.
If you have 50 pods each generating logs, and you're ingesting thousands of log lines per second, you can't just write them one by one to disk. You need batching (ingester), efficient querying across time ranges (query frontend), index maintenance so queries don't slow down over time (compactor), and separation of recent vs old data (ingesters vs storage).
Each component handles one piece of that puzzle.
Deploying Loki in "simple scalable" mode
Loki has different deployment modes. I went with "simple scalable" which groups components into three paths:
- Write path: Distributor + Ingester
- Read path: Query Frontend + Querier
- Backend: Compactor
This gives you horizontal scaling (you can add more ingesters or queriers) without the complexity of running every component separately.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: loki
namespace: argocd
spec:
project: infrastructure
source:
repoURL: https://grafana.github.io/helm-charts
chart: loki
targetRevision: 6.6.4
helm:
values: |
loki:
commonConfig:
replication_factor: 1
storage:
type: azure
azure:
accountName: # from Azure
accountKey: # from secret
containerName: loki-chunks
write:
replicas: 2
read:
replicas: 2
backend:
replicas: 1
Storing logs in Azure Storage Account
This is the part that made everything cost-effective. I'm not storing logs on expensive Kubernetes persistent volumes. I'm using Azure Blob Storage.
Loki writes chunks and indexes to blob storage, which costs way less than keeping everything on SSDs attached to Kubernetes nodes. We're paying about $0.02 per GB per month for blob storage versus $0.10+ per GB for persistent volumes.
The trade-off: querying old logs is slightly slower since they're in blob storage instead of local disk. But most queries are for recent logs, and ingesters have that data in memory anyway.
I created a separate storage account and blob container just for Loki, then passed the credentials via a Kubernetes secret.
Setting up Alloy
Alloy was simpler than I expected. It runs as a DaemonSet (one pod per node) and automatically discovers all containers on that node.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: alloy
namespace: argocd
spec:
project: infrastructure
source:
repoURL: https://grafana.github.io/helm-charts
chart: alloy
targetRevision: 0.3.2
helm:
values: |
alloy:
configMap:
content: |
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
// Add pod labels as Loki labels
}
loki.source.kubernetes "pods" {
targets = discovery.relabel.pods.output
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki-gateway/loki/api/v1/push"
}
}
Alloy reads logs from /var/log/pods on each node, which is where Kubernetes writes all container logs. It adds labels like pod name, namespace, and container name, then ships the logs to Loki.
What I learned about labels
Loki's whole design revolves around labels. Unlike traditional logging systems that index every word in your logs, Loki only indexes labels. When you query, you filter by labels first, then grep through the matching logs.
This means choosing good labels matters. We use:
namespace- which Kubernetes namespacepod- pod namecontainer- container name within the podapp- application label from Kubernetes
You want enough labels to narrow down your search, but not so many that you create tons of unique streams. Each unique combination of labels creates a new "stream" in Loki, and too many streams hurt performance. I had to resist the urge to label everything.
How it's working
The setup's been running for a few weeks. Logs from all our pods flow into Loki, and we can query them in Grafana just like we did with Grafana Cloud.
Retention is set to 30 days. After that, the compactor deletes old chunks from blob storage. We're generating about 10-15GB of logs per day (after compression), which costs us maybe $0.20/day to store in Azure.
Compare that to what we were paying Grafana Cloud for log ingestion and storage. The savings are similar to what we got from the metrics migration. And we have full control over retention policies now.
What I'd do differently
If I were doing this again, I'd spend more time on log filtering before they reach Loki. Not every log line is useful. A lot of our pods generate debug logs that we never actually look at. We're storing them anyway, which is a waste.
Alloy can filter logs before sending them to Loki - dropping debug lines, removing sensitive data, that kind of thing. I should set that up to reduce storage costs.
I'd also set up log-based alerts earlier. Loki's ruler component can trigger alerts like "notify me if you see ERROR more than 10 times in 5 minutes". We have that for metrics but not for logs yet. Should have done it from the start.
Wrapping up
Moving logs turned out to be more work than metrics, but not as bad as I expected. The key was understanding that Loki is really multiple components working together, and each one solves a specific problem in the pipeline.
Using Azure Blob Storage instead of local volumes made this way cheaper than Grafana Cloud. Having logs in the same Grafana instance as our metrics is convenient too - I can correlate them in the same dashboard.
The observability migration is done now. Metrics and logs both running in our cluster, managed through ArgoCD. Total cost is probably 10% of what we were paying for Grafana Cloud.
For more details on Loki components and architecture, check out the official documentation