Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate logging solutions #40

Open
rsalmond opened this issue Dec 19, 2024 · 4 comments
Open

investigate logging solutions #40

rsalmond opened this issue Dec 19, 2024 · 4 comments
Assignees
Labels
Epic A large project that needs to be broken into parts

Comments

@rsalmond
Copy link
Contributor

rsalmond commented Dec 19, 2024

We need to get logs off prod1 / cache1 / data1 and into something searchable. It would be really nice if we didn't have to run an elasticsearch cluster for this purpose.

Possible options:

Others?

@IanEdington IanEdington added the Epic A large project that needs to be broken into parts label Jan 24, 2025
@rsalmond
Copy link
Contributor Author

Following @azend's good advice I abandoned playing with quickwit and played with some options for shipping logs straight off the DO droplets into Cloudwatch. After several fruitless hours I gave up trying to get the cloudwatch agent to actually ship logs. Not sure what was wrong but on prem usage doesn't seem particularly important to the maintainers. eg. If you put the config file (instead of fetching config from parameter store) where they say you should put it the agent deletes it for you 🙃

So I've got a fluentbit config that seems to do the job. Since we don't have automation for the droplets my rough plan is:

Deploy fluent bit:

curl https://packages.fluentbit.io/fluentbit.key | gpg --dearmor | sudo tee /usr/share/keyrings/fluentbit-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/fluentbit-keyring.gpg] https://packages.fluentbit.io/ubuntu/focal/ focal main" | sudo tee -a /etc/apt/sources.list

sudo apt update
sudo apt install fluent-bit

Drop an ~/.aws/credentials file with a profile called digital_ocean_monitoring in the homedir of the fluentbit user. Then start with a fluentbit.conf file that looks roughly like this:

[INPUT]
    name tail
    Tag nginx
    Path /var/log/nginx/*.log

[OUTPUT]
    Name cloudwatch_logs
    Match nginx
    region ca-central-1
    profile digital_ocean_monitoring
    log_group_name nginx
    log_stream_name nginx
    auto_create_group On

And massage it until it looks right for:

  • nginx
  • php-fpm8.2 (seems to be the one doing the work)
  • others?

@rsalmond
Copy link
Contributor Author

Oh and PR #80 adds an IAM user and policy for sending to cloudwatch.

@azend
Copy link
Collaborator

azend commented Jan 30, 2025

Looks good to me. I'm not surprised there was some gotcha with the Cloudwatch agent. Though deleting its own config wouldn't have been my top choice hah. If the credentials worked to sign into STS then I guess we could have created and used a parameter. But using FluentBit seems like a far more flexible option.

I can't remember how many VMs there are in DO but Ian suggested that there were a lot. We could create an Ansible playbook if there's a lot but it probably isn't worth it if there's less than 10. I found a role that pretty much completes exactly what's mentioned above. But I don't know if we want to use it or rob the important parts for our own role. https://github.com/bimdata/ansible_role_fluentbit/blob/main/tasks/main.yml

While I'm thinking about it, we should put a log retention period on the output config. One year of retention is enough for most security and compliance standards including SOC2. The exact period doesn't matter but it helps log storage from building up when the project goes idle or we forget to change the setting later.

@rsalmond
Copy link
Contributor Author

Status update: #80 has been applied to prod/stage and merged. Above deploy plan is complete on prod1 and logs are shipping into cloudwatch in the prod aws account.

That said - they're kinda ugly. Particularly the php-fpm logs which contain multi-line stack traces that are not grouped together so they're hard to read. Fluent-bit does have support for multiline grouping but I didn't have much time to play with it so haven't got that bit figured out yet. My regex game is not strong 😿

One other thing I noticed, which is probably nbd but worth mentioning. the fluent-bit process chews a bunch of CPU at startup.

Image

I didn't notice any issues with gpo.ca or secure.gpo.ca while I was testing it, but if anyone noticed load spikes it was probably that.

I've capture the manual changes in #81 so I can sleep at night. It ain't pretty but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic A large project that needs to be broken into parts
Projects
None yet
Development

No branches or pull requests

3 participants