Knowledge Setup Guide
resolve uses a two tier knowledge model — org level knowledge applies company wide, while team level knowledge is team specific when both exist, team knowledge takes priority see team knowledge docid 6klfd9jow9rrkqi8fzjr1 for more detail resolve pulls knowledge differently depending on context alert investigations resolve examines all knowledge associated with the alert and its team(s) one runbook per alert rule or monitor makes it easier for resolve to surface the most relevant guidance chats resolve pulls in skills and relevant knowledge sections concise documents with clearly marked headings help resolve match the right knowledge to your question faster in both cases clearly stating when a piece of knowledge applies (e g , "use this when investigating latency spikes in the checkout service") helps resolve pull in the right guidance at the right time getting started start with the highest impact content and refine over time here's the recommended order create your team and add a brief description so resolve understands what this team owns teams work best when they mirror your on call rotations, so the right knowledge is accessible to the right responders each team member has one of two roles team admin full access to edit team content, configuration, and manage members team member can view team content, configuration, and members, but cannot make changes write your team's resolve md with a system overview, glossary, and key query guidance for your observability tools add dashboard guidance for your most used dashboards resolve auto generates a starting point; you can review and refine it attach alert runbooks for your highest volume or most critical alerts these can imported from github or confluence or written directly in resolve include general documents for specialized knowledge that applies to specific scenarios like runbooks, these can be imported from external sources rather than written from scratch iterate as you use resolve fill in context gaps as they come up knowledge types there are 4 types of team knowledge, which are all markdown https //commonmark org/help/ files 1\ resolve md team specific agent instructions used in all chats and alert investigations tips for writing resolve md use markdown formatting with clear section headers include a brief system overview and a glossary of your team's terminology create separate sections for different observability types (e g , "metrics guidance" vs "logs guidance") and custom tools, if applicable include context about when the guidance applies (e g , "when troubleshooting production issues, use 'app3 prod cluster' as the environment") keep it concise, about 10,000 characters this content is referenced frequently, so overly long files may reduce effectiveness example resolve md clusters we have the following clusters app3 cluster, dev2 cluster, stgl cluster app3 cluster is the production cluster if the user mentions a question around particular cluster(s) then pass that information to all tools and agents for eg if a user is asking to look in logs for app3, then you should always mention the cluster information during explorations logs guidance always apply cluster filter the logs are fetched from a grafana cloud instance that has logs for many other organizations for this organization, you must always apply one of the following filters cluster="app3 cluster" or cluster="dev2 cluster" if investigating a kubernetes pod, you should also use the kubernetes log integration for most services, you can get the logs for the right level using a query like {cluster="app3 cluster", namespace="checkout assistant", service name=" \<service name> "} | detected level=" \<level> " levels can be info, warn, debug, error the labels like cluster, namespace, service name are indexed and make the query efficient as much as possible, avoid just a keyword based search without label filters as that scans a lot of logs 2\ alert runbooks alert specific guidance always examined during mapped alert investigations tips for writing alert runbooks write sequential, actionable steps resolve can follow in order add examples of exact or templated queries with clear placeholders (e g , "run service name=\<service name> environment\ prod error — replace \<service name> with the actual service") when referencing dashboards, add corresponding dashboard guidance for those dashboards include guidance on examples for queries or custom tool use with unique syntax (e g , "if the country is germany, the log index should be index=index ge ") either use titles that match the alert name or link the runbook to the alert rule on the alerts page example alert runbooks use traces to find the appropriate rds instances and related services use 'rds overview (us east 2)' dashboard to get the health of the overall rds instances use 'rds performance insights (us east 2)' dashboard to triage further use the 'pgstats' command via the awscli tool to determine if there are specific queries resulting in high cpu from step 1, use the related services to determine the radius of impact 3\ dashboard guidance instructions for interpreting each dashboard, including variable selection, filters, and chart sections resolve will auto generate guidance for the dashboards you add review and edit the markdown file to improve how resolve uses your dashboard tips for writing dashboard guidance be specific about how to set filters and variables (e g , " service name should be set to \<servicename> prod ") call out specific charts or sections that are most useful for particular issue types example dashboard guidance when to use this dashboard troubleshoot any service issues including latency, performance etc dashboard variables replace \<servicename> and \<country code> according to the service and country for all variables set to \<servicename> \<country code> except aws ecs service append " backend" service should be set to \<servicename> \<country code> host service should be set to \<servicename> \<country code> aws ecs service should be set to \<servicename> \<country code> backend if the service name and country code are both provided, do not append an additional country code example if the service is "checkout" and country is germany, use the following values service should be set to checkout ge host service should be set to checkout ge aws ecs service should be set to checkout ge backend 4\ docs guidance for specific issue scenarios, which is pulled in based on relevance tips for writing effective docs clearly state when this guidance applies so resolve can retrieve it at the right moment provide rich context for services and systems (e g , "the recommendation service is rarely mission critical; failures here are almost always rooted in other services") include guidance on which attributes or fields to use for filtering and grouping in your specific observability tools describe how and when to use custom tools (e g , "when investigating payment processing issues, use the paymentdb tool to check transaction status ") example docs general guidance for the frontend service you can often use the @http url @http path group and @http target attributes to group by or filter for certain spans/traces investigation guidance the frontend is the entrypoint of our app (sitting behind frontend proxy which receives requests from users) therefore lots of errors might bubble up from our other microservices and trigger frontend alerts just because there is an error or alert triggered on the frontend service does not mean it is the root cause of the problem in fact its likely that it is not the root cause error logs in the frontend might reference other services and we can use traces to determine if any dependencies are the source of our errors evidence queries guidance you can often use the @http url @http path group and @http target attributes to group by or filter for certain spans/traces the service makes http and grpc requests to downstream services so you might see common error codes from those protocols in any error logs importing from external sources import alert runbooks and docs from github or confluence instead of duplicating them in resolve resolve fetches the latest version at investigation time, so your guidance always stays in sync with the source of truth to set this up, go to external docs on your team knowledge page