Skip to content
Snippets Groups Projects
Verified Commit 740b5cb0 authored by David Hoese's avatar David Hoese
Browse files

Update and fix influxdb inserts, tasks, and manual averages

parent c749cc63
No related branches found
No related tags found
No related merge requests found
...@@ -21,6 +21,99 @@ rain01 and rain02 servers. ...@@ -21,6 +21,99 @@ rain01 and rain02 servers.
these should use a python with the mentioned packages installed and use these should use a python with the mentioned packages installed and use
the full path to that environment's python executable. the full path to that environment's python executable.
Initialize InfluxDB v2 Database
-------------------------------
This should only be needed once when setting up a server with a fresh InfluxDB
database.
Prerequisites
^^^^^^^^^^^^^
Follow the installation instructions for InfluxDB v2.x:
https://docs.influxdata.com/influxdb
For operational use we currently install InfluxDB as a systemd service:
https://docs.influxdata.com/influxdb/v2.6/install/?t=Linux#install-influxdb-as-a-service-with-systemd
We'll also need the influx CLI to initalize the database.
https://docs.influxdata.com/influxdb/v2.6/reference/cli/influx/
Setup database
^^^^^^^^^^^^^^
These commands should be run as the metobs user (I think) so configuration is
stored there.
.. code-block::
influx setup -u metobs -p PASSWORD -o metobs -b metobs_raw --retention 0 -n metobs_operator -f
Where PASSWORD should be replaced by the password in the
``/home/metobs/influxdb_operator_password.txt`` file. This creates a user
"metobs" and an organization named "metobs". It creates a initial bucket
called "metobs_raw". The retention period for this bucket is set to 0 (infinite) initially
but will be changed later once data has been added and aggregated to "realtime" buckets.
This configuration is saved under the profile "metobs_operator". It is named this because this is
an operator account that has permissions to all buckets on all organizations. It should not be
used for normal usage. Lastly, the ``-f`` stops you from being prompted for confirmation.
Since no token was specified in this command InfluxDB will generate one for us
and store it in ``~/.influxdbv2/configs``.
.. note::
The password file should be read-only by the "metobs" user on the server.
Create realtime bucket
^^^^^^^^^^^^^^^^^^^^^^
.. code-block::
influx bucket create -o metobs -n metobs_realtime -d "Aggregated data for realtime displays" -r 0
This way we have a "metobs_raw" bucket for all full resolution data and a "metobs_realtime" for all
aggregated/averaged data. List the buckets to get the bucket IDs to be used when creating users.
.. code-block::
influx bucket ls
Create operational tokens
^^^^^^^^^^^^^^^^^^^^^^^^^
A read-only token for the metobs_raw and metobs_realtime buckets:
.. code-block::
influx auth create -o metobs --read-bucket abcd --read-bucket efgh -d "Read-only access to metobs buckets"
This creates a token (printed to the terminal) in the
"metobs" organization that has read access to the buckets with IDs "abcd" and
"efgh". Replace these IDs with the IDs for the raw and realtime buckets from
the previous ``influx bucket ls`` command.
A read-write token for ingest purposes:
.. code-block::
influx auth create -o metobs --read-bucket abcd --read-bucket efgh --write-bucket abcd --write-bucket efgh --read-tasks --write-tasks -d "Read-write access to metobs buckets"
Make sure to note the tokens from these commands. They will not be readable anywhere else.
Store the read-only token in ``/home/metobs/influxdbv2_metobs_ro.txt`` and the
read-write in ``/home/metobs/influxdbv2_metobs_rw.txt``.
Change the permissions for these two files to read-only for the metobs user:
.. code-block::
chmod 400 /home/metobs/influxdbv2_metobs_ro.txt
chmod 400 /home/metobs/influxdbv2_metobs_rw.txt
Backfill InfluxDB Database Backfill InfluxDB Database
-------------------------- --------------------------
......
#!/usr/bin/env python
# encoding: utf8
"""Utilities for working with an InfluxDB database for realtime data. """Utilities for working with an InfluxDB database for realtime data.
There should only be one database named 'metobs' There should only be one database named 'metobs'
...@@ -10,10 +8,14 @@ see https://github.com/influxdata/influxdb/issues/6932. ...@@ -10,10 +8,14 @@ see https://github.com/influxdata/influxdb/issues/6932.
See the README for more information about how to use this and other scripts. See the README for more information about how to use this and other scripts.
""" """
from __future__ import annotations
import argparse
import os import os
import sys import sys
import logging import logging
from typing import Any
import requests import requests
import importlib import importlib
import math import math
...@@ -21,21 +23,20 @@ from itertools import zip_longest ...@@ -21,21 +23,20 @@ from itertools import zip_longest
from datetime import datetime from datetime import datetime
from metobscommon.util.calc import wind_vector_components from metobscommon.util.calc import wind_vector_components
from influxdb_client import InfluxDBClient, TaskCreateRequest
from influxdb_client.client.write_api import SYNCHRONOUS
LOG = logging.getLogger(__name__) LOG = logging.getLogger(__name__)
DB_NAME = 'metobs'
DB_HOST = 'localhost' DB_HOST = 'localhost'
DB_PORT = 8086 DB_PORT = 8086
ORG_NAME = "metobs"
REALTIME_BUCKET_NAME = "metobs_realtime"
RAW_BUCKET_NAME = "metobs_raw"
# name of the single measurement that all data is written to # name of the single measurement that all data is written to
MEASUREMENT_NAME = "metobs" MEASUREMENT_NAME = "metobs"
# don't include time when writing data, used for testing CQs and RPs # don't include time when writing data, used for testing CQs and RPs
DEBUG_TIME = bool(os.environ.get("DEBUG_INFLUX_TIME", False)) DEBUG_TIME = bool(os.environ.get("DEBUG_INFLUX_TIME", False))
FAKE_CQ_TEMPLATE = "SELECT {ops} INTO {dbname}.{rp}.metobs_{duration} " \
"FROM {dbname}.one_year.metobs WHERE " \
"time >= '{start_time}' and time < '{end_time}' and " \
"{site_conditions} " \
"GROUP BY time({duration}), inst, site"
def _mktime(frame, timestamp="stamp"): def _mktime(frame, timestamp="stamp"):
return int((frame[timestamp] - datetime(1970, 1, 1)).total_seconds() * 10**9) return int((frame[timestamp] - datetime(1970, 1, 1)).total_seconds() * 10**9)
...@@ -81,41 +82,11 @@ def frame_records(frames, ...@@ -81,41 +82,11 @@ def frame_records(frames,
measurement_name, fields=fields_str, timestamp=nanos, tags=tags_str) measurement_name, fields=fields_str, timestamp=nanos, tags=tags_str)
def insert(lines, host=DB_HOST, port=DB_PORT, dbname=DB_NAME, **kwargs): def insert(lines: list[str], influxdb_conn_params: dict[str, str]):
resp = requests.post( """Insert InfluxDB 'line protocol' lines into an InfluxDB v2 database."""
'http://{host}:{port:d}/write?db={dbname}'.format(host=host, port=port, dbname=dbname), with InfluxDBClient(**influxdb_conn_params) as client:
data='\n'.join(lines)) write_api = client.write_api(write_options=SYNCHRONOUS)
resp.raise_for_status() write_api.write(bucket=RAW_BUCKET_NAME, record=lines)
def _query(query_string, host=DB_HOST, port=DB_PORT, dbname=DB_NAME):
resp = requests.get(
'http://{host}:{port:d}/query?db={dbname}&q={query}'.format(host=host,
port=port,
dbname=dbname,
query=query_string)
)
resp.raise_for_status()
def create_database(args):
LOG.info("Creating database '%s'...", args.dbname)
query_str = "CREATE DATABASE {dbname}".format(dbname=args.dbname)
_query(query_str, host=args.host, port=args.port, dbname=args.dbname)
# LOG.info("Setting six week default retention policy...")
# query_str = "CREATE RETENTION POLICY six_week ON {dbname} DURATION 6w REPLICATION 1 DEFAULT".format(dbname=args.dbname)
# _query(query_str, host=args.host, port=args.port, dbname=args.dbname)
LOG.info("Setting one year retention policy...")
query_str = "CREATE RETENTION POLICY one_year ON {dbname} DURATION 52w REPLICATION 1 DEFAULT".format(dbname=args.dbname)
_query(query_str, host=args.host, port=args.port, dbname=args.dbname)
# LOG.info("Setting five year retention policy...")
# query_str = "CREATE RETENTION POLICY five_year ON {dbname} DURATION 5y REPLICATION 1".format(dbname=args.dbname)
# _query(query_str, host=args.host, port=args.port, dbname=args.dbname)
LOG.info("Setting forever retention policy...")
query_str = "CREATE RETENTION POLICY forever ON {dbname} DURATION INF REPLICATION 1".format(dbname=args.dbname)
_query(query_str, host=args.host, port=args.port, dbname=args.dbname)
create_continuous_queries(args)
DURATIONS = ['1m', '5m', '1h'] DURATIONS = ['1m', '5m', '1h']
...@@ -146,90 +117,92 @@ def _get_symbols_list(args): ...@@ -146,90 +117,92 @@ def _get_symbols_list(args):
return symbols_list return symbols_list
def create_continuous_queries(args): def create_tasks(args):
"""Create continuous queries to create averaged fields. """Create recurring tasks that create averaged versions of the raw data.
Due to limitations of CQs in InfluxDB CQs can not be specified with
an execution offset (ex. run CQ 30s after the specified interval). To
get around this we resample the CQ every 2 intervals. So for a 1m
interval we run the query to cover 2m and take the average for each 1m
interval. This results in points being calculated twice and the most
recent point being modified (new data or updated averages).
Any real time applications using the API and not requesting the whole The task as configured (see raw string variable below) assumes realtime
data interval (only getting the newest point) should either be able to data never arrives later than 1 minute after the observation time. This
update the newest point or should ignore the newest point until there task/tasks only apply to newly arriving data and will not be applied to
is a point after it (assuming data came in in a semi-regular manner). backfilled data. The aggregation/averaging function must be called
separately.
https://github.com/influxdata/influxdb/issues/6932
https://docs.influxdata.com/influxdb/v1.5/query_language/continuous_queries/#advanced-syntax
""" """
symbols_list = _get_symbols_list(args) influxdb_conn_params = convert_influxdb_args_to_kwargs(args)
with InfluxDBClient(**influxdb_conn_params) as client:
for idx, symbols in enumerate(symbols_list): tasks_api = client.tasks_api()
if "wind_dir" in symbols:
symbols.extend(["wind_east", "wind_north"])
# symbols.remove("wind_speed")
symbols.remove("wind_dir")
for duration in DURATIONS: for duration in DURATIONS:
factor = 3 if duration == '1m' else 2 LOG.info(f"Adding task for {duration!r} averaged data")
# see docstring above, 3m for 1m, 10m for 5m, etc task_for_duration = AGGREGATE_FLUX_TASK_FORMAT.format(
resample_for = "{:d}{}".format(int(duration[:-1]) * factor, duration[-1]) interval=duration,
raw_bucket=RAW_BUCKET_NAME,
LOG.info("Adding continuous query for '%s' averaged data", duration) realtime_bucket=REALTIME_BUCKET_NAME,
operation_str = ",".join(["mean({field}) as {field}".format(field=field) for field in symbols]) org=ORG_NAME,
cq_name = "cq_{idx}_{duration}".format(idx=idx, duration=duration) )
if hasattr(args, "drop_previous") and args.drop_previous: task_request = TaskCreateRequest(
LOG.info("Dropping previous CQ '%s'", cq_name) flux=task_for_duration,
query_str = "DROP CONTINUOUS QUERY {cq_name} ON {dbname}".format(cq_name=cq_name, dbname=args.dbname) org=ORG_NAME,
_query(query_str, host=args.host, port=args.port, dbname=args.dbname) description=f"Aggregate raw data into {duration} averages",
rp = DURATION_TO_RP.get(duration, 'one_year') status="active",
query_str = "CREATE CONTINUOUS QUERY {cq_name} ON {dbname} " \ )
"RESAMPLE EVERY {duration} FOR {resample_for} " \ tasks_api.create_task(task_create_request=task_request)
"BEGIN SELECT {ops} INTO " \
"{dbname}.{rp}.metobs_{duration} FROM " \
"{dbname}.one_hour.metobs " \ # Assume that data coming from realtime sources are no more than 1 minute late
"GROUP BY time({duration}), inst, site END".format( # XXX: Do I need to get 2x the interal of data?
dbname=args.dbname, duration=duration, ops=operation_str, # XXX: Do I need to do anything about per-site/inst pair or per-symbol?
rp=rp, cq_name=cq_name, resample_for=resample_for) AGGREGATE_FLUX_TASK_FORMAT = """
_query(query_str, host=args.host, port=args.port, dbname=args.dbname) option task = {{
name: "Average data ({interval})",
every: {interval},
offset: 1m,
}}
from(bucket: "{raw_bucket}")
|> range(start: -(task.every * 2))
|> filter(fn: (r) => r["_measurement"] == "metobs")
|> aggregateWindow(every: {interval}, fn: mean)
|> set(key: "_measurement", value: "metobs_{interval}")
|> to(org: "{org}", bucket: "{realtime_bucket}", tagColumns: ["site", "inst"])
"""
MANUAL_AGGREGATE_QUERY = """
from(bucket: "{raw_bucket}")
|> range(start: {start}, stop: {stop})
|> filter(fn: (r) => r["_measurement"] == "metobs")
|> filter(fn: (r) => r["site"] == "{site}" and r["inst"] == "{inst}")
|> aggregateWindow(every: {interval}, fn: mean)
|> set(key: "_measurement", value: "metobs_{interval}")
|> to(bucket: "{realtime_bucket}", tagColumns: ["site", "inst"])
"""
def run_average_queries(args): def run_average_queries(args):
stations = [x.split('.') for x in args.stations] stations = [x.split('.') for x in args.stations]
assert(all(len(x) == 2 for x in stations)) if not (all(len(x) == 2 for x in stations)):
raise ValueError("Stations must be specified as 2-element pairs separated by a '.'.")
site_cond = ["(site='{site}' AND inst='{inst}')".format(site=site, inst=inst)
for site, inst in stations] start_time = args.start_time.strftime("%Y-%m-%dT%H:%M:%SZ")
site_cond = " OR ".join(site_cond) end_time = args.end_time.strftime("%Y-%m-%dT%H:%M:%SZ")
influxdb_conn_params = convert_influxdb_args_to_kwargs(args)
symbols_list = _get_symbols_list(args) with InfluxDBClient(**influxdb_conn_params) as client:
# flatten symbols list to get all of them together query_api = client.query_api()
symbols = sorted(set([x for y in symbols_list for x in y])) for site, inst in stations:
for duration in args.durations:
if "wind_dir" in symbols: LOG.info(f"Running average query for {duration!r} averaged data")
symbols.extend(["wind_east", "wind_north"]) aggregate_query = MANUAL_AGGREGATE_QUERY.format(
# symbols.remove("wind_speed") raw_bucket=RAW_BUCKET_NAME,
symbols.remove("wind_dir") start=start_time,
stop=end_time,
start_time = args.start_time.strftime("%Y-%m-%d %H:%M:%S") site=site,
end_time = args.end_time.strftime("%Y-%m-%d %H:%M:%S") inst=inst,
operation_str = ",".join(["mean({field}) as {field}".format(field=field) for field in symbols]) interval=duration,
for duration in args.durations: realtime_bucket=REALTIME_BUCKET_NAME,
LOG.info("Running average query for '%s' averaged data", duration) )
rp = DURATION_TO_RP.get(duration, 'one_year') if args.debug:
query_str = FAKE_CQ_TEMPLATE.format( print(aggregate_query)
dbname=args.dbname, duration=duration, rp=rp, continue
start_time=start_time, end_time=end_time, LOG.debug(aggregate_query)
site_conditions=site_cond, ops=operation_str, query_api.query(aggregate_query)
)
if args.debug:
print(query_str)
continue
else:
LOG.debug(query_str)
_query(query_str, host=args.host, port=args.port, dbname=args.dbname)
def _dt_convert(datetime_str): def _dt_convert(datetime_str):
...@@ -238,35 +211,16 @@ def _dt_convert(datetime_str): ...@@ -238,35 +211,16 @@ def _dt_convert(datetime_str):
def main(): def main():
import argparse
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--host", default=DB_HOST,
help="Hostname of database connection")
parser.add_argument("--port", default=DB_PORT, type=int,
help="Port of database connection")
parser.add_argument("--dbname", default=DB_NAME,
help="Name of database to modify")
parser.add_argument('-v', '--verbose', dest='verbosity', action="count", default=0, parser.add_argument('-v', '--verbose', dest='verbosity', action="count", default=0,
help='each occurrence increases verbosity 1 level through ERROR-WARNING-INFO-DEBUG') help='each occurrence increases verbosity 1 level through ERROR-WARNING-INFO-DEBUG')
add_influxdb_command_line_arguments(parser)
subparsers = parser.add_subparsers() subparsers = parser.add_subparsers()
subparser = subparsers.add_parser("create", help="Create the initial database and CQs") subparser = subparsers.add_parser("create_tasks", help="Create the continuous queries")
subparser.add_argument("--symbol-list", nargs="+", subparser.set_defaults(func=create_tasks)
help="Lists of importable paths for the symbols to use in continuous queries")
subparser.add_argument("--symbols", nargs="+", subparser = subparsers.add_parser("run_manual_average", help="Manually run averaging queries")
help="List of symbols to include in the continuous queries (alternative to --symbol-list)")
subparser.set_defaults(func=create_database)
subparser = subparsers.add_parser("create_cqs", help="Create the continuous queries")
subparser.add_argument("--symbol-list", nargs="+",
help="Lists of importable paths for the symbols to use in continuous queries")
subparser.add_argument("--symbols", nargs="+",
help="List of symbols to include in the continuous queries (alternative to --symbol-list)")
subparser.add_argument("--drop-previous", action="store_true",
help="Attempt to drop previous CQs with the same names")
subparser.set_defaults(func=create_continuous_queries)
subparser = subparsers.add_parser("run_manual_cqs", help="Manually run averaging CQs")
subparser.add_argument('--debug', action='store_true', subparser.add_argument('--debug', action='store_true',
help='Don\'t actually send the commands, print them to stdout') help='Don\'t actually send the commands, print them to stdout')
subparser.add_argument('--stations', nargs="+", subparser.add_argument('--stations', nargs="+",
...@@ -282,10 +236,6 @@ def main(): ...@@ -282,10 +236,6 @@ def main():
subparser.add_argument('-d', '--durations', nargs='+', default=['1m'], subparser.add_argument('-d', '--durations', nargs='+', default=['1m'],
choices=DURATIONS, choices=DURATIONS,
help="Which average intervals to compute") help="Which average intervals to compute")
subparser.add_argument("--symbol-list", nargs="+",
help="Lists of importable paths for the symbols to use in continuous queries")
subparser.add_argument("--symbols", nargs="+",
help="List of symbols to include in the continuous queries (alternative to --symbol-list)")
subparser.set_defaults(func=run_average_queries) subparser.set_defaults(func=run_average_queries)
args = parser.parse_args() args = parser.parse_args()
...@@ -297,5 +247,25 @@ def main(): ...@@ -297,5 +247,25 @@ def main():
return args.func(args) return args.func(args)
def add_influxdb_command_line_arguments(parser: argparse.ArgumentParser) -> None:
parser.add_argument("--influxdb-host", default=DB_HOST,
help="Hostname of database connection")
parser.add_argument("--influxdb-port", default=DB_PORT, type=int,
help="Port of database connection")
parser.add_argument("--influxdb-org", default="metobs",
help="InfluxDB organization to do work under")
parser.add_argument("--influxdb-token", required=True,
help="Authentication token for InfluxDB connections."
"This should have write permissions to all metobs buckets.")
def convert_influxdb_args_to_kwargs(args: argparse.Namespace) -> dict[str, Any]:
return {
"url": f"http://{args.influxdb_host}:{args.influxdb_port}",
"org": args.influxdb_org,
"token": args.influxdb_token,
}
if __name__ == "__main__": if __name__ == "__main__":
sys.exit(main()) sys.exit(main())
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment