Skip to content
Snippets Groups Projects
Max Drexler's avatar
Max Drexler authored
d55e05a3
History

ATS Predictions

This library can be found as part of the ats-project, here. This repository will NOT be kept up to date.

Installation

Install in a Python environment using pip.

Requirements

Python >= 3.8

pip install -U git+https://gitlab.ssec.wisc.edu/mdrexler/ats_predictions.git

This will install a CLI for running the script and a python library needed for creating files that define how to make predictions.

Usage

After installation, run the script with the predict_ats command or with python -m predict_ats.

predict_ats [-h] [-s SKIP [SKIP ...]] [-O OUTPUT] [-o] [--quiet] [-f FILES [FILES ...]] cfg_path

The script requires a configuration (JSON) file that tells it how to make predictions.

Configuration File

The configuration file is a JSON file with metadata on how to make predictions. There is an example here.

Here are all the possible keys in the configruation file:

Key Value Type (JSON) Required? Notes
ocr_database str [✓] Path to OCR database, see this.
prediction_file str [✓] Path to file that contains prediction functions.
predictions list of objects [✓] List of prediction function metadata, see this.
output_database str [✗] Default is ./predictions.sqlite
data_source_file str [✗] Path to file that contains data source functions, default prediction_file.
data_sources list of objects [✗] List of data source metadata, see this.
ocr_table str [✗] Table that contains the ocr lines in ocr_database, default: 'lines'.

Documentation

Getting the OCR Dataset

The OCR database is located on zazu at:
/home/mdrexler/Projects/ats-projects/naming/ocr_database/export/scanned.sqlite

Prediction Functions

A prediction function is a function that takes the OCR data for a single ATS file and returns a prediction about its metadata. A prediction function with have the following signature:

import pandas as pd
from predict_ats import make_prediction

@make_prediction
def prediction(df: pd.DataFrame, *args, **kwargs) -> dict | None:
    ...

*Note that the decorator @make_prediction is required!

In Config File

To specify a prediction function in the configuration file, first, make sure that the prediction_file's value is the absolute path to the file that contains the prediction function.

Then, add a JSON object to the predictions list with the following format:

{
    "name": "unique_name (used as database table name)",
    "function": "function_name",
    "data_source": "optional data_source name",
    "args": [
        "optional", "list", "of", "arguments", "to", "pass", "to", "function"
    ],
    "kwargs": {
        "optional": "object",
        "of": "keyword",
        "arguments": "to",
        "pass": "to",
        "the": "function"
    }
}
Input

df is a dataframe that contains the OCR data. It has the following signature:

filename line_number lines
name of ats file 0-indexed line number OCR data

To get a string representation of the OCR data from all lines use the following:

ocr_str = ''.join(df.loc[:, 'lines'].astype(str))

*args and **kwargs are passed in from the configuration file.

Output

The prediction function should return None if a prediction can't be made, or a dictionary with the predictions made, like:

{
    'satellite': 1, # either 1 or 3
    'day': 1, # between 1 and 31
    'month': 1, # between 1 and 12
    'year': 1966, # between 1966 and 1974
    'time': '0012', # either 4 or 6 length string
    'latitude': 90.0, # float
    'longitude': 180.0, # float
    'is_mapped': False, # boolean
}

*Note that none of the keys are required to make a prediction, but for a valid prediction satellite, day, month, year, and time are required.

Data Source Functions

A data source function is a function that takes the OCR data for all ATS files, transforms or edits the data in some way, and returns the new data. A data source function with have the following signature:

import pandas as pd

def source(df: pd.DataFrame, *args, **kwargs) -> pd.Series:
    ...

*Note you should only return the edited column as a series, NOT the entire dataframe.

In Config File

To specify a data source use the optional key data_sources. It should be a list of the following:

Functional sources

{
    "name": "name_of_source (used as reference for predictions)",
    "function": "name of function in data source file",
    "args": [
        "optional", "list", "of", "arguments", "to", "pass", "to", "the", "data", "source", "function"
    ],
    "kwargs":{
        "optional": "object",
        "of": "keyword",
        "arguments": "to",
        "pass": "to",
        "the": "data",
        "source": "function"
    }
}

Compositional sources

You can also specify a composition of data sources, so it's not necessary to repeat code throughout different data source functions.

{
    "name": "name_of_source (used as reference for predictions)",
    "composition": [
        "list", "of", "data", "source", "names", "to", "compose", "in", "order"
    ]
}

*Note: The default file that predict_ats will use to find the data source functions is prediction_file, but if you want to separate data source functions from prediction functions it's possible to specify a separate data_source_file that contains just the data source functions.

Input

df is a dataframe that contains the OCR data. It has the following signature:

filename line_number lines
name of ats file 0-indexed line number OCR data

*args and **kwargs are passed in from the configuration file.

Output

The data source function should return a pandas Series where each element in the series corresponds to a row in the input dataframe.

If you would like to filter out some rows in the dataframe, set corresponding rows to panda's NaN value instead of removing the rows' values from the Series.

e.g. removing all rows that are the first OCR line of an image:

def no_first_lines(df: pd.Dataframe) -> pd.Series:
    mask = df[df['line_number'] == 0]
    return df['lines'].mask(~mask)

Author

Created by Max Drexler.