ATS Predictions
This library can be found as part of the ats-project, here. This repository will NOT be kept up to date.
Installation
Install in a Python environment using pip.
Requirements
Python >= 3.8
pip install -U git+https://gitlab.ssec.wisc.edu/mdrexler/ats_predictions.git
This will install a CLI for running the script and a python library needed for creating files that define how to make predictions.
Usage
After installation, run the script with the predict_ats
command or with python -m predict_ats
.
predict_ats [-h] [-s SKIP [SKIP ...]] [-O OUTPUT] [-o] [--quiet] [-f FILES [FILES ...]] cfg_path
The script requires a configuration (JSON) file that tells it how to make predictions.
Configuration File
The configuration file is a JSON file with metadata on how to make predictions. There is an example here.
Here are all the possible keys in the configruation file:
Key | Value Type (JSON) | Required? | Notes |
---|---|---|---|
ocr_database |
str | [✓] | Path to OCR database, see this. |
prediction_file |
str | [✓] | Path to file that contains prediction functions. |
predictions |
list of objects | [✓] | List of prediction function metadata, see this. |
output_database |
str | [✗] | Default is ./predictions.sqlite |
data_source_file |
str | [✗] | Path to file that contains data source functions, default prediction_file . |
data_sources |
list of objects | [✗] | List of data source metadata, see this. |
ocr_table |
str | [✗] | Table that contains the ocr lines in ocr_database , default: 'lines'. |
Documentation
Getting the OCR Dataset
The OCR database is located on zazu at:
/home/mdrexler/Projects/ats-projects/naming/ocr_database/export/scanned.sqlite
Prediction Functions
A prediction function is a function that takes the OCR data for a single ATS file and returns a prediction about its metadata. A prediction function with have the following signature:
import pandas as pd
from predict_ats import make_prediction
@make_prediction
def prediction(df: pd.DataFrame, *args, **kwargs) -> dict | None:
...
*Note that the decorator @make_prediction
is required!
In Config File
To specify a prediction function in the configuration file, first, make sure that the prediction_file
's value is the absolute path to the file that contains the prediction function.
Then, add a JSON object to the predictions
list with the following format:
{
"name": "unique_name (used as database table name)",
"function": "function_name",
"data_source": "optional data_source name",
"args": [
"optional", "list", "of", "arguments", "to", "pass", "to", "function"
],
"kwargs": {
"optional": "object",
"of": "keyword",
"arguments": "to",
"pass": "to",
"the": "function"
}
}
Input
df
is a dataframe that contains the OCR data. It has the following signature:
filename | line_number | lines |
---|---|---|
name of ats file | 0-indexed line number | OCR data |
To get a string representation of the OCR data from all lines use the following:
ocr_str = ''.join(df.loc[:, 'lines'].astype(str))
*args
and **kwargs
are passed in from the configuration file.
Output
The prediction function should return None
if a prediction can't be made, or a dictionary with the predictions made, like:
{
'satellite': 1, # either 1 or 3
'day': 1, # between 1 and 31
'month': 1, # between 1 and 12
'year': 1966, # between 1966 and 1974
'time': '0012', # either 4 or 6 length string
'latitude': 90.0, # float
'longitude': 180.0, # float
'is_mapped': False, # boolean
}
*Note that none of the keys are required to make a prediction, but for a valid prediction satellite, day, month, year, and time are required.
Data Source Functions
A data source function is a function that takes the OCR data for all ATS files, transforms or edits the data in some way, and returns the new data. A data source function with have the following signature:
import pandas as pd
def source(df: pd.DataFrame, *args, **kwargs) -> pd.Series:
...
*Note you should only return the edited column as a series, NOT the entire dataframe.
In Config File
To specify a data source use the optional key data_sources
. It should be a list of the following:
Functional sources
{
"name": "name_of_source (used as reference for predictions)",
"function": "name of function in data source file",
"args": [
"optional", "list", "of", "arguments", "to", "pass", "to", "the", "data", "source", "function"
],
"kwargs":{
"optional": "object",
"of": "keyword",
"arguments": "to",
"pass": "to",
"the": "data",
"source": "function"
}
}
Compositional sources
You can also specify a composition of data sources, so it's not necessary to repeat code throughout different data source functions.
{
"name": "name_of_source (used as reference for predictions)",
"composition": [
"list", "of", "data", "source", "names", "to", "compose", "in", "order"
]
}
*Note: The default file that predict_ats will use to find the data source functions is prediction_file
, but if you want to separate data source functions from prediction functions it's possible to specify a separate data_source_file
that contains just the data source functions.
Input
df
is a dataframe that contains the OCR data. It has the following signature:
filename | line_number | lines |
---|---|---|
name of ats file | 0-indexed line number | OCR data |
*args
and **kwargs
are passed in from the configuration file.
Output
The data source function should return a pandas Series where each element in the series corresponds to a row in the input dataframe.
If you would like to filter out some rows in the dataframe, set corresponding rows to panda's NaN value instead of removing the rows' values from the Series.
e.g. removing all rows that are the first OCR line of an image:
def no_first_lines(df: pd.Dataframe) -> pd.Series:
mask = df[df['line_number'] == 0]
return df['lines'].mask(~mask)
Author
Created by Max Drexler.