Usage Examples

Launch the tool with no arguments to get a help message. You get the same thing if you launch with --help option.

# nldi-cli

You must supply a COMMAND: ['display', 'download', 'ingest', 'sources', 'validate']

Usage: nldi-cli [OPTIONS] COMMAND [ARGS]...

  CLI to launch NLDI crawler.

Options:
  -v             Verbose mode.
  --config PATH  Location of config file.
  --help         Show this message and exit.

Commands:
  display   Show details for named source.
  download  Download the data associated with a named data source.
  ingest    Download and process data associated with a named data source.
  sources   List all available crawler sources and exit.
  validate  Connect to data source(s) to verify that they can supply data...

Note that you can specify -v to go into verbose mode. Using -vv will be extra verbose.

The tool connects to a database for all source data. The details of that database connection can be configured in two ways: by setting environment variables, or by supplying a config file.

CONFIG Option 1

# export NLDI_DB_HOST=172.18.0.1
# export NLDI_DB_PORT=5432
# export NLDI_DB_USER=username
# export NLDI_DB_PASS=secret
# export NLDI_DB_NAME=nldi
#
# nldi-cli sources

ID : Source Name                                    : Type  : URI
==   ==============================================   =====   ================================================
 1 : Water Quality Portal                           : POINT : https://www.waterqualitydata.us/data/Station/sea...
 2 : HUC12 Pour Points                              : POINT : https://www.sciencebase.gov/catalogMaps/mapping/...
 5 : NWIS Surface Water Sites                       : REACH : https://www.sciencebase.gov/catalog/file/get/60c...
 6 : Water Data Exchange 2.0 Sites                  : POINT : https://www.hydroshare.org/resource/5f665b7b82d7...
 7 : geoconnex.us reference gages                   : REACH : https://www.hydroshare.org/resource/3295a17b4cc2...
 8 : Streamgage catalog for CA SB19                 : REACH : https://sb19.linked-data.internetofwater.dev/col...
 9 : USGS Geospatial Fabric V1.1 Points of Interest : REACH : https://www.sciencebase.gov/catalogMaps/mapping/...
10 : Vigil Network Data                             : REACH : https://www.sciencebase.gov/catalog/file/get/60c...
11 : NWIS Groundwater Sites                         : POINT : https://www.sciencebase.gov/catalog/file/get/60c...
12 : New Mexico Water Data Initative Sites          : POINT : https://locations.newmexicowaterdata.org/collect...
13 : geoconnex contribution demo sites              : REACH : https://geoconnex-demo-pages.internetofwater.dev...

The sources sub-command to the CLI tool will list all of the data sources configured in the database specified using the NNDI_DB_ environment variables.

CONFIG Option #2

You can also specify connection parameters by pointing the tool to a toml file. The contents of that file should look like this:

[nldi-db]
hostname: "172.18.0.1"
port: "5432"
username: "username"
password: "secret"
db_name: "nldi"

Use this option with the --config switch on the CLI tool:

# nldi-cli -vv --config ./crawler.toml sources
INFO:root:VERBOSE is 2
INFO:root: Consulting environment variables for DB connection info...
INFO:root: Parsing TOML config file ./nldi-crawler.toml for DB connection info...
WARNING:root:Password stored as plain text in crawler.toml. Consider passing as env variable instead.
INFO:root: Using DB connection URI: postgresql://username:****@172.18.0.1:5432/nldi

ID : Source Name                                    : Type  : URI
==   ==============================================   =====   ================================================
 1 : Water Quality Portal                           : POINT : https://www.waterqualitydata.us/data/Station/sea...
 2 : HUC12 Pour Points                              : POINT : https://www.sciencebase.gov/catalogMaps/mapping/...
 5 : NWIS Surface Water Sites                       : REACH : https://www.sciencebase.gov/catalog/file/get/60c...
 6 : Water Data Exchange 2.0 Sites                  : POINT : https://www.hydroshare.org/resource/5f665b7b82d7...
 7 : geoconnex.us reference gages                   : REACH : https://www.hydroshare.org/resource/3295a17b4cc2...
 8 : Streamgage catalog for CA SB19                 : REACH : https://sb19.linked-data.internetofwater.dev/col...
 9 : USGS Geospatial Fabric V1.1 Points of Interest : REACH : https://www.sciencebase.gov/catalogMaps/mapping/...
10 : Vigil Network Data                             : REACH : https://www.sciencebase.gov/catalog/file/get/60c...
11 : NWIS Groundwater Sites                         : POINT : https://www.sciencebase.gov/catalog/file/get/60c...
12 : New Mexico Water Data Initative Sites          : POINT : https://locations.newmexicowaterdata.org/collect...
13 : geoconnex contribution demo sites              : REACH : https://geoconnex-demo-pages.internetofwater.dev...

Note the additional output due to the verbose flags being set.

Display a Source

We can get detailed information about a specific source with the display sub-command. For future examples, assume the environment has been set to the correct database connection settings.

# nldi-cli display 10
10 :: Vigil Network Data
  Source Suffix:  vigil
  Source URI:     https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=vigil.geojson
  Feature ID:     SBID
  Feature Name:   Site Name
  Feature URI:    SBURL
  Feature Reach:  nhdpv2_REACHCODE
  Feature Measure:nhdpv2_REACH_measure
  Ingest Type:    reach
  Feature Type    hydrolocation

Validate a Source

This will attempt to connect to the URI of the source, and parse its output to see that it is valid JSON and matches the metadata that the source table expects.

# nldi-cli validate 10
Checking Vigil Network Data...  [FAIL] : Column not found for 'feature_reach' : nhdpv2_REACHCODE

# nldi-cli validate 1
Checking Water Quality Portal...  [FAIL] : Network Timeout

# nldi-cli validate 13
Checking geoconnex contribution demo sites...  [PASS]

You can also validate all sources at once, by not specifying a specific source:

# nldi-cli -v  validate
INFO:root: VERBOSE is 1
INFO:root: Consulting environment variables for DB connection info...
INFO:root: Parsing TOML config file ./nldi-crawler.toml for DB connection info...
INFO:root: Using DB connection URI: postgresql://username:****@172.18.0.1:5432/nldi
INFO:root:Validating data source(s)
1 : Checking Water Quality Portal...  [FAIL] : Network Timeout
2 : Checking HUC12 Pour Points...  [FAIL] : Invalid JSON
5 : Checking NWIS Surface Water Sites...  [PASS]
6 : Checking Water Data Exchange 2.0 Sites...  [PASS]
7 : Checking geoconnex.us reference gages...  [PASS]
8 : Checking Streamgage catalog for CA SB19...  [FAIL] : Network Timeout
9 : Checking USGS Geospatial Fabric V1.1 Points of Interest...  [PASS]
10 : Checking Vigil Network Data...  [FAIL] : Column not found for 'feature_measure' : nhdpv2_REACH_measure
11 : Checking NWIS Groundwater Sites...  [PASS]
12 : Checking New Mexico Water Data Initative Sites...  [FAIL] : Network Timeout
13 : Checking geoconnex contribution demo sites...  [FAIL] : Column not found for 'feature_id' : id

Download Source Data to Local Disk

This is not especially useful on its own, but can be used for debugging. Specifying a source will download the GeoJSON from that source to a local file.

# nldi-cli download 10
Source 10 downloaded to /home/trantham/nldi-crawler-py/CrawlerData_10_w2_08yh5.geojson

The filename is CrawlerData_, plus the the source number, plus a unique random string from the tempfile library, plus .geojson.

Ingesting Source Data

This is the main function of this tool. It will download source data, then parse it for individual features, which it will then insert into the master features table in the nldi database.

# nldi-cli -vv --config ./crawler.toml ingest 10
INFO:root:VERBOSE is 2
INFO:root: Consulting environment variables for DB connection info...
INFO:root: Parsing TOML config file ./nldi-crawler.toml for DB connection info...
WARNING:root:Password stored as plain text in nldi-crawler.toml. Consider passing as env variable instead.
INFO:root: Using DB connection URI: postgresql://username:****@172.18.0.1:5432/nldi
INFO:root: Downloading data from https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=vigil.geojson ...
INFO:root:Writing to tmp file /home/trantham/nldi-crawler-py/CrawlerData_10_wbkl40e2.geojson
INFO:root: Source 10 dowloaded to /home/trantham/nldi-crawler-py/CrawlerData_10_wbkl40e2.geojson
INFO:root: Ingesting from REACH source: 10 / Vigil Network Data
INFO:root: Processed 70 features from Vigil Network Data