Function Reference

routines to manage the table of crawler_sources

class nldi_crawler.source.CrawlerSource(**kwargs)

An ORM reflection of the crawler_source table

The crawler_source table is held in the nldi_data schema in the NLDI PostGIS database. The schema name and table name are hard-coded to reflect this.

This object maps properties to columns for a given row of that table. Once this object is created, the row’s data is instantiated within the object.

> stmt = select(CrawlerSource)

.order_by(CrawlerSource.crawler_source_id) .where(CrawlerSource.crawler_source_id == 1)

> for src in session.scalars(stmt): … print(f”{src.crawler_source_id} == {src.source_name}”)

table_name(*args)

Getter-like function to return a formatted string representing the table name.

If an optional positional argument is given, that string is appended to the table name. This lets us do things like:

> self.table_name() feature_suffix > self.table_name(“temp”) feature_suffix_temp > self.table_name(“old”) feature_suffix_old

Returns

name of the table for this crawler_source

Return type

string

nldi_crawler.source.download_geojson(source)

Downloads data from the specified source, saving it to a temporary file on local disk.

Parameters

source (CrawlerSource()) – The descriptor for the source.

Returns

path name to temporary file

Return type

str

nldi_crawler.source.list_sources(dal, selector='')

Fetches a list of crawler sources from the master NLDI-DB database. The returned list holds one or mor CrawlerSource() objects, which are reflected from the database using the sqlalchemy ORM.

Parameters

connect_string (str) – The db URL used to connect to the database

Returns

A list of sources

Return type

list of CrawlerSource objects

nldi_crawler.source.validate_src(src)

Examines a specified source to ensure that it downloads, and the returned data is proprly formatted and attributed.

Parameters

src (CrawlerSource) – the source to examine

Returns

a tuple of two values: A boolean to indicate if validated, and a string holding a description of the reason for failure. If validated is True, the reason string is zero-length.

Return type

tuple

classes and functions surrounding database access

class nldi_crawler.db.DataAccessLayer(uri=postgresql+psycopg2://read_only_user@localhost:5432/nldi)

Abstraction layer to hold connection details for the data we want to access via the DB connection

Session()

Opens a sqlalchemy.orm.Session() using the engine defined at instatiation time.

connect()

opens a connection to the database. Sets self.engine as the way to access that connection.

disconnect()

closes the open engine

class nldi_crawler.db.NLDI_Base

Base class used to create reflected ORM objects.

routines to manage the ingestion of crawler sources

class nldi_crawler.ingestor.StrippedString(*args, **kwargs)

Custom type to extend String. We use this to forcefully remove any non-printing characters from the input string. Some non-printables (including backspace and delete), if included in the String, can mess with the SQL submitted by the connection engine.

cache_ok: Optional[bool] = True

Indicate if statements using this ExternalType are “safe to cache”.

The default value None will emit a warning and then not allow caching of a statement which includes this type. Set to False to disable statements using this type from being cached at all without a warning. When set to True, the object’s class and selected elements from its state will be used as part of the cache key. For example, using a TypeDecorator:

class MyType(TypeDecorator):
    impl = String

    cache_ok = True

    def __init__(self, choices):
        self.choices = tuple(choices)
        self.internal_only = True

The cache key for the above type would be equivalent to:

>>> MyType(["a", "b", "c"])._static_cache_key
(<class '__main__.MyType'>, ('choices', ('a', 'b', 'c')))

The caching scheme will extract attributes from the type that correspond to the names of parameters in the __init__() method. Above, the “choices” attribute becomes part of the cache key but “internal_only” does not, because there is no parameter named “internal_only”.

The requirements for cacheable elements is that they are hashable and also that they indicate the same SQL rendered for expressions using this type every time for a given cache value.

To accommodate for datatypes that refer to unhashable structures such as dictionaries, sets and lists, these objects can be made “cacheable” by assigning hashable structures to the attributes whose names correspond with the names of the arguments. For example, a datatype which accepts a dictionary of lookup values may publish this as a sorted series of tuples. Given a previously un-cacheable type as:

class LookupType(UserDefinedType):
    '''a custom type that accepts a dictionary as a parameter.

    this is the non-cacheable version, as "self.lookup" is not
    hashable.

    '''

    def __init__(self, lookup):
        self.lookup = lookup

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect):
        # ...  works with "self.lookup" ...

Where “lookup” is a dictionary. The type will not be able to generate a cache key:

>>> type_ = LookupType({"a": 10, "b": 20})
>>> type_._static_cache_key
<stdin>:1: SAWarning: UserDefinedType LookupType({'a': 10, 'b': 20}) will not
produce a cache key because the ``cache_ok`` flag is not set to True.
Set this flag to True if this type object's state is safe to use
in a cache key, or False to disable this warning.
symbol('no_cache')

If we did set up such a cache key, it wouldn’t be usable. We would get a tuple structure that contains a dictionary inside of it, which cannot itself be used as a key in a “cache dictionary” such as SQLAlchemy’s statement cache, since Python dictionaries aren’t hashable:

>>> # set cache_ok = True
>>> type_.cache_ok = True

>>> # this is the cache key it would generate
>>> key = type_._static_cache_key
>>> key
(<class '__main__.LookupType'>, ('lookup', {'a': 10, 'b': 20}))

>>> # however this key is not hashable, will fail when used with
>>> # SQLAlchemy statement cache
>>> some_cache = {key: "some sql value"}
Traceback (most recent call last): File "<stdin>", line 1,
in <module> TypeError: unhashable type: 'dict'

The type may be made cacheable by assigning a sorted tuple of tuples to the “.lookup” attribute:

class LookupType(UserDefinedType):
    '''a custom type that accepts a dictionary as a parameter.

    The dictionary is stored both as itself in a private variable,
    and published in a public variable as a sorted tuple of tuples,
    which is hashable and will also return the same value for any
    two equivalent dictionaries.  Note it assumes the keys and
    values of the dictionary are themselves hashable.

    '''

    cache_ok = True

    def __init__(self, lookup):
        self._lookup = lookup

        # assume keys/values of "lookup" are hashable; otherwise
        # they would also need to be converted in some way here
        self.lookup = tuple(
            (key, lookup[key]) for key in sorted(lookup)
        )

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect):
        # ...  works with "self._lookup" ...

Where above, the cache key for LookupType({"a": 10, "b": 20}) will be:

>>> LookupType({"a": 10, "b": 20})._static_cache_key
(<class '__main__.LookupType'>, ('lookup', (('a', 10), ('b', 20))))

New in version 1.4.14: - added the cache_ok flag to allow some configurability of caching for TypeDecorator classes.

New in version 1.4.28: - added the ExternalType mixin which generalizes the cache_ok flag to both the TypeDecorator and UserDefinedType classes.

See also

sql_caching

impl

alias of String

process_bind_param(value, dialect)

Receive a bound parameter value to be converted.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.

The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.

Parameters
  • value – Data to operate upon, of any type expected by this method in the subclass. Can be None.

  • dialect – the Dialect in use.

See also

types_typedecorator

_types.TypeDecorator.process_result_value()

nldi_crawler.ingestor.create_tmp_table(dal, src)

This method of creating the temp table relies completely on the postgress dialect of SQL to do the work. We could use sqlalchemy mechanisms to achieve something similar, but this is quick and easy – and it eliminates some problems if the created table is not truly identical to the features table it models on. This will become important when we establish inheritance among tables later.

nldi_crawler.ingestor.ingest_from_file(src, fname, dal)

Takes in a source dataset, and processes it to insert into the NLDI-DB feature table

Parameters
  • src (CrawlerSource()) – The source to be ingested

  • fname (str) – The name of the local copy of the dataset.

Return type

int

nldi_crawler.ingestor.install_data(dal, src)

To ‘install’ the ingested data, we will manipulate table names and inheritance.

The data design has the various sources (named ‘feature_{suffix}’) INHERIT from the feature parent table. Queries against feature will return rows from any child that inherits from it.

The workflow here is to take the already-populated feature_{suffix}_tmp table and shuffle the table names:

  • remove feature_{suffix}_old

  • Remove inheritance relationship between feature and feature_{suffix}

  • rename feature_{suffix} to feature_{suffix}_old

  • rename feature_{suffix}_tmp to feature_{suffix}

  • re-establish inheritance between feature and feature_{suffix}

  • remove the feature_{suffix}_old table

Command Line Interface for launching the NLDI web crawler.