Function Reference

routines to manage the table of crawler_sources

class nldi_crawler.source.CrawlerSource(**kwargs)

An ORM reflection of the crawler_source table

The crawler_source table is held in the nldi_data schema in the NLDI PostGIS database. The schema name and table name are hard-coded to reflect this.

This object maps properties to columns for a given row of that table. Once this object is created, the row’s data is instantiated within the object.

> stmt = select(CrawlerSource): .order_by(CrawlerSource.crawler_source_id) .where(CrawlerSource.crawler_source_id == 1)

> for src in session.scalars(stmt): … print(f”{src.crawler_source_id} == {src.source_name}”)

table_name(*args)

Getter-like function to return a formatted string representing the table name.

If an optional positional argument is given, that string is appended to the table name. This lets us do things like:

> self.table_name() feature_suffix > self.table_name(“temp”) feature_suffix_temp > self.table_name(“old”) feature_suffix_old

Returns: name of the table for this crawler_source
Return type: string

nldi_crawler.source.download_geojson(source)

Downloads data from the specified source, saving it to a temporary file on local disk.

Parameters: source (CrawlerSource()) – The descriptor for the source.
Returns: path name to temporary file
Return type: str

nldi_crawler.source.list_sources(dal, selector='')

Fetches a list of crawler sources from the master NLDI-DB database. The returned list holds one or mor CrawlerSource() objects, which are reflected from the database using the sqlalchemy ORM.

Parameters: connect_string (str) – The db URL used to connect to the database
Returns: A list of sources
Return type: list of CrawlerSource objects

nldi_crawler.source.validate_src(src)

Examines a specified source to ensure that it downloads, and the returned data is proprly formatted and attributed.

Parameters: src (CrawlerSource) – the source to examine
Returns: a tuple of two values: A boolean to indicate if validated, and a string holding a description of the reason for failure. If validated is True, the reason string is zero-length.
Return type: tuple

classes and functions surrounding database access

class nldi_crawler.db.DataAccessLayer(uri=postgresql+psycopg2://read_only_user@localhost:5432/nldi)

Abstraction layer to hold connection details for the data we want to access via the DB connection

Session(): Opens a sqlalchemy.orm.Session() using the engine defined at instatiation time.

connect(): opens a connection to the database. Sets self.engine as the way to access that connection.

disconnect(): closes the open engine

class nldi_crawler.db.NLDI_Base: Base class used to create reflected ORM objects.

routines to manage the ingestion of crawler sources

class nldi_crawler.ingestor.StrippedString(*args, **kwargs)

Custom type to extend String. We use this to forcefully remove any non-printing characters from the input string. Some non-printables (including backspace and delete), if included in the String, can mess with the SQL submitted by the connection engine.

cache_ok: Optional[bool] = True

Indicate if statements using this ExternalType are “safe to cache”.

The default value None will emit a warning and then not allow caching of a statement which includes this type. Set to False to disable statements using this type from being cached at all without a warning. When set to True, the object’s class and selected elements from its state will be used as part of the cache key. For example, using a TypeDecorator:

class MyType(TypeDecorator):
    impl = String

    cache_ok = True

    def __init__(self, choices):
        self.choices = tuple(choices)
        self.internal_only = True

The cache key for the above type would be equivalent to:

>>> MyType(["a", "b", "c"])._static_cache_key
(<class '__main__.MyType'>, ('choices', ('a', 'b', 'c')))

The caching scheme will extract attributes from the type that correspond to the names of parameters in the __init__() method. Above, the “choices” attribute becomes part of the cache key but “internal_only” does not, because there is no parameter named “internal_only”.

The requirements for cacheable elements is that they are hashable and also that they indicate the same SQL rendered for expressions using this type every time for a given cache value.

To accommodate for datatypes that refer to unhashable structures such as dictionaries, sets and lists, these objects can be made “cacheable” by assigning hashable structures to the attributes whose names correspond with the names of the arguments. For example, a datatype which accepts a dictionary of lookup values may publish this as a sorted series of tuples. Given a previously un-cacheable type as:

class LookupType(UserDefinedType):
    '''a custom type that accepts a dictionary as a parameter.

    this is the non-cacheable version, as "self.lookup" is not
    hashable.

    '''

    def __init__(self, lookup):
        self.lookup = lookup

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect):
        # ...  works with "self.lookup" ...

Where “lookup” is a dictionary. The type will not be able to generate a cache key:

>>> type_ = LookupType({"a": 10, "b": 20})
>>> type_._static_cache_key
<stdin>:1: SAWarning: UserDefinedType LookupType({'a': 10, 'b': 20}) will not
produce a cache key because the ``cache_ok`` flag is not set to True.
Set this flag to True if this type object's state is safe to use
in a cache key, or False to disable this warning.
symbol('no_cache')

If we did set up such a cache key, it wouldn’t be usable. We would get a tuple structure that contains a dictionary inside of it, which cannot itself be used as a key in a “cache dictionary” such as SQLAlchemy’s statement cache, since Python dictionaries aren’t hashable:

>>> # set cache_ok = True
>>> type_.cache_ok = True

>>> # this is the cache key it would generate
>>> key = type_._static_cache_key
>>> key
(<class '__main__.LookupType'>, ('lookup', {'a': 10, 'b': 20}))

>>> # however this key is not hashable, will fail when used with
>>> # SQLAlchemy statement cache
>>> some_cache = {key: "some sql value"}
Traceback (most recent call last): File "<stdin>", line 1,
in <module> TypeError: unhashable type: 'dict'

The type may be made cacheable by assigning a sorted tuple of tuples to the “.lookup” attribute:

class LookupType(UserDefinedType):
    '''a custom type that accepts a dictionary as a parameter.

    The dictionary is stored both as itself in a private variable,
    and published in a public variable as a sorted tuple of tuples,
    which is hashable and will also return the same value for any
    two equivalent dictionaries.  Note it assumes the keys and
    values of the dictionary are themselves hashable.

    '''

    cache_ok = True

    def __init__(self, lookup):
        self._lookup = lookup

        # assume keys/values of "lookup" are hashable; otherwise
        # they would also need to be converted in some way here
        self.lookup = tuple(
            (key, lookup[key]) for key in sorted(lookup)
        )

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect):
        # ...  works with "self._lookup" ...

Where above, the cache key for LookupType({"a": 10, "b": 20}) will be:

>>> LookupType({"a": 10, "b": 20})._static_cache_key
(<class '__main__.LookupType'>, ('lookup', (('a', 10), ('b', 20))))

New in version 1.4.14: - added the cache_ok flag to allow some configurability of caching for TypeDecorator classes.

New in version 1.4.28: - added the ExternalType mixin which generalizes the cache_ok flag to both the TypeDecorator and UserDefinedType classes.