Function Reference
routines to manage the table of crawler_sources
- class nldi_crawler.source.CrawlerSource(**kwargs)
An ORM reflection of the crawler_source table
The crawler_source table is held in the nldi_data schema in the NLDI PostGIS database. The schema name and table name are hard-coded to reflect this.
This object maps properties to columns for a given row of that table. Once this object is created, the row’s data is instantiated within the object.
- > stmt = select(CrawlerSource)
.order_by(CrawlerSource.crawler_source_id) .where(CrawlerSource.crawler_source_id == 1)
> for src in session.scalars(stmt): … print(f”{src.crawler_source_id} == {src.source_name}”)
- table_name(*args)
Getter-like function to return a formatted string representing the table name.
If an optional positional argument is given, that string is appended to the table name. This lets us do things like:
> self.table_name() feature_suffix > self.table_name(“temp”) feature_suffix_temp > self.table_name(“old”) feature_suffix_old
- Returns
name of the table for this crawler_source
- Return type
string
- nldi_crawler.source.download_geojson(source)
Downloads data from the specified source, saving it to a temporary file on local disk.
- Parameters
source (CrawlerSource()) – The descriptor for the source.
- Returns
path name to temporary file
- Return type
str
- nldi_crawler.source.list_sources(dal, selector='')
Fetches a list of crawler sources from the master NLDI-DB database. The returned list holds one or mor CrawlerSource() objects, which are reflected from the database using the sqlalchemy ORM.
- Parameters
connect_string (str) – The db URL used to connect to the database
- Returns
A list of sources
- Return type
list of CrawlerSource objects
- nldi_crawler.source.validate_src(src)
Examines a specified source to ensure that it downloads, and the returned data is proprly formatted and attributed.
- Parameters
src (CrawlerSource) – the source to examine
- Returns
a tuple of two values: A boolean to indicate if validated, and a string holding a description of the reason for failure. If validated is True, the reason string is zero-length.
- Return type
tuple
classes and functions surrounding database access
- class nldi_crawler.db.DataAccessLayer(uri=postgresql+psycopg2://read_only_user@localhost:5432/nldi)
Abstraction layer to hold connection details for the data we want to access via the DB connection
- Session()
Opens a sqlalchemy.orm.Session() using the engine defined at instatiation time.
- connect()
opens a connection to the database. Sets self.engine as the way to access that connection.
- disconnect()
closes the open engine
- class nldi_crawler.db.NLDI_Base
Base class used to create reflected ORM objects.
routines to manage the ingestion of crawler sources
- class nldi_crawler.ingestor.StrippedString(*args, **kwargs)
Custom type to extend String. We use this to forcefully remove any non-printing characters from the input string. Some non-printables (including backspace and delete), if included in the String, can mess with the SQL submitted by the connection engine.
- cache_ok: Optional[bool] = True
Indicate if statements using this
ExternalTypeare “safe to cache”.The default value
Nonewill emit a warning and then not allow caching of a statement which includes this type. Set toFalseto disable statements using this type from being cached at all without a warning. When set toTrue, the object’s class and selected elements from its state will be used as part of the cache key. For example, using aTypeDecorator:class MyType(TypeDecorator): impl = String cache_ok = True def __init__(self, choices): self.choices = tuple(choices) self.internal_only = True
The cache key for the above type would be equivalent to:
>>> MyType(["a", "b", "c"])._static_cache_key (<class '__main__.MyType'>, ('choices', ('a', 'b', 'c')))
The caching scheme will extract attributes from the type that correspond to the names of parameters in the
__init__()method. Above, the “choices” attribute becomes part of the cache key but “internal_only” does not, because there is no parameter named “internal_only”.The requirements for cacheable elements is that they are hashable and also that they indicate the same SQL rendered for expressions using this type every time for a given cache value.
To accommodate for datatypes that refer to unhashable structures such as dictionaries, sets and lists, these objects can be made “cacheable” by assigning hashable structures to the attributes whose names correspond with the names of the arguments. For example, a datatype which accepts a dictionary of lookup values may publish this as a sorted series of tuples. Given a previously un-cacheable type as:
class LookupType(UserDefinedType): '''a custom type that accepts a dictionary as a parameter. this is the non-cacheable version, as "self.lookup" is not hashable. ''' def __init__(self, lookup): self.lookup = lookup def get_col_spec(self, **kw): return "VARCHAR(255)" def bind_processor(self, dialect): # ... works with "self.lookup" ...
Where “lookup” is a dictionary. The type will not be able to generate a cache key:
>>> type_ = LookupType({"a": 10, "b": 20}) >>> type_._static_cache_key <stdin>:1: SAWarning: UserDefinedType LookupType({'a': 10, 'b': 20}) will not produce a cache key because the ``cache_ok`` flag is not set to True. Set this flag to True if this type object's state is safe to use in a cache key, or False to disable this warning. symbol('no_cache')
If we did set up such a cache key, it wouldn’t be usable. We would get a tuple structure that contains a dictionary inside of it, which cannot itself be used as a key in a “cache dictionary” such as SQLAlchemy’s statement cache, since Python dictionaries aren’t hashable:
>>> # set cache_ok = True >>> type_.cache_ok = True >>> # this is the cache key it would generate >>> key = type_._static_cache_key >>> key (<class '__main__.LookupType'>, ('lookup', {'a': 10, 'b': 20})) >>> # however this key is not hashable, will fail when used with >>> # SQLAlchemy statement cache >>> some_cache = {key: "some sql value"} Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict'
The type may be made cacheable by assigning a sorted tuple of tuples to the “.lookup” attribute:
class LookupType(UserDefinedType): '''a custom type that accepts a dictionary as a parameter. The dictionary is stored both as itself in a private variable, and published in a public variable as a sorted tuple of tuples, which is hashable and will also return the same value for any two equivalent dictionaries. Note it assumes the keys and values of the dictionary are themselves hashable. ''' cache_ok = True def __init__(self, lookup): self._lookup = lookup # assume keys/values of "lookup" are hashable; otherwise # they would also need to be converted in some way here self.lookup = tuple( (key, lookup[key]) for key in sorted(lookup) ) def get_col_spec(self, **kw): return "VARCHAR(255)" def bind_processor(self, dialect): # ... works with "self._lookup" ...
Where above, the cache key for
LookupType({"a": 10, "b": 20})will be:>>> LookupType({"a": 10, "b": 20})._static_cache_key (<class '__main__.LookupType'>, ('lookup', (('a', 10), ('b', 20))))
New in version 1.4.14: - added the
cache_okflag to allow some configurability of caching forTypeDecoratorclasses.New in version 1.4.28: - added the
ExternalTypemixin which generalizes thecache_okflag to both theTypeDecoratorandUserDefinedTypeclasses.See also
sql_caching
- impl
alias of
String
- process_bind_param(value, dialect)
Receive a bound parameter value to be converted.
Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.
- Parameters
value – Data to operate upon, of any type expected by this method in the subclass. Can be
None.dialect – the
Dialectin use.
See also
types_typedecorator
_types.TypeDecorator.process_result_value()
- nldi_crawler.ingestor.create_tmp_table(dal, src)
This method of creating the temp table relies completely on the postgress dialect of SQL to do the work. We could use sqlalchemy mechanisms to achieve something similar, but this is quick and easy – and it eliminates some problems if the created table is not truly identical to the features table it models on. This will become important when we establish inheritance among tables later.
- nldi_crawler.ingestor.ingest_from_file(src, fname, dal)
Takes in a source dataset, and processes it to insert into the NLDI-DB feature table
- Parameters
src (CrawlerSource()) – The source to be ingested
fname (str) – The name of the local copy of the dataset.
- Return type
int
- nldi_crawler.ingestor.install_data(dal, src)
To ‘install’ the ingested data, we will manipulate table names and inheritance.
The data design has the various sources (named ‘feature_{suffix}’) INHERIT from the feature parent table. Queries against feature will return rows from any child that inherits from it.
The workflow here is to take the already-populated feature_{suffix}_tmp table and shuffle the table names:
remove feature_{suffix}_old
Remove inheritance relationship between feature and feature_{suffix}
rename feature_{suffix} to feature_{suffix}_old
rename feature_{suffix}_tmp to feature_{suffix}
re-establish inheritance between feature and feature_{suffix}
remove the feature_{suffix}_old table
Command Line Interface for launching the NLDI web crawler.