By Adam McCune In Recipes
Acquire
Download one or more files from a remote server and store the files in your project recipes so they can be used as data ingredients.
Acquire capabilities include: Retrying while waiting for files to become available on the server, downloading multiple files, archiving files locally and remotely, and filtering by time range and filename patterns.
Acquire is typically one of the first instructions in a data recipe. Acquire is great for retrieving the latest data from a remote system before your recipe proceeds with the rest of the instructions.
The most commonly used type of Acquire is SFTP (Secure File Transfer Protocol), but other types exist, including queries to REST and SOAP API (Application Programming Interfaces).
Screenshots
Specifics
Running Acquire in Flatmap
The Flatmap command is ‘acquire’ or ‘ac’ for short. Commands can be specified on the command line or in the interactive console.
When in command line and in the root directory for a project, you can give a single command in three parts, beginning ‘fm’ for Flatmap, then the recipe directory inside the project (for example, ‘data-demographic’), then the command (for example, ‘acquire’ or ‘ac’): ‘fm data-demographic acquire’. Alternatively, you can launch the Flatmap interactive console from the command line with the first two parts (for example, ‘fm data-demographic’), and then the Flatmap interactive console will launch, marked by ‘fm>’ at the beginning of every line. At this point you can type one or more commands at a time, such as ‘acquire’ or ‘acquire;attach’. To return to the command line, use the Flatmap command ‘exit’.
When the Acquire command is received, Flatmap will scan the data recipe for any files matching the pattern: Acquire(.*).conf, then execute the Acquire commands in alphanumeric order by filename.
Multiple Acquire Configurations
If you have multiple Acquire instructions in a recipe, you may need to be able to specify which one Flatmap should run, so that you can run them at different times or in a particular sequence.
To specify which Acquire instruction to run, add the ‘e=’ (Endpoint) into your filename and use the ‘e=’ command in Flatmap. When the e= endpoint is set, Flatmap only runs Acquire configurations that match the endpoint. For example, if you have two Acquire instructions with the filenames “Acquire Sftp e=test.conf” and “Acquire Sftp e=production.conf”, the two commands “e=test;acquire” would select the first instruction to run.
Configuration file
Acquire instructions exist in a configuration file matching the pattern: Acquire(.*).conf
The file’s filename follows a pattern including the name of the instruction (“Acquire”), the method of remote access (e.g., “Sftp”), and (optionally) an additional detail such as the server used for remote access (or, as noted above, the endpoint used to select which Acquire command to run). For example, a configuration file might be named one of the following:
- “Acquire Sftp.conf”
- “Acquire Sftp Albertlogin.conf”
Note: the configuration file uses the following types of data
- “Strings” are sequences of characters marked in quotation marks, typically consisting of brief text.
- “Integers” are whole numbers.
- “Booleans” are true or false.
- “Amount of time” fields consist of an integer, a period, and a unit of time (e.g., 5.minutes or 4.hours)
The configuration file can determine the following.
Connecting to the remote source (server) which the file comes from
Under the main heading (e.g., “acquire.sftp”)
- hostname (string): the server’s domain name (e.g., “sftp.albertlogin.us”)
- username (string): the username for access to the server
- password (string): the password for access to the server
- port (integer): the port for access to the server; if none is specified, the default port is used
acquire.sftp {
hostname = "sftp.albertlogin.us"
username = ""
password = ""
port = 2143
}
Finding the file on the server
Under the “rsyncs” subheading
- remote-path (string): the directory on the server where the file can be found
- file-regex (string): a regular expression indicating the pattern for the names of the files that should be acquired (e.g., “gitInbound.+[.]xlsx” meaning any filename containing “gitInbound” following by multiple characters and the Excel extension “xlsx”)
- latest-only (boolean): whether to acquire all files matching “file-regex” (if “latest-only” is false) or only the latest or most recent matching file (if “latest-only” is true)
acquire.sftp {
rsyncs : [
{
remote-path = "attach/test_brainps"
file-regex = "Inbound.+[.]xlsx"
latest-only = true
}
]
}
Time to spend looking for the file on the server
Under the “retry” heading
- frequency (amount of time): the amount of time to wait between attempts to find the file
- duration (amount of time): the total amount of time to continue looking for the file
(For example, if “frequency” is 5.minutes and “duration” is 4.hours then Flatmap will look for the file every five minutes for four hours.)
acquire.sftp {
retry {
frequency = 5.minutes
duration = 4.hours
}
}
Archiving the file on the server
Under the “archives” subheading
(Note: We archive files in a different directory on the same server so we can still access them, but they won’t be automatically acquired the next time the Acquire instruction runs.)
- remote-from-path (string): the directory where the file begins (compare “remote-path” under “rsyncs”)
- remote-to-path (string): the directory (on the same server) where the file will end up when it is archived
- file-regex (string): a regular expression indicating the pattern for the names of the files that should be archived (compare “file-regex” under “rsyncs”)
acquire.sftp {
archives: [
{
remote-from-path = "attach/test_brainps"
remote-to-path = "attach/test_brainps/processed"
file-regex = ".+[.]xlsx"
}
]
}
Copying the file into the “attach” directory of the project
Under the main heading (e.g., “acquire.sftp”)
- attach-path (string): the directory into which the file will be copied (inside the “attach” directory of the project); the directory name will get the current date added to the beginning, so there will be a separate directory for each day that the “acquire” instruction is run
acquire.sftp {
attach-path = "Daily/demographic"
}
Under the “rsyncs” subheading
- attach (boolean): set to true if the file should be copied into the “attach” directory
acquire.sftp {
rsyncs : [
{
attach = true
}
]
}
Sample Instructions
Filename: Acquire Sftp Albertlogin.conf
## 2021Jun14 Sftp Acquire for the Recipe Card
acquire.sftp {
hostname = "sftp.albertlogin.us"
username = ""
password = ""
port = 2143
# 2019-Nov-03 Attach path as an alternative to local-replace-file
attach-path = "Daily/demographic"
# 2020.4 Sftp Acquire supports retry blocking
retry {
frequency = 5.minutes
duration = 4.hours
}
rsyncs : [
{
remote-path = "attach/test_brainps"
// local-replace-path = "./Identity Inbound ISSN"
attach = true
file-regex = "Inbound.+[.]xlsx"
latest-only = true
}
]
// 2020.2 Verified working
archives: [
{
remote-from-path = "attach/test_brainps"
remote-to-path = "attach/test_brainps/processed"
file-regex = ".+[.]xlsx"
}
]
}