Download Types of Data Supported

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Types of Data Supported
Datameer supports the following types of structured, semi-structured, and unstructured data. See Supported Data Sources for additional details.
Databases
Relational databases include Oracle, DB2, and MySQL
Amazon Redshift - a hosted data warehouse product, which is part of the larger cloud computing platform Amazon Web Services.
Oracle – relational database management system designed for grid computing inclusive CLOB support for importing data.
MySQL – relational database based on structured query language. You need to provide the host name using a syntax such as
123.45.67.89 or anyhost.com. In addition, you need to provide the database name, user name, and password.
MSSQL – relational database based on structured query language.
DB2 – IBM relational database management system
PostgreSQL - is an object-relational database management system (ORDBMS)
Greenplum - Industry-leading massively parallel processing (MPP) database.
HSQL (file) – is a lightweight, 100% Java SQL Database Engine. You need to provide the database name you want to use, the username, and
password.
HSQL (http) – used when access to the server hosting the database is restricted to the HTTP protocol due to firewalls on the client or server. You
need to provide the host name using a syntax such as 123.45.67.89 or anyhost.com. In addition, you need to indicate the port, database name,
user name, and password.
Please look here for more information about importing data from a database.
Before being able to import data from a database an administrator will need to Install Database Drivers.
Files
Files –
Apache log files - Apache server records all incoming requests and all requests processed to a log file. The format of the access log is
highly configurable. The location and content of the access log are controlled by the CustomLog directive.
Apache Avro - is a data serialization system that provides rich data structures, compact fast binary data format, container file to store
persistent data, remote procedure call, and simple integration with dynamic languages.
Cobol Copybook - A COBOL copybook is a section of code that defines the data structures of COBOL programs.
Comma-delimited text files (.CSV) - This type of file stores tabular data (numbers and text) in plain-text form. Plain text means that the file
is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of
records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most
commonly a literal comma or tab.
Excel Workbooks - A spreadsheet application developed by Microsoft. Datameer supports Excel 2007 and newer versions and uses the
1900 date system.
Fixed Width - Fixed-width format is a file with a font whose letters and characters each occupy the same amount of horizontal space.
HTML File Type - HyperText Markup Language, the markup language used to be display HTML elements in a web browser.
IIS Logs - IIS (Internet Information Services) is a web server application and set of feature extension modules created by Microsoft for
use with Microsoft Windows. IIS 7.5 supports HTTP, HTTPS, FTP, FTPS, SMTP and NNTP.
JSON - An unordered collection of key:value pairs with the ':' character separating the key and the value, comma-separated and enclosed
in curly braces; the keys must be strings and should be distinct from each other.
Key/Value Pair - A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the
value, which is either the data that is identified.
Log4j Log File - a popular logging package written in Java.
Mbox - A generic term for a family of related file formats used for holding collections of electronic mail messages.
Netfilter / IP-Tables - Netfilter is the packet filtering framework inside the Linux kernel. Iptables is a user space application that allows a
sys admin to configure tables provided by the Linux kernel firewall.
Orc (Optimized Row Columnar) - a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file
formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of
data processing framework, data model or programming language.
Regex Parsable Text Files - Specify the file or folder, enter a Regex pattern for processing the data, and specify whether the first row
contains the column headers.
Sequence File with Metadata - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
Unstructured data such as Twitter data. -Information that either does not have a pre-defined data model and/or does not fit well into relati
onal tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results
in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in
fielded form in databases or annotated in documents.
XML data: Specify the file or folder, the root element, container element, and XPath expressions for the fields you would import to
Datameer.
Azure Blob Storage - A Microsoft storage service for large unstructured binary and text data. Available for Datameer HDP 2.0+ and CDH
4+ users. Please contact our services department for the connector plug-in.
You can import or upload individual sheets from a spreadsheet by first converting the file to a .CSV file type.
Datameer server filesystem – the local filesystem. Use this choice to set up a local filesystem for use by Datameer.
FTP - (File Transfer Protocol) is a standard network protocol used to transfer files from one host or to another host over a TCP-based network,
such as the Internet.
HDFS – (Hadoop Distributed File System) a distributed file system used by Hadoop applications that creates multiple replicas of data blocks and
distributes them on nodes throughout a cluster to allow extremely rapid computations. You need to provide the location of your HDFS named
node such as hdfs://localhost:9000. In addition, you need to indicate the port used by the job tracker, e.g. localhost:9001. The default value is
9000. To learn more about HDFS, see: http://hadoop.apache.org/hdfs/
S3 – (Amazon Simple Storage Service) is a simple web services interface that provides scalable, reliable, secure, fast, and inexpensive
infrastructure for backup or storage of data. Choose this selection if you are using Amazon storage services. You need to provide the S3 Bucket,
the Access key, and the access secret. To learn more about S3, see: http://aws.amazon.com/s3/. Datameer supports the Signature Version 2
signing process.
S3-Block – A block-based file system backed by S3. You need to provide the S3 Bucket, the Access key, and the access secret. To learn more
about S3, see: http://aws.amazon.com/s3/
SFTP - (SSH File Transfer Protocol) Like FTP, it transfers files and has a similar command set, but unlike FTP, it encrypts both commands and
data, preventing passwords and sensitive information from being transmitted openly over the network.
SSH – (Secure Shell) is a set of Unix utilities including SCP and SFTP, based on SSL, which uses a simple Public Key Infrastructure and
Encryption to allow users to securely transfer files between Unix file systems. You need to provide the host name, port, user name and password.
The default port is 22.
Versions of Datameer 3.0 do not support SSH/SCP for Windows. SFTP is supported for Windows.
As of v3.1, Datameer supports Bitverse SSH Server/Client for the Windows platform. The root paths to be specified while creating the
connection should look something like: /c:/mydata/folder1
Datameer is able to split large files across multiple mappers enabling parallel data ingestion. Two requirements must be fulfilled for this
to be possible.
1. Splitting of the file protocol must be supported. Currently splitting all of the above protocols is supported.
2. Splitting of the compression type must be supported. Currently LZO and Gzip are splittable, zip and Bz2 are not supported.
See Importing Data for more information.
Others
Hive – a data warehouse infrastructure built on Hadoop that provides data summarization and ad hoc querying. You need to provide the
connection type for the connection where the hive puts its data. This is usually a HDFS or S3 connection. In addition, you need to provide the
warehouse location and the metastore URI in format such as thrift://host:10000. To learn more about Hive, see http://wiki.apache.org/hadoop/Hive
HiveServer2 - a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation,
based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide
better support for open API clients like JDBC and ODBC.