Download How Data Integration Works Introduction to How Data Integration Works 8/24/2014

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
8/24/2014
HowStuffWorks "Data Integration Basics"
How Data Integration Works
by Jonathan Strickland
Brow se the article How Data Integration Works
Introduction to How Data Integration Works
For the average Joe, information technology (IT) is a mysterious universe filled with indecipherable programming languages and expensive hardware. Eavesdropping on
IT technicians is almost like overhearing a conversation spoken in a foreign language. But despite this seemingly impenetrable language barrier, it can be critically
important for decision-makers in businesses and organizations to understand the world of IT. One of the most important IT concepts is data integration.
On the surface, data integration sounds like a simple idea. Because many organizations store information on multiple databases, they need a way to retrieve data
from different sources and assemble it in a unified way. For example, let's imagine that an electronics company is preparing to roll out a new mobile device. The
marketing department might want to retrieve customer information from a sales department database and compare it to information from the product department to
create a targeted sales list. A good data integration system would let the marketing department view information from both sources in a unified way, leaving out any
information that didn't apply to the search.
In reality, data integration is a complicated discipline. There isn't a universal approach to data integration, and many of the techniques IT experts use are still evolving.
Some data integration approaches might work better than others for an organization, depending upon that organization's needs. We'll take a close look at some of the
general strategies IT experts use to integrate multiple sources of data and enter the world of database management.
What are the basics of data integration? Find out in the next section.
adAfterBody
Data Integration Basics
Data integration focuses mainly on databases. A database is an organized collection of data. It's
similar to a file system, which is an organizational structure for files so they're easy to find, access
and manipulate.
There are different ways to categorize databases. Some people prefer to classify them according to
the kind of data the databases store. For example, you might classify a database as a media
database if all the information stored there is contained in video or sound files.
Another classification method looks at how the databases organize data. A database's
organizational arrangement is called a schema. A common organizational technique is to use
tables to show the relationship between different data points. Tables are like spreadsheets. Columns
This sim ple table show s custom er purchases.
How StuffWorks.com
define categories of data, while rows are records. A database using this approach is a relational
database.
Object-oriented programming (OOP) databases take a different approach to organizing data. The
OOP language is a departure from traditional approaches to programming, which follow the pattern of inserting data into a set of instructions and then producing
output. The OOP language focuses instead on defining data as objects and then determining how different objects relate and interact with one another.
http://computer.howstuffworks.com/data-integration1.htm/printable
1/4
8/24/2014
HowStuffWorks "Data Integration Basics"
To create an OOP database, first you'd define all the objects you plan on storing in the database. Then, you'd define the way each object relates to every other object
within the database. After you identify an object, you put it into a class, or set of objects. To define a class you have to determine what data each object within that
class must have and which logic sequences, called methods, will affect those objects. The objects within a system can communicate with you or other objects using
interfaces called messages.
It's easier to understand with an example. Let's say you're building a database containing information about American sports. You decide to start by defining baseball
teams. Once you've created the definition of a baseball team, you can generalize it as a class within the database. The Atlanta Braves would be a specific instance of
that class, also known as an object. The class of baseball teams belongs to a superclass of American sports teams, which would also include other classes like
football and soccer teams.
To access information within a database (no matter how it organizes data), you use a query. A query is just a request for information. People and applications can
submit queries to databases. A database responds to queries by sending data that meets the original request's parameters. Queries rely on special computer
languages such as Structured Query Language (SQL). If you've ever used an Internet search engine, you've submitted a query -- your search terms.
The database responds to queries by creating a view of data. A view is a specific way of displaying
data. In a data integration system, the returned view shows only the data directly related to the
original query. In our table example, if you submitted a query asking for all the customers who
bought more than $100 worth of products, you'd get this result:​
This view shows only the data relevant to the query "customers who purchased more than $100 in
products." Notice that it doesn't show what kind of products were purchased, nor does it display
customers who purchased less than $100 of products.
What are the different approaches to data integration? Find out in the next section.
This view show s only the data relevant to the query "custom ers
w ho purchased m ore than $100 in products."
How StuffWorks.com
WHAT IS DATA?
Data can be any kind of information at all. It can be the contents of a cell on a spreadsheet. It can be a sound file or a video. It can be a string of words in a
document. It can be the raw information created as output from a computer program. Or it can be the information used to describe a file. Data integration focuses
on information, not files.
adAfterBody
THIS WAS A TRIUMPH
Portal pages like iGoogle or My Yahoo are
examples of common user interfaces. Portals pull
information from multiple sources, but don't
integrate the data into a unified view.
adAfterSmallInset
Data Integration Approaches
Based on the previous section, you might think that databases are fairly complex. That's a fair assumption, and
it helps explain why data integration is still a developing discipline even though it's more than 30 years old. The
goal of data integration is to gather data from different sources, combine it and present it in such a way that it
appears to be a unified whole.
Let's say you're about to leave on a trip and you want to see what traffic is like before you decide which route to take out of town. Here's how the different approaches
to data integration would handle your query.
The manual integration approach would leave all the work to you. First, you'd have to know where to look for your data. You would need to know the physical
location for both the traffic report and the map for your town. You would need to retrieve the traffic report and the map data directly from their respective databases,
then compare the two sets of data against each other to figure out what's the best route out of town.
If you used a common user interface approach, you'd have to do a little less work. You'd use an interface such as the World Wide Web to make a query. The query
results would appear as a view on the interface. You'd still have to compare the traffic report to the map to determine the best route, but at least the interface would
take care of locating and retrieving the data.
Some integration approaches rely on applications to do all the work for you. The applications, which are specialized computer programs, would locate, retrieve and
http://computer.howstuffworks.com/data-integration1.htm/printable
2/4
8/24/2014
HowStuffWorks "Data Integration Basics"
integrate the information for you. During the integration process, the applications must manipulate the data so that the information from one source is compatible with
the information from the other source. In our example, that would mean you'd submit a query to an application and it would present a view that combined a map of
your town with data from traffic reports. The problem with this approach is that applications become complex and difficult to program as the number of data sources
and formats increase.
Then there's the common data storage method, also known as data warehousing. Using this method, all the data from the various databases you intend to
integrate are extracted, transformed and loaded. That means that the data warehouse first pulls all the data from the various data sources. Then, the data
warehouse converts all the data into a common format so that one set of data is compatible with another. Then it loads this new data into its own database. When you
submit your query, the data warehouse locates the data, retrieves it and presents it to you in an integrated view. Using our example, the data warehouse would locate
the latest information it has on traffic reports and maps of your town. Then it would integrate the two together and send the view back to you. There are several
advantages and drawbacks to this system, which we'll look into in the next section.
Most data integration system designers assume that the end goal is to create as little work for the end user as possible, so they tend to focus on applications and
data warehousing techniques.
What is it that data warehouses do, exactly? Find out in the next section.
META? I HARDLY KNEW HER!
Descriptions of data are called metadata.
Metadata is useful for naming and defining data
as well as describing the relationship of one set
of data to other sets. Data integration systems
use metadata to locate the information relevant to
queries.
adAfterSmallInset
Data Warehouses
As we saw earlier, a data warehouse is a database that stores information from other databases using a
common format. That's about as specific as you can get when describing data warehouses. There's no unified
definition that dictates what data warehouses are or how designers should build them. As a result, there are
several different ways to create data warehouses, and one data warehouse might look and behave very
differently from another.
In general, queries to a data warehouse take very little time to resolve. That's because the data warehouse has already done the major work of extracting, converting
and combining data. The user's side of a data warehouse is called the front end, so from a front-end standpoint, data warehousing is an efficient way to get integrated
data.
From the back-end perspective, it's a different story. Database managers must put a lot of thought into a data warehouse system to make it effective and efficient.
Converting the data gathered from different sources into a common format can be particularly difficult. The system requires a consistent approach to describing and
encoding the data.
The warehouse must have a database large enough to store data gathered from multiple sources. Some data warehouses include an additional step called a data
mart. The data warehouse takes over the duties of aggregating data, while the data mart responds to user queries by retrieving and combining the appropriate data
from the warehouse.
One problem with data warehouses is that the information in them isn't always current. That's because of the way data warehouses work -- they pull information from
other databases periodically. If the data in those databases changes between extractions, queries to the data warehouse won't result in the most current and accurate
views. If the data in a system rarely changes, this isn't a big deal. For other applications, though, it's problematic.
Going back to our example from before with the traffic report and map, you can see how this would be a problem. While the town's map might not require frequent
updates, traffic conditions can change dramatically in a relatively short amount of time. A data warehouse might not extract data very frequently, which means timesensitive information may not be reliable. For those sort of applications, it's better to take a different data integration approach.
What's the alternative to data warehousing? Find out in the next section.
CLUB FED
Federated database systems (FDBMS) are
collections of networked, autonomous databases.
These systems handle several daunting tasks:
Networked Databases
For data integration systems that rely on information that changes frequently, a data warehouse approach isn't
ideal. One way that IT experts try to address this issue is to design systems that pull data directly from
Accepting user queries
individual data sources. Since there's no centralized database dedicated to analyzing, categorizing and
Dividing them into several sub-queries
integrating the data in preparation for user queries, those responsibilities fall to other parts of the system.
Using special tags called wrappers to define
the sub-queries in a way that each respective
database can understand
IT experts define data integration systems in terms of schemata. The unified view produced from a processed
Sending those wrapped sub-queries to the
appropriate databases
is the source schema. The way the global and source schemata interrelate is called mapping. Think of the
Accepting the data sent back from those
databases
query is the global schema. The structure of the various data sources and the way they relate to one another
source schema as a blueprint for all the data within the system, while the global schema is a blueprint for the
view presented in response to a query.
Integrating all the data into a unified view, and
Presenting that view to the user.
They're difficult to create and maintain, mainly due
to the complex nature of databases.
adAfterSmallInset
There are two main approaches to resolving queries in a data integrated system: global-as-view and local-asview. Each approach focuses on a particular part of the overall system and has its advantages and
disadvantages.
In a global-as-view approach, the focus is on the global schema. As long as the data sources remain consistent,
the global-as-view approach works well. It's easy to change the set-up of the global schema. That means it's not difficult to analyze the same overall set of data in
different ways. However, adding or removing data sources to the system is problematic, because it affects data across the system as a whole.
The local-as-view technique takes the opposite approach. It focuses on the data sources. As long as the global schema remains constant, it's easy to add or remove
http://computer.howstuffworks.com/data-integration1.htm/printable
3/4
8/24/2014
HowStuffWorks "Data Integration Basics"
data sources to the system. The schema looks for the same kinds of data and relationships within the new data sources. In this approach, changing the parameters
of the global schema is difficult. If you want to analyze the data sources in a new way, you'll have to redefine the entire system.
So that's the story on data integration. The next time you look at a weather map or call up a filtered selection of data, you'll be aware of the complex series of
processes going on in the background making it all possible.
To learn more about data integration, migrate on over to the links on the following page.
Lots More Information
Related Articles
How Boolean Logic Works
What are relational databases?
How Domain Name Servers Work
How Home Networking Works
How Internet Infrastructure Works
How Internet Search Engines Work
How Semantic Web Works
How Web Pages Work
How Web Servers Work
How Information Architecture Works
More Great Links
Data Warehousing Information Center
SearchDataManagement
adLastPage
Sources
Baldwin, James R. "The Data Warehouse: An Overview." Spring 1997. http://varietysoftworks.com/jbaldwin/Education/CS615-DataWarehouse.html
Haas, Laura and Lin, Eileen. "IBM Federated Database Technology." IBM. March 1, 2002. http://www128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html
Halevy, Alon Y. et al. "Enterprise Information Integration: Successes, Challenges and Controversies." International Conference on Management of Data. 2005.
Koch, Christoph. "Data Integration against Multiple Evolving Autonomous Schemata." PhD Thesis. Technical University, Vienna. May 16, 2001.
http://www.csd.uoc.gr/~hy562/Papers/thesis_final.pdf
Lenzerini, Maurizio. "Data Integration: A Theoretical Perspective." University of Rome. ACM PODS. 2002. http://delivery.acm.org/10.1145/550000/543644/p233lenzerini.pdf?key1=543644&key2=8368337021&coll=&dl=&CFID=15151515&CFTOKEN=6184618
Poje, Richard J. "Treasury and IT integration in plain English." Sep. 1, 2003. http://www.allbusiness.com/management/channel-management/659425-1.html
SearchDataManagement. http://searchdatamanagement.techtarget.com/home/0,289692,sid91,00.html
Singh, Munindar P. "The Practical Handbook of Internet Computing." CRC Press. 2005.
The Data Warehousing Information Center. http://www.dwinfocenter.org/
Ziegler, Patrick and Dittrich, Klaus R. "Three Decades of Data Integration -- All Problems Solved?" University of Zurich. First International IFIP Conference on
Semantics of a Networked World. 2004.
http://computer.howstuffworks.com/data-integration1.htm/printable
4/4