Download Preventing Fallen ANGELs: Redundancy, Backup, Recovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Tandem Computers wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Team Foundation Server wikipedia , lookup

PL/SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Transcript
No Fallen ANGELs!
Redundancy, Backup,
Recovery
Andrea Chappell: University of Waterloo
Adam Hauerwas: Providence College
Ruomiao Wang & Jie Li: Kelly Direct, Indiana
University
Terry O'Heron & Crystal Foust: Penn State
Agenda
 How

do you backup/archive courses?
What policies and procedures guide your
response to requests to recover a course, a
file, an internal ANGEL page, a student
upload file?
 How
do you protect your system from
various failures, and in what time do you
“promise” to have it back online?
University of Waterloo (Andrea)
 ANGEL is
the centrally supported LMS
since summer 2004.
 Core to university business.
 Need to configure against various types of
failures, e.g.:


Disaster (fire, flooding, etc.)
Partial system failure (ANGEL/IIS or SQL
server systems, disks, etc.)
Constraints (what we can’t change)
 Support
coverage is not 24x7: Central IT
(IST) provides extended support for critical
systems but not 24x7 support.
 Cannot survive lengthy power outages.
 Cannot survive some network outages.

Network support is also not 24x7.
Backup Processes
 System

Database (dump of db file), Transaction logs
(cut once per day) and Upload files backed up
nightly by campus backup service.
 Course


data backup
archives
Long term: Archive courses at end of term.
Shorter term: Remove from system after 4
terms. (Note: to offer a course again, copy
course rather than reuse same instance).
Recovery Process
 Recover
data to dev system and copy lost
data to production.

This can be very complex if the missing data
is a quiz that was run, a bulletin board, etc.!
 Currently
no policies on what to recover, or
promise of time to recovery. Requests
considered on individual basis.
Protecting against failures
 Current
strategy: Buy robust equipment,
configure to minimize points of failure.
Production Systems
• Dual RAID disks
ANGEL/IIS
(Dell server)
• Dual power supply
• 7x24 4 hour hardware
support (from vendor)
• Housed in accesscontrolled machine room
SQL Server
(Dell server)
• Uninterrupted Power
Supply
Development System
ANGEL/IIS
and SQL
server
Vulnerabilities in Current Strategy

The ANGEL/IIS or SQL Server hardware, e.g.,
system motherboard failure

Don’t have ready back-up machine.
• Could temporarily use development system.


Likely a minimum half day down-time.
Machine room “fire”



All hardware lost.
Up to one day of lost data (if 24 hours from last
backup).
Days of down time!
Configurations under Investigation
Looking for faster recovery time, less potential
data loss, through increased redundancy.
 Config 1: Identical production and
development systems, different locations.
 Config 2: Identical production and dev
systems, shared data (data filer), Load
Balancer (Cisco), different locations.
Config 1

Identical production and development
systems, different locations.
Gains:
ANGEL/IIS
(Dell server)
• In system failure:
• If possible, move disks to duplicate
system – 4 working hours.
SQL Server
(Dell server)
• Or, recover data to duplicate systems –
perhaps 8 working hours.
Issues:
• People intervention still required.
Cost:
• Two new systems.
Config 2

Identical prod and dev systems, shared
data, load balancer, different locations.
Load Balancer
ANGEL/IIS
(Dell server)
ANGEL/IIS
(Dell server)
Gains:
• Failure of one ANGEL/IIS system instantaneous fall over to remaining.
• Failure of SQL Server - reconfigure dev
system to point to data filer.
Issues:
Data Filer
• Single point of failure unless filer clustered.
• Greater complexity may cause downtime.
SQL Server
(Dell server)
Cost:
• 3 new systems, plus filer (~$30 USD)
Providence College (Adam)
 Like
Waterloo, ANGEL has been our LMS
since Fall, 2001.
 Support coverage is not 24x7.
 Cannot survive lengthy power outages or
network outages.
PC Backup and Recovery

System data backup




Course archives



Back up database and logs to files once per day.
Use Tivoli to back up both DB and file system nightly.
Creates “backup of a backup.”
Short term: Archive courses 90 days after term end.
Long term: Store archives to DVD.
Recovery

Like Waterloo, recover Production database in
Development environment.
PC’s Redundancy
 Today:
Robust Production Server
Production System
Development System
ANGEL
IIS/SQL
(HP DL380)
ANGEL
IIS/SQL
(Desktop)
• Multiple RAID disks (System, DB, Data)
• Dual Power Supplies and NIC’s
• Access-controlled machine room
• UPS
PC’s Future Architecture
 This
Summer: New Server and SAN
Production System
IBM Storage
Area Network
Development System
ANGEL
IIS/SQL
(New HP)
ANGEL
IIS/SQL
(Old HP)
• Purchase new server and install O/S and SQL Server
on local RAID.
• Store database and web files on SAN disk.
• In the event of Production hardware failure, connect
Production disk to Development server with little downtime.
Kelley Direct On-Line Programs,
Indiana University (Ruomiao)

Road to ANGEL



Piloted ANGEL as LMS in Fall 2003
Spring 2004: all courses delivered via
ANGEL
Critical learning platform that connects KD to
the students
Kelley Direct On-Line Programs,
Indiana University
Kelley Direct On-Line Programs,
Indiana University

Current Data Protection Measures

Backup
System Backups
•
•
Full Backups once a week starting Friday night
Differential Backups every night around 11 PM
Database Backups
•
Full ANGEL SQL database backup every night at
10PM. The database backup output files are then
backed up by system tape backups for that night.
• Transaction log backups every six hours.
The backup tapes are then taken to an offsite location.
Kelley Direct On-Line Programs,
Indiana University

Current System Protection Measures

Disk
•



Configured with RAID 5 with a spare disk
Dual power connections
UPS System connection (30 min.)
Spare Chassis
•
Test server has identical hardware and server as
a spare chassis
Kelley Direct On-Line Programs,
Indiana University

Current Recovery Practices
File or Database Restore

•
Restore from disk, tape backups, or individual developer’s
machines.
System Component Failure

•
Replace the faulty component(s) from the spare chassis
(test server) or move entire disk array to from production to
test server
Total System Failure or disk array failure

•
•
Rebuilt entire system, possibly to alternate hardware.
All the ANGEL components will either need to be installed from
scratch, or restored from backup tapes. Some system
components have to be reconfigured manually.
Kelley Direct On-Line Programs,
Indiana University

Challenges for KD ANGEL Environment
Security

•
Scalability

•
•

Limited capability to scale performance based on volume
Availability


ANGEL web server resides on the same physical machine that
hosts the ANGEL databases
No redundancy built in. Single server design. Any component
failure means downtime
Shrinking Maintenance Window (or do we still have one?)
(continue on next slide)
Kelley Direct On-Line Programs,
Indiana University

Challenges for KD ANGEL Environment
Storage Capacity

•
Limited expansion capability
Recoverability

•
Single copy of production data on disk. Tape restoration is time consuming
and means data loss
Availability

•
No redundancy built in. Single server design. Any component
failure means downtime
Growth

•
Significant enrollment growth is expected for the programs in the
next three years
Development Environment

•
Developers are coding on own machines. Configurations differ
from production environment. Less efficient.
Kelley Direct On-Line Programs,
Indiana University

Some Questions






How can backend infrastructure better support the vision of the
on-line programs?
How to plan system capacity when progarm changes (such as
enrollment growth)?
How to better protect student data?
What the available options for long-term data retention?
How to better meet the requirements for less service
interruption?
What should we do to ensure a faster ANGEL systems
recovery?
Kelley Direct On-Line Programs,
Indiana University
Penn State Environment (Terry,
Crystal)
 Support
coverage is 24x7
 Backup Power (generator)
 Redundant network connectivity
 Failover capability
 Mirrored storage
 Daily Backups/Off-site storage
 Daily Maintenance (5-7 am)
 Archive (courses, inactive groups)
Constraints
 Backup


SQL: 3 hours
File: 3-4 days
 Restoration


SQL: 1.5 hours
File: 2 min. - ??
ANGEL Production Environment
eND Load Balancer (Failover)
Dell PE1650 (WIN2K)
(2) 1.4 GHZ, 2 GB RAM
Web Server 1
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM
Web Server 2
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM
Web Server 3
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM
eND Load Balancer
Dell PE1650 (WIN2K)
(2) 1.4 GHZ, 2.5 GB RAM
Web Server 4
Dell PE1750
(2) 3.0 GHZ,
4 GB RAM
Web Server 5
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM
File Server
Dell PE2650
(2) 3.06 GHZ, 8 GB RAM
SQL Server
IBM xSeries 445
(8) 2.7 GHZ, 16 GB RAM
File Server (Failover)
Dell PE2650
(2) 2.8 GHZ, 4 GB RAM
SQL Server Failover
IBM xSeries 445
(8) 2.7 GHZ, 16 GB RAM
Web Server 6
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM
Web Server 7
Dell PE1850
(2) 3.2 GHZ,
3 GB RAM