TRANSPARENT APPLICATION FAILOVER FOR RAC
RAC provides near-continuous availability by hiding
failures from end-user clients and application server clients. This provides
continuous, uninterrupted data access. Transparent Application Failover (TAF)
is what applications use to sync up with Oracle RAC availability. TAF in the database reroutes
application clients to an available database node in the cluster when the
connected node fails. Application clients do not see error messages describing
loss of service.
if the users connection to Node 1 dies, their transaction
is rolled back but they can continue work without having to manually reconnect.
To get a good understanding of how the TAF architecture works, it is helpful to
walk through a failover scenario where a user is querying the database to
retrieve 1000 rows from the database. Assume that the user is connected to Node
1/ Instance 1.
1. The heartbeat mechanism between the various nodes in the cluster checks to see if the other node is available and is participating in the cluster configuration.
2. Let’s assume that a user on his/her terminal performs a query against the database to retrieve 1000 rows from the database via Instance 1.
3. The initial 500-rows are retrieved from the database via Instance 1 and returned to the user for browsing through his/her interface.
4. While the user is browsing through the first 500 rows through his browser, Node 1 fails and does not participate in the clustered configuration.
5. Node 2 checks for the heartbeat of the other participating nodes deducts that Node 1 is not available.
6. While the user is unaware of the failure and scrolls through the remaining 500 rows on the window. The process try’s to connect to Instance 1 but detects that the Instance is not available.
7. Using the entries in the tnsnames.ora file the user connection to the other available node is established.
8. User is now connected to Node 2.
9. Oracle re-executes the query using the connection on Instance 2 and displays the remaining rows to the user. If the data was available in the buffer cache, the rows are returned back to the user instantaneously. However, if the rows are not available, Oracle has to perform an I/O operation. This would be delayed until the recovery process has completed.
when Node 1 fails, any SELECT
statements that had partially executed on Instance 1 are migrated as part of
the failover process and are displayed through Instance 2 when the user process
fails over to Node 2. All this happens transparently without any interruption
to the user. It should be noted that along with the ‘SELECT’ statement, the
following are also failed over:
- Client/server connection.
- User session state.
- Prepared statements.
- Active cursors that have begun to return results to the user.
The benefits that such a feature adds in meeting the high
availability requirements in today’s machine critical applications are
overwhelming and the first question that arises is this, why did Oracle not
introduce such a feature before or why is this feature not available among
other databases? Though the mechanism is very useful in meeting today’s high
availability requirements, implementing such a feature is complex, basically
because the database connections are not stateless. This means that during the
moment of failure, the database, the users and the transactions are in a specific
state of operation, such as:
- Retrieving data from the database
- Database has a connection via Oracle Net to the Instance on Node 1
- User connecting to the database has password and other user-authentication information
- The session has language and character set information that is specific to the Instance on which the user has established connection
- There are cursors open and in execution
- 'SELECT’ cursors are open and partially scrolled by the user
- INSERT, UPDATE, and DELETE statements and PL/SQL procedures are being executed
It should be noted from the scenarios above that only
SELECT statements are failed over from one node to another; transactional
statements are not failed over by configuration of TAF. However, transactional
statements can programmatically be transferred from Node 1 to Node 2 by proper
validation of Oracle returned error messages and taking appropriate actions.
This should be a preferred method to avoid any user interruptions as well as
keeping the database or system failures transparent to the user. Among the
transactional statements, the following do not failover or are not protected
when a node fails:
- PL/SQL server-side package variables
- Global temporary tables
- Effect of any ALTER SESSION statement
- Applications not using the JDBC thick driver
- Transactional statements, ie Statements that include INSERT, UPDATE and DELETE operations
While the failover is in process, it is user friendly to
inform the user via the application interface that the activity or command
issued may take some time. This information could be communicated by validating
the various error messages returned by Oracle as part of the node and
connection failure. Some of the common Oracle error codes that should be
handled by the application to track and transfer transactional statements
include:
• ORA-01012: not logged on to Oracle
• ORA-01033: Oracle initialization or shutdown in progress
• ORA-01034: ORACLE not available
• ORA-01089: immediate shutdown in progress – no operations are permitted
• ORA-03113: end-of-file on communication channel
• ORA-03114: not connected to ORACLE
• ORA-12203: TNS: unable to connect to destination
• ORA-12500: TNS: listener failed to start a dedicated server process
• ORA-12571: TNS: packet writer failure
Transparent Application Failover Configuration :
Using Oracle RAC and TAF Together
The continuous availability features of Oracle RAC and TAF come together when these products cooperate in restarting failed transactions. Let's take a closer look at how this works.Within each connected Oracle Net client, tnsnames.ora file parameters define the failover types and methods for that client. The parameters direct Oracle RAC and TAF on how to restart any transactions that may be in-flight during a hardware failure on the node.
It is important to note that TAF failover control is external to the Oracle RAC cluster, and each Oracle Net client may have unique failover types and methods, depending on processing requirements. The following is a client tnsnames.ora file entry for a node, including its current TAF failover parameters:
bubba.world =
(DESCRIPTION_LIST =
(FAILOVER = true)
(LOAD_BALANCE = true)
(DESCRIPTION =
(ADDRESS =
(PROTOCOL = TCP)
(HOST = redneck)(PORT = 1521))
(CONNECT_DATA =
(SERVICE_NAME = bubba)
(SERVER = dedicated)
(FAILOVER_MODE =
(BACKUP=cletus)
(TYPE=select)
(METHOD=preconnect)
(RETRIES=20)
(DELAY=3)
)
)
)
The failover_mode section of the tnsnames.ora file lists the parameters and their values:
BACKUP=cletus. This names the backup node that will take over failed connections when a node crashes. In this example, the primary server is bubba, and TAF will reconnect failed transactions to the clients instance in case of server failure.
TYPE=select. This tells TAF to restart all in-flight transactions from the beginning of the transaction (and not to track cursor states within each transaction).
METHOD=preconnect. This directs TAF to create two connections at transaction startup time: one to the primary bubba database and a backup connection to the clients database. In case of instance failure, the clients database will be ready to resume the failed transaction.
RETRIES=20. This directs TAF to retry a failover connection up to 20 times.
DELAY=3. This tells TAF to wait three seconds between connection retries.
Remember, you must set these TAF parameters in every tnsnames.ora file on every Oracle Net client that needs transparent failover. Putting It All Together
An Oracle Net client can be a single PC or a huge application server. In the architectures of giant Oracle RAC systems, each application server has a customized tnsnames.ora file that governs the failover method for all connections that are routed to that application server.
Watching TAF in Action
The transparency of TAF operation is a tremendous advantage to application users, but DBAs need to quickly see what has happened and where failover traffic is going, and they need to be able to get the status of failover transactions. To provide this capability, the Oracle data dictionary has several new columns in the V$SESSION view that give the current status of failover transactions.The following query calls the new FAILOVER_TYPE, FAILOVER_METHOD, and FAILED_OVER columns of the V$SESSION view. Be sure to note that the query is restricted to nonsystem sessions, because Oracle data definition language (DDL) and data manipulation language (DML) are not recoverable with TAF.
select
username,
sid,
serial#,
failover_type,
failover_method,
failed_over
from
v$session
where
username not in ('SYS','SYSTEM',
'PERFSTAT')
and
failed_over = 'YES';
You can run this script against the backup node after an instance failure to see those transactions that have been reconnected with TAF. Remember, TAF will quickly redirect transactions, so you'll only see entries for a short period of time immediately after the failover. A backup node can have a variety of concurrent failover transactions, because the tnsnames.ora file on each Oracle Net client specifies the backup node, the failover type, and the failover method. Transparent Application Failover (TAF) Configuration For Standby database (Data Guard) :
The following configurations require that the client reconnect to the database server when the primary database fails over to the standby database:
However, if you are using an Oracle Call Interface (OCI) client, you can use transparent application failover (TAF). TAF is the ability of applications to automatically reconnect to the database if the connection fails. If the client application is not involved in a database transaction, then users may not notice the failure of the primary database server.
The following example shows address information for the ProductDB database and the Standby1 database:
ProductDB=( DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(PORT=1521)(HOST=PRMYHOST.foo.com)) (CONNECT_DATA=(SID=db1)(FAILOVER_MODE=(BACKUP=Standby1) (TYPE=session) (METHOD=basic))) ) Standby1 =( DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(PORT=1521)(HOST=STBYHOST.foo.com)) (CONNECT_DATA=(SID=db1)) )
Sequence of events:
- Client application is connected to the ProductDB database.
- The primary database in PRMYHOST.foo.com fails.
- The standby database is activated as the new production database.
- When the client application fails to connect to PRMYHOST.foo.com, it uses the database specified with the BACKUP parameter in the FAILOVER_MODE clause, and automatically connects to STBYHOST.foo.com.
Comments
Post a Comment