Feeds:
Posts
Comments

One of the things I have appreciated about my “new job” is getting to work with a technically astute large-scale mission-critical enterprise architecture. Part of this includes data warehousing and business intelligence, two areas I have had interest in since reading “Super Crunchers” by Ian Ayres (a very interesting look at data mining to find extremely useful and actionable information).

OLTP vs. OLAP

When first introduced to data warehousing, my mental model was of a super complex, massive farm of database servers.  The reality is that data warehouses sprang up because of their different needs from transactional operations:

  • Online Transaction Processing (OLTP). This mirrors how a database would be used in a typical web-based  production environment such as placing orders within Amazon.  A large number of short-lived transactions in a write-oriented database that is probably barely coming up with its capacity needs.  With limited capacity, do you really want to be executing in-depth analysis queries against your production transaction-oriented database?  Even if you did, how could you possibly tune it to be as fast as possible from both a read and write perspective?  You cannot.
  • Online Analysis Processing (OLAP). Instead of doing write-oriented update transactions, OLAP focuses on the more read-oriented queries and statistical analysis.  For this, you want your database to be optimized for read operations as these queries can take a while to execute, and the organization will want to be able to identify trends  or other issues from this near real-time data.  For example, think about how Amazon mines your purchase patterns to make suggestions, or to increase or decrease the price of books and other offerings.  This is actionable business intelligence leading to greater agility and returns.

Thus, your data warehouse is typically a separate write-oriented database containing near-real time information (typically anywhere from less than an hour to a day stale) with a much larger time horizon.

ETL

One of the inputs can be your OLTP database, but a data warehouse is typically composed of numerous data feeds.

This is where Extract Transform and Load (ETL) typically comes in:

  • Extract.  Extract the data from multiple data sources
  • Transform.  Transform and clean the data. Different data sources can have different representations of the same conceptual entity.  Furthermore, they can contain data errors (e.g., data entry input errors) and other related problems that need to be addressed when trying to put together an integrated picture from many different data sources
  • Load.  Load the transformed data into the data warehouse

Business Intelligence (BI) and Data Mining

Business Intelligence is the use of the data in the data warehouse to derive actionable business level information.  This can include analysis along a number of different dimensions (e.g., sales per region, sales across timing, trending) as well as forecasting the future based upon the past.  Analysts can use ad hoc queries to see what is going on and to do “what-if” kind of scenarios.

Data Mining can be employed to identify trends and other relationships within the data that would not be so readily obvious.

Star Schema

For update-oriented operations such as a web site handling placing orders, you want a normalized schema for greater efficiency.  However, for the analysis queries performed in a data warehouse, you typically want a mix between a normalized schema and a star schema.  So what is a star schema?

I am going to use an example from the book “Oracle Essentials”, which I just finished reading.  Here is a typical query (which shows the advantages of a star schema):

Show me how many sales of widgets (a product type) were sold by a store chain (a sales channel) in Louisiana (a geography) over the past 3 months (a time)

This query involves many dimensions:

  • product type
  • sales channel
  • geography
  • time

In the star schema, you would have a central fact table (representing sales transactions) with four connected dimension tables (e.g,  Product, Channel, Geography and Time).

For efficiency, data within these dimensions is usually hierarchical (e.g., for the time dimension, day rolls up into week, which rolls up into month,which rolls up into quarter, which rolls up into year).  If your data is looking for a particular quarter, it can be executed against that summary as opposed to all the more granular data related to weeks and days.  Hence these are referred to as summary tables.

Conclusion

Once again, I recommend reading “Super Crunchers” by Ian Ayres for more real world uses of data mining.

After summarizing the JMS Tutorial, I wrote a number of code samples to try to cover the essentials and experiment to make sure things worked the way I thought.  This included at least the following:

  • Using queues and topics
  • Implementing a durable topic subscriber
  • Using local transactions and experimenting with rolling back
  • Using non-transaction mode and experimenting with what happens when messages not acknowledged
  • Sending multiple messages with different priorities
  • Sending messages with a timeout and making sure it actually does
  • Using receive in blocking mode, blocing mode with a timeout, and listeners for synchronous and asynchronous
  • Playing with persistent and non-persistent modes and restarting ActiveMQ to see how it is handled
  • Having multiple consumers for topics vs. queues and seeing if all or some of them get the messages
  • Playing with the request/reply pattern

It is that last item that took a little longer to wrap my head around as I was mixing up the various destinations.

Invoking a method and getting the return value back is one example of the request/reply pattern.  Invoking a RPC via web services or even old-style CORBA is another example.  All of these are synchronous.  As JMS is inherently asynchronous, you can use it to create asynchronous requests and replies.

One approach would be to set up a queue that the requester can send the request message to, and another queue that the replier can use to send the reply message back to the requester.  However, this does not scale.  What if you have need to add another requester?  Another reply queue would need to be set up administratively, and the replier logic would need to be updated to differentiate between the two requesters.

Addtionally, what if the one requester sends three requests to the replier before any of the replies come back?  There needs to be some way to be able to correlate the replies back to the original requests.

JMS offers the ability to create temporary queues and topics.  Here is how it works:

  1. The requester creates the temporary queue, and adds the queue info to the request message
  2. The requester sends the request message to the queue used by the replier
  3. The requester waits for incoming reply messages on the temporary queue it created
  4. The replier receives the request message on the normal queue it is listening on, processes it, creates the reply message, and uses the message’s temporary queue information to know what queue to send the reply message to
  5. The replier includes the original id of the request message so that the requester can correlate this reply with a particular request, in case the requester has made other requests to this replier
  6. The replier sends the reply message to the temporary queue
  7. The requester retrieves the request from the temporary queue, and uses the original request id from the reply to properly correlate the reply with its request
  8. Once done using the system, the requester can then delete the temporary queue

Here is part of the requester code:


requester = session.createProducer(replierDestination);
TextMessage requestMessage = session.createTextMessage();
requestMessage.setText("This is the request");
TemporaryQueue temporaryQueue = session.createTemporaryQueue();
MessageConsumer responseConsumer = session.createConsumer(temporaryQueue);
requestMessage.setJMSReplyTo(temporaryQueue);
request.send(requestMessage);
TextMessage reply = (TextMessage) responseConsumer.receive();
String correlationId = reply.getJMSCorrelationID();

Here is part of the replier code:


TextMessage request = (TextMessage)replier.receive();
producer = session.createProducer(request.getJMSReplyTo());
Message response = session.createTextMessage("processed message: " + request.getText());
response.setJMSCorrelationID(request.getJMSMessageID());
producer.send(response);

My new job uses JMS, finally giving me a good reason to really play with this technology.   After reviewing the coding techniques, I needed a JMS Provider to work against that would be quick and easy to set up – Apache ActiveMQ.  It was quick and easy:

  1. Went to http://activemq.apache.org/activemq-530-release.html and downloaded the Unix version for my Mac.
  2. Unzipped and made sure executable.
  3. Made sure the activemq file under bin/macosx was executable
  4. From the install directory, executed:  bin/macosx/activemq start
  5. Went to the admin UI to set up topics and queues:  http://localhost:8161/admin/
  6. Stopped the server with:  bin/macosx/activemq stop

How easy was that?!  Now, on to the JMS coding, where I ran into a few speed bumps along the way.

First, if you are going to do JMS development outside of a J2EE container, you need to include jms.jar (which you can download standalone from Sun).

Second, this exception pops up while trying to get my JNDI contex:

javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file:  java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:645)

Googling on this, Sun’s forum states that when this happens, most likely it is due to installing the JDK before uninstalling the previous JDK.  On a Windows this is not as big of a deal, but uninstalling and then installing the JDK is a little more challenging on the Mac.  Fortunately, a little more research found another workaround:


Properties props = new Properties();
props.setProperty(Context.INITIAL_CONTEXT_FACTORY,"org.apache.activemq.jndi.ActiveMQInitialContextFactory");
props.setProperty(Context.PROVIDER_URL,"tcp://localhost:61616");
jndiContext = new InitialContext(props);

Third, it can’t find a number of the classes it needs.  I included the following three jar files from the ActiveMQ installation:

  • activemq-core-5.3.0.jar
  • commons-logging-1.1.jar
  • geronimo-j2ee-management_1.0_spec-1.0.jar

Fourth, what name should I use for the connection factory for jndiContext.lookup?  “ConnectionFactory” ends up working well.

Fifth, its time to set up the topics and queues.  I turned to the Admin UI, able to set up the physical queues and topics but seeing no way to set up the JNDI logical queues and topics.  Trying to create destinations without these is not working.  However, it turns out that ActiveMQ makes it really easy to set up with dynamicQueues and dynamicTopics:


queueDestination = (Destination) jndiContext.lookup("dynamicQueues/SyncQueue");
topicDestination = (Destination) jndiContext.lookup("dynamicTopics/SyncTopic");

After these issues, all the other work I did with JMS went pretty straight-forward.

Okay, I know that all of this are not strictly versus each other, but I used to find the combination of options for SOAP messages intimidating.  Why so many options to essentially accomplish the same thing?  The following are the basic options:

  • rpc/literal
  • rpc/encoded
  • document/literal
  • document/encoded

Since encoded is not part of the WS-I (Web Service Interoperability) standard, that just leaves rpc/literal and document/literal.   So, you can just use the former for request/response RPC type calls, and the latter for passing business documents, right?  Wrong!

In actuality, the rpc vs. document is misleading as you can make rpc-style calls with either underlying representation.

The default and most commonly used is document/literal.  The underlying WSDL will be more complicated, but WSDL is not supposed to be for humans (so they say).  The rpc style is limited to very simple XSD types such as String and Integer, and the resulting WSDL will not even have a types section to define and constrain the parameters.

Far more complicated typing is allowed with document.  This is because the document style also comes with an XSD that can be used to validate the incoming SOAP messages; if you use rpc you will not have an XSD to validate against.  However, the rpc style’s SOAP messages are easier to look at and understand.

The wrapped variant (used with document), which is a de facto standard, helps to address this.  It re-arranges the SOAP message some so that it is easier to understand from the programmer’s view and looks more like rpc.  It clearly identifies the service operations and the names of each parameter.  The downside is that the client code you have to write is a little more complicated to put together.

In the end, document/literal/wrapped is the default in Java’s web services.

I have my java web service up and running, and I am deploying it to Tomcat.  However, when Tomcat starts up, the deployment is failing with the following exception:

java.lang.ClassNotFoundException: com.sun.xml.ws.transport.http.servlet.WSServletContextListener

I am told that Tomcat should have everything it needs, but that is clearly referring to Tomcat 5 and not Tomcat 6.

According to Techie Gyan, the following is the problem:

“The second change which tells about the cause of the error above is that tomcat 6.0 supports JAX-WS 2.1 and not JAX-WS 2.0 and java 6 supports 2.0 only (till some version, now it started supporting the newer version as well).”

My approach to fixing the problem is different, but it works:

  1. Go to https://jax-ws.dev.java.net/
  2. Download the latest version (2.2 at the time of this writing)
  3. Unzip the file and place the jars in Tomcat’s lib directory
  4. Restart Tomcat

Now I have no issues accessing my web service from Tomcat.

In my last blog on this, I discussed some of the Capacity best practices.  I am now going to briefly touch on the remainder of the book:

  • Networking. The section talks about best practices in a data center.  This includes different networks for different functions (e.g., production, admin) and different NIC cards on different machines to segment traffic.  It also discusses the usage of Virtual IPs.
  • Security.  Here, the discussion centers on the principle of “Least Privilege”.  Strategies for not having to run things as root are discussed, as well as how to deal with outbound passwords so that they cannot be compromised.
  • Availability. He begins with a strategy on how to document required availability to avoid ambiguity.  Strategies for load balancing and clustering are discussed, as well as how to appropriately use reverse proxies (looks like I need to play with Squid).
  • Administration. There are a number of strategies discussed here that will make administrators lives a lot easier.  This includes making the QA environment more closely resemble the production environment.  I particularly liked how his discussion around zero, one or many.  If you have 20 servers in a farm, it makes a big difference to have more than one in QA (though 20 isn’t needed) in terms of issues discovered.  He discussed strategies for dealing with configuration files, and how to facilitate clear start-ups and shut-downs.
  • Design Summary. This chapter delves into a number of general design considerations for consider for production, including making your application as easy to operate as possible.
  • Transparency. This book goes deep into strategies to reveal as much about the internal operations of the servers and the systems as a whole.   After going through some of his black Friday failure strategies, I would want to have as many of these as possible!

I appreciate good computer science books.  This is, seriously, one of the best books I have ever read.  It has made a HUGE difference in my understanding and capabilities in this area.  I strongly recommend it.

In my last blog I discussed some of the capacity anti-patterns of Michael Nyguard’s excellent but tragically misnamed book “Release It!”  I finished reading the patterns and best practices portion, some of which are discussed below:

  • Pool Connections. If bad connections are not removed from the pool, they will be given to proportionally more threads since they finish more quickly.  You also need to consider your strategy for checking out resources from a pool.  If you go per-page, you will have better protection against deadlock in multi-pool scenarios, but you will need a larger pool as these connections will be held longer.  You also need to monitor calls to your connection pools to see how long threads are waiting.
  • Use Caching Carefully. You need to be able to set the maximum memory usage for your cache.  You should also be monitoring your hit rates to make sure the cache is buying you something.  Don’t cache things which are cheap to regenerate, and use soft references where possible for when memory gets tight.
  • Precompute Content. There is no need to recompute content that never changes or changes infrequently, yet this happens all the time.  This is a great area to explore for increasing capacity.

An excellent resource in Rails is the “RailsLab: Scaling Rails” podcast I downloaded on iTunes.  I have watched it before, but after reading this book, I will likely watch it again and start playing with what they suggest.

In the next blog, I will summarize the remainder of the book.  However, I will likely hold off on publishing that until after the holidays.

I finished Michael Nygard’s “Release It!”.  Having discussed some of the anti-patterns and best practices for stability earlier, I am going to discuss capacity anti-patterns (some of them anyway – buy the book!):

  • Resource Pool Contention. When the ratio of threads and resources in a pool are out of balance, the more time the threads waste in contention for the resource.  On the flip side, if you have an app server farm with large database connection pools, the number of open sessions on the database server may simply overwhelm it.
  • AJAX Overkill. AJAX can improve your user experience, but it does have a cost associated with it.  The requests that will be hitting your server will be more frequent.  Some of the automated polling that is done can overwhelm your server in high usage situations (effectively limiting capacity).  The proper balance needs to be found.
  • Overstaying Sessions. The java default is 30 minutes, which is often way too long (the book shows normal times for things like retail sites vs. travel sites).  These unused sessions take up memory and other app server resources, limiting your capacity.  Thus, you want to get rid of old sessions as soon as possible without visibly impacting your users. Your sessions should be small, only hold non-transient data, use keys instead of values where possible, and leverage soft references where supported by the language.  In the end, that means more potential sessions and a capacity constraint elsewhere.
  • Excessive White Space. These seem innocuous as browsers ignore extra white space in HTML, right?  You are using more CPU cycles, more bandwidth, and more RAM to serve those pages up – put them on a diet.
  • Data Eutropification. Your site may be zippy at first, but if you don’t have a strategy for expunging old data, you site will get slower and slower.  It is certainly a bad practice to be doing data mining against the production database.  Lack of partitioning as well as lack of indexes for ORM relationships can also have a big impact.

In the next post, I will follow up on the capacity best practices.

In my last blog I discussed some of the stability anti-patterns of Michael Nyguard’s excellent but tragically misnamed book “Release It!”  I finished reading the patterns and best practices portion, discussed below:

  • Timeouts.  Traditionally a thread blocks until the resource is ready, but this can be destabilizing if the resource is never ready.  Thus, a thread needs to be able to time out.  A mechanism for delayed retries should also be set up.
  • Circuit Breakers.  The author gives an example of early real-world circuit breakers where with too many items plugged in (resulting in too much current and heat), the circuit breaker breaks the circuit instead of the house burning down.  He introduces a logical circuit breaker where too many failures breaks the circuit, rejecting the next series of requests until the system is ready to try again (see the book for the actual algorithm).
  • Bulkheads.  A ship’s hull is divided into different watertight bulkheads so that if the hull is compromised, the failure is limited to that bulkhead as opposed to taking the entire ship down.  By partitioning your system, you can confine errors to one area as opposed to taking the entire system down.  These partitions can be hardware redundancy, binding certain processes to certain CPUs, segmenting different areas of business functionality to different server farms, or partitioning threads into different thread groups for different functionality.  I remember how a previous company’s server would run out of threads, and there were no other threads dedicated to admin functionality to allow an admin to interact with it.
  • Handshaking.  By asking a component if it can take on more before asking it to perform that work, the system has a way to introduce throttling.  If the component is too busy, it can tell the clients to back off until it is able to handle more requests.

I am now moving on to reading the Capacity anti-patterns and patterns and will blog on that later.  Once again, I STRONGLY  recommend purchasing and reading this book!

I am 110 pages into Michael Nygard’s “Release It!”.  I ignored this book earlier because I pictured some namby, pamby book on release best practices, but this book is absolutely fantastic and fascinating to read!

The book focuses on anti-patterns then best practices for stability then capacity in production systems (typically focusing on web commerce).  After that, it delves into system level design best practices.  I am currently up to stability anti-patterns.

Once again, I am tempted to list all of them, but that seems unfair to the author so I will cover only a few.  I will follow-up on this post as I get further into this book (352 pages).  Here are some:

  • Integration Points. This is the number one point of failure as engineers code the interactions expecting the systems to always be up, and when they are not, the resulting cascading failures bring the entire system down.  Making an HTTP request without a specified timeout?  Not a good idea when all your threads block waiting for a request that will never return and new requests are starved.  A properly structured system will expect these types of failures and handle them appropriately to protect the other layers.
  • Third Party Libraries. These little block boxes of functionality are not as robust as you need them to buy, and offer not as configurable as they need to be to handle failure scenarios.  This can be a huge weakness in your production system.
  • Scaling Effects. Your system works well, but as you begin to scale certainly aspects of it out horizontally, the other elements are not matched appropriately and then crash.  This is particularly true with shared resources.
  • Unbalanced Capabilities. Your overall system has a small percentage of its resources appropriately dedicated to handling home installation for your eCommerce web-site (e.g., like Best Buy).  But your marketing department commits an Attack of Self Denial by offering a free installation promotion without coordinating with production.  Suddenly, the home installation component cannot handle all the requests, and the cascading failures bringing down your site which can normally handle it due to the expected models (even including Black Monday).

All in all, this book is AWESOME!  I hope to play with some of the patterns and anti-patterns in Java and Ruby and report back on that also.  I will follow up by discussing Stability Best Practices in my next blog.

« Newer Posts - Older Posts »