Data Technologies Archives - Anuj Varma, Hands-On Technology Architect, Clean Air Activist

Physical Database Design and Tuning – Oracle or SQL Server

Anuj Varma — Wed, 24 Mar 2021 19:25:31 +0000

Troubleshooting Database Performance – 3 Broad Categories

Physical Database Design
Query Statement Tuning
DB Configuration

Physical Database Design

INDEXING – Look for fragmentation, If frag > 0% , try rebuild indices.
Filegroups (File Placement, Object Placement) – Ldf (log files) and mdf (data files) on separate drives – SQL Server
Partitioning – Horizontal or vertical (columnar).
Denormilazation

RE-INDEXING Notes – Look for fragmentation, If frag > 0% , try rebuild indices. Dbcc indexdefrag (allows db to stay online). dbcc dbreindex (also rebuilds statistics, also works with constraints on indices)
FILEGroups Notes – SYSTEM Objects – Primary Filegroup, USER Objects in a separate filegroup,, TRANSACTION Log – Separate Volume – Lessens I/O load. Also, TEXT/IMAGE Data best in a separate filegroup.

Query Statement Tuning

Look primarily for full table scans and nasty joins.
Find Queriies that have a high execution count (run frequently) – e.g. Select execution_count, physical_reads, logical_reads FROM SYS..dm_exec_query_stats a, SYS.dm_exec_cached_plans b, SYS.syscacheobjects c
Subqueries vs Joins – While both do the same thing, look at the explain plan for efficiency. Subquery is only better if an aggregate is being calculated and fed back on the fly. JOIN is better when columns from different tables are needed.

Physical Reads vs Logical Reads

Physical Reads should only happen if data is not in the buffer cache (logical read). High Physical Reads is also a symptom

Truncate vs. Shrink – Reduce Log Sizes

Truncate and Shrink Full Transaction Log (SHRINK is what actually reduces the file size).

Indices – Clustered vs. Non Clustered

Clustered Seeks are fastest, unless a non-clustered includes two or more columns, in which case non clustered could be faster. INSERTS and UPDATES always faster on a clustered index.
Clustering – Active Active vs. Active Passive

Primary Index vs. Unique Index

Primary always creates a clustered index. Unique can be non clustered.

Buffer Pool vs. Buffer (Data) Cache Hit Ration.

Overall Process Space – Buffer Pool in SQL Server

Memory used for data cache – Data Cache. The hit ration here is important (can obtain it from a Windows Perf Counter – Buffer Manager)

DB Configuration – Recovery Model

Simple – Most recent backup
Full – Regular Backup upto a point of failure
Bulk Logged –

Potential Data Type Mismatches (Oracle to SQL Server )

bfile -> Not in SQLserver

nClob –> nText

raw -> Varbinary

Special Data Types – Spatial Data Types

Need special treatment – User Data Types e,g, UDT
e.g. Geometry and Geography. STGeomFromText(‘LINESTRING(…..)

Summary

This is meant to be a quick recap of the first places to look for tuning your database performance.

Need an expert to help out with your Database Design or Strategy? Set up a time with Anuj Varma.

The post Physical Database Design and Tuning – Oracle or SQL Server appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Couchbase vs DynamoDB

Anuj Varma — Sun, 28 Jul 2019 16:52:24 +0000

Couchbase Advantages

Run on almost any cloud platform – including AWS
Avoid DynamoDB’s item-size restrictions
Speed up performance with in-memory processing and built-in caching
Use your team’s existing SQL skills for writing complex queries
Cut license and support costs by up to 50% compared to DynamoDB

The post Couchbase vs DynamoDB appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

MariaDB auto update statistics

Anuj Varma — Tue, 19 Jun 2018 21:41:00 +0000

To check if auto update is enabled on statistics, try this command

show variables like '%metadata%';

If you see an output such as:

innodb_stats_auto_recalc = 1, you're all set..

If you see an output such as:

 innodb_stats_on_metadata | ON    |

it means that statistics get updated whenever metadata on the table is requested, which is typically enough. But , you may still need to set that first variable –

innodb_stats_auto_recalc = 1

Here is some more info on this topic – https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_stats_auto_recalc

For cloud migration projects or cloud consulting on AWS, GCP or Azure, contact Cloud Migration Architect

The post MariaDB auto update statistics appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Types of Non Relational Data

Anuj Varma — Sun, 04 Feb 2018 18:10:00 +0000

The post Types of Non Relational Data appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

The BigData Landscape

Anuj Varma — Thu, 10 Aug 2017 14:18:55 +0000

This is a Work in Progress…

Pre Processing of Data (Un-Structured)

Map Reduce

Pre Processing of Data (Structured or Semi-Structured)

PIG
Hive
Hadoop (see below)

Statistical Analysis (After Pre-Processing)

R is used for statistical analysis which happens after processing of data . However there is some limitation on size of data which can be used.

Hadoop

covers both data storage and data processing at massive scale.
PIG and HIVE are tools which belong to Hadoop.

The post The BigData Landscape appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Running out of disk space? Sharding

Anuj Varma — Tue, 20 Jun 2017 21:18:17 +0000

What is automatic sharding?

Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards.

The post Running out of disk space? Sharding appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Data warehouse versus data marts

Anuj Varma — Tue, 03 Jan 2017 16:41:42 +0000

Most data warehousing initiatives fail (mainly because this level of standardization slows down an agency/company enough that the project gets derailed; boiling the ocean phenomenon).

Avoid building a Data Warehouse right away, but approach it in a slightly different manner.

Build individual data marts instead; each department gets to own its own data mart. These individual data marts would still follow a common standardized technical architecture; and would be able to talk to each other.

For e.g. definitions and metadata in each data mart should follow the same convention.

This paves the way for a final data warehouse – which could simply be a loosely coupled conglomerate of these independent data marts.

The post Data warehouse versus data marts appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Multiple copies of data–and editability

Anuj Varma — Wed, 30 Nov 2016 19:00:23 +0000

A book has an author –an Author has multiple books. A book would be modeled as a document in NoSQL – as would an author. So – we end up with two documents – Book and Author. Each book has a unique title as well as an Author associated with it. This book-author association is part of each Book document.

Case 1 – Changing an Attribute that belongs on multiple documents

Now, supposing the Book title changes –let us say there is a new edition of a textbook. Do you have to update thousands of occurences of the Book document with the new title? The answer is – Yes. However, with a little design change, you can avoid the multiple updates. Instead of having Title as an attribute of the book, suppose you separate Title into its own document (say a document called BookMetaData). Now, each book just has a BookMetaData ID associated with it. If the title of the book changes, one simply needs to update it in the BookMetaData document – and all the associated books will automatically pick up the change.

Couchbase’s alternative to handle multiple document updates

Couchbase offers something of a shortcut – using a view collation. With collated views, you can have a single query spanning all the documents that you might need. With views, Couchbase Server allows one to keep a single canonical source of an item of data while having it show up in many different places.

NoSQL’s mantra – Denormalize, Denormalize !

The relational data model rigidly ties one to database schemas. One resorts to normalization of data and performs joins to perform complex queries. More recently though, changes in application characteristics have led application developers to non-relational database technologies. One can view distributed document database technology as a natural successor to relational database technology:

It effortlessly scales across virtual machines or cloud instances.
It doesn’t tie you to a rigid schema before inserting data, nor does it require a schema change when different data must be captured and processed.
Its rich data model and view technology allows for complex data modeling, capture and queries.

Summary

One of the most frequent criticisms of NoSQL is that – updates of any document element suck ! Essentially, an update could require thousands of documents to be simultaneously updated. However, this limitation is easily overcome by allowing a slightly modified design – by separating out the ‘frequently updated’ info into its own document

The post Multiple copies of data–and editability appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

Tracking relationships in NOSQL

Anuj Varma — Tue, 09 Aug 2016 16:06:22 +0000

In NoSQL, there is no way to ‘relate’ the post with the comments.

So, what do you do?

Well – you essentially store the postId and the commentId – for EACH comment (i.e. , you store post1,comment1, post1,comment2….and so on)

This storage will work – but will be optimized for one type of query (all comments for a given post)
If you have another type of search (say, all Users who commented on this article), you are screwed. You did not store the userId along with the commentId – so again, you will be back to the drawing board.

However, if all you really care about is getting all comments on a post (first type of query), you are not only set, you will have noticeably faster retrieval times (compared to the relational model). Especially as the data set gets larger and larger.

The post Tracking relationships in NOSQL appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

NoSQL and data integrity

Anuj Varma — Tue, 14 Jun 2016 20:41:33 +0000

Redundant Data Storage

NoSQL stores many to many relationships in the same way that de-normalized tables do – by storing them redundantly. Since you do not base your NoSQL design on relationships between data, you database design is driven by the type queries that will run against it.

You would use the same design methodology here that you would use to denormalize a relational database: if query performance is of utmost importance, you would flatten (de-normalize) your database – to accommodate the query in question.

This optimizes your tables for one type of query at the expense of other types of queries. If your application has the need for both types of queries to be equally optimized, you would be better off not de-normalizing and not NoSQL ing it.

Integrity Violation with DeNormalized Data

There is a risk with denormalization – that data (or entire sets of data) will get out of sync with one another. This is called an integrity violation or a data anomaly. A normalized relational database (RDBMS) is DESIGNED to prevent such integrity violations.

How does NoSQL PREVENT integrity violations?

In a denormalized database and in NoSQL, it becomes the programmer’s responsibility to write application code to prevent integrity violations.

Summary

The post NoSQL and data integrity appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.