How Should the Analytics Team Access Data?
The company has a production database with juicy data. Now what?
You are probably the first data person at a growing company. The team is doing an excellent job, people love the product and the future looks good. But everybody is kind of stumbling in the dark, although they suspect that the answers are in the data. This is why you are here! There are so many questions to be asked and so many ways to do better. What aspect of the product to focus on next? How to get more traction and more users? How to make existing users happier? You probably can name dozens very specific, interesting things to find out off the top of your head.
The devil is in the details. What you want to do is clear - dive into the data. But what to do it right? In the long run, you intend to build a data warehouse, but that’s months down the line. Small steps are in order for now. There are many low hanging fruits to be harvested with simple SQL/database queries. Everything you need for impressive results is in the production database. What’s the best way to access it?
Things go horribly wrong while handling data. You don’t want to be able to interfere with the production setup. Maybe one of your scripts will completely overload the database one day, and everything will come to a grinding halt. Or there will be unintended data corruption. If your company happens to use something like MongoDB, accidental deletion of tables and collections are only a typo away. Stuff happens. How to reduce the risk of things going terribly wrong without constant looming fear and tripple-checking your queries?
You should have direct access to the company’s production data, but not the business critical database itself. The safest bet is to set up a second database, which maintains a copy of all the production data. This is usually called something along the lines of a ‘copy slave’ in a master-slave setup, or ‘read-only replication’. All your queries and scripts should only access a harmless copy of the data, and if things go wrong the database copy will be able to be fixed in no time. This way, nothing related to your analytics effort is able to disrupt the business and you have complete access to ALL of the data. Happy chrunching!