Interviewing for a Database Developer

I work for a firm that’s heavily invested in SQL – a team that needs to have developers who know their way around relational databases and MySQL in particular. I want to show you how I run interviews for our development positions.

Method

Everybody has their own methods and opinions on how to conduct technical interviews. I’ve found that I generally dislike interviews that focus either on whiteboard puzzles or obscure technical details, since they don’t really show how well the candidate is at what really matters: building functioning, quality apps. I really like running the interview like we’re talking about the design for a new product. I want to figure out the requirements, mull over the data model, and write some simple queries to make sure we can show the data we need to.

This process should show two things: the candidate has a good enough grip on the MySQL database that they can comfortably build a system on top of that platform, and that they can work with others when they have difficulties. If they don’t ask for clarification for requirements and talk through their thinking, it’s a warning sign. Syntax and data types can be looked up, normalization can be taught, but humility and enthusiasm cannot. The benefit to having a discussion vs. conducting a verbal short response test is that it can’t be memorized or faked. You’re having a conversation, not giving a test.

An interview is a high-pressure meeting. You as the interviewer might be spending 40 hours a week for the next year with this person. You both have to decide if that’s a good thing, and you have maybe an hour or 45 minutes to do that. Help them relax! Tell jokes, even if they’re silly. Smile and look them in the eye. Make small talk and ask them what they like to do for a hobby. This is why putting interests on your resume is useful: it creates a great jumping off point. When I talk about my favorite things, such as Hawaii or The Iliad, it creates a comfort zone that makes conversations like this much less scarier.

Summary: try to make the interview seem like a design session among colleagues and include friendly conversational asides.

An annotated example

I like to begin with data modeling, so we have some tables to work with. This is the beginning of the simulated design meeting. I describe an app and ask the candidate to create tables for it. Some example apps are a Netflix clone, a simple social network, or an e-commerce app. Let’s run through this example with the backend software for a car rental agency, like Hertz or Avis. I say, I want to sign a customer up, assign them a car, allow them to return it, and then present them a bill.

Data modeling

The first step is just creating the tables we’re dealing with. I like to see how the candidate figures out what to create. If all goes well, we’ll come up with something like this:

Create Table Customers
(
     CustomerID Int Primary Key Auto_Increment
    ,FirstName Varchar(50) Not Null
    ,LastName Varchar(50) Not Null
    ,Email Varchar(100) Not Null
    ,Address Varchar(50) Not Null
    ,City Varchar(50) Not Null
    ,State Char(2) Not Null
    ,Zip Varchar(10) Not Null
);
Create Table Rentals
(
     CustomerID Int Not Null
    ,CarID Int Not Null
    ,StartDate DateTime Not Null
    ,EndDate DateTime Null
    ,PricePerDay Decimal(10,2) Not Null
);
Create Table Cars
(
     CarID Int Primary Key Auto_Increment
    ,CarTypeID Int Not Null
    ,PurchasedDate DateTime Not Null
    ,Constraint Cars_CarTypeID_Fk Foreign Key (CarTypeID) References CarTypes(CarTypeID)
);
Create Table CarTypes
(
     CarTypeID Int Primary Key Auto_Increment
    ,Make Varchar(50) Not Null
    ,Model Varchar(50) Not Null
    ,Year Datetime Not Null
    ,Type Varchar(50) Not Null
    ,BasePricePerDay Decimal(10,2) Not Null
);

The candidate and you should go over this design, and explain why it was designed this way. I’ll go over my choices with this design. I normalize the data structures by pulling out the model and model information into a separate table. Prices are always in decimal to avoid rounding errors with floating point numbers. Year on CarTypes is actually a datetime since we may want to perform queries such as finding all cars older than 2010 model year or getting the average age of the cars in the fleet.

CRUD

After we’ve created our model, we can now run through CRUD – that is, questions about how you do create, read, update, delete statements. This is 90% of queries within an application, so it’s pretty important a candidate understands the basics. In SQL terms, this is how to do INSERT, SELECT, UPDATE, and DELETE statements.

Run through Insert, Select, Update, and Delete:

Insert Into CarTypes (Make, Model, Year, Type, BasePricePerDay)
Values ('Honda', 'CRV', '2013-01-01', 'Sedan', 35.00);

Select Make, Model From CarTypes;

Update CarTypes Set Model = 'CR-V' Where Model = 'CRV';

Delete From CarTypes Where Make = 'Honda' And Model = 'CR-V' And Year = '2013-01-01';

Joins

The next level is joins. This is where SQL gets complicated, and shows if a developer has done anything complex with the language. Joining is how you represent complex data and answer complex questions. Ask them how you would find every Honda and when it was purchased. The query would look like this:

Select Make, Model, PurchasedDate
From Cars C
Join CarTypes CT On CT.CarTypeID = C.CarTypeID
Where
Make = 'Honda'

Ask them about the difference between an inner join and an outer join (right inner join is default). As a brief review, an outer join will return rows from the first table even if there are no rows in the second, while an inner join will return rows if and only if a row from the first table matches the second table. Ask them to build a query that would show all Honda models, even if the car rental company doesn’t own one.

Select Make, Model, PurchasedDate
From CarTypes CT
Left Join Cars C On C.CarTypeID = CT.CarTypeID
Where
Make = 'Honda';

Set thinking

It may seem simple, but developers often don’t understand set thinking: in SQL, you are always dealing with a collection of records, even if you only modify or insert one row. Many programming languages and libraries have added the ability to operate on collections in code just like you would with SQL, such as LINQ in C# or Underscore in JavaScript, so the ability to reason about a collection is important for all developers, regardless if they are using a database.

The opposite of this idea is iteration or row-by-row thinking. Although plain SQL doesn’t have looping, developers are tempted to read from one table at a time and perform “the join” in the code, row by row. Looping is perfectly acceptable in application code, but it’s generally wrong when using SQL. I want to see if the candidates can operate in both paradigms.

Ask them to send an e-mail to every customer who has rented a Honda. When finished, the SQL and pseudo-code will look like:

// Get all the customers
$sql = "
Select Distinct
 FirstName
,LastName
,Email
From Customers C
Join Rentals R On R.CustomerID = C.CustomerID
Join Cars CS On CS.CarID = R.CarID
Join CarTypes CT On CT.CarTypeID = CS.CarTypeID
Where
Make = 'Honda'
And
R.EndDate Is Not Null
";

$rows = db_query_with_results($sql);

foreach ($rows as $customer)
{
    send_email($customer)
}

You don't want something that looks like this:

// Get all the Honda cars
$honda_models_sql = "
Select
CarTypeID
From CarTypes CT
Where
Make = 'Honda'
";

$honda_models = db_query_with_results($honda_models_sql);

$honda_cars = array();

foreach ($honda_models as $honda_model)
{
$honda_cars_sql = "
Select
CarID
From Cars C
Where
CarTypeID = ?
";

$honda_models = db_query_with_results($honda_models_sql, $honda_model['CarTypeID']);

}

…

See what the problem is? That’s just a disaster: you have to hit the database for every model and every car that’s a Honda, and then combine that using loops. I’ve seen both production code and interview candidates use this type of solution.

Aggregate Queries

Often, we’re crunched for time in a interview, so I’ll combine questions about joins with aggregate queries. I just want to see if the candidate can think about aggregation (this is a continuation of set-thinking).

Ask them to write a query to get the number of cars that each customer has rented. They’ll often start with something like this:

Select FirstName, LastName, Count(*) RentedCars
From Customers C
Join Rentals R On R.CustomerID = C.CustomerID
Where
R.EndDate Is Not Null
Group By
FirstName, LastName

There’s an error in this query, and so I’ll ask, what’s wrong with this query? What about those customers who’ve signed-up but never rented a car?

Select FirstName, LastName, Count(*) RentedCars
From Customers C
Left Join Rentals R On R.CustomerID = C.CustomerID
Where
R.EndDate Is Not Null
Group By
FirstName, LastName

There’s still another an error: if there are two customers named John Smith, this query will aggregate them together. The next step is to include the CustomerID in the group by clause:

Select FirstName, LastName, Count(*) RentedCars
From Customers C
Left Join Rentals R On R.CustomerID = C.CustomerID
Where
R.EndDate Is Not Null
Group By
CustomerID, FirstName, LastName

I’d also ask if this query would throw an error:

Select FirstName, LastName, Count(*) RentedCars
From Customers C
Left Join Rentals R On R.CustomerID = C.CustomerID
Where
R.EndDate Is Not Null

In its default SQL mode, MySQL will just roll up all rentals to a randomly chosen customer and won’t throw an error. They get bonus points if they can tell me how to fix this issue with only_full_group_by.

The next query is to get all those customers who’ve rented more than five cars:

Select FirstName, LastName, Count(*) RentedCars
From Customers C
Left Join Rentals R On R.CustomerID = C.CustomerID
Where
R.EndDate Is Not Null
Group By
CustomerID, FirstName, LastName
Having Count(*) >= 5

You can then ask them about other aggregate functions, like Sum and Average.

Indexes

I like to ask them what indexes would help in this situation:

Select Make, Model, PurchasedDate
From CarTypes CT
Left Join Cars C On C.CarTypeID = CT.CarTypeID
Where
Make = 'Honda'
And
Type = 'Sedan';

They’ll hopefully come up with something similiar to this:

Create Index CarTypes_Make_Type_ix On CarTypes(Make, Type);

You can then ask if this same index will help the following query:

Select Make, Model, PurchasedDate
From CarTypes CT
Left Join Cars C On C.CarTypeID = CT.CarTypeID
Where
Type = 'Sedan';

The right answer is no, since Make appears first in the index. An index is sorted by the first column, so MySQL can’t use it since it doesn’t have that first column to search by. You can fix that index to cover both queries by changing the order:

Create Index CarTypes_Type_Make_ix On CarTypes(Type, Make);

The other index issue to talk about is SARGEability, or, the ability of the engine to use an index. The query engine cannot use an index if the column being compared is first passed into a function. Ask if the following would be able to use the index:

Select Make, Model, PurchasedDate
From CarTypes CT
Left Join Cars C On C.CarTypeID = CT.CarTypeID
Where
Type = 'Sedan'
And
Year(Year) = '2013'
;

Create Index CarTypes_Year_ix On CarTypes(Year);

Once again, the answer is no. To fix this, use the full date form instead of passing the column into a function:

Select Make, Model, PurchasedDate
From CarTypes CT
Left Join Cars C On C.CarTypeID = CT.CarTypeID
Where
Type = 'Sedan'
And
Year = '2013-01-01'
;

This covers the basics that a developer using a relational database like MySQL should know. Next few sections would cover interviews for a lead developer who would be helping others use MySQL to the fullest.

Routines

If things are going swimmingly and you’d like to see how far your candidate can go, get into stored routines. Ask what the difference is between functions and procedures. The answer is that functions can’t modify data and must return something, a row or a value, while a procedure is under no such obligation.

Next is why you should use them. This is a pretty open-ended question, but routines are generally used when you want to minimize the number of roundtrips to the MySQL server. Since they use a procedural language, you can do things like loop and branch or perform multiple queries in one fell swoop, thus allowing you accomplish a bunch of work with one network call.

The reasons you shouldn’t use them are equally enlightening. The biggest reason is that generally the stored procedure will contain business logic which means you’ll want to test it and MySQL routines are not easily tested. You don’t have the option to mock a table, so creating a test framework for a routine is painful.

The next reason is that it’s also somewhat painful to source control your stored procedures. Even if you have a schema migration script, it’s hard to track down when changes were made, and what features necessitated that change.

Finally, stored procedures are yet another part that determines yours application’s behavior, which makes reading the codebase harder. IntelliJ doesn’t have a “jump to stored procedure definition” option. Your code should be organized in such a way that developers understand where to find business rules, and stored routines just add another place to look.

Views

Although they are not super important, views a part of using a relational database well. I’ll ask what a view is and when you should use one. The basic answer is that it’s a named query that can be used like a regular table in other queries. They’re useful when you want to have a consistent way of defining something (such as what an active customer is) or when you want to stop typing the same joins over and over.

If you want to get fancy, you can ask whether views can be updated (can you write an insert or an update against it). The answer, with caveats, is yes.

Scaling

Something that often comes up is how to deal with growth. On certain applications, maybe you never have to worry about outgrowing a single server. If you are working on a busy system, then it’s good to dig into how you can make MySQL work for thousands of queries a second. This could be its own guide, but the following are good starting points. For each of these, I’d ask what is this and when should you use it? You could also ask, how do you make a system scale, and generally the answers will fall into these categories.

Scaling up or Vertical scaling. This is just buying a bigger, faster box with more storage. It’ll get you a long way. If you’re running a micro instance in AWS and experiencing performance issues, you just need to get a bigger instance size. Likewise, if you’re running a pizza box with 4 gig of RAM, it’s not time to start sharding yourself, you just need to upgrade in order to buy yourself some time.

Replication. This is the process of automatically copying data from one server to another, in a publisher/subscriber pattern. This allows you to create many read-only boxes that mirror a master server, so that you can scale out reads.

Partitioning. When you have a large table, you can split it up into several physical tables while maintaining the same logical table. Your application thinks its running against one table, while in reality, MySQL is writing to many smaller tables.

Denormalization. This is the process of storing redundant data in order to reduce the number of joins. Sometimes, joins can be a performance issue and you need to create a table that has all the data that an application needs for frequently made requests. An example of this would be how Twitter creates your feed (or once did in the past): they copy every tweet for each follower into a feed table. That way, when a user want to see their feed, they just write a query like:

Select Tweet From Feed Where UserID = ?;

vs.

Select Tweet
From Tweets T
Join Users U On U.UserID = T.UserID
Join Followers F On F.FollowedUserID = U.UserID
Where
F.FollowingUserID = ?

Service oriented architecture. Instead of creating one giant codebase with one overworked database, this is the idea of separating your application into leaner services that are responsible for one thing. This allows you spread the work across multiple databases. It also works to scale out your team, as each developer doesn’t have to understand a monolithic application, but can build tighter, smaller applications.

Sharding. This is the process of taking a table and distributing its rows across many servers. A simplistic example would be that for a Users table, usernames that start with A-M go to Server 1 and N-Z go to server 2. This is a last ditch effort because it makes everything much, much more complicated. There are a few systems that sit on top of MySQL to make this process easier, but it’s still a last resort.

Cloud

Most teams are now looking at providers like Amazon AWS to help them make their team more effective and to reduce risks. While nothing is free, services like AWS can take away so much pain that they’re worth it. This is a good discussion topic for a database developer lead since the move to a managed service is fairly recent and requires a good understanding of MySQL. You can’t just copy the same patterns that you’re used to with dedicated hardware over to Amazon.

You must instead realize what’s important and copy the purposes behind the patterns. For instance, Amazon doesn’t have giant boxes. You can get SSD-backed storage and large servers, but you’ll hit a limit (and the big boxes cost a ton). You need to think about scaling out sooner that you would with a physical server.

This is a pretty wide open topic in a field that’s pretty new, but I’d like to see the candidate discuss the tradeoffs of using MySQL in the cloud and how they would mitigate them. These are written with the idea that we’re talking about RDS on AWS.

Pros

  • Instant provisioning
  • MySQL with an API (made for DevOps)
  • Backups by default
  • Failover architecture built-in
  • Turnkey replication
  • Automatic patching
  • Push-button major version upgrades
  • DBA-free scaling
  • Basic, simple monitoring is part of the package

Cons

  • Costly
  • Only Oracle MySQL is available
  • No Nagios (or any other agent-based monitoring)
  • No XtraBackup
  • Dependant on AWS services like EBS
  • Largest instance sizes are equivalent to medium-sized boxes

Growing beyond

I’d also like to see how the candidate grows. Learning doesn’t happen by accident, and I’d like to see what steps they are taking to push themselves further, both in their technical ability and their interpersonal skills. A person that can learn from reading tutorials, blogs, or books, as well as from their own mistakes, is incredibly valuable.

I recently read Ender’s Game, which is about a child that is faced with the daunting task of saving the world. He’s matched up against other children who have the same mission in a rigorous training, but what separates him from the others is that he is constantly learning and changing. He’s not the best when he starts the school, but by the end, he’s far and away the most promising candidate.

This growth must be deliberate. Ask the candidate what book they’re reading right now, or what technology they’re learning. This doesn’t just apply to them, either. You need to provide an environment that encourages developers to get better. You yourself need to grow as well.

If you want to grow, I’m writing a book about the topics I’ve covered in this interview guide, as well as many others. It’s an artifact of my own growth, and I want to save you time my providing a good guide to using MySQL the right way.

Ready to take your MySQL skills to the next level?

You can subscribe below and I’ll send you updates on a new book I’m writing that helps you work with MySQL like a pro. If you sign-up, I’ll also send out stuff on MySQL like tips and tutorials.



 No spam


Tagged with:
Posted in Databases, IT

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>