In this assignment, you will work with data from the airline industry (
airline_industry.db
Download airline_industry.db
). You must download it and add it to your
working directory to write SQL queries against it. You will write some SQL directly. On certain questions, you will need to integrate Python and run SQL queries using pandas that can be returned as a dataframe and analyzed with the tools and techniques from Python. Note that there are multiple tables in the database: flights, airports, planes, weather, airlines.
Using Jupyter notebook, answer each of the following questions. This is an individual assignment. You must submit your own work. You will do your work in Jupyter notebook and must show the code you used to get the answer. To complete the assignment you must download your Jupyter notebook with code and answers as an html file and upload the notebook in html format. All questions should be answered fully when a text answer is required and all code must be shown to indicate how you arrived at the result.
Your report submission should be formatted using Markdown and code in Jupyter notebook as follows:
The name of the assignment as an H1, your name as an H2, and the date as an H3 all in the first cell block. Add a horizontal line/rule after the cell. The questions should then be answered in individual cells indicated by Q1, Q2, Q3…etc. in H4 format and separated from other questions by horizontal lines/rulers.
Note you must write out the SQL query used to generate the answer for all questions unless otherwise specified.
How many records are in each table in the database?
Did any flights depart and arrive exactly on time?
What is the mean arrival delay among flights?
How many flights had above average arrival delays?
What is the relationship between temperature and wind speed? Do changes in the mean temperature throughout the year (by month) move reflect changes in mean wind speed?
Print the first 5 rows of flights and airlines. Does the flights table contain the full name of the airline? If so, print them. If not, write a query to return the first 10 records in the flights data with the following columns: the origin, destination, carrier, flight number, and full name of the airline for each flight in a single table.
Your supervisor is interested in the airlines that offer the most flights in peak periods. Write a query that returns the number of flights each airline operates in December. The results should be a single table that contains five columns: carrier, month, name, and num_flights. The results should be limited to 10 and shown in descending order by number of flights.
Use pandas to connect to the database. Define a function that takes one argument–month_num–called num_flights_fun that uses pandas’ `.read_sql_query()` and runs a query that returns the number of flights each airline operates in whatever month is specified in the function call. The result should be returned as a pandas dataframe that contains five columns: carrier, month, name, and num_flights.
Test your function to ensure it works. Create a new dataframe called sept that contains the results with only records from September and a dataframe called jan that contains the results with only records from January. The only change you should make to get the results is adjusting the value of the input parameter you are passing in the function call in num_flights_fun. Print out the last five rows of sept and jan.
Using sept and jan, print the top five airlines with the greatest number of flights in September and in January, in descending order. Are the top five airlines the same in September and January? If so, what are the airlines? If not, which airlines differ from September to January?