Yadong Li | Na Liu | Xiangwei Shi | Feng Lu |
---|---|---|---|
4608283 | 4615689 | 4614909 | 4619161 |
IN4400 Programming and data science for the 99% (2016-2017 Q2)
Assignment 2 Group 18
Introduction
Traditionally
, box office is the only standard to determine whether a movie is good or bad and to analyze the development of the movie industry. On the other hand, movie ratings have been popular on the Internet, and it is a novel and neoteric way to investigate the others’ reviews and remarks before you go to the cinema. Can movie ratings reflect the development of the whole movie industry and can the people in the movie industry know which kind of movies the audiences like from the information based on the ranks the audiences made?
Since 1890s, movies have been existing and affecting people’s lives. With the great development of movie technology, a movie industry has been formed from silent motion pictures. Nowadays, not only is there the largest “movie factory” – Hollywood in America, but also there are other industries all over the world, such as Bollywood. For a long time, box office has been regarded as a synonym for the number of tickets sold or the amount of money raised by ticket sales, which can be projected and analyzed for movies and the whole industry. However, there is a newly-spread way to know about and analyze the movies – ranking the movies. Now let us find out how the movie ratings can reflect the movie industry.
Outline
So far, our goal is unveiled: we want to explorer movie information data and movie rating data and hopefully we can get some interesting insights from it. This is kind of like an exploratory data analysis. To achieve our goal, we should first choose proper tools. After that, we must collect raw data from realiable sources and do some cleaning to prepare for analysis. Another very important process for us is to visualize the data. This is so important for exploratory data analysis since we don’t know the patterns or conclusions hiden in the data. With visualization, I believe we can have some interesting findings in this journey.
Method and Tools
The tools we used here is the combination of Python, Pandas library and Jupyter Notebook. I believe that this combination of tools is of great help for data analysis tasks. First, it’s very easy to load data from different sources or you can create your own. For experienced Python users, you can also collect data from web crawlers or available APIs. The second reason for it being powerful is that you can do almost everything as you want to clean, manipulate or visualize data with the help of Python and its libraries. One of the biggest advantage for using programming language such as python is that you can create your own functions and do whatever you want. There is a lot of freedom. Finally, it’s also a great tool for data visualization.
To summarise, Python, as a programming language, acts like a powerful tool for data acquisition, cleaning and manipulation. Pandas, as a wonderful library, provides a friendly data structure and some helpful functions and features for us. Jupyter Notebook, on the other hand, provides a simple interactive environment for programming in Python and later presenting to the audience.
Data Acquisition and Cleaning
In order to investigate the relationship of movie ratings and movie industry, we use the exploratory data analysis method, which is based on the database from MovieLens website (Movielens). More than 24,000,000 ratings and 670,000 tag applications applied to 40,000 movies by 260,000 users constitute our data base including tag genome data with 12 million relevance scores across 1,100 tags. From the dataset, we can get ratings of the movies from 1874 to 2016, which are from 0 to 5. These data were created from January 09, 1995 to October 17, 2016. Since the users were selected at random, it can be representative. All selected users had rated at least one movie and each user was represented only by an id. Our data is more than 1 GB (datafile). Due to the large amount of the dataset, it becomes representative and meaningful to visualize movie ratings and to analyze the movie industry.
The screenshot below shows the raw data when we just read in and no preprocessing is done.

This is the movie rating data:

As for the method of cleaning data, we firstly dropped the movies with fewer number of ratings which are the movies with the average ratings of 5.0 and influence the analysis of the data. And then we separated the genres into different rows. Because some movies are with more than one genres and these genres from the data source are kept in one row in the spreadsheet. At last we separate the movie names and the released year, which are also kept in one row, and the reason is the same.
This is what we get after cleaning:

Data Visualization
The whole data visualization and analysis process is based on the usage of Jupyter Notebook. This is a web application that allows the user to create and share documents that contain live code, equations, visualizations and explanatory text. And we can use this tool in data cleaning and transformation, numerical simulation and statistical modeling. Therefore, we use this tool to complete the whole data visualization with usage of programming language – Python, which is a high-level programming language and easy to use in data analysis.
For example, the picture above is showing the top 100-rankings movies each year. The color and the position of each circle represents the average rating and the ranking of each movie. The scale of each circle represents the variance of therating of each movie, which stands for the degree of the controversy of each movie. From the picture, we can see there are more and more movies getting good ratings and the average ratings are getting higher and higher as the year changes.
Ratings Reflect What Kind of Genre of Movies Are Popular
In this dataset, every rated movie belongs to one or more than one genres. According to the ratings of each genomes, we calculated the means of every genres.
What can we know from the mean ratings of each genres? Let us explore the information together. Each colored histogram shows a mean of genre. We can easily find the highest one in the middle of the picture represents the film-noir movie, which means people like film-noir movies more than other types of movies because of the highest mean rating. Based on this theory, we can also find people prefer to watch animation, crime, documentary, drama, IMAX and war movies rather than other types. Among all the types, honor movies get the lowest mean ratings generally because the scary plots and background music cannot attract more audiences. As for other types, they get close averages of ratings between the ones of film-noir type and honor type, which means they are just normal types for the audiences. Hence, we can find out which genres of movies could get a high rating on average and become popular since they are released. Relatively, it could be known that honor movies may not earn the public praise.
Ratings Reflect the Distribution of the Qualities of Movies
Except extreme conditions, the ratings given by the users represent the popularity of the movies but also reflect the qualities of the movies to some degree. Let us take an instance, the movies with the high ratings of 4.5 are much better than those with the ratings of 1.5 on the movie quality. Therefore, we classified the average ratings into three kinds which are high quality, medium quality and low quality. High quality movies are the movies with ratings from 4.0 to 5.0, and medium quality movies are the movies with ratings from 2.0 to 4.0. The rest movies with the ratings from 0.0 to 2.0 are the low quality movies. The graph below shows the distribution of the ratings each year and reflects the distribution of the qualities of the movies.
There are three colors in the graph and they represent the three qualities of the movies. A large scale of orange area tells us most movies have the medium qualities. And the movies that can be viewed as the good movies and bad movies are fewer than the commonplace movies. Meanwhile you can find the exact numbers of three different qualities of movies. Despite of the increasing amount of the movies, the number of the high quality movies still remains, which tells us that there cannot be many true great movies. If there are many “great quality” movies, then they become the ordinary ones.
Ratings Reflect the Controversial high-quality movies
If a movie has a high average rating, you can say this movie is a good movie. But does anyone agree with you? Let us find out the answer together! According to the theory of statistics, with the same average, variance measures how far a set of numbers are spread out from the average. Therefore, we took top 10-average-ranking movies each year as example and calculated the variance of each movie. And we plotted and formed the graph below, which can tell us the answer to the question above.
In the graph, each circle represents a movie that is one of top 10 rating movies of each year. The color shows the average rating of the movie. You can find the relationship between the average ratings and colors through the color bar on the right of the graph. And the position of each circle represents the variance of each movie. A higher position of a circle means a high variance of a movie, which also means this movie is more controversial. The exact information can be found on the graph. From the graph, we can see that the number of movies with high variance of ratings increases, and the color of the circles become lighter and the average ratings increase as time passes. So you can find that even though the qualities of movies get better and better, the number of controversial movies still becomes larger. In other word, the movies that you believe are great are controversial and the movies from recent decades of years have more audience who have total different opinions on the movie qualities than the movies from 1930s to 1970s.
Ratings Tell Us What Happened in Movie Industry of 2016
Besides all the visualization and analysis on movie and movie industry, we can find out what happened in the movie industry of 2016.
The graph above can tell us what exactly happened in 2016. Each circle is a movie released in 2016. And the color and position of each circle represents the average ratings. The scale of each circle tells us the variance of each movie. The color bar on the right of the graph tells us the relation of the colors and average ratings. Meantime, each movie has a unique number. You can find the exact information of each movie on the graph.
Conclusion
The whole data visualization and data analysis on movie ratings gives us several conclusions about the development of movies and the movie industry.
The average ratings of movies increase as year increases.
People prefer film-noir movies and honor movies attract the fewest audiences.
Despite the increase of the average ratings of movies, the number of high qualitymovies does not change greatly.
As the time passes, the high quality movies are getting more and more controversial.
What happened in movie industry of 2016 can be found based on the users’ ratings on movies.
Through these conclusions based on analysis of movie ratings, we can know more about the whole movie industry. Therefore, movie ratings can be a method of analyzing the movie industry. At least we can tell the directors in movie industry that you can try to make film-noir movies and it has a great possibility to succeed if you are not famous rather than honor movies.