CSE498, Collaborative Design, Spring 2020
Computer Science and Engineering
Michigan State University

Headquartered in Seattle, Amazon is the world’s largest online retailer and is also the world’s largest cloud services provider with their Amazon Web Services (AWS) products.

As a leader in the technology sector, Amazon has access to massive amounts of data. They employ teams of data scientists to analyze this data to improve Amazon’s various offerings, including their product recommendations.

The task of finding the best dataset for a problem is time- consuming and requires significant manual work, including looking through thousands of individual files that are stored in many different locations. This process takes up a substantial amount of time that could be better used for development.

Our Amazon Data Hub software streamlines dataset acquisition with an easy-to-use website that allows data scientists to automatically search through Amazon’s collection of data.

When an Amazon data scientist uploads a dataset to our Amazon Data Hub repository, it undergoes automated analysis. This includes object detection and speech recognition for images, videos and audio, as well as statistical analysis of numerical data.

Data scientists use the web application to search through our catalog of datasets. Search results include information provided when the dataset was uploaded, as well as information from our automated analysis. Intuitive visualizations of each dataset allow users to quickly evaluate the relevance of each dataset.

The Amazon Data Hub decreases the time it takes to find suitable datasets from hours to minutes, allowing data scientists to spend their time on more important work.

Our system uses AWS’s scalable products, including S3, DynamoDB, Rekognition, Transcribe, Lambda, Elastic MapReduce, and Elasticsearch, to store, process and search the datasets. Python Flask is used to connect our back end with our ReactJS front end.