Database Design & API User Interface
Junjie Wang
4/29/202414 min read
In response to the fragmented data management challenges faced by Airbnb’s Host Department in New York (AHD-NY), we have developed a sophisticated PostgreSQL-based relational database system. Our solution integrates ETL processes to consolidate disparate data, robust security measures to safeguard sensitive information, and real-time dashboarding tools like Metabase to provide actionable insights. This comprehensive approach enhances data accuracy, streamlines management, and supports strategic decision-making, ultimately empowering AHD-NY to optimize host performance and drive market expansion. Additionally, the introduction of a user-friendly web form facilitates efficient data entry, further advancing operational efficiency and ensuring a secure, scalable data infrastructure.
Problem Statement
The Airbnb Host Department in New York (AHD-NY) struggles with fragmented data management, as crucial information such as reviews, host details, and listings are scattered across disparate systems and manual spreadsheets. This decentralized approach hampers operational efficiency and inhibits data-driven decision-making processes, hindering AHD-NY’s ability to support hosts effectively and unlock new market opportunities.
Moreover, security and scalability concerns loom over the current data infrastructure, posing risks of data breaches and inhibiting future growth potential.
Project Scenario
The Airbnb Host Department in New York (AHD-NY) is responsible for overseeing and supporting hosts within the region, ensuring accurate listings and providing hosts resources for success in the rental market. However, limited market expansion has been observed, recognizing the significance of data-driven decision-making, they are now determined to provide VIP hosts with insightful data information and dashboards. In line with this commitment, they are dedicated to offering compelling VIP membership packages at attractive, lower fees. This strategic initiative aims to empower hosts with the knowledge and insights necessary to unlock new opportunities and drive a larger market presence in New York. To address this, our team has been commissioned by AHD-NY to design a comprehensive relational database system.
Proposal
Our proposed solution involves developing a robust relational database system leveraging PostgreSQL, streamlining data management processes and provide real-time dashboard insights. The data analysts at AHD-NY will utilize our system to make data-driven results for hosts to optimize their rental properties and opportunities. The rationale behind choosing a relational database system lies in its ability to handle structured data efficiently and manage complex relationships between hosts, listings, reviews, and other relevant entities.
For our initial plan of action, we will start by reviewing the Airbnb dataset to understand the structure and relationships between different variables. Next, we will develop ETL processes to extract data from disparate systems and transform it into a format compatible with the relational database. This step is crucial for migrating existing data seamlessly and establishing a centralized repository for all host-related information. Concurrently, we will implement robust user authentication and authorization mechanisms to safeguard sensitive data and ensure compliance with privacy regulations. We will then design an Entity-Relationship (ER) diagram to visualize the database schema. Ultimately, we will create a relational database in 3NF (Third Normal Form), ensuring that it captures all necessary aspects of the Airbnb listings and hosts data.
Once the data migration and security measures are in place, we will integrate reporting and dashboarding tools such as Metabase with the relational database. These tools will enable AHD-NY analysts to visualize key metrics and performance indicators in real-time, empowering them to make data-driven decisions for hosts.
Furthermore, we propose enhancing data input capabilities by introducing a user-friendly web form. This form will be accessible on a dedicated webpage, enabling members of the AHD-NY team to input host information effortlessly. Once submitted, the data will be seamlessly stored in the PostgreSQL database, streamlining the data entry process and ensuring data integrity.
In short, our work has many benefits for Airbnb. Firstly, our construction of the database plays a very important role in the storage, management, and analysis of data information. For example, by breaking the table apart, we can make the data structure clearer and reduce data redundancy and duplication. The establishment of such a standardized database model can ensure the consistency and accuracy of the data and make it more convenient for Airbnb to manage and maintain the data, which improves the efficiency of data management. In addition, the establishment of a segmented database not only improves the efficiency of business decision-making by making data query and analysis more convenient but also simplifies the permission management settings of the database, improves data security, protects the data security of users and platforms, and reduces the risk of data leakage.
Timeline Overview
Task Deadline
🔵 Data Cleaning and Integration March 22, 2024
🔵 Database Understanding and Design March 27, 2024
🔵 Database Implementation March 27, 2024
🔵 ETL Implementation April 10, 2024
🔵 Dashboard Integration April 15, 2024
🔵 Web Form Development April 20, 2024
🔵 Web Form Testing April 23, 2024
1: Data Cleaning and Integration:
Cleanse and preprocess the raw CSV files, addressing missing values, outliers, and inconsistencies.
Standardize data formats and units across all datasets.
Merge/join the cleaned datasets into a unified dataset based on common keys or attributes.
2: Database Understanding and Database Design:
Analyze the Airbnb dataset to understand its structure and relationships between variables.
Identify key entities and attributes for inclusion in the database.
Design the database schema, including tables, relationships, and constraints.
Create an Entity-Relationship (ER) diagram to visualize the proposed database structure.
3: Database Implementation:
Develop the PostgreSQL database schema based on the ER diagram.
Create tables, define data types, constraints, and relationships as per the design.
Ensure database normalization up to the Third Normal Form (3NF).
Define and create appropriate functions and triggers to ensure database integrity.
4: ETL Implementation:
Implement the ETL process to extract data from the cleaned datasets
Transform it according to the database schema
Load it into the PostgreSQL database.
5: Dashboard Integration:
Integrate reporting and dashboarding tools (Metabase) with the relational database.
Develop real-time dashboards to visualize key metrics and performance indicators.
6: Web Form Development:
Design and develop a user-friendly web form for data input, makes it easy for members of the Airbnb Host Department to input new data.
Data Validation: Implement data validation mechanisms within the web form to ensure that the input data meets the required format and quality standards before being submitted to the database.
Ensure seamless integration between the web form and the PostgreSQL database, allowing data entered through the form to be directly stored in the database tables.
7: Web Form Testing
Functional Testing: Conduct functional testing to verify that all features and functionalities of the web form are working as intended. Test data entry, submission, error handling, and database integration.
Usability Testing: Perform usability testing to assess the user-friendliness and intuitiveness of the web form. Gather feedback from team members to identify any usability issues or areas for improvement.
Data Validation Testing: Validate the data validation mechanisms implemented in the web form to ensure that input data meets the required format and quality standards. Test various scenarios to assess the effectiveness of validation rules.
Database Choice
Data Cleaning Python File Link
There are four reasons why we choose this dataset. First, this dataset closely aligns with the goals of the Airbnb Host Department in New York (AHD-NY) to provide VIP hosts with insightful data information and dashboards.
Second, the dataset is extensive, including characteristics such as listing URLs, property types, room types, accommodations, number of reviews, ratings, host information, and pricing, among others. This volume of information makes it possible to analyze the rental market in a variety of ways, revealing patterns and trends that can guide strategic choices and illuminating the relationships between host offerings and customer preferences. The relationships between various entities, including hosts, listings, and reviews, as well as the diversity of data types present in the dataset offer a strong basis for creating a database schema that can adequately represent the intricacies of the New York Airbnb rental market.
Furthermore, the dataset provides a rich source of geographical data, including locations and neighborhood names, which is essential for understanding spatial distribution and regional market differences within New York. This spatial analysis can help identify high-demand areas, inform pricing strategies, and guide the allocation of resources for host support and development.
Finally, the dataset’s inclusion of host information, such as response time, acceptance rate, and super host status, offers a unique opportunity to assess host performance and its impact on guest satisfaction and booking rates. This can help identify best practices and areas for improvement in host management and support.
By using this dataset, we can help AHD-NY make data-driven decisions that can result in better market expansion and host assistance, as well as to provide VIP hosts with practical insights to maximize their rental properties. The dataset is a priceless tool for our project because it has the ability to open up new doors and increase our market presence in New York.
Cleaned Data Overview
Dataset Link For: cleaned_listing.csv
Normalization & ETL Processing
Normalization
The normalization plan aims to efficiently organize the data, reduce redundancy, and ensure consistency and integrity. Our group begin by defining different entities that represent the logical structure of the Airbnb system. These entities include hosts, neighborhoods, listings, calendars, listing_amenities, reviewers, and listing_reviews. For each entity, we extract important properties from the dataset, ensuring that each attribute represents a single data point according to the first normal form (1NF).
The host table contains information about the hosts, including host_id, host_url, host_name, and other personal and account information. The host_id is a unique identifier for each host, meeting the requirement for a primary key to assure unique records.
The neighborhood table contains the location details, which break down the area-specific data in the postings. It uses the neighborhood_id as a unique identifier, eliminating the requirement for duplicating neighborhood information in the listing table while adhering to the third normal form (3NF) and removing transitive dependencies.
In the listing table, each record contains a unique listing_id that uniquely identifies each Airbnb listing. Additionally, foreign keys like host_id and neighborhood_id are included, referencing the host and neighborhood tables, respectively. This design eliminates redundancy and supports efficient data retrieval, following 2NF and 3NF principles by ensuring that all non-key attributes are fully functionally dependent on the primary key and have no transitive dependencies.
The amenities table and the listing_amenity table together manage the many-to-many relationship between listings and their amenities. The amenities table lists each amenity with a unique amenity_id, and the listing_amenity table uses composite keys listing_id and amenity_id to uniquely identify each pairing, which is a sophisticated step beyond 3NF to accommodate many-to-many relationships between entities.
Furthermore, we normalize the reviews by separating them into two tables: reviewer and listing_review. The reviewer table stores unique reviewers, with reviewer_id as the primary key. The listing_review table contains composite keys of reviewer_id, listing_id, and review_date together to manage the many-to-many relationship between listing and review to track the reviews for each listing. This structure prevents duplication of reviewer information and maintains a clear separation between reviewers and their reviews.
Finally, the calendar table records the availability and pricing of each listing over time, with listing_id and calendar_date as a composite primary key. It relates to the listing table directly and is normalized to prevent the duplication of listing information across dates.
Our normalization procedure arranges the data into logically structured tables, enforces relationships through primary and foreign key constraints, and maintains the data model’s integrity. This technique allows for the effective structuring of Airbnb’s data, simplifies data maintenance, and supports precise and robust data analysis.
ETL Processing
The ETL process for the Airbnb dataset is a comprehensive transformation and loading sequence tailored to organize data within a PostgreSQL database to facilitate efficient queries and analysis.
Extract:
We begin by extracting the pre-cleaned CSV files for listings, reviews, and calendar, into pandas DataFrames. This is a straightforward step using the read_csv function from pandas, which effectively captures the data in a format conducive to manipulation and processing.
Transform:
During the transformation stage, we handle each entity separately to ensure the data meets the schema requirements.
To avoid redundancy in the host table, we extract essential fields from the listing dataframe and remove any duplicate entries based on the host ID.
We also extract unique neighborhood names and regions from the listing dataframe to create the neighborhood table. We then generate a new neighborhood_id for each distinct neighborhood-region pair, guaranteeing that each listing can relate to the neighborhood using this ID.
The listing table requires a foreign key(neighborhood_id) from the previously created neighborhood table. We choose important fields from the listing dataframe, also verifying that data types and values match our database schema before loading.
For the calendar table, we chose listing_id, calendar_date, available, price from the calendar dataframe, excepted the unnecessary adjusted_price column.
The transformation of the amenities table involves extracting/stacking individual amenities from the lists, and creating a unique amenity_list. This list is used to populate the amenities table with a unique amenity_id for each amenity.
The listing_amenity table is a junction table that records the many-to-many relationship between listings and amenities. For each amenity listed in a listing, we create a record in this table associating the listing_id with the amenity_id.
To fill the reviewer table, reviewer data is extracted from the review dataframe and duplicates are eliminated.
The listing_review table stores the relationship between listings and reviews, where each review is uniquely identified by its reviewer_id, listing_id, and review_date.
Load:
We establish a connection to the PostgreSQL database using SQLAlchemy and create the tables according to our ERD schema. Once the DataFrame tables are created with the same attributes as the schema, we load the transformed DataFrame into the SQL database. We use the to_sql function, which allows us to specify the table name, connection, and whether to append or replace data. For each table, the data is loaded in the following order to respect the dependencies due to foreign key constraints: host, neighborhood, listing, calendar, amenities, listing_amenity, reviewer, and listing_review. After loading the data into each table, we perform a query using SELECT * FROM tablename to confirm that the data has been inserted correctly. This detailed ETL process ensures that the data in the PostgreSQL database is organized, consistent, and ready for queries and analysis.
Sample ERD
Project Scenario
The Airbnb Host Department in New York (AHD-NY) is responsible for overseeing and supporting hosts within the region, ensuring accurate listings and providing hosts resources for success in the rental market. However, limited market expansion has been observed, recognizing the significance of data-driven decision-making, they are now determined to provide VIP hosts with insightful data information and dashboards. In line with this commitment, they are dedicated to offering compelling VIP membership packages at attractive, lower fees. This strategic initiative aims to empower hosts with the knowledge and insights necessary to unlock new opportunities and drive a larger market presence in New York. To address this, our team has been commissioned by AHD-NY to design a comprehensive relational database system.
Analytics Applications
Our designed database system offers a multifaceted approach to data interaction, catering to the diverse needs of both analysts and C-level officers. By leveraging a combination of direct querying capabilities and user-friendly web page designs, our system enables seamless access to critical business insights and facilitates informed decision-making at all organizational levels.
For Analysts:
Analysts have direct access to the database, allowing them to perform in-depth data exploration and analysis tailored to their specific requirements. Through tools like PostgreSQL, Python, and R, analysts can import our database backup file and execute SQL queries to retrieve relevant datasets and do querying analysis. Our database design enhances usability by seamlessly integrating with Jupyter Notebooks and Python Integrated Development Environments (IDEs), facilitating an interactive environment for iterative analysis and visualization. By integrating programming libraries such as SQLAlchemy or psycopg2, analysts can establish connections to the PostgreSQL database, enabling them to run predefined queries or custom scripts for data retrieval and manipulation. This allows them to perform complex queries, filter data, and derive actionable insights, leveraging the system’s robust querying capabilities to extract meaningful business intelligence to the C-level officers.
For C-level Officers:
C-level officers benefit from streamlined access to critical business insights through interactive dashboards and automated reports powered by Metabase, seamlessly integrated with our database system. Metabase simplifies data exploration and visualization, presenting key metrics and performance indicators in easy-to-understand formats. Executives can effortlessly monitor business metrics, track trends, and identify areas requiring attention, all within a user-friendly interface.
API Web Page Use Case:
Airbnb User Interface Main Page Screenshot
The API Web Page user interface serves as a pivotal tool for database managers, facilitating efficient information retrieval and management tasks within our system. With this interface, database managers can seamlessly access the database and perform various operations such as searching, inserting, and updating hosts. For instance, they can use the API to retrieve specific host information or update existing host data, ensuring simultaneous and accurate information management.
Airbnb Host Searching System Screenshot
Actions performed through the API interface will be immediately updated to the backend PostgreSQL database, ensuring data consistency and integrity across the system. This real-time synchronization mechanism ensures that any changes made via the API, such as inserting or updating new hosts, are promptly propagated to the database, enabling swift updates and modifications without delay.
Airbnb Insert Host Page Screenshot
Furthermore, the API interface incorporates robust security measures, including secure authentication mechanisms and database access controls, to safeguard sensitive data and prevent unauthorized access. Only authorized users can interact with the system and perform permitted actions.
Airbnb Security Check Screenshot
In addition, robust triggers are implemented within the API system, such as the primary key constraint. If the input data does not comply with our database schema, the user interface system will not allow inserts and updates to the backend PostgreSQL, providing error feedback to prevent data manipulation.
Additionally, the user interface enforces not null checks, ensuring that page information cannot be submitted to the backend PostgreSQL if NOT NULL values are missing from the inputs. This comprehensive approach to database system guarantees data integrity and consistency by enforcing strict adherence to predefined database schema, ensuring a seamless and error-free user experience.
Overall, the API Web Page interface provides database managers with a user-friendly toolset for streamlined database management. With intuitive guidance and seamless integration, managers can efficiently access, manipulate, and synchronize data with the backend PostgreSQL database in real-time. This user-centric approach enhances operational efficiency, simplifies database management processes, and promotes data consistency throughout the system.
Tools Used:
This database system relies on PostgreSQL as its backend database, providing robust storage and management capabilities for data. Flask is utilized as the web framework for developing web-based interfaces, enabling direct interaction with the database. SQLAlchemy package acts as the intermediary layer between the API or Python applications and the database, facilitating efficient execution of SQL queries and data manipulation.
Metabase serves as the primary business intelligence tool within the system, enabling the generation of interactive dashboards and automated reports. It offers intuitive data visualization and analysis features, empowering users to gain insights into the data effortlessly.
Redundancy and Performance:
To bolster data integrity and accessibility, our database system implements regular backups of the PostgreSQL database in conjunction with a cache mechanism seamlessly integrated into our API, enhancing both reliability and performance optimization. Through periodic database backups, we mitigate the risk of data loss, while the cache mechanism expedites data retrieval, minimizing latency for users. This approach underscores our commitment to delivering dependable data interactions and a well-designed database system.
Metabase Dashboard
Conclusion
In this project, we aim to design a comprehensive relational database system (RDMS) that improves the efficiency and accuracy of data-driven decision making by integrating and optimizing data management processes and providing real-time dashboard insights. To achieve this goal, we chose PostgreSQL with its high scalability and compatibility as our relational database management system, and designed an ER diagram based on the 3NF standardized benchmark to reduce data redundancy and improve query efficiency. After improving the maintainability and scalability of the system, we migrated the existing data to our relational data system through the ETL (Extract, Transform, Load) process to ensure data consistency and integrity. On this basis, technicians can connect with the database through any channel and conduct complex data queries. Once data migration and security measures were in place, we used Metabase to connect databases to simplify data exploration and visualization and improved the speed and quality of decision making. This allows Airbnb-NY managers to easily monitor business metrics, track trends, identify areas to focus on, and present a range of results to their users. Finally, we set up an API Web Page interface to simplify the data entry process and improve the efficiency and accuracy of data entry. With the API interface seamlessly integrated into our RDBMS, Airbnb-NY will be able to manage data more effectively and increase its market competitiveness via data analysis, and hosts will be able to respond more quickly to market changes, ultimately driving Airbnb’s business growth in the New York market.
Contact Me
+86 13301911207
+1 5512292307
© 2024. All rights reserved.
jw4451@columbia.edu