Home > News content

Continuous launch of two new data lake products, AWS Zhang Xia analysis of Amazon's data core competitiveness

via:博客园     time:2020/3/27 0:02:43     readed:963


Amazon used to be the largest user of Oracle's global database. We used75pb databaseUsed.7500 + databasesFor example, there are more than 1000 different teams in the entire Amazon. From operation, e-commerce, marketing, inventory, almost all aspects of the business are based on Oracle's database.

One thing we have done in the past year and a half to two years isMove out in all directionsWe migrated all Oracle databases to our corresponding products in November last year. This migration solves a series of problems, such as the difficulty of expansion and the high cost of Oracle support, 60 per cent reduction in database costs 70 per cent reduction in overhead and 40 per cent increase in performance.

At a recent media conference,Zhang Xia, Chief Strategic Consultant of AWS cloud computing companyWith such emotion. This similar scene, Lei Feng net editor has twice heard in the AWS re:Invent scene Amazon CTO Warne

However, the big win

On March 24, AWS announced that two new products were officially launched in AWS China (Ningxia) region operated by cloud data: AWS Glue and Amazon Athena.

In terms of specific functions:

AWS glue was officially launched in the AWS China (Ningxia) region operated by West cloud data. AWS glue is a fully hosted data extraction, transformation and loading (ETL) service and metadata catalog. It makes it easier for customers to prepare data, load data into databases, data warehouses, and data lakes for data analysis. With AWS glue, data can be ready for analysis in minutes. Because AWS glue is a serverless service, customers only need to pay for the computing resources they consume when performing ETL tasks.

Amazon Athena is officially online in AWS China (Ningxia) region, which is operated by West cloud data. Amazon Athena is an interactive query service that allows customers to easily analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL language. Because Athena is a serverless service, customers do not need to manage the infrastructure and pay only for queries they run. Athena can automatically expand and execute queries in parallel, so even large datasets and complex queries can get query results quickly.

Both releases also mean that the AWS is becoming more complete in terms of data lakes and data analysis solutions.

At the communication meeting, Zhang Xia also explained the data work of AWS in depth, introduced the data service, products, important concepts and operation methods of AWS, and revealed the logic behind the analysis of AWS building data lake.

On the basis of his original words, Lei Feng made editing and arrangement without changing his original meaning.

How does Amazon realize the concept of data lake?

The concept of data lake was probably first introduced in may 2011, so it is only eight or nine years away.

Amazon has been promoting the content of data lake for a long time. First, Amazon has a basic cloud service called Amazon S3, which was released on the White Valentine's day on March 14, 2006, and is the first public cloud service in the world. Amazon S3 can store any binary based information, including structured and unstructured data. The services in the top half of the figure on the left can do various operations around these data. Below are some data transfer tools.


From the right, you can take a rough look at the flow chart or platform chart of the data lake. In general, all kinds of data come from the lens, from the cell phone, from the database, from the car, from the wind turbine, and so on. We extract it in some way, then store it, register it in a directory, and then S3 it in the core of the data lake.

Then use various analysis methods to take out these data on the right. It can also be restored to the nature of data warehouse, can also be turned into various reports, can also be turned into some forecasts, can also be used for machine learning analysis. So this is the concept of the whole data lake.


Based on this, we see that the data lake is a source of all kinds of raw data, like rain water, river water and so on, we store it in depressions, which contains all kinds of data, on which we can do all kinds of data analysis we need, interactive query, operational analysis, exchange or even buy and sell data, visualize the data, do real-time analysis, recommend, predict, and do all the data required functions.

From the perspective of AWS, we have corresponding services to implement each function just mentioned. I will give you a brief introduction of these services. All these services are introduced in Chinese and English on our website. All these services are provided in the way of cloud. They are very simple and easy to use. There are many guiding content with operation.

As we have said before, we have Amazon S3, to store all kinds of data. It has 11 9 data persistence. It has six copies in three available areas above the cloud. And there's a cold storage on the back end called Amazon Glacier (glacier means glacier). If this data is not commonly used, we can go there, the cost can be reduced a lot, just need three or four more hours to get it out.

Cold storage also has a deep cold storage deep archive. In addition to this, before I start, there is also Amazon dynamodb, a non relational database, which stores key value data. In the game, for example, the level of each player, how much blood he has, and what kind of weapon he uses, such numbers are the number of key value pairs. A large number of such data are stored in non relational databases like Amazon dynamodb around the world.

Take a look at other services. One is the Amazon RDS service. The full name of RDS is the relational database service, which is a hosted relational database. This is the first relational database launched from the perspective of AWS from cloud computing. It supports databases like SQL server, Oracle, open-source PostgreSQL and mysql.


and we soon launched our proprietary Amazon Aurora database, a relational database that is cloud-derived. That was a relational database hosted on the cloud, and now this Aurora is a fully hosted, fully MySQL and PostgreSQL native database.

After the launch of this database, it is the fastest-growing service among all the cloud services of AWS, with a large number of users. So far, I mentioned that these services have been implemented in China and are also provided in China.

A very important product is a cloud product called Amazon Redshift, it's a data warehouse, but it's a cloud's data warehouse, very powerful, very scalable, and the cost is 1/10 of the traditional database. If you still need to go from the database to the data warehouse, you can see that on the cloud we can also fully implement these solutions, from the database to the data warehouse. And the data warehouse we still have a lot of new preparation, in this meeting can not be fully mentioned.

I'd like to report to you that Amazon Neptune, a graphic database, has also landed in China in the past six months since its global release. So Chinese users can also use the latest graphics database.


There is also the Amazon EMR,EMR Elastic MapReduce, it is using clusters, using methods like open source Hadoop we often hear about doing big data cluster analysis. And this EMR product is a way to do big data research on amazon's cloud-a product we' ve been offering for a long time and already at home.

Another very important concept is to process real-time data. Because compared with the original, the current feature is to use more real-time data, which is more batch historical data. Our service is called Amazon There are four different types of kinesis, one is to directly process video data stream, the other is to directly guide data to key services, each with different usage. This real-time data analysis is an important part of the analysis framework.

Important services in two data lakes: Amazon Athena and AWS glue

The focus is on the following two services, which are very important components of the data lake, we only officially released in China in the first quarter of this year.


The first product is called Amazon Athena, English Athena meaning Athena, which is an interactive tool for data query. We put all kinds of data S3 above, using SQL can directly query the data in the S3, so it is interactive query, so that quickly stored massive data has a tool like the traditional use of SQL language methods, standard database query language, directly to the S3 to do data query.


The second service is called AWS glue. Glue means glue, which is the connection between different database services. There are two main functions. One is ETL. ETL is extract, transform and load, which is the most basic operation in data. When translated, it is called data extraction, transformation and loading. It turns out that ETL is used in data warehouse from database, so glue also performs the functions of data extraction, transformation and loading. In addition, there is another function, that is, the function of data directory service, because these data are stored in the data lake. In this process, these data should be labeled and classified. And glue has a function of directly crawling the massive data in the data lake like a reptile, and generating data directory through its automatic crawling.

So we have launched these two services overseas for a while, but we have completed their official release in China this year. So at present, we have provided these services in Beijing, which is jointly operated with halo new network, and in Ningxia Zhongwei, which is operated by Xiyun data.

Although the data lake is a very good approach, it still has a little complexity. I will introduce the operation steps of the data lake very quickly.

We set up the data, save it, then move the data, load it into different places, then clean it up and build the data directory. The storage to be managed, and then the data to be safe, compliant storage, good management, and then when needed some tools to take out the data for various analyses. these are some steps of the data lake operation.


We also launched a service called AWS lake Formation was released at the technology conference the year before last. This service has not been officially launched in China yet, but we will launch this service in a very short time this year. By automating the work of building data lake and helping you operate it, many enterprises can complete the construction of data Lake in a few days.

For amazon, we have more services, like we have quantum bookkeeping databases, time series databases, and two or three new databases that are not yet available.

On top of this is the data lake. The main elements of the data lake are three elements: Amazon S3 / glacier, AWS glue and AWS lake formation. AWS lake formation is a product service that is not provided at present, but will be provided soon.

Amazon redshift is the data warehouse, Amazon EMR is the big data analysis, AWS glue still plays a key role in it to realize the data analysis without server, then Amazon Athena (Athena) is to do interactive analysis, Amazon elastic search is to do some operation and maintenance analysis, and Amazon kinesis is to do real-time data analysis.

At the top are some of our presentation tools, including Amazon quicksight, Amazon poly, Amazon transcribe and Amazon sagemaker. Amazon sagemaker is an artificial intelligence service that will soon be launched in China.

So the whole big data analysis service panorama to show you with this picture. The vast majority of services have landed in China, and we have a full range of big data data lake can provide big data analysis, we have a lot of customers to use.

Why use AWS to build data lakes and analyze?

why use AWS to build data lakes and analyze them?

In short, it is easy to use, efficient, comprehensive and safe, and can meet a variety of needs. AWS innovations are all based on customer needs.

The Forrester 2019 year big data analysis report, AWS is in the top position, actually there are some similar other, such as the Gartner database analysis report, as well as the data management tool solution report, we are all in a good position. Around the world, including in China, there are a large number of different companies, whether Internet companies or traditional companies, using AWS data analysis, data lake analysis tools.

Even with the full range of data lakes, there are tens of thousands of companies, including Airbnb,yelp equivalent of the u.s. dpp, travel companies, the largest pharmaceutical companies and so on, covering almost all walks of life.

Don't talk about others, talk about Amazon itself.


Amazon used to be the largest user of Oracle database in the world. It used 75pb database and more than 7500 database examples. There are more than 1000 different teams in the whole Amazon. From operation, e-commerce, marketing, inventory, almost all aspects of business are based on Oracle database.

We have done one thing in the past year and a half to two years, that is, we have migrated out of Oracle database in all aspects. Last November or so, we migrated all Oracle databases to our corresponding products. This migration solves a series of problems, such as the original expansion difficulty and high cost, such as Oracle supporting high cost, etc., reducing database cost by 60%, reducing management cost by 70%, and increasing performance by 40%.

The other example is that amazon actually built a data lake within the entire enterprise, which also has an internal word called Galaxy (galaxy), which is not a product of AWS, which is the deployment of an amazon data lake.

The entire data lake integrates amazon's data into a variety of big data analyses that contain 50 to 100 PB of data, through which amazon has as many as 600,000 analysis tasks a day, a variety of data analysis, from recommendations to users, various operations, inventory information, information to buy, price information, are all functions that can be achieved through the data lake.

This is also a core competitiveness of Amazon.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments