作者:富连达发布日期:2023-07-20浏览人数:81
OVERVIEW
There are multiple ways to generate data, and different applications, with different components, have their own data management rules and data generation rules. The purpose of data collection is to harmonize the collection and recycling of dispersed data to facilitate data management.
How to collect data?
Generally speaking, the collection system in big data systems requires independent collection machines to complete, and the collection is relatively time-consuming, especially for real-time collection, where you need to set up monitoring and wait for data reports. These operations should be distinguished from business machines.
In the big data ecosystem, there are corresponding solutions for data from different sources. Next, let's take them one by one.
Log Collection
Suppose there is a .log daily application service on the server with a sequence of file and content ip accesses. The .log file needs to be, synchronized to the data warehouse. What do we do?
First, if the server can read the file directory directly through the port by authorizing the collector. Then, we use it entirely wget capture. Set up scheduling to control the frequency of collection. Our needs can be met as well.
Using log collection tools in the big data ecosystem, if there is no fixed port access, flume, or logstash, reporting data via avro protocol.
The following flume to a brief description as an example.
flume is an Apache open source log collection system, the new version of flumeNG, before flumeOG on this basis, modified to increase ease of use, and further improve performance .
flumeNG is mainly divided into three parts , source, channel, sink.
source refers to the data source access, you can carry out logtail reading, you can also access the kafka waiting message queue. Or through the avro protocol, receiving upper layer data. With the version upgrade, support for more and more modules, you can go to the flume website to see the latest support.
channel refers to the channel, the purpose is to buffer data processing in the upper layer of data inflow is too much, to avoid downstream congestion. At present, the main use of memory and disk, memory channel data faster, but the system memory occupies a large, reliable data can not be guaranteed. Once the machine is down the machine, the data cannot be recovered. file channel data can be persisted to disk in the form of files. Even if there is a file to record the location of the data read, there will be a down machine, you can still find the line from the node record and continue to read. Avoid data duplication and loss. However, by the disk IO limitations, the data reading is slow, which can easily cause a large amount of data backlog when the data is reported.
sink refers to the consumer side, you can specify the downstream consumer module, can be written directly to hdfs, can also be passed to another service through the avro protocol, can also be written to the kafka waiting message queue. Support modules, see the official website description.
Because flume supports logtail, so the data can be captured line by line in the log writing process, can also meet the real-time requirements.
Good flume collection architecture must be agent collection on each machine, and then multi-collector collection, do more sink writing. On the one hand, it ensures the stability of the system, on the other hand, it improves the parallelism and ensures the writing speed.
Since log data can be collected in this way, there are other ways to real-time data? Of course.
Clickstream for example, the user report data nginx, nginx can be directly accessed through the module kafka, after the kafka data consumption written in the consumer side to complete. kafka can also be accessed storm, spark streaming, flink real-time computation of data and so on.
Database synchronization
If you need to synchronize the data in the mysql database with the digital warehouse, there are two scenarios.
First, full synchronization, the whole database table data is fully synchronized once a day. This method is suitable for the database content is small, you can create partitions every day, every day is full data. The advantage of this method is that it is simple to import completely and does not require maintenance of the content. The disadvantage is that it will take up a lot of space, and it takes a long time to import the full database each time.
Second, incremental export, set zipper, mark the full state of the data life cycle. Because hdfs in the reading process, is the file IO form sequential reading, the middle content can not be modified. Therefore, if the content of the database is updated, it is necessary to read the binlog update content of the day and rewrite the record. Data mark ACTIVE, old data HISTORY mark. Through multiple dt latest data content hanging out of the combination. This is a high maintenance cost, the rules limit death. The advantage is that incremental files can be marked every day, without the need for a full import.
We can use database synchronization Sqoop can use full data import c ** . Specific methods of use can refer to the corresponding tool files.
How to store the data after collection?
Data Cleaning
After collecting raw data, it cannot be directly imported into the data warehouse. The data is generated by the application service, which may produce incorrect data for unknown reasons. Then, the development of cleaning rules to screen some of the invalid data first can not only save storage space, but also reduce the error rate of subsequent calculations.
Commonly used cleaning rules include key field non-null, field format validation, content accuracy validation, data source validation and so on. Different validation rules are formulated according to different businesses. The purpose is to clean up erroneous data before entering the warehouse.
Data Storage
Currently, most data warehouse data lags behind on hdfs using hive content management.
The hdfs distributed storage system uses distributed services for data storage. It avoids the upper limit of storage capacity of a single node and backups to avoid data loss due to equipment failure.
A data data.log, which can be cut into data1.log, data2.log, data3.log, are stored in server1, server2, server3.At the same time, each data is stored on at least two servers to ensure that a server server down backup data can still be obtained from the remaining two server to get the data.log complete results.
hdfs extension is also very convenient, just give a new node directory hdfs for management in the load configuration.
hive in order to facilitate the use of linear databases for developers to get used to hdfs developed linear library table management tools. Storing data hdfs stores metadata and library table basic information mysql, supports SQL syntax, retrieves data.
When the user executes sql on hive, hive will be the back-end execution engine calls, converted to mr, spark, or tez, retrieve data. The speed of retrieval depends on the size of the table and the execution speed of the cluster compute engine. hive is much slower comparing execution speed mysql. Hence, the OLAP execution engine came later.
Summarize
Whether it is subsequent data storage, computation or calculation, data collection is the beginning of the entire data flow, etl processing data collection failure will affect the subsequent results. Therefore, the data collection side agent needs to improve monitoring, warning, fault tolerance to ensure that data is not lost.
Furunda Relay
Ltd. is a NI alliance, agents, system integrators, mainly NIGPIB, NILABVIEW, NIDAQ, NI boards, NI data acquisition cards and other products. The company has a full set of hardware and software for product testing should be run solutions, the development scope includes ICT, Boundary Scan, functional testing, system testing. Business covers Shenzhen, Guangzhou, Zhuhai, Foshan, Nanjing, Hangzhou, Xiamen, Xi'an, Chengdu, Wuhan, Chongqing, Beijing and other places.
You can contact online customer service if you need!